This is an example for running the FrESCO library with a hierarchical version of the P3B3 becnhmark data from the ECP-Candle [repository](https://github.com/ECP-CANDLE/Benchmarks) with an added key `groups` which deginates how different entries are grouped together. Included within the FrESCO repository is a preformatted version of the clc dataset for model training. If you've not already done so, go to the data directory and unzip the dataset using the command `$ tar -xf clc.tar.gz`.

Training a case-level ontext (clc) model is a two step process. Initially we'll need to train a model on the data. We can do this in a similar fashion to the other datasets. In the `configs/` directory are sample `model_args.yml` files for the three sample datasets, using the default settings in these files, we are ready to train a model for the first step.

In [1]:
import fresco
import argparse

The FrESCO library is typically run from the command line with arguments specifying the model type and model args, so we'll have to set them up manually for this notebook.

In [2]:
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    _ = parser.add_argument("--model", "-m", type=str, default='ie',
                        help="""which type of model to create. Must be either
                                IE (information extraction) or clc (case-level context).""")
    _ = parser.add_argument('--model_path', '-mp', type=str, default='',
                       help="""this is the location of the model
                               that will used to make predictions""")
    _ = parser.add_argument('--data_path', '-dp', type=str, default='',
                        help="""where the data will load from. The default is
                                the path saved in the model""")
    _ = parser.add_argument('--model_args', '-args', type=str, default='',
                        help="""file specifying the model or clc args; default is in
                                the fresco directory""")

We are going to train a multi-task classification model on the clc dataset, the first step of which is an `information extraction` model. We'll also point the code to the clc model args file. 

In [3]:
args = parser.parse_args(args=['-m', 'ie', '-args', '../configs/clc_step1.yml'])

With these arguments specified, just need a few imports before we're ready to train our model. 

In [4]:
from fresco import run_ie, run_clc

from fresco.validate import exceptions

In [5]:
run_ie.run_ie(args)

Validating kwargs in model_args.yml file
../configs/clc_step1.yml
Word embeddings file does not exist; will default to random embeddings.
Loading data and creating DataLoaders
Loading data from ../data/clc/
Num workers: 4, reproducible: True
Training on 15000 validate on 2000

Defining a model
Creating model trainer
Training a mthisan model with 2 cuda device


epoch: 1

training time 18.86
Training loss: 0.855519
        task:      micro        macro
      task_1:     0.5233,     0.0615
      task_2:     0.5517,     0.4335
      task_3:     0.8872,     0.4742
      task_4:     0.4484,     0.3094

epoch 1 validation

epoch 1 val loss: 1.08947620, best val loss: inf
patience counter is at 0 of 5
        task:      micro        macro
      task_1:     0.5430,     0.1018
      task_2:     0.5830,     0.3895
      task_3:     0.8880,     0.4703
      task_4:     0.4780,     0.3428

epoch: 2

training time 17.45
Training loss: 0.847021
        task:      micro        macro
      task_1:    


training time 18.27
Training loss: 0.211272
        task:      micro        macro
      task_1:     0.8859,     0.7868
      task_2:     0.9173,     0.9151
      task_3:     0.9717,     0.9209
      task_4:     0.7873,     0.7607

epoch 16 validation

epoch 16 val loss: 0.32126487, best val loss: 0.33579794
patience counter is at 0 of 5
        task:      micro        macro
      task_1:     0.8980,     0.8099
      task_2:     0.9330,     0.9309
      task_3:     0.9480,     0.8873
      task_4:     0.7990,     0.7785

epoch: 17

training time 18.25
Training loss: 0.208659
        task:      micro        macro
      task_1:     0.8979,     0.8123
      task_2:     0.9223,     0.9203
      task_3:     0.9735,     0.9258
      task_4:     0.8068,     0.7828

epoch 17 validation

epoch 17 val loss: 0.31447380, best val loss: 0.32126487
patience counter is at 0 of 5
        task:      micro        macro
      task_1:     0.9040,     0.8170
      task_2:     0.9300,     0.9277
      task_

epoch 31 val loss: 0.27292886, best val loss: 0.26750629
patience counter is at 1 of 5
        task:      micro        macro
      task_1:     0.9410,     0.8891
      task_2:     0.9450,     0.9436
      task_3:     0.9680,     0.9274
      task_4:     0.9170,     0.9065

epoch: 32

training time 18.29
Training loss: 0.239570
        task:      micro        macro
      task_1:     0.9487,     0.9151
      task_2:     0.9346,     0.9331
      task_3:     0.9861,     0.9623
      task_4:     0.9185,     0.9099

epoch 32 validation

epoch 32 val loss: 0.27532295, best val loss: 0.26750629
patience counter is at 2 of 5
        task:      micro        macro
      task_1:     0.9450,     0.8968
      task_2:     0.9470,     0.9457
      task_3:     0.9680,     0.9274
      task_4:     0.9110,     0.8988

epoch: 33

training time 18.26
Training loss: 0.101796
        task:      micro        macro
      task_1:     0.9469,     0.9121
      task_2:     0.9383,     0.9369
      task_3:     0.98


training time 18.28
Training loss: 0.204627
        task:      micro        macro
      task_1:     0.9658,     0.9471
      task_2:     0.9477,     0.9466
      task_3:     0.9899,     0.9728
      task_4:     0.9447,     0.9393

epoch 47 validation

epoch 47 val loss: 0.23976292, best val loss: 0.23601118
patience counter is at 4 of 5
        task:      micro        macro
      task_1:     0.9560,     0.9220
      task_2:     0.9540,     0.9529
      task_3:     0.9740,     0.9393
      task_4:     0.9290,     0.9201

epoch: 48

training time 18.30
Training loss: 0.110330
        task:      micro        macro
      task_1:     0.9651,     0.9456
      task_2:     0.9504,     0.9494
      task_3:     0.9892,     0.9711
      task_4:     0.9467,     0.9420

epoch 48 validation

epoch 48 val loss: 0.23263797, best val loss: 0.23601118
patience counter is at 0 of 5
        task:      micro        macro
      task_1:     0.9560,     0.9208
      task_2:     0.9540,     0.9529
      task_

Now that we've trained an inital model, we're ready to train the final clc model. First we'll rename the saved model from step one, then specify the updated name in the args for step two. The default name for step two is: `clc_model.h5`.  Now we're on our way.

In [6]:
args = parser.parse_args(args=['-m', 'clc', '-args', '../configs/clc_step2.yml'])

In [7]:
run_clc.run_case_level(args)

Validating kwargs in clc_args.yml file
Loading trained model from ./savedmodels/clc_model/clc_model.h5
Word embeddings file does not exist; will default to random embeddings.
Validating kwargs from pretrained model 
Word embeddings file does not exist; will default to random embeddings.
Loading data and creating DataLoaders
Loading data from ../data/clc/
Num workers: 4, reproducible: True
Training on 15000 validate on 2000
model loaded

Defining a CLC model
Creating model trainer
Training a case-level model with 2 cuda device


epoch: 1

training time 1.43
Training loss: 0.931656
        task:      micro        macro
      task_1:     0.4310,     0.0540
      task_2:     0.5139,     0.4975
      task_3:     0.8729,     0.4909
      task_4:     0.4243,     0.3267

epoch 1 validation

epoch 1 val loss: 0.90204819, best val loss: inf
patience counter is at 0 of 5
        task:      micro        macro
      task_1:     0.5430,     0.1173
      task_2:     0.5570,     0.4368
      task_3:  

epoch 15 val loss: 0.85979277, best val loss: 0.85822257
patience counter is at 1 of 5
        task:      micro        macro
      task_1:     0.5430,     0.1173
      task_2:     0.5570,     0.4368
      task_3:     0.8890,     0.4706
      task_4:     0.4325,     0.3017

epoch: 16

training time 1.43
Training loss: 0.854823
        task:      micro        macro
      task_1:     0.5417,     0.1225
      task_2:     0.5428,     0.4932
      task_3:     0.8935,     0.4719
      task_4:     0.4465,     0.3124

epoch 16 validation

epoch 16 val loss: 0.86009715, best val loss: 0.85822257
patience counter is at 2 of 5
        task:      micro        macro
      task_1:     0.5430,     0.1173
      task_2:     0.5410,     0.4546
      task_3:     0.8890,     0.4706
      task_4:     0.4415,     0.2875

epoch: 17

training time 1.43
Training loss: 0.866319
        task:      micro        macro
      task_1:     0.5412,     0.1232
      task_2:     0.5412,     0.4870
      task_3:     0.8935


training time 1.44
Training loss: 0.845142
        task:      micro        macro
      task_1:     0.5456,     0.1204
      task_2:     0.5495,     0.4848
      task_3:     0.8935,     0.4719
      task_4:     0.4543,     0.3126

epoch 31 validation

epoch 31 val loss: 0.85564961, best val loss: 0.85640588
patience counter is at 0 of 5
        task:      micro        macro
      task_1:     0.5430,     0.1173
      task_2:     0.5280,     0.4687
      task_3:     0.8890,     0.4706
      task_4:     0.4445,     0.2579

epoch: 32

training time 1.43
Training loss: 0.845179
        task:      micro        macro
      task_1:     0.5455,     0.1206
      task_2:     0.5508,     0.4790
      task_3:     0.8935,     0.4719
      task_4:     0.4560,     0.3145

epoch 32 validation

epoch 32 val loss: 0.85590241, best val loss: 0.85564961
patience counter is at 1 of 5
        task:      micro        macro
      task_1:     0.5430,     0.1173
      task_2:     0.5280,     0.4687
      task_3: