In [1]:
import os
import pathlib

from model_pipeline import FARMTrainer
from model_pipeline import (
    ModelConfig,
    TokenizerConfig,
    TrainingConfig,
    FileConfig,
    MLFlowConfig,
    ProcessorConfig,
    InferConfig,
)

08/26/2020 04:56:32 - INFO - transformers.file_utils -   PyTorch version 1.5.0 available.


## Training Pipeline

The training pipeline trains the relevance classifier once the dataset has been extracted and curated. The model trained is comprised of a transformer model (e.g., BERT) that can be loaded pre-trained on the NQ dataset into the pipeline and then be fine-tuned on the curated data for our specific relevance detection task.

Our pipeline includes components that are provided by the FARM library. FARM is a framework which facilitates transfer learning tasks for BERT based models. Documentation for FARM is available here: https://farm.deepset.ai.



#### Set parameters

Before starting training, parameters for each component of the training pipeline must be set. For this we create `config` objects which hold these parameters. Default values have already been set but they can be easily changed.

In [2]:
file_config = FileConfig()  # Settings data files and checkpoints parameters
processor_config = ProcessorConfig()  # Settings for the processor component
tokenizer_config = TokenizerConfig()  # Settings for the tokenizer
model_config = ModelConfig()  # Settings for the model
train_config = TrainingConfig()  # Settings for training
mlflow_config = MLFlowConfig()  # Settings for training

Parameters can be changed as follows:

In [3]:
file_config.experiment_name = "demo_training"

However, we advise that you manually update the parameters in the corresponding config file:

`esg_data_pipeline/config/config_farm_trainer.py`

We can check the value for some parameters:

In [4]:
print(f"Experiment_name: \n {file_config.experiment_name} \n")
print(f"Data directory: \n {file_config.data_dir} \n")
print(f"Curated dataset path: \n {file_config.curated_data} \n")
print(f"Split train/validation ratio: \n{file_config.dev_split} \n")
print(f"Training dataset path: \n {file_config.train_filename} \n")
print(f"Validation dataset path: \n {file_config.dev_filename} \n")
print(f"Directory where trained model is saved: \n {file_config.saved_models_dir} \n")

Experiment_name: 
 demo_training 

Data directory: 
 /model_pipeline/model_pipeline/data 

Curated dataset path: 
 /model_pipeline/model_pipeline/data/curation/esg_TEXT_dataset.csv 

Split train/validation ratio: 
0.2 

Training dataset path: 
 /model_pipeline/model_pipeline/data/train_split_02.csv 

Validation dataset path: 
 /model_pipeline/model_pipeline/data/val_split_02.csv 

Directory where trained model is saved: 
 /model_pipeline/model_pipeline/saved_models/test_farm 



In [5]:
print(f"Max number of tokens per example: {processor_config.max_seq_len} \n")

Max number of tokens per example: 512 



In [6]:
print(f"Use GPU: {train_config.use_cuda} \n")

Use GPU: True 



In [7]:
print(f"Learning_rate: {train_config.learning_rate} \n")
print(f"Number of epochs for fine tuning: {train_config.n_epochs} \n")
print(f"Batch size: {train_config.batch_size} \n")
print(f"Perform Cross validation: {train_config.run_cv} \n")

Learning_rate: 1.2168533249479066e-05 

Number of epochs for fine tuning: 1 

Batch size: 4 

Perform Cross validation: False 



## Table vs. Text


The same model architecture is used to tain for both table and text data, although the final trained models for the two data types will be different. We thus need to train the model two times, once for text data and another time for table data.
In order to switch between the two data types, the parameter `data_type` in the config file must be set to either `Text` or `Table`, as shown in the following cell. This will enable the appropriate pre-processing component of the pipeline.

# Training the Text Model

In [8]:
file_config.data_type = "Text"
print(f"Data type: \n {file_config.data_type} \n")

Data type: 
 Text 



#### Load model trained on NQ dataset

We have already trained a relevance classifier on Google's large NQ dataset. We then saved the model in the following directory: `file_config.saved_models_dir / "relevance_roberta"`

We need to load this model in our pipeline to fine-tune a relevance classifier on our specific ESG curated dataset. For this we have to set the parameter `model_config.load_dir` to be the directory where we saved our first checkpoint. We can check that this is set:

In [9]:
print(f"NQ checkpoint directory: {model_config.load_dir}")

NQ checkpoint directory: /model_pipeline/model_pipeline/saved_models/NQ/relevance_roberta


#### Fine-tune on curated ESG data

Once all the parameters are set a `FARMTrainer` object can be instantiated by passing all the configuration objects

In [10]:
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

Call the method `run()` to start training

In [11]:
farm_trainer.run()

08/26/2020 04:56:45 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
08/26/2020 04:56:45 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
08/26/2020 04:56:46 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
08/26/2020 04:56:46 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
The git executable must be specified in one of the following ways:
    - be included in 

08/26/2020 04:56:47 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/___________________________________________                     

ID: 3-0
Clear Text: 
 	text: What is the total volume of natural gas liquid production?
 	text_b: During the past year, NOVATEK’s hydrocarbon production totaled 549.1 million boe, including 68.8 bcm of natural gas and 11,800 thousand tons of liquids
 	text_classification_label: 1
Tokenized: 
 	tokens: ['What', 'Ġis', 'Ġthe', 'Ġtotal', 'Ġvolume', 'Ġof', 'Ġnatural', 'Ġgas', 'Ġliquid', 'Ġproduction', '?']
 	tokens_b: ['During', 'Ġthe', 'Ġpast', 'Ġyear', ',', 'ĠNO', 'V', 'ATE', 'K',

08/26/2020 04:56:53 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/___________________________________________                     

ID: 0-0
Clear Text: 
 	text: What is the target year for climate commitment?
 	text_b: we became the first electric utility in the country to announce our aspiration to produce 100-percent carbon-free electricity for customers by 2050.
 	text_classification_label: 1
Tokenized: 
 	tokens: ['What', 'Ġis', 'Ġthe', 'Ġtarget', 'Ġyear', 'Ġfor', 'Ġclimate', 'Ġcommitment', '?']
 	tokens_b: ['we', 'Ġbecame', 'Ġthe', 'Ġfirst', 'Ġelectric', 'Ġutility', 'Ġin', 'Ġthe', 'Ġcountry', 'Ġto', 'Ġa

08/26/2020 04:57:09 - INFO - transformers.modeling_utils -   All model checkpoint weights were used when initializing RobertaModel.

08/26/2020 04:57:09 - INFO - transformers.modeling_utils -   All the weights of RobertaModel were initialized from the model checkpoint at /model_pipeline/model_pipeline/saved_models/NQ/relevance_roberta/language_model.bin.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use RobertaModel for predictions without further training.
08/26/2020 04:57:09 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
08/26/2020 04:57:09 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
08/26/2020 04:57:09 - INFO - farm.modeling.prediction_head -   Loading prediction head from /model_pipeline/model_pipeline/saved_models/NQ/relevance_roberta/prediction_head_0.bin
08/26/2020 04:57:10 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAda

08/26/2020 04:58:04 - INFO - farm.eval -   
 _________ text_classification _________
08/26/2020 04:58:04 - INFO - farm.eval -   loss: 8.346077924716973e-06
08/26/2020 04:58:04 - INFO - farm.eval -   task_name: text_classification
08/26/2020 04:58:04 - INFO - farm.eval -   acc: 1.0
08/26/2020 04:58:04 - INFO - farm.eval -   report: 
               precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000         0
           1     1.0000    1.0000    1.0000       167

   micro avg     1.0000    1.0000    1.0000       167
   macro avg     0.5000    0.5000    0.5000       167
weighted avg     1.0000    1.0000    1.0000       167

Train epoch 0/0 (Cur. train loss: 0.0000): 100%|██████████| 167/167 [00:58<00:00,  2.88it/s]
Evaluating: 100%|██████████| 42/42 [00:03<00:00, 13.96it/s]
08/26/2020 04:58:11 - INFO - farm.eval -   

\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**********************************

1.0

At the end of the training process, the model and the processor vocabulary are saved into the directory `file_config.saved_models_dir`

In [12]:
!ls -al $file_config.saved_models_dir

total 488312
drwxr-xr-x 2 root root      4096 Aug 26 04:58 .
drwxr-xr-x 7 root root      4096 Aug 26 04:58 ..
-rw-r--r-- 1 root root 498630327 Aug 26 04:58 language_model.bin
-rw-r--r-- 1 root root       562 Aug 26 04:58 language_model_config.json
-rw-r--r-- 1 root root    456318 Aug 26 04:58 merges.txt
-rw-r--r-- 1 root root      6879 Aug 26 04:58 prediction_head_0.bin
-rw-r--r-- 1 root root       556 Aug 26 04:58 prediction_head_0_config.json
-rw-r--r-- 1 root root       727 Aug 26 04:58 processor_config.json
-rw-r--r-- 1 root root       772 Aug 26 04:58 special_tokens_map.json
-rw-r--r-- 1 root root       189 Aug 26 04:58 tokenizer_config.json
-rw-r--r-- 1 root root    898822 Aug 26 04:58 vocab.json


## Cross-validation

To better estimate the performance of the model on new data, it is recommended to perform k-folds cross validation (CV). CV works as follows:

- Split the entire data randomly into k folds (usually 5 to 10)
- Fit the model using the K — 1 folds and validate the model using the remaining Kth fold and save the scores
- Repeat until every K-fold serve as the test set and average the saved scores

_FARMTrainer_ includes this features. To perform 3-fold CV proceed as follows:

In [13]:
train_config.run_cv = True
train_config.xval_folds = 3
train_config.n_epochs = 3

In [14]:
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

In [None]:
farm_trainer.run()

! CV mode does not save a checkpoint, it is only used for validation

## Inference

We can use the saved model and test it on some real examples.

In [16]:
import pandas as pd
import pathlib

from model_pipeline.relevance_infer import TextRelevanceInfer
from model_pipeline.config_farm_train import InferConfig

### Loading the model

In [17]:
infer_config = InferConfig()

The following cell will load the model trained by 1QBit. Skip it if you want to use your own model.

In [18]:
oneqbit_checkpoint_dir = pathlib.Path("/model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG")
infer_config.load_dir = {
    "Text": oneqbit_checkpoint_dir / "esg_text_checkpoint",
    "Table": oneqbit_checkpoint_dir / "esg_table_checkpoint",
}

In [19]:
component = TextRelevanceInfer(infer_config)

08/26/2020 05:10:18 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/26/2020 05:10:18 - INFO - transformers.modeling_utils -   loading weights file /model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG/esg_text_checkpoint/language_model.bin from cache at /model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG/esg_text_checkpoint/language_model.bin
08/26/2020 05:10:23 - INFO - transformers.modeling_utils -   All model checkpoint weights were used when initializing RobertaModel.

08/26/2020 05:10:23 - INFO - transformers.modeling_utils -   All the weights of RobertaModel were initialized from the model checkpoint at /model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG/esg_text_checkpoint/language_model.bin.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use RobertaModel for predictions without further training.
08/26/2020 05:10:23 - INFO 

### Prediction on a Single Example

In [20]:
input_text = "The company is going to reduce 8% in gas production"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

08/26/2020 05:11:17 - INFO - farm.data_handler.processor -   *** Show 2 random examples ***
08/26/2020 05:11:17 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/___________________________________________                     

ID: 0-0
Clear Text: 
 	text: Is the company going to go green?
 	text_b: The company is going to reduce 8% in gas production
Tokenized: 
 	tokens: ['Is', 'Ġthe', 'Ġcompany', 'Ġgoing', 'Ġto', 'Ġgo', 'Ġgreen', '?']
 	tokens_b: ['The', 'Ġcompany', 'Ġis', 'Ġgoing', 'Ġto', 'Ġreduce', 'Ġ8', '%', 'Ġin', 'Ġgas', 'Ġproduction']
Features: 
 	input_ids: [0, 6209, 5, 138, 164, 7, 213, 2272, 116, 2, 2

[{'task': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is going to reduce 8% in gas production',
    'label': '1',
    'probability': 0.81757015}]}]

### Prediction on an Entire Folder

`run_folder()` will make prediction on all the JSON files in the /data/extraction folder. This will take some time, around 35 min.

In [None]:
component.run_folder()

The results are saved in a CSV. For each table, the extracted text, as well as the page number from the source pdf file are saved.

In [None]:
df_table_results = pd.read_csv("/model_pipeline/model_pipeline/data/infer/Text.csv")
df_table_results.head(20)

# Training the Table Model


We just need to change the `data_type` to `Table`, and make sure that curated table data is present under `FileConfig.curated_data`.

In [3]:
file_config.data_type = "Table"

print(f"Data type: \n {file_config.data_type} \n")

Data type: 
 Table 



In [5]:
print(file_config.curated_table_data)
os.path.isfile(file_config.curated_table_data)

/model_pipeline/model_pipeline/data/curation/esg_TABLE_dataset.csv


True

Same as text model, We have already trained a relevance classifier on tables of NQ dataset and the model has been saved under the name `saved_models/NQ/relevence_roberta_table_headers`. 
ModelConfig.load_dir points to the text checkpoint by feault. It should be changed for table. 
If set to None, the model will start training from a Roberta Language Model checkpoint.

In [6]:
model_config.load_dir = pathlib.Path("/model_pipeline/model_pipeline/saved_models/NQ/relevance_roberta_table_headers")

The training pipelines for ESG text data and table data are quite similar except the preprocessing. For table, first all the texts inside the tables (column headers, row headers and cells containing text data) should be extracted.

In [7]:
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

In [8]:
farm_trainer.run()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.dropna(how="any", inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop_duplicates(inplace=True)
08/25/2020 23:47:31 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: True
08/25/2020 23:47:31 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'RobertaTokenizer'
08/25/2020 23:47:31 - INFO - filelock -   Lock 140156607137616 acquired on /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock
08/25/2020 23:47:31 - INFO -

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




08/25/2020 23:47:32 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json in cache at /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
08/25/2020 23:47:32 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
08/25/2020 23:47:32 - INFO - filelock -   Lock 140156607137616 released on /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock
08/25/2020 23:47:32 - INFO - filelock -   Lock 140156473681104 acquired on /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea6

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




08/25/2020 23:47:33 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt in cache at /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
08/25/2020 23:47:33 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
08/25/2020 23:47:33 - INFO - filelock -   Lock 140156473681104 released on /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
08/25/2020 23:47:33 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /root/.cache/torch/trans

08/25/2020 23:47:33 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/___________________________________________                     

ID: 5-0
Clear Text: 
 	text: What is the total amount of direct greenhouse gases emissions referred to as scope 1 emissions?
 	text_b: TOTAL GHG EMISSIONS, Year GHG emissions/revenues*, 2016 124.5, 2017 144.4, 2018 162.4, Scope 1 emissions Scope 2 emissions, (kt CO eq)2 eq) (kt CO 2 eq), 1,203.40 38.90, 1,299.7  37.50, 1,348.83 35.68, Scope 3 emissions, (kt CO 2
 	text_classification_label: 1
Tokenized: 
 	tokens: ['What', 'Ġis', 'Ġthe', 'Ġtotal', 'Ġamount', 'Ġof', 'Ġdirect', 'Ġ

08/25/2020 23:47:39 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/___________________________________________                     

ID: 1-0
Clear Text: 
 	text: What is the total amount of energy indirect greenhouse gases emissions referred to as scope 2 emissions?
 	text_b: Estimated total energy (MJ) delivered by Shell [A], Estimated greenhouse gas emissions covered by the Net Carbon, Footprint calculation (million tonnes CO2e) [B], 2.105E+13, 2.200E+13, 2.144E+13, 2.093E+13
 	text_classification_label: 0
Tokenized: 
 	tokens: ['What', 'Ġis', 'Ġthe', 'Ġtotal', 'Ġamount', 'Ġof', 'Ġenergy', 'Ġindirect', 'Ġgr

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




08/25/2020 23:47:43 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json in cache at /root/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/25/2020 23:47:43 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/25/2020 23:47:43 - INFO - filelock -   Lock 140156367390736 released on /root/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690.lock
08/25/2020 23:47:43 - INFO - filelock -   Lock 140156367273936 acquired on /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=501200538.0, style=ProgressStyle(descri…




08/25/2020 23:48:00 - INFO - transformers.file_utils -   storing https://cdn.huggingface.co/roberta-base-pytorch_model.bin in cache at /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e
08/25/2020 23:48:00 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e
08/25/2020 23:48:00 - INFO - filelock -   Lock 140156367273936 released on /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e.lock
08/25/2020 23:48:00 - INFO - transformers.modeling_utils -   loading weights file https://cdn.huggingface.co/roberta-base-pytorch_model.bin from cache at /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f0

08/25/2020 23:50:03 - INFO - farm.eval -   loss: 0.5068952876921983
08/25/2020 23:50:03 - INFO - farm.eval -   task_name: text_classification
08/25/2020 23:50:04 - INFO - farm.eval -   acc: 0.8888888888888888
08/25/2020 23:50:04 - INFO - farm.eval -   report: 
               precision    recall  f1-score   support

           0     0.9194    0.9268    0.9231       123
           1     0.8085    0.7917    0.8000        48

    accuracy                         0.8889       171
   macro avg     0.8639    0.8592    0.8615       171
weighted avg     0.8882    0.8889    0.8885       171

Train epoch 1/4 (Cur. train loss: 1.1553):  94%|█████████▍| 160/170 [00:56<00:02,  3.93it/s]
Evaluating: 100%|██████████| 43/43 [00:03<00:00, 14.04it/s]
08/25/2020 23:50:14 - INFO - farm.eval -   

\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
***************************************************
***** EVALUATION | DEV SET | AFTER 330 BATCHES *****
**

08/25/2020 23:52:25 - INFO - farm.eval -   
 _________ text_classification _________
08/25/2020 23:52:25 - INFO - farm.eval -   loss: 0.4542162711160225
08/25/2020 23:52:25 - INFO - farm.eval -   task_name: text_classification
08/25/2020 23:52:25 - INFO - farm.eval -   acc: 0.9122807017543859
08/25/2020 23:52:25 - INFO - farm.eval -   report: 
               precision    recall  f1-score   support

           0     0.9426    0.9350    0.9388       123
           1     0.8367    0.8542    0.8454        48

    accuracy                         0.9123       171
   macro avg     0.8897    0.8946    0.8921       171
weighted avg     0.9129    0.9123    0.9126       171

Train epoch 4/4 (Cur. train loss: 0.6336):  24%|██▎       | 40/170 [00:13<00:33,  3.92it/s]
Evaluating: 100%|██████████| 43/43 [00:03<00:00, 14.02it/s]
08/25/2020 23:52:36 - INFO - farm.eval -   

\\|//       \\|//      \\|//       \\|//     \\|//
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
***********************

08/25/2020 23:53:26 - INFO - model_pipeline.farm_trainer -   Processor vocabulary saved to /model_pipeline/model_pipeline/saved_models/test_farm


0.9064327485380117

The model will be save in the FileConfig.saved_models_dir direcory and we can use it for inference.

## Inference using Table model

For the Inference, 
1. The data needs to already be extarcted by the extraction component.
2. The model need to be trained and its checkpoints needs to be saved.

In [9]:
from model_pipeline.relevance_infer import TableRelevanceInfer
from model_pipeline.config_farm_train import InferConfig
import pathlib, os

### Loading the model

The inference component expects to find the trained model in "Table" key of `infer_config.load_dir`

In [10]:
infer_config = InferConfig()

In [11]:
infer_config.load_dir

{'Table': 'saved_models/test_farm/Table',
 'Text': 'saved_models/test_farm/Text'}

Again, we can load the model pretrained by 1QBit using the following cell. Skip if you want to use yout own model.

In [12]:
oneqbit_checkpoint_dir = pathlib.Path("/model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG")
infer_config.load_dir = {
    "Text": oneqbit_checkpoint_dir / "esg_text_checkpoint",
    "Table": oneqbit_checkpoint_dir / "esg_table_checkpoint",
}

In [21]:
component = TableRelevanceInfer(infer_config)

08/19/2020 21:15:28 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
08/19/2020 21:15:28 - INFO - transformers.modeling_utils -   loading weights file /model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG/esg_table_checkpoint/language_model.bin from cache at /model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG/esg_table_checkpoint/language_model.bin
08/19/2020 21:15:33 - INFO - transformers.modeling_utils -   All model checkpoint weights were used when initializing RobertaModel.

08/19/2020 21:15:33 - INFO - transformers.modeling_utils -   All the weights of RobertaModel were initialized from the model checkpoint at /model_pipeline/model_pipeline/saved_models/1QBit_Pretrained_ESG/esg_table_checkpoint/language_model.bin.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use RobertaModel for predictions without further training.
08/19/2020 21:15:33 - IN

### Prediction on an Entire Folder

`run_folder` in `TableRelevanceInfer` is the method responsile method for making prediction on all the csv files located in a folder. 

In [22]:
!ls  $file_config.data_dir/extraction/*.csv

'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page102_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page104_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page105_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page108_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page109_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page109_2.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page110_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page110_2.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page112_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page112_2.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 annual_page113_1.csv'
'/model_pipeline/model_pipeline/data/extraction/NYSE_TOT_2015 ann

In [23]:
component.run_folder()

08/19/2020 21:15:56 - INFO - model_pipeline.relevance_infer -   ###### Received 265 examples for Table, number of questions: 20
08/19/2020 21:17:53 - INFO - farm.data_handler.processor -   *** Show 2 random examples ***
08/19/2020 21:17:53 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/___________________________________________                     

ID: 5238-0
Clear Text: 
 	page: 274
 	pdfname: NYSE_TOT_2015 annual
 	text: What is the target carbon reduction in percentage?
 	text_b: (M$), As of December 31, 2013, Proved properties . . . . . . . . . . . . . . . . . ., Unproved properties . . . . . . . . . . 

Inferencing Samples: 100%|██████████| 332/332 [00:41<00:00,  7.97 Batches/s]
08/19/2020 21:18:34 - INFO - model_pipeline.relevance_infer -   Saved 262 relevant examples for Table in /model_pipeline/model_pipeline/data/infer/Table.csv


In [22]:
!ls $infer_config.result_dir

Text.csv


In [None]:
pd.read_csv(os.path.join(infer_config.result_dir, "Table.csv")).tail()

## Getting statistics for train and val set

In [None]:
import pandas as pd

data_set = "val"
file = "../model_pipeline/data/{}_split_02.csv".format(data_set)
test_data = pd.read_csv(file, index_col=0)

from farm.infer import Inferencer

model = Inferencer.load("../model_pipeline/saved_models/test_farm/")
result = model.inference_from_file(file)
results = [d for r in result for d in r["predictions"]]
preds = [int(r["label"]) for r in results]
test_data["pred"] = preds

from sklearn.metrics import matthews_corrcoef, recall_score, precision_score, f1_score, accuracy_score

groups = test_data.groupby("text")
scores = {}
for group, data in groups:
    pred = data.pred
    true = data.label
    scores[group] = {}
    scores[group]["accuracy"] = accuracy_score(true, pred)
    scores[group]["f1_score"] = f1_score(true, pred)
    scores[group]["recall_score"] = recall_score(true, pred)
    scores[group]["precision_score"] = precision_score(true, pred)
    scores[group]["support"] = len(pred)

In [16]:
scores_df = pd.DataFrame(scores)
scores_df = pd.DataFrame(scores).to_csv("../model_pipeline/data/{}_table_metric.csv".format(data_set))