In [None]:
import os
from esg_data_pipeline.components import FARMTrainer
from esg_data_pipeline.config import (
    ModelConfig,
    TokenizerConfig,
    TrainingConfig,
    FileConfig,
    MLFlowConfig,
    ProcessorConfig,
)

### Training Pipeline

The training pipeline trains the relevance classifier once the dataset has been prepared and curated. The model trained comprises a transformers model such as BERT which can be loaded pre-trained into the pipeline and then be fine-tuned on the curated data for our specific relevance detection task.

Our pipeline includes components provided by the FARM library. FARM is a framework which facilates transfer learning tasks for BERT based model. Documentation is available here: https://farm.deepset.ai.

For our demo we use the curated data generated after receiving the last set of annotations from Allianz.

#### Set parameters

Before starting training, parameters for each component of the training pipeline must be set. For this we create `config` objects which hold these parameters. Default values have already been set but they can be easily changed.

In [None]:
file_config = FileConfig()  # Settings data files and checkpoints parameters
processor_config = ProcessorConfig()  # Settings for the processor component
tokenizer_config = TokenizerConfig()  # Settings for the tokenizer
model_config = ModelConfig()  # Settings for the model
train_config = TrainingConfig()  # Settings for training
mlflow_config = MLFlowConfig()  # Settings for training

Parameters can be changed as follows:

In [None]:
file_config.experiment_name = "demo_training"

However we advise to manually update the parameters in corresponding config file: `esg_data_pipeline/config/config_farm_trainer.py`

We can check the value for some parameters:

In [None]:
print(f"Experiment_name: \n {file_config.experiment_name} \n")
print(f"Curated dataset path: \n {file_config.curated_data} \n")
print(f"Split train/validation ratio: \n{file_config.test_split} \n")
print(f"Training dataset path: \n {file_config.train_filename} \n")
print(f"Validation dataset path: \n {file_config.dev_filename} \n")
print(f"Directory where trained model is saved: \n {file_config.saved_models_dir} \n")

In [None]:
print(f"Max number of tokens per example: {processor_config.max_seq_len} \n")

In [None]:
print(f"Use GPU: {train_config.use_cuda} \n")

In [None]:
print(f"Learning_rate: {train_config.learning_rate} \n")
print(f"Number of epochs for fine tuning: {train_config.n_epochs} \n")
print(f"Batch size: {train_config.batch_size} \n")
print(f"Perform Cross validation: {train_config.run_cv} \n")

#### Load model trained on NQ dataset

We have already trained a relevance classifier on Google's large NQ dataset. We then saved the model in the following directory: `file_config.saved_models_dir / "relevance_roberta"`

We need to load this model in our pipeline to fine-tune a relevance classifier on our specific ESG curated dataset. For this we have to set the parameter `model_config.load_dir` to be the directory where we saved our first checkpoint. We can check that this is set:

In [None]:
print(f"NQ checkpoint directory: {model_config.load_dir}")

#### Fine-tune on curated ESG data

Once all the parameters are set a `FARMTrainer` object can be instantiated by passing all the configuration objects

In [None]:
farm_trainer = FARMTrainer(
    file_config=file_config,
    tokenizer_config=tokenizer_config,
    model_config=model_config,
    processor_config=processor_config,
    training_config=train_config,
    mlflow_config=mlflow_config,
)

Call the method `run()` to start training

In [None]:
farm_trainer.run()