# How to use our configs? 

> Detailed tutorial about how to pass arguments to embeddings pipelines.

- title-block-banner: true

In [1]:
#| hide
from __future__ import annotations
import numpy as np
from fastcore.test import *
from nbdev.showdoc import *
from nbdev.qmd import *
import warnings
import os


In [2]:
#| hide

# disable HF thousand warnings
warnings.simplefilter("ignore")
# set os environ variable for multiprocesses
os.environ["PYTHONWARNINGS"] = "ignore"

In [3]:
#| hide
from embeddings.config.lightning_config import (
    LightningAdvancedConfig,
    LightningBasicConfig,
)

Two types of config are defined in our library: `BasicConfig` and `AdvancedConfig`.

## BasicConfig

> allows for easy use of the most common parameters in the pipeline. 

In [4]:
show_doc(LightningBasicConfig)

---

### LightningBasicConfig

>      LightningBasicConfig (use_scheduler:bool=True, optimizer:str='Adam',
>                            warmup_steps:int=100, learning_rate:float=0.0001,
>                            adam_epsilon:float=1e-08, weight_decay:float=0.0,
>                            finetune_last_n_layers:int=-1,
>                            classifier_dropout:Optional[float]=None,
>                            max_seq_length:Optional[int]=None,
>                            batch_size:int=32, max_epochs:Optional[int]=None,
>                            early_stopping_monitor:str='val/Loss',
>                            early_stopping_mode:str='min',
>                            early_stopping_patience:int=3)

## AdvancedConfig

> the objects defined in our pipelines are constructed in a way that they can be further paramatrized with keyword arguments. These arguments can be utilized by constructing the `AdvancedConfig`.   

In [5]:
show_doc(LightningAdvancedConfig)

---

### LightningAdvancedConfig

>      LightningAdvancedConfig (finetune_last_n_layers:int,
>                               task_model_kwargs:Dict[str,Any],
>                               datamodule_kwargs:Dict[str,Any],
>                               task_train_kwargs:Dict[str,Any],
>                               model_config_kwargs:Dict[str,Any],
>                               early_stopping_kwargs:Dict[str,Any],
>                               tokenizer_kwargs:Dict[str,Any],
>                               batch_encoding_kwargs:Dict[str,Any],
>                               dataloader_kwargs:Dict[str,Any])

  
In summary, the `BasicConfig` takes arguments and automatically assign them into proper keyword group, while the `AdvancedConfig` takes as the input keyword groups that should be already correctly mapped.  


The list of available config can be found below.

## Running pipeline with BasicConfig

Let's run example pipeline on `polemo2` dataset

But first we downsample our dataset due to hardware limitations for that purpose we use HuggingFacePreprocessingPipeline

In [6]:
#|exec_doc
from embeddings.pipeline.hf_preprocessing_pipeline import HuggingFacePreprocessingPipeline

In [7]:
show_doc(HuggingFacePreprocessingPipeline)

---

### HuggingFacePreprocessingPipeline

>      HuggingFacePreprocessingPipeline (dataset_name:str, persist_path:str, sam
>                                        ple_missing_splits:Optional[Tuple[Optio
>                                        nal[float],Optional[float]]]=None, down
>                                        sample_splits:Optional[Tuple[Optional[f
>                                        loat],Optional[float],Optional[float]]]
>                                        =None, ignore_test_subset:bool=False,
>                                        seed:int=441, load_dataset_kwargs:Optio
>                                        nal[Dict[str,Any]]=None)

Preprocessing pipeline dedicated to work with HuggingFace datasets.

Then we need to use `run` method

In [8]:
show_doc(HuggingFacePreprocessingPipeline.run)

---

### PreprocessingPipeline.run

>      PreprocessingPipeline.run ()

In [9]:
#|exec_doc
prepocessing = HuggingFacePreprocessingPipeline(
    dataset_name="clarin-pl/polemo2-official",
    persist_path="data/polemo2_downsampled",
    downsample_splits=(0.001, 0.005, 0.005)
)
prepocessing.run()

No config specified, defaulting to: polemo2-official/all_text
Found cached dataset polemo2-official (/root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)
100%|██████████| 3/3 [00:00<00:00, 686.58it/s]
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-b5e701b965017bbe.arrow and /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-d36fd2c84292ba9d.arrow
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-9d13530ab41d82c9.arrow and /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e

DatasetDict({
    train: Dataset({
        features: ['text', 'target'],
        num_rows: 7
    })
    validation: Dataset({
        features: ['text', 'target'],
        num_rows: 5
    })
    test: Dataset({
        features: ['text', 'target'],
        num_rows: 5
    })
})

We have now our data prepared locally, now we need to define our `pipeline`.

Let's start from config. 
 We will use parameters from [`clarin-pl/lepiszcze-allegro__herbert-base-cased-polemo2`](https://huggingface.co/clarin-pl/lepiszcze-allegro__herbert-base-cased-polemo2), which configuration was obtained from `extensive hyperparmeter search`. 

::: {.callout-warning}  
Due to hardware limitation we limit parmeter `max_epochs` to 1 and we leave `early stopping` configuration parameters as defaults 
:::

In [10]:
show_doc(LightningBasicConfig)

---

### LightningBasicConfig

>      LightningBasicConfig (use_scheduler:bool=True, optimizer:str='Adam',
>                            warmup_steps:int=100, learning_rate:float=0.0001,
>                            adam_epsilon:float=1e-08, weight_decay:float=0.0,
>                            finetune_last_n_layers:int=-1,
>                            classifier_dropout:Optional[float]=None,
>                            max_seq_length:Optional[int]=None,
>                            batch_size:int=32, max_epochs:Optional[int]=None,
>                            early_stopping_monitor:str='val/Loss',
>                            early_stopping_mode:str='min',
>                            early_stopping_patience:int=3)

In [11]:
#|exec_doc

config = LightningBasicConfig(
        use_scheduler=True,
        optimizer="Adam",
        warmup_steps=100,
        learning_rate=0.001,
        adam_epsilon=1e-06,
        weight_decay=0.001,
        finetune_last_n_layers=3,
        classifier_dropout=0.2,
        max_seq_length=None,
        batch_size=64,
        max_epochs=1,
)
config

LightningBasicConfig(use_scheduler=True, optimizer='Adam', warmup_steps=100, learning_rate=0.001, adam_epsilon=1e-06, weight_decay=0.001, finetune_last_n_layers=3, classifier_dropout=0.2, max_seq_length=None, batch_size=64, max_epochs=1, early_stopping_monitor='val/Loss', early_stopping_mode='min', early_stopping_patience=3, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={})

Now we define pipeline dedicated for text classification `LightningClassificationPipeline`

In [12]:
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline


In [13]:
show_doc(LightningClassificationPipeline)

---

### LightningClassificationPipeline

>      LightningClassificationPipeline
>                                       (embedding_name_or_path:Union[str,pathli
>                                       b.Path], dataset_name_or_path:Union[str,
>                                       pathlib.Path], input_column_name:Union[s
>                                       tr,Sequence[str]],
>                                       target_column_name:str,
>                                       output_path:Union[str,pathlib.Path], eva
>                                       luation_filename:str='evaluation.json', 
>                                       config:Union[embeddings.config.lightning
>                                       _config.LightningBasicConfig,embeddings.
>                                       config.lightning_config.LightningAdvance
>                                       dConfig]=LightningBasicConfig(use_schedu
>                                       ler=True, optimizer='Adam',
>                                       warmup_steps=100, learning_rate=0.0001,
>                                       adam_epsilon=1e-08, weight_decay=0.0,
>                                       finetune_last_n_layers=-1,
>                                       classifier_dropout=None,
>                                       max_seq_length=None, batch_size=32,
>                                       max_epochs=None,
>                                       early_stopping_monitor='val/Loss',
>                                       early_stopping_mode='min',
>                                       early_stopping_patience=3,
>                                       tokenizer_kwargs={},
>                                       batch_encoding_kwargs={},
>                                       dataloader_kwargs={}), devices:Union[int
>                                       ,List[int],str,NoneType]='auto', acceler
>                                       ator:Union[str,pytorch_lightning.acceler
>                                       ators.accelerator.Accelerator,NoneType]=
>                                       'auto', logging_config:embeddings.utils.
>                                       loggers.LightningLoggingConfig=Lightning
>                                       LoggingConfig(loggers_names=[],
>                                       tracking_project_name=None,
>                                       wandb_entity=None,
>                                       wandb_logger_kwargs={}), tokenizer_name_
>                                       or_path:Union[pathlib.Path,str,NoneType]
>                                       =None, predict_subset:embeddings.data.da
>                                       taset.LightingDataModuleSubset=<Lighting
>                                       DataModuleSubset.TEST: 'test'>, load_dat
>                                       aset_kwargs:Optional[Dict[str,Any]]=None
>                                       , model_checkpoint_kwargs:Optional[Dict[
>                                       str,Any]]=None)

Helper class that provides a standard way to create an ABC using
inheritance.

In [14]:
from dataclasses import asdict # For metrics conversion
import pandas as pd  # For metrics conversion

In [15]:
#|exec_doc
pipeline = LightningClassificationPipeline(
    embedding_name_or_path="hf-internal-testing/tiny-albert",
    dataset_name_or_path="data/polemo2_downsampled/",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    devices="auto",
    accelerator="cpu",
    config=config
)

100%|██████████| 1/1 [00:00<00:00, 39.46ba/s]
100%|██████████| 1/1 [00:00<00:00, 90.29ba/s]
100%|██████████| 1/1 [00:00<00:00, 86.20ba/s]
Casting the dataset: 100%|██████████| 1/1 [00:00<00:00, 173.49ba/s]
Casting the dataset: 100%|██████████| 1/1 [00:00<00:00, 169.30ba/s]
Casting the dataset: 100%|██████████| 1/1 [00:00<00:00, 172.10ba/s]


Similarly as with HuggingFacePreprocessingPipeline we use `run` method

In [16]:
show_doc(LightningClassificationPipeline.run)

---

### LightningPipeline.run

>      LightningPipeline.run (run_name:Optional[str]=None)

In [17]:
#|exec_doc
metrics = pipeline.run()

Some weights of the model checkpoint at hf-internal-testing/tiny-albert were not used when initializing AlbertForSequenceClassification: ['predictions.LayerNorm.weight', 'predictions.dense.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.decoder.weight', 'predictions.LayerNorm.bias', 'predictions.decoder.bias']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at hf-internal-testing/tiny-albert and are newly initialized: ['classifier.b

Epoch 0: 100%|██████████| 2/2 [00:00<00:00, 14.22it/s, loss=1.39, v_num=, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.200, val/MulticlassPrecision=0.050, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.0833]
Testing: 0it [00:00, ?it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test/Loss': 1.3870834112167358,
 'test/MulticlassAccuracy': 0.0,
 'test/MulticlassF1Score': 0.0,
 'test/MulticlassPrecision': 0.0,
 'test/MulticlassRecall': 0.0}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 1/1 [00:00<00:00, 48.23it/s]


Restoring states from the checkpoint path at /app/nbs/01_Tutorials/checkpoints/epoch=0-step=0.ckpt
Loaded model weights from checkpoint at /app/nbs/01_Tutorials/checkpoints/epoch=0-step=0.ckpt


Predicting: 100%|██████████| 1/1 [00:00<?, ?it/s]


In [18]:
metrics = pd.DataFrame.from_dict(asdict(metrics), orient="index", columns=["values"])
metrics

Unnamed: 0,values
accuracy,0.0
f1_macro,0.0
f1_micro,0.0
f1_weighted,0.0
recall_macro,0.0
recall_micro,0.0
recall_weighted,0.0
precision_macro,0.0
precision_micro,0.0
precision_weighted,0.0


## Running pipeline with AdvancedConfig

As mentioned in previous section `LightningBasicConfig` is only limited to most important parameters. 

Let's see an example of the process of defining the parameters in our `LightningAdvancedConfig`. 
Tracing back different kwargs we can find: 


1. [`task_train_kwargs`](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-flags)
Parameters that are passed to the `Lightning Trainer` object.


1. [`task_model_kwargs`](https://github.com/CLARIN-PL/embeddings/blob/main/embeddings/model/lightning_module/lightning_module.py#L19)
Parameters that are passed to the `Lightning module` object (we use `TextClassificationModule` which inherits from `HuggingFaceLightningModule` and `HuggingFaceLightningModule`).

1. [`datamodule_kwargs`](https://github.com/CLARIN-PL/embeddings/blob/main/embeddings/data/datamodule.py#L35)  
Parameters passed to the datamodule classes, currently `HuggingFaceDataModule` takes several arguments (such as max_seq_length, processing_batch_size or downsamples args) as an input

1. [`batch_encoding_kwargs`](https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2456)
Parameters that are defined in `__call__` method of the tokenizer which allow for manipulation of the tokenized text by setting parameters such as truncation, padding, stride etc. and specifying the return format of the tokenized text

1. [`tokenizer_kwargs`](https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/tokenization_auto.py#L351)
This is a generic configuration class of the hugginface model's tokenizer, possible parameters depends on the tokenizer that is used. For example for bert uncased tokenizer these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/tokenizer_config.json

1. [`load_dataset_kwargs`](https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/loading_methods#datasets.load_dataset)
Keyword arguments from the `datasets.load_dataset method` which loads a dataset from the Hugging Face Hub, or a local dataset; mostly metadata for downloading, loading, caching the dataset

1. [`model_config_kwargs`](https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/configuration_auto.py#L515)
This is a generic configuration class of the hugginface model, possible parameters depends on the model that is used. For example for bert uncased these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/config.json

1. [`early_stopping_kwargs`](  
https://github.com/PyTorchLightning/pytorch-lightning/blob/5d2d9b09df5359226fea6ad2722592839ac0ebc4/pytorch_lightning/callbacks/early_stopping.py#L35) 
Params defined in `__init__` of the `EarlyStopping` lightning callback; you can specify a metric to monitor and conditions to stop training when it stops improving 
1. [`dataloader_kwargs`](
https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader) 
Defined in `__init__` of the torch `DataLoader` object which wraps an iterable around the Dataset to enable easy access to the sample; specify params such as num of workers, sampling or shuffling


Lets create an advanced config with all the parameters we want to use.

In [19]:
#|exec_doc

advanced_config = LightningAdvancedConfig(
    finetune_last_n_layers=0,
    datamodule_kwargs={
        "max_seq_length": None,
    },
    task_train_kwargs={
        "max_epochs": 1,
        "devices": "auto",
        "accelerator": "cpu",
        "deterministic": True,
    },
    task_model_kwargs={
        "learning_rate": 0.001,
        "train_batch_size": 64,
        "eval_batch_size": 64,
        "use_scheduler": True,
        "optimizer": "Adam",
        "adam_epsilon": 1e-6,
        "warmup_steps": 100,
        "weight_decay": 0.001,
    },
    early_stopping_kwargs=None,
    model_config_kwargs={"classifier_dropout": 0.2},
    tokenizer_kwargs={},
    batch_encoding_kwargs={},
    dataloader_kwargs={}
)
advanced_config

LightningAdvancedConfig(finetune_last_n_layers=0, task_model_kwargs={'learning_rate': 0.001, 'train_batch_size': 64, 'eval_batch_size': 64, 'use_scheduler': True, 'optimizer': 'Adam', 'adam_epsilon': 1e-06, 'warmup_steps': 100, 'weight_decay': 0.001}, datamodule_kwargs={'max_seq_length': None}, task_train_kwargs={'max_epochs': 1, 'devices': 'auto', 'accelerator': 'cpu', 'deterministic': True}, model_config_kwargs={'classifier_dropout': 0.2}, early_stopping_kwargs=None, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={})

Now we can add config the pipeline and run it.

In [20]:
#|exec_doc

pipeline = LightningClassificationPipeline(
    embedding_name_or_path="hf-internal-testing/tiny-albert",
    dataset_name_or_path="data/polemo2_downsampled/",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    devices="auto",
    accelerator="cpu",
    config=advanced_config
)

metrics_adv_cfg = pipeline.run()

Loading cached processed dataset at /app/nbs/01_Tutorials/data/polemo2_downsampled/train/cache-77e994fb05243ad6.arrow
100%|██████████| 1/1 [00:00<00:00, 86.48ba/s]
Loading cached processed dataset at /app/nbs/01_Tutorials/data/polemo2_downsampled/test/cache-75bdd677854c6d1d.arrow
Loading cached processed dataset at /app/nbs/01_Tutorials/data/polemo2_downsampled/train/cache-cf1f2693fe2c5dfe.arrow
Casting the dataset: 100%|██████████| 1/1 [00:00<00:00, 179.47ba/s]
Loading cached processed dataset at /app/nbs/01_Tutorials/data/polemo2_downsampled/test/cache-b8547187eb8006cc.arrow
Some weights of the model checkpoint at hf-internal-testing/tiny-albert were not used when initializing AlbertForSequenceClassification: ['predictions.LayerNorm.weight', 'predictions.dense.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.decoder.weight', 'predictions.LayerNorm.bias', 'predictions.decoder.bias']
- This IS expected if you are initializing AlbertForSequenceClassification from the 

Epoch 0: 100%|██████████| 2/2 [00:00<00:00, 16.25it/s, loss=1.39, v_num=, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Testing: 0it [00:00, ?it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test/Loss': 1.3838523626327515,
 'test/MulticlassAccuracy': 0.6000000238418579,
 'test/MulticlassF1Score': 0.1875,
 'test/MulticlassPrecision': 0.15000000596046448,
 'test/MulticlassRecall': 0.25}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 1/1 [00:00<00:00, 45.59it/s]


Restoring states from the checkpoint path at /app/nbs/01_Tutorials/checkpoints/epoch=0-step=0-v1.ckpt
Loaded model weights from checkpoint at /app/nbs/01_Tutorials/checkpoints/epoch=0-step=0-v1.ckpt


Predicting: 100%|██████████| 1/1 [00:00<?, ?it/s]


Finally, we can check out some of the metrics.

In [21]:
metrics_adv_cfg = pd.DataFrame.from_dict(asdict(metrics_adv_cfg), orient="index", columns=["values"])
metrics_adv_cfg

Unnamed: 0,values
accuracy,0.6
f1_macro,0.25
f1_micro,0.6
f1_weighted,0.45
recall_macro,0.333333
recall_micro,0.6
recall_weighted,0.6
precision_macro,0.2
precision_micro,0.6
precision_weighted,0.36


We used a very small dataset and very small Language Model, so the results are not very good. However, in reality we surely will get better results with more sophisticated models and larger datasets.

Good luck in your experiments!