# How to use our configs? 
> Detailed tutorial about how to pass arguments to embeddings pipelines.

- title-block-banner: true


In [None]:
#| hide
from __future__ import annotations
import numpy as np
from fastcore.test import *
from nbdev.showdoc import *
from nbdev.qmd import *
import warnings
import os


In [None]:
#| hide

# disable HF thousand warnings
warnings.simplefilter("ignore")
# set os environ variable for multiprocesses
os.environ["PYTHONWARNINGS"] = "ignore"

Two types of config are defined in our library: `BasicConfig` and `AdvancedConfig`.
`BasicConfig` allows for easy use of the most common parameters in the pipeline. However, the objects defined in our pipelines are constructed in a way that they can be further paramatrized with keyword arguments. These arguments can be utilized by constructing the `AdvancedConfig`.   
In summary, the `BasicConfig` takes arguments and automatically assign them into proper keyword group, while the `AdvancedConfig` takes as the input keyword groups that should be already correctly mapped.  


The list of available config can be found below.

In [None]:
#| hide

from embeddings.config.lightning_config import (
    LightningAdvancedConfig,
    LightningBasicConfig,
)

In [None]:
show_doc(LightningBasicConfig)

In [None]:
show_doc(LightningAdvancedConfig)

## Running pipeline with BasicConfig

Let's run example pipeline on `polemo2` dataset, for tutorial purposes we use `allegro/herbert-base-cased`.

But first we downsample our dataset due to hardware limitations for that purpose we use HuggingFacePreprocessingPipeline

In [None]:
#|exec_doc

from embeddings.pipeline.hf_preprocessing_pipeline import HuggingFacePreprocessingPipeline

In [None]:
show_doc(HuggingFacePreprocessingPipeline)

Then we need to use `run` method

In [None]:
show_doc(HuggingFacePreprocessingPipeline.run)

In [None]:
#|exec_doc

prepocessing = HuggingFacePreprocessingPipeline(
    dataset_name="clarin-pl/polemo2-official",
    persist_path="data/polemo2_downsampled",
    downsample_splits=(0.001, 0.005, 0.005)
)
prepocessing.run()

We have now our data prepared locally, now we need to define our `pipeline`.

Let's start from config. 
 We will use parameters from [`clarin-pl/lepiszcze-allegro__herbert-base-cased-polemo2`](https://huggingface.co/clarin-pl/lepiszcze-allegro__herbert-base-cased-polemo2), which configuration was obtained from `extensive hyperparmeter search`. 

::: {.callout-warning}  
Due to hardware limitation we limit parmeter `max_epochs` to 1 and we leave `early stopping` configuration parameters as defaults 
:::

In [None]:
show_doc(LightningBasicConfig)

In [None]:
#|exec_doc

cfg = LightningBasicConfig(
        use_scheduler=True,
        optimizer="Adam",
        warmup_steps=100,
        learning_rate=0.001,
        adam_epsilon=1e-06,
        weight_decay=0.001,
        finetune_last_n_layers=3,
        classifier_dropout=0.2,
        max_seq_length=None,
        batch_size=64,
        max_epochs=1,
)
cfg

NameError: name 'LightningBasicConfig' is not defined

Now we define pipeline dedicated for text classification `LightningClassificationPipeline`

In [None]:
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline


In [None]:
show_doc(LightningClassificationPipeline)

In [None]:
from dataclasses import asdict # For metrics conversion
import pandas as pd  # For metrics conversion

In [None]:
#|exec_doc
pipeline = LightningClassificationPipeline(
    embedding_name_or_path="allegro/herbert-base-cased",
    dataset_name_or_path="data/polemo2_downsampled/",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    config=cfg
)

Similarly as with HuggingFacePreprocessingPipeline we use `run` method

In [None]:
show_doc(LightningClassificationPipeline.run)

In [None]:
#|exec_doc
metrics = pipeline.run()

# Converting metrics to DataFrame for better nb display

metrics = pd.DataFrame.from_dict(asdict(metrics), orient="index", columns=["values"])
metrics

## Running pipeline with AdvancedConfig

As mentioned in previous section `LightningBasicConfig` is only limited to most important parameters. 

Let's see an example of the process of defining the parameters in our `LightningAdvancedConfig`. 
Tracing back different kwargs we can find: 


1. [`task_train_kwargs`](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-flags)
Parameters that are passed to the `Lightning Trainer` object.


1. [`task_model_kwargs`](https://github.com/CLARIN-PL/embeddings/blob/main/embeddings/model/lightning_module/lightning_module.py#L19)
Parameters that are passed to the `Lightning module` object (we use `TextClassificationModule` which inherits from `HuggingFaceLightningModule` and `HuggingFaceLightningModule`).

1. [`datamodule_kwargs`](https://github.com/CLARIN-PL/embeddings/blob/main/embeddings/data/datamodule.py#L35)  
Parameters passed to the datamodule classes, currently `HuggingFaceDataModule` takes several arguments (such as max_seq_length, processing_batch_size or downsamples args) as an input

1. [`batch_encoding_kwargs`](https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2456)
Parameters that are defined in `__call__` method of the tokenizer which allow for manipulation of the tokenized text by setting parameters such as truncation, padding, stride etc. and specifying the return format of the tokenized text

1. [`tokenizer_kwargs`](https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/tokenization_auto.py#L351)
This is a generic configuration class of the hugginface model's tokenizer, possible parameters depends on the tokenizer that is used. For example for bert uncased tokenizer these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/tokenizer_config.json

1. [`load_dataset_kwargs`](https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/loading_methods#datasets.load_dataset)
Keyword arguments from the `datasets.load_dataset method` which loads a dataset from the Hugging Face Hub, or a local dataset; mostly metadata for downloading, loading, caching the dataset

1. [`model_config_kwargs`](https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/configuration_auto.py#L515)
This is a generic configuration class of the hugginface model, possible parameters depends on the model that is used. For example for bert uncased these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/config.json

1. [`early_stopping_kwargs`](  
https://github.com/PyTorchLightning/pytorch-lightning/blob/5d2d9b09df5359226fea6ad2722592839ac0ebc4/pytorch_lightning/callbacks/early_stopping.py#L35) 
Params defined in `__init__` of the `EarlyStopping` lightning callback; you can specify a metric to monitor and conditions to stop training when it stops improving 
1. [`dataloader_kwargs`](
https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader) 
Defined in `__init__` of the torch `DataLoader` object which wraps an iterable around the Dataset to enable easy access to the sample; specify params such as num of workers, sampling or shuffling


In [None]:
#|exec_doc

advanced_config = LightningAdvancedConfig(
    finetune_last_n_layers=0,
    datamodule_kwargs={
        "max_seq_length": None,
    },
    task_train_kwargs={
        "max_epochs": 1,
        "devices": "auto",
        "accelerator": "cpu",
        "deterministic": True,
    },
    task_model_kwargs={
        "learning_rate": 0.001,
        "train_batch_size": 64,
        "eval_batch_size": 64,
        "use_scheduler": True,
        "optimizer": "Adam",
        "adam_epsilon": 1e-6,
        "warmup_steps": 100,
        "weight_decay": 0.001,
    },
    early_stopping_kwargs=None,
    model_config_kwargs={"classifier_dropout": 0.2},
    tokenizer_kwargs={},
    batch_encoding_kwargs={},
    dataloader_kwargs={}
)
advanced_config

Now we can run pipeline

In [None]:
#|exec_doc

pipeline = LightningClassificationPipeline(
    embedding_name_or_path="allegro/herbert-base-cased",
    dataset_name_or_path="data/polemo2_downsampled/",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    config=advanced_config
)

metrics_adv_cfg = pipeline.run()

# Converting metrics to DataFrame for better nb display

metrics_adv_cfg = pd.DataFrame.from_dict(asdict(metrics_adv_cfg), orient="index", columns=["values"])
metrics_adv_cfg