In [None]:
import os

os.chdir("..")

from embeddings.config.lightning_config import (
    LightningAdvancedConfig,
    LightningBasicConfig,
)
from embeddings.defaults import DATASET_PATH, RESULTS_PATH
from embeddings.pipeline.hf_preprocessing_pipeline import HuggingFacePreprocessingPipeline
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline
from embeddings.utils.utils import build_output_path, format_eval_result

# How to use our configs? 

Two types of config are defined in our library: `BasicConfig` and `AdvancedConfig`.
`BasicConfig` allows for easy use of the most common parameters in the pipeline. However, the objects defined in our pipelines are constructed in a way that they can be further paramatrized with keyword arguments. These arguments can be utilized by constructing the `AdvancedConfig`.   
In summary, the `BasicConfig` takes arguments and automatically assign them into proper keyword group, while the `AdvancedConfig` takes as the input keyword groups that should be already correctly mapped.  

The keywords arguments will depend on the type of the pipelines for the Flair pipeline (that are used for static embeddings), and thus there are config defined for type of the task.

The list of available config can be found below.


### **Flair**:  
   - FlairBasicConfig
   - FlairSequenceLabelingBasicConfig
   - FlairTextClassificationBasicConfig
   - FlairSequenceLabelingAdvancedConfig
   - FlairTextClassificationAdvancedConfig
   
### **Lightning**:
   - LightningBasicConfig
   - LightningAdvancedConfig




## What are the available advanced config keyword arguments and where to find them?

In general, the keywords are passed to the object when constructing specific pipelines. Take for example the fragment of `LightningClassificationPipeline`:

```
datamodule = TextClassificationDataModule(
    tokenizer_name_or_path=tokenizer_name_or_path,
    dataset_name_or_path=dataset_name_or_path,
    text_fields=input_column_name,
    target_field=target_column_name,
    train_batch_size=config_space.train_batch_size,
    eval_batch_size=config_space.eval_batch_size,
    tokenizer_kwargs=config_space.tokenizer_kwargs,
    batch_encoding_kwargs=config_space.batch_encoding_kwargs,
    load_dataset_kwargs=load_dataset_kwargs,
    **config_space.datamodule_kwargs
)
task = TextClassificationTask(
    model_name_or_path=embedding_name_or_path,
    output_path=output_path,
    finetune_last_n_layers=config_space.finetune_last_n_layers,
    model_config_kwargs=config_space.model_config_kwargs,
    task_model_kwargs=config_space.task_model_kwargs,
    task_train_kwargs=config_space.task_train_kwargs,
    early_stopping_kwargs=config_space.early_stopping_kwargs,
)
```

We can identify and trace the keyword arguments to find the possible arguments that can be set in the config kwargs.

Let's see an example of the process of defininf the parameters in our `LightningAdvancedConfig`. 
Tracing back different kwargs we can find: 

1. `task_train_kwargs`
https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-flags - parameters that are passed to the `Lightning Trainer` object.

2. `task_model_kwargs`
https://github.com/CLARIN-PL/embeddings/blob/4292d110691c6c67695fefab74c927dbae9acff7/embeddings/model/lightning_module/lightning_module.py#L19 - parameters that are passed to the `Lightning module` object (we use `TextClassificationModule` which inherits from `HuggingFaceLightningModule` and `HuggingFaceLightningModule`).

3. `datamodule_kwargs` - https://github.com/CLARIN-PL/embeddings/blob/main/embeddings/data/datamodule.py#L35 - parameters passed to the datamodule classes, currently `HuggingFaceDataModule` takes several arguments (such as max_seq_length, processing_batch_size or downsamples args) as an input

4. `batch_encoding_kwargs` https://github.com/huggingface/transformers/blob/db7d6a80e82d66127b2a44b6e3382969fdc8b207/src/transformers/tokenization_utils_base.py#L2359 - parameters that are defined in `__call__` method of the tokenizer which allow for manipulation of the tokenized text by setting parameters such as truncation, padding, stride etc. and specifying the return format of the tokenized text

5. `tokenizer_kwargs` https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/tokenization_auto.py#L351
This is a generic configuration class of the hugginface model's tokenizer, possible parameters depends on the tokenizer that is used. For example for bert uncased tokenizer these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/tokenizer_config.json

6. `load_dataset_kwargs`
https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/loading_methods#datasets.load_dataset - keyword arguments from the `datasets.load_dataset method` which loads a dataset from the Hugging Face Hub, or a local dataset; mostly metadata for downloading, loading, caching the dataset

7. `model_config_kwargs`
https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/configuration_auto.py#L515
This is a generic configuration class of the hugginface model, possible parameters depends on the model that is used. For example for bert uncased these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/config.json

8. `early_stopping_kwargs`  
https://github.com/PyTorchLightning/pytorch-lightning/blob/5d2d9b09df5359226fea6ad2722592839ac0ebc4/pytorch_lightning/callbacks/early_stopping.py#L35 - params defined in `__init__` of the `EarlyStopping` lightning callback; you can specify a metric to monitor and conditions to stop training when it stops improving 
9. `dataloader_kwargs`
https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader - defined in `__init__` of the torch `DataLoader` object which wraps an iterable around the Dataset to enable easy access to the sample; specify params such as num of workers, sampling or shuffling

In [None]:
embedding_name_or_path = "allegro/herbert-base-cased"
dataset_name = "clarin-pl/polemo2-official"
input_columns_name = "text"
target_column_name = "target"

dataset_path = build_output_path(DATASET_PATH, embedding_name_or_path, dataset_name)
dataset_path.mkdir(parents=True, exist_ok=True)

output_path = build_output_path(RESULTS_PATH, embedding_name_or_path, dataset_name)
output_path.mkdir(parents=True, exist_ok=True)

In [None]:
basic_config = LightningBasicConfig(
    learning_rate=0.01, max_epochs=1, max_seq_length=128, finetune_last_n_layers=0
)

In [None]:
advanced_config = LightningAdvancedConfig(
    finetune_last_n_layers=0,
    datamodule_kwargs={
        "max_seq_length": 64,
    },
        task_train_kwargs={
            "max_epochs": 1,
            "devices": "auto",
            "accelerator": "cpu",
            "deterministic": True,
        },
        task_model_kwargs={
            "learning_rate": 5e-4,
            "train_batch_size": 32,
            "eval_batch_size": 32,
            "use_scheduler": False,
            "optimizer": "AdamW",
            "adam_epsilon": 1e-8,
            "warmup_steps": 100,
            "weight_decay": 0.0,
        },
        early_stopping_kwargs={
            "monitor": "val/Loss",
            "mode": "min",
            "patience": 3,
        },
    model_config_kwargs={"classifier_dropout": 0.5},
)

In [None]:
pipeline = HuggingFacePreprocessingPipeline(
    dataset_name="clarin-pl/polemo2-official",
    load_dataset_kwargs={
        "train_domains": ["hotels", "medicine"],
        "dev_domains": ["hotels", "medicine"],
        "test_domains": ["hotels", "medicine"],
        "text_cfg": "text",
    },
    persist_path=str(dataset_path),
    sample_missing_splits=None,
    ignore_test_subset=False,
    downsample_splits=(0.01, 0.01, 0.05),
    seed=441,
)
pipeline.run()

In [None]:
pipeline = LightningClassificationPipeline(
    embedding_name_or_path=embedding_name_or_path,
    dataset_name_or_path=str(dataset_path),
    input_column_name=input_columns_name,
    target_column_name=target_column_name,
    output_path=output_path,
    config=advanced_config,
)

In [None]:
result = pipeline.run()