In [1]:
import os
os.chdir("..")

from embeddings.defaults import RESULTS_PATH
from embeddings.pipeline.config_space import LightningConfigSpace
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline
from embeddings.utils.utils import build_output_path, format_eval_result

-----

### Lightning classification pipeline

Example below presents what parameters can be defined easily in our pipeline without necessity to provide them into seperate kwargs groups.

We need to define an object of class `LightningConfigSpace`. Doing that we can set some parameters of the pipeline:


When user want to modify different parameters that are not covered by this class by default he can do it providing them in two ways:

1. After object is defined it can be updated with parameters that belongs to specific group of kwargs.

2. During the definition specific group of kwargs can be inserted into `LightningConfigSpace`.

Detailed description of all `kwargs` can be found:
1. `task_train_kwargs`
https://github.com/PyTorchLightning/pytorch-lightning/blob/5d2d9b09df5359226fea6ad2722592839ac0ebc4/pytorch_lightning/trainer/trainer.py#L122 - params that are defined in `__init__`

2. `datamodule_kwargs` - ...

3. `task_model_kwargs`
https://github.com/CLARIN-PL/embeddings/blob/4292d110691c6c67695fefab74c927dbae9acff7/embeddings/model/lightning_module/lightning_module.py#L19 - params that are defined in `__init__`

4. `batch_encoding_kwargs` https://github.com/huggingface/transformers/blob/db7d6a80e82d66127b2a44b6e3382969fdc8b207/src/transformers/tokenization_utils_base.py#L2359 - params that are defined in `__call__` method

5. `tokenizer_kwargs` https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/tokenization_auto.py#L351
This is a generic configuration class of the hugginface model's tokenizer, possible parameters depends on the tokenizer that is used. For example for bert uncased tokenizer these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/tokenizer_config.json

6. `load_dataset_kwargs`
https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader - dataloader kwargs

7. `model_config_kwargs`
https://github.com/huggingface/transformers/blob/074645e32acda6498f16203a8459bb597610f623/src/transformers/models/auto/configuration_auto.py#L515
This is a generic configuration class of the hugginface model, possible parameters depends on the model that is used. For example for bert uncased these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/config.json

8. `early_stopping_kwargs`  
https://github.com/PyTorchLightning/pytorch-lightning/blob/5d2d9b09df5359226fea6ad2722592839ac0ebc4/pytorch_lightning/callbacks/early_stopping.py#L35 - params that are defined in `__init__`

In [2]:
embedding_name_or_path = "allegro/herbert-base-cased"
dataset_name = "clarin-pl/polemo2-official"
input_columns_name = "text"
target_column_name = "target"
root = RESULTS_PATH.joinpath("lightning_sequence_classification")

output_path = build_output_path(root, embedding_name_or_path, dataset_name)
output_path.mkdir(parents=True, exist_ok=True)

In [3]:
config_space = LightningConfigSpace(
    learning_rate=0.01, max_epochs=1, max_seq_length=128, 
)

# Below we are providing extra parameters that are non-default parameters
# covered by `LightningConfigSpace` class
config_space.update_specific_params_group(
    {"truncation": True, "is_split_into_words": True}, "batch_encoding_kwargs"
)

In [4]:
pipeline = LightningClassificationPipeline(
    embedding_name_or_path=embedding_name_or_path,
    dataset_name_or_path=dataset_name,
    input_column_name=input_columns_name,
    target_column_name=target_column_name,
    output_path=output_path,
    config_space=config_space
)

In [None]:
result = pipeline.run()

No config specified, defaulting to: pol_emo2/all_text
Reusing dataset pol_emo2 (/Users/lukaszkoziol/.cache/huggingface/datasets/clarin-pl___pol_emo2/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)


  0%|          | 0/3 [00:00<?, ?it/s]

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
No config specified, defaulting to: pol_emo2/all_text
Reusing dataset pol_emo2 (/Users/lukaszkoziol/.cache/huggingface/datasets/clarin-pl___pol_emo2/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)


  0%|          | 0/3 [00:00<?, ?it/s]

No config specified, defaulting to: pol_emo2/all_text
Reusing dataset pol_emo2 (/Users/lukaszkoziol/.cache/huggingface/datasets/clarin-pl___pol_emo2/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Stringifying the column:   0%|          | 0/7 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/7 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Stringifying the column:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Stringifying the column:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/1 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Some weights of the model checkpoint at allegro/herbert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.sso.sso_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.sso.sso_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification 

Validation sanity check: 0it [00:00, ?it/s]

  rank_zero_warn(
  rank_zero_warn(


Training: 0it [00:00, ?it/s]