# LEPISZCZE

> The use cases and examples how to train and submit models to the [LEPISZCZE](https://lepiszcze.ml/). 

- bibliography: references.bib
- title-block-banner: true

In [2]:
#| default_exp lepiszcze

In [3]:
#| hide
from nbdev.showdoc import *

from IPython.core.display import display, HTML
display(HTML("<style>.container { max-width:1800px !important;width:auto; }</style>"))

import pandas as pd
pd.set_option('display.max_colwidth', None)

  from IPython.core.display import display, HTML


> We recommend to read our NeurIPS paper [@augustyniak2022this] where you can find our lessons learned from the process of designing and compiling LEPISZCZE benchmark.

In [1]:
#| export 
from pathlib import Path

from embeddings.config.lightning_config import LightningBasicConfig, LightningAdvancedConfig
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

  from .autonotebook import tqdm as notebook_tqdm


We will start with training a text classifier using `embeddings.pipeline.lightning_classification.LightningClassificationPipeline`

In [4]:
doc(LightningClassificationPipeline)

In [2]:
#| export 
LEPISZCZE_SUBMISSIONS = Path("../lepiszcze-submissions")
LEPISZCZE_SUBMISSIONS.mkdir(exist_ok=True, parents=True)

In [6]:
#| export
config = LightningBasicConfig(
    learning_rate=0.01, max_epochs=1, max_seq_length=128, finetune_last_n_layers=0
)

In [3]:
advanced_config = LightningAdvancedConfig(
    finetune_last_n_layers=0,
    task_train_kwargs={
        "max_epochs": 1,
        "devices": "auto",
        "accelerator": "cpu",
        "deterministic": True,
    },
    task_model_kwargs={
        "learning_rate": 5e-4,
        "use_scheduler": False,
        "optimizer": "AdamW",
        "adam_epsilon": 1e-8,
        "warmup_steps": 100,
        "weight_decay": 0.0,
    },
    datamodule_kwargs={
        "downsample_train": 0.01,
        "downsample_val": 0.01,
        "downsample_test": 0.05,
    },
    dataloader_kwargs={"num_workers": 0},
)

TypeError: __init__() missing 4 required positional arguments: 'model_config_kwargs', 'early_stopping_kwargs', 'tokenizer_kwargs', and 'batch_encoding_kwargs'

In [7]:
#| export
pipeline = LightningClassificationPipeline(
    dataset_name_or_path="clarin-pl/polemo2-official",
    embedding_name_or_path="allegro/herbert-base-cased",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    config=config
)

No config specified, defaulting to: polemo2-official/all_text
Found cached dataset polemo2-official (/root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)
100%|██████████| 3/3 [00:00<00:00, 39.80it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-df7f6639fbb755c8.arrow
  0%|          | 0/1 [00:00<?, ?ba/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-56499cf86dfad548.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-e46c214a0dfb2649.arrow
Casting the dataset:   0%|          | 0/1 [00:0

It took a couple of seconds but finally we have a pipeline objects ready and we need only run it.

In [None]:
from embeddings.config.lightning_config import LightningAdvancedConfig

In [8]:
#| export
results = pipeline.run()
print(results)

Some weights of the model checkpoint at allegro/herbert-base-cased were not used when initializing BertForSequenceClassification: ['cls.sso.sso_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.sso.sso_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


: 

: 

In [4]:
import pprint

import typer

from embeddings.defaults import RESULTS_PATH
from embeddings.pipeline.flair_classification import FlairClassificationPipeline
from embeddings.utils.utils import build_output_path, format_eval_results

In [6]:
root = "."
embedding_name_or_path = "clarin-pl/word2vec-kgr10"
dataset_name = "clarin-pl/polemo2-official"
input_column_name = "text"
target_column_name = "target"

In [7]:
output_path = build_output_path(root, embedding_name_or_path, dataset_name)
pipeline = FlairClassificationPipeline(
    embedding_name=embedding_name_or_path,
    dataset_name=dataset_name,
    input_column_name=input_column_name,
    target_column_name=target_column_name,
    output_path=output_path,
)

2022-11-13 22:35:56,661 - embeddings.embedding.auto_flair - INFO - clarin-pl/word2vec-kgr10 not compatible with Transformers, trying to initialise as static embedding.
Downloading: 100%|██████████| 76.0/76.0 [00:00<00:00, 39.7kB/s]
Downloading: 100%|██████████| 72.0/72.0 [00:00<00:00, 42.1kB/s]
Downloading: 100%|██████████| 139M/139M [01:01<00:00, 2.27MB/s] 
Downloading:  11%|█         | 289M/2.74G [02:10<18:57, 2.16MB/s]  

In [None]:
result = pipeline.run()