# LEPISZCZE

> The use cases and examples how to train and submit models to the [LEPISZCZE](https://lepiszcze.ml/). 

- bibliography: ../references.bib
- title-block-banner: true

In [1]:
#| default_exp lepiszcze

In [2]:
#| hide
import pandas as pd
from IPython.core.display import HTML, display
from nbdev.showdoc import *

  from IPython.core.display import HTML, display


In [3]:
#| hide
display(HTML("<style>.container { max-width:1800px !important;width:auto; }</style>"))
pd.set_option('display.max_colwidth', None)

> We recommend to read our NeurIPS paper [@augustyniak2022this] where you can find our lessons learned from the process of designing and compiling LEPISZCZE benchmark.

In [4]:
from pathlib import Path

from embeddings.config.lightning_config import LightningBasicConfig, LightningAdvancedConfig
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

  from .autonotebook import tqdm as notebook_tqdm


We will start with training a text classifier using `embeddings.pipeline.lightning_classification.LightningClassificationPipeline`

In [5]:
doc(LightningClassificationPipeline)

In [6]:
LEPISZCZE_SUBMISSIONS = Path("../lepiszcze-submissions")
LEPISZCZE_SUBMISSIONS.mkdir(exist_ok=True, parents=True)

In [7]:
config = LightningBasicConfig(
    learning_rate=0.01, max_epochs=1, max_seq_length=128, finetune_last_n_layers=0
)

In [8]:
pipeline = LightningClassificationPipeline(
    dataset_name_or_path="clarin-pl/polemo2-official",
    embedding_name_or_path="distilbert-base-uncased",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    config=config
)

No config specified, defaulting to: polemo2-official/all_text
Found cached dataset polemo2-official (/root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)
100%|██████████| 3/3 [00:00<00:00, 76.32it/s]
100%|██████████| 1/1 [00:02<00:00,  2.29s/ba]
100%|██████████| 1/1 [00:00<00:00,  9.10ba/s]
100%|██████████| 1/1 [00:00<00:00, 10.65ba/s]
Casting the dataset: 100%|██████████| 7/7 [00:00<00:00,  9.94ba/s]
Casting the dataset: 100%|██████████| 1/1 [00:00<00:00, 10.82ba/s]
Casting the dataset: 100%|██████████| 1/1 [00:00<00:00, 11.31ba/s]


It took a couple of seconds but finally we have a pipeline objects ready and we need only run it.

In [9]:
from embeddings.config.lightning_config import LightningAdvancedConfig

In [10]:
results = pipeline.run()
print(results)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

TypeError: __new__() missing 1 required positional argument: 'task'

In [11]:
from embeddings.pipeline.flair_classification import FlairClassificationPipeline
from embeddings.utils.utils import build_output_path

In [12]:
root = "."
embedding_name_or_path = "clarin-pl/word2vec-kgr10"
dataset_name = "clarin-pl/polemo2-official"
input_column_name = "text"
target_column_name = "target"

In [13]:
output_path = build_output_path(root, embedding_name_or_path, dataset_name)
pipeline = FlairClassificationPipeline(
    embedding_name=embedding_name_or_path,
    dataset_name=dataset_name,
    input_column_name=input_column_name,
    target_column_name=target_column_name,
    output_path=output_path,
)

2022-12-16 00:16:44,569 - embeddings.embedding.auto_flair - INFO - clarin-pl/word2vec-kgr10 not compatible with Transformers, trying to initialise as static embedding.


In [14]:
result = pipeline.run()

No config specified, defaulting to: polemo2-official/all_text
Found cached dataset polemo2-official (/root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)
100%|██████████| 3/3 [00:00<00:00, 86.55it/s]
2022-12-16 00:18:30,059 - embeddings.transformation.flair_transformation.corpus_transformation - INFO - Info of ['train', 'validation', 'test']:
{'builder_name': 'polemo2-official',
 'citation': '\n'
             '@inproceedings{kocon-etal-2019-multi,\n'
             '    title = "Multi-Level Sentiment Analysis of {P}ol{E}mo 2.0: '
             'Extended Corpus of Multi-Domain Consumer Reviews",\n'
             '    author = "Koco{\'n}, Jan  and\n'
             '      Mi{\\l}kowski, Piotr  and\n'
             '      Za{\'s}ko-Zieli{\'n}ska, Monika",\n'
             '    booktitle = "Proceedings of the 23rd Conference on '
             'Computational Natural Language Learning (CoNLL)",\n'
            

2022-12-16 00:18:42,849 Computing label dictionary. Progress:


6573it [00:00, 56695.52it/s]

2022-12-16 00:18:42,971 Dictionary created for label 'None' with 5 values: 1 (seen 2469 times), 2 (seen 1824 times), 3 (seen 1309 times), 0 (seen 971 times)





2022-12-16 00:18:44,592 ----------------------------------------------------------------------------------------------------
2022-12-16 00:18:44,594 Model: "TextClassifier(
  (decoder): Linear(in_features=300, out_features=5, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
  (document_embeddings): DocumentPoolEmbeddings(
    fine_tune_mode=none, pooling=mean
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings(
        '/root/.cache/huggingface/hub/894a9a0a7a7c9e5defa71b9ed26e5699b9394e25d3ebce51d39188935f15ac57.3b0d8f1d834bcf9f436953bd7051d5d9761aae9f482a26ed62e8c21283da012b'
        (embedding): Embedding(2283378, 300)
      )
    )
  )
  (weights): None
  (weight_tensor) None
)"
2022-12-16 00:18:44,596 ----------------------------------------------------------------------------------------------------
2022-12-16 00:18:44,597 Corpus: "Corp

: 

: 