# LEPISZCZE

> The use cases and examples how to train and submit models to the [LEPISZCZE](https://lepiszcze.ml/). 

- bibliography: ../references.bib
- title-block-banner: true

In [2]:
#| default_exp lepiszcze

In [9]:
#| hide
import pandas as pd
from IPython.core.display import HTML, display
from nbdev.showdoc import *

  from IPython.core.display import HTML, display


In [10]:
#| hide
display(HTML("<style>.container { max-width:1800px !important;width:auto; }</style>"))
pd.set_option('display.max_colwidth', None)

> We recommend to read our NeurIPS paper [@augustyniak2022this] where you can find our lessons learned from the process of designing and compiling LEPISZCZE benchmark.

In [11]:
from pathlib import Path

from embeddings.config.lightning_config import LightningBasicConfig, LightningAdvancedConfig
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline

We will start with training a text classifier using `embeddings.pipeline.lightning_classification.LightningClassificationPipeline`

In [12]:
doc(LightningClassificationPipeline)

In [13]:
LEPISZCZE_SUBMISSIONS = Path("../lepiszcze-submissions")
LEPISZCZE_SUBMISSIONS.mkdir(exist_ok=True, parents=True)

In [14]:
config = LightningBasicConfig(
    learning_rate=0.01, max_epochs=1, max_seq_length=128, finetune_last_n_layers=0
)

In [15]:
pipeline = LightningClassificationPipeline(
    dataset_name_or_path="clarin-pl/polemo2-official",
    embedding_name_or_path="distilbert-base-uncased",
    input_column_name="text",
    target_column_name="target",
    output_path=".",
    config=config
)

No config specified, defaulting to: polemo2-official/all_text
Found cached dataset polemo2-official (/root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)
100%|██████████| 3/3 [00:00<00:00, 277.11it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-f315ad251f6218f5.arrow
  0%|          | 0/1 [00:00<?, ?ba/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-a652821a41c5b53a.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70/cache-2614cc27a4c1f826.arrow
Casting the dataset:   0%|          | 0/1 [00:

It took a couple of seconds but finally we have a pipeline objects ready and we need only run it.

In [None]:
from embeddings.config.lightning_config import LightningAdvancedConfig

In [16]:
results = pipeline.run()
print(results)

No config specified, defaulting to: polemo2-official/all_text
Found cached dataset polemo2-official (/root/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)
100%|██████████| 3/3 [00:00<00:00, 254.26it/s]
  rank_zero_warn(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


: 

: 

In [4]:
import pprint

import typer

from embeddings.defaults import RESULTS_PATH
from embeddings.pipeline.flair_classification import FlairClassificationPipeline
from embeddings.utils.utils import build_output_path, format_eval_results

In [6]:
root = "."
embedding_name_or_path = "clarin-pl/word2vec-kgr10"
dataset_name = "clarin-pl/polemo2-official"
input_column_name = "text"
target_column_name = "target"

In [7]:
output_path = build_output_path(root, embedding_name_or_path, dataset_name)
pipeline = FlairClassificationPipeline(
    embedding_name=embedding_name_or_path,
    dataset_name=dataset_name,
    input_column_name=input_column_name,
    target_column_name=target_column_name,
    output_path=output_path,
)

2022-11-13 22:35:56,661 - embeddings.embedding.auto_flair - INFO - clarin-pl/word2vec-kgr10 not compatible with Transformers, trying to initialise as static embedding.
Downloading: 100%|██████████| 76.0/76.0 [00:00<00:00, 39.7kB/s]
Downloading: 100%|██████████| 72.0/72.0 [00:00<00:00, 42.1kB/s]
Downloading: 100%|██████████| 139M/139M [01:01<00:00, 2.27MB/s] 
Downloading: 100%|██████████| 2.74G/2.74G [20:30<00:00, 2.23MB/s] 


In [None]:
# result = pipeline.run()