# MMS - Example training pipeline

This is a simple example of MMS dataset application. We will fine-tune a transfomer model for sentiment classification on Polish social media posts. We will utilize cleanlab self confidence score to select high quality texts for training, to limit number of training examples.

In [None]:
#| eval: false
!pip install datasets transformers==4.30.0 torch sacremoses scikit-learn evaluate accelerate

In [None]:
#| eval: false
import os

import evaluate
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)

Our dataset is publicly available but we need to you to accept conditions. Please see [this link](https://huggingface.co/datasets/Brand24/mms), accept the terms

In [10]:
#| eval: false
mms_dataset = load_dataset("Brand24/mms")

Downloading and preparing dataset mms/default to /root/.cache/huggingface/datasets/Brand24___mms/default/0.2.0/70532fdd01f149ff84a280b7d9cfb661643abf4837b4f0f3aa1128064e870d65...


Downloading data files:   0%|          | 0/80 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/80 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

There are 14 different dimensions which differentiate obtained datasets. In addition, there is a pre-calculated cleanlab self conficence score for each sample. All of them can be used to sample examples which suit our use case best

In [9]:
#| eval: false
mms_dataset.column_names

NameError: name 'mms_dataset' is not defined

Select only samples in polish and coming from social media

In [None]:
#| eval: false
pl_sm = mms_dataset["train"].filter(lambda x: x["language"] == "pl" and x["domain"] == "social_media")

To achieve higher performance, we will select only samples with high self confidence score

In [None]:
#| eval: false
pl_sm_high_confidence = pl_sm.filter(lambda x: x["cleanlab_self_confidence"] > 0.6)

In [None]:
#| eval: false
len(pl_sm_high_confidence)

We will use this examples to fine-tune Polish version of BERT model - HerBERT

In [None]:
#| eval: false
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

In [None]:
#| eval: false
tokenized_dataset = pl_sm_high_confidence.map(tokenize, batched=True, batch_size=512)

In [None]:
#| eval: false
model = AutoModelForSequenceClassification.from_pretrained("allegro/herbert-base-cased", num_labels=3)

In [None]:
#| eval: false
split_dataset = tokenized_dataset.train_test_split(test_size=0.1)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

In [None]:
#| eval: false
training_args = TrainingArguments(
    output_dir="PL_SM_SENT",
    evaluation_strategy="epoch",
    num_train_epochs=5,
)
metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
#| eval: false
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
#| eval: false
trainer.train()