<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/MSP/TextClassificationWithBERT.ipynb" target="_new"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

# Teksta klasificēšana ar loģistisko regresiju un BERT

Piezīme: Colab izpildlaika vides parametros izvēlieties bezmaksas GPU (T4).

## GoEmotions datu kopa

* Publikācija: https://aclanthology.org/2020.acl-main.372/
* Oriģinālā datu kopa: https://github.com/google-research/google-research/tree/master/goemotions
* Priekšapstrādāta **EN** versija: https://huggingface.co/datasets/google-research-datasets/go_emotions
* Priekšapstrādāta **LV** versija: https://huggingface.co/datasets/AiLab-IMCS-UL/go_emotions-lv

## Loģistiskā regresija

### Izpildes vides sagatavošana

In [None]:
!pip install scikit-learn
!pip install nltk

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### Datu kopas lejupielāde

In [None]:
!wget -q https://raw.githubusercontent.com/google-research/google-research/refs/heads/master/goemotions/data/dev.tsv
!wget -q https://raw.githubusercontent.com/google-research/google-research/refs/heads/master/goemotions/data/train.tsv
!wget -q https://raw.githubusercontent.com/google-research/google-research/refs/heads/master/goemotions/data/test.tsv

## BERT

### Izpildes vides sagatavošana

In [None]:
!pip install transformers
!pip install datasets

In [13]:
from transformers import BertForSequenceClassification, BertTokenizer
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import load_dataset

### BERT modeļa izvēle un tekstvienību dalītāja ielāde RAM

* Oficiālie Google BERT modeļi - `base` un `large` versijas: https://huggingface.co/google-bert
* Neoficiālas mazākas BERT versijas, piem., `small`: https://huggingface.co/prajjwal1/bert-small

Piezīme: Obligāti jāizmanto modelim atbilstošais tekstvienību dalītājs (*tokenizer*).

In [14]:
# Ielādē CPU atmiņā izvēlētā BERT modeļa tekstvienību dalītāju no Hugging Face
bert_tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')

### Datu kopas ielāde un priekšapstrāde

In [15]:
def is_single_label(sample):
    value = sample["labels"]
    if isinstance(value, (list, tuple)):
        return len(value) == 1
    else:
        return False

def to_int_label(sample):
    return {"labels": sample["labels"][0]}

def tokenize(batch):
    return bert_tokenizer(batch["text"], truncation=True)

In [None]:
data_set = load_dataset("google-research-datasets/go_emotions", "simplified")

filtered_data_set = data_set.filter(is_single_label)
flattened_data_set = filtered_data_set.map(to_int_label)

tokenized_data_set = flattened_data_set.map(tokenize, batched=True)
final_data_set = tokenized_data_set.select_columns(["input_ids", "labels"])

print("data_set:", data_set["train"][0])
print("filtered_data_set:", filtered_data_set["train"][0])
print("flattened_data_set:", flattened_data_set["train"][0])
print("tokenized_data_set:", tokenized_data_set["train"][0])
print("final_data_set:", final_data_set["train"][0])

train_set = final_data_set["train"]
validation_set = final_data_set["validation"]
test_set = final_data_set["test"]

### Sagatavošanās bāzes modeļa pielāgošanai

In [None]:
# Nosaka dažādo klašu skaitu apmācības datu  kopā
label_count = len(data_set["train"].features["labels"].feature.names)
print("label_count", label_count)

# Ielādē RAM izvēlēto BERT modeli, izveido tam atbilstošu klasificēšanas "galvu"
bert_model = BertForSequenceClassification.from_pretrained(
    'google-bert/bert-base-uncased', num_labels=label_count
)

# Nodefinē vienkāršotu novērtēšanas metriku - "accuracy"
def eval_metrics(p):
    preds = p.predictions.argmax(-1)
    return {"accuracy": float((preds == p.label_ids).mean())}

# Specificē modeļa apmācības hiperparametrus
args = TrainingArguments(
    output_dir = "bert-base-uncased-go_emotions",
    learning_rate = 2e-5,              # tipiski BERT modeļiem
    per_device_train_batch_size = 64,  # atkarībā no GPU atmiņas; var ietekmēt rezultātu
    per_device_eval_batch_size = 128,  # atkarībā no GPU atmiņas
    num_train_epochs = 5,
    fp16 = True,                       # ātrdarbībai uz T4
    metric_for_best_model = "accuracy",
    save_strategy = "epoch",
    eval_strategy = "epoch",
    load_best_model_at_end = True,
    report_to = "none"                 # neizmantot W&B servisu
)

# Izveido apmācības "dzinēju"
trainer = Trainer(
    model = bert_model,
    args = args,
    train_dataset = train_set,
    eval_dataset = validation_set,
    compute_metrics = eval_metrics,
    processing_class = bert_tokenizer,
    data_collator = DataCollatorWithPadding(bert_tokenizer)
)

### Bāzes modeļa pielāgošana klasificēšanas uzdevumam

In [27]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.8883,1.400555,0.600704
2,1.3204,1.289372,0.622911
3,1.1478,1.281488,0.620053
4,1.0374,1.299102,0.621152
5,0.9286,1.329054,0.614336


TrainOutput(global_step=2840, training_loss=1.2214983658051826, metrics={'train_runtime': 682.411, 'train_samples_per_second': 266.027, 'train_steps_per_second': 4.162, 'total_flos': 3611501508638112.0, 'train_loss': 1.2214983658051826, 'epoch': 5.0})

### Labākās pielāgotās versijas testēšana

In [28]:
trainer.evaluate(test_set)

{'eval_loss': 1.2845581769943237,
 'eval_accuracy': 0.6250544662309369,
 'eval_runtime': 2.4701,
 'eval_samples_per_second': 1858.224,
 'eval_steps_per_second': 14.574,
 'epoch': 5.0}