<a href="https://colab.research.google.com/github/Sergey-Kit/itmo_dl_nlp_course/blob/hw_6/itmo_dl_nlp_course_dz_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Классификация текста с помощью дообученной модели BERT

##### 1. Установка зависимостей

In [None]:
!pip install datasets=="2.14.6" transformers=="4.35.0" accelerate=="0.24.1"

In [29]:
import numpy as np
import pandas as pd
from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, f1_score

In [4]:
pd.set_option('max_colwidth', 100)

RANDOM_STATE = 42

##### 2. Загрузка датасета и токенизация

In [6]:
dataset = load_dataset("ag_news")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

##### 3. Делим выборку

In [8]:
def tokenize_function(batch):
    return tokenizer(batch["text"],
                     padding="max_length",
                     truncation=True,
                     max_length=256)

In [9]:
train_data = dataset['train'].select(range(10000)).map(tokenize_function, batched=True)
test_data = dataset['test'].select(range(2000)).map(tokenize_function, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

## Валидация моделей

##### 1. Fine-tuning модели

In [10]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
training_args = TrainingArguments(
   output_dir="./results",
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   do_train=True,
   do_eval=True,
   save_strategy="epoch",
   seed=RANDOM_STATE,
)

In [22]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = np.mean(pred == labels)
    f1 = f1_score(labels, pred, average='macro')
    return {"accuracy": accuracy,
            "f1": f1}

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    compute_metrics=compute_metrics,
)

In [24]:
trainer.train()

Step,Training Loss
500,0.2125
1000,0.1395


TrainOutput(global_step=1250, training_loss=0.16620638885498046, metrics={'train_runtime': 433.3491, 'train_samples_per_second': 46.152, 'train_steps_per_second': 2.885, 'total_flos': 1324721233920000.0, 'train_loss': 0.16620638885498046, 'epoch': 2.0})

##### 2. Сравнение

In [26]:
models = {
    'Before fine-tuning': AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4),
    'After fine-tuning': AutoModelForSequenceClassification.from_pretrained("./results/checkpoint-200", num_labels=4)
}

for model_name, model in models.items():
    trainer = Trainer(model=model, compute_metrics=compute_metrics)
    results = trainer.evaluate(eval_dataset=test_data)
    predictions = trainer.predict(test_data)
    pred_labels = np.argmax(predictions.predictions, axis=1)

    print(f"\n{model_name}:")

    metrics_df = pd.DataFrame({
        'Metric': ['Accuracy', 'F1 Score'],
        'Value': [results['eval_accuracy'], results['eval_f1']]
    })
    display(metrics_df)

    print("\nConfusion Matrix:")
    display(pd.DataFrame(confusion_matrix(test_data['label'], pred_labels)))
    print("\nClassification Report:")
    display(pd.DataFrame.from_dict(
        classification_report(test_data['label'],
        pred_labels,
        output_dict=True,
        zero_division=0)
).T)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Before fine-tuning:


Unnamed: 0,Metric,Value
0,Accuracy,0.217
1,F1 Score,0.154328



Confusion Matrix:


Unnamed: 0,0,1,2,3
0,186,4,321,0
1,44,8,471,3
2,205,4,240,0
3,248,14,252,0



Classification Report:


Unnamed: 0,precision,recall,f1-score,support
0,0.272328,0.363992,0.311558,511.0
1,0.266667,0.015209,0.028777,526.0
2,0.186916,0.534521,0.276976,449.0
3,0.0,0.0,0.0,514.0
accuracy,0.217,0.217,0.217,0.217
macro avg,0.181478,0.228431,0.154328,2000.0
weighted avg,0.181676,0.217,0.149353,2000.0



After fine-tuning:


Unnamed: 0,Metric,Value
0,Accuracy,0.904
1,F1 Score,0.901347



Confusion Matrix:


Unnamed: 0,0,1,2,3
0,438,19,30,24
1,5,513,6,2
2,12,0,365,72
3,5,3,14,492



Classification Report:


Unnamed: 0,precision,recall,f1-score,support
0,0.952174,0.857143,0.902163,511.0
1,0.958879,0.975285,0.967012,526.0
2,0.879518,0.812918,0.844907,449.0
3,0.833898,0.957198,0.891304,514.0
accuracy,0.904,0.904,0.904,0.904
macro avg,0.906117,0.900636,0.901347,2000.0
weighted avg,0.907229,0.904,0.903574,2000.0
