Домашнее задание
1. Возьмите готовую модель из https://huggingface.co/models для классификации сентимента текста.
2. Сделайте предсказания на всем df_val. Посчитайте метрику качества.
3. Дообучите эту модель на df_train. Посчитайте метрику качества на df_val.

Данные на google drive: https://drive.google.com/file/d/1Mev_EEput0LlBj8MDHIJkBtahlJ6J901

**Подключение библиотек**

In [50]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [51]:
!pip install torchmetrics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [52]:
import numpy as np
import pandas as pd
from google.colab import drive

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam


from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

from tqdm import tqdm
from torchmetrics import F1Score

In [53]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [54]:
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [55]:
df_train = pd.read_csv("/content/drive/MyDrive/data/train.csv")
df_val = pd.read_csv("/content/drive/MyDrive/data/val.csv")

df_train.shape, df_val.shape

((181467, 3), (22683, 3))

In [56]:
df_train.head()

Unnamed: 0,id,text,class
0,0,@alisachachka не уезжаааааааай. :(❤ я тоже не ...,0
1,1,RT @GalyginVadim: Ребята и девчата!\nВсе в кин...,1
2,2,RT @ARTEM_KLYUSHIN: Кто ненавидит пробки ретви...,0
3,3,RT @epupybobv: Хочется котлету по-киевски. Зап...,1
4,4,@KarineKurganova @Yess__Boss босапопа есбоса н...,1


In [57]:
tokenizer = BertTokenizer.from_pretrained('cointegrated/rubert-tiny-toxicity')
model = BertForSequenceClassification.from_pretrained('cointegrated/rubert-tiny-toxicity')

In [58]:
# Проверка работы модели
sentiment = pipeline("text-classification", model='cointegrated/rubert-tiny-toxicity')
sentiment("Хорошая погода")

[{'label': 'non-toxic', 'score': 0.9994722008705139}]

In [59]:
 # Проверка работы токенайзера

example_text = 'Пример текста для токенизации'
bert_input = tokenizer(example_text, padding='max_length', max_length=15, 
                       truncation=True, return_tensors="pt")


print(bert_input['input_ids'])
print(bert_input['attention_mask'])

tensor([[    2,  3086, 10885, 22723,   871, 24302,  3464, 10880,     3,     0,
             0,     0,     0,     0,     0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])


In [60]:
# Загрузка параметров модели
print(model)
print("Parameters full train:", sum([param.nelement() for param in model.parameters()]))

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29564, 312, padding_idx=0)
      (position_embeddings): Embedding(512, 312)
      (token_type_embeddings): Embedding(2, 312)
      (LayerNorm): LayerNorm((312,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-2): 3 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=312, out_features=312, bias=True)
              (key): Linear(in_features=312, out_features=312, bias=True)
              (value): Linear(in_features=312, out_features=312, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=312, out_features=312, bias=True)
              (LayerNorm): LayerNorm((312,), eps=1e-12, e

**Подготовка данных**

In [61]:
idx = 20
print(df_train.iloc[idx]['text'])
print('label is', df_train.iloc[idx]['class'])
sentiment(df_train.iloc[idx]['text'])

Австрия ждеееет! Последние формальности уладили) Буду сегодня паковать чемодан!))
label is 1


[{'label': 'non-toxic', 'score': 0.9992603659629822}]

In [62]:
df_train['text'] = df_train['text'].apply(lambda x: x.lower())
df_val['text'] = df_val['text'].apply(lambda x: x.lower())

In [63]:
Dataset

torch.utils.data.dataset.Dataset

In [64]:
class TwitterDataset(Dataset):
    
    def __init__(self, txts, labels):
        self._labels = labels
        
        self.tokenizer = BertTokenizer.from_pretrained('cointegrated/rubert-tiny-toxicity')
        self._txts = [self.tokenizer(text, padding='max_length', max_length=10,
                                     truncation=True, return_tensors="pt")
                      for text in txts]
        
    def __len__(self):
        return len(self._txts)
    
    def __getitem__(self, index):
        return self._txts[index], self._labels[index]

In [65]:
DataLoader

torch.utils.data.dataloader.DataLoader

In [66]:
y_train = df_train['class']
y_val = df_val['class']

train_dataset = TwitterDataset(df_train['text'], y_train)
valid_dataset = TwitterDataset(df_val['text'], y_val)

train_loader = DataLoader(train_dataset,
                          batch_size=128,
                          shuffle=True,
                          num_workers=0)
valid_loader = DataLoader(valid_dataset,
                          batch_size=128,
                          shuffle=False,
                          num_workers=0)

In [67]:
for txt, lbl in train_loader:
    print(txt.keys())
    print(txt['input_ids'].shape)
    break

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
torch.Size([128, 1, 10])


**Построение и обучение нейронной сети**

In [68]:
class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):
        super().__init__()
        self.pretrained_model = BertForSequenceClassification.from_pretrained('cointegrated/rubert-tiny-toxicity')
        self.sigm = nn.Sigmoid()

    def forward(self, x, mask):
        pooled_output = self.pretrained_model(input_ids=x, attention_mask=mask,return_dict=False)[0]
        final = self.sigm(pooled_output)
        return final

In [69]:
model = BertClassifier().to(device)

In [70]:
model.pretrained_model.classifier

Linear(in_features=312, out_features=5, bias=True)

In [75]:
criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.pretrained_model.parameters(), lr=0.001)

In [79]:
train_f1 = F1Score().to(device)
valid_f1 = F1Score().to(device)

for epoch_num in range(10):
    total_acc_train = 0
    total_loss_train = 0

    model.train()
    for train_input, train_label in tqdm(train_loader):
        mask = train_input['attention_mask'].to(device)
        input_id = train_input['input_ids'].squeeze(1).to(device)
        train_label = train_label.to(device)
        
        output = model(input_id, mask)

        batch_loss = criterion(output, train_label)
        total_loss_train += batch_loss.item()

        train_f1(output, train_label)

        acc = (output.argmax(dim=1) == train_label).sum().item()
        total_acc_train += acc

        model.zero_grad()
        batch_loss.backward()
        optimizer.step()
            
    model.eval()
    total_loss_val, total_acc_val = 0.0, 0.0
    for val_input, val_label in valid_loader:
        val_label = val_label.to(device)
        mask = val_input['attention_mask'].to(device)
        input_id = val_input['input_ids'].squeeze(1).to(device)

        output = model(input_id, mask)

        batch_loss = criterion(output, val_label)
        total_loss_val += batch_loss.item()
                    
        acc = (output.argmax(dim=1) == val_label).sum().item()
        total_acc_val += acc

        valid_f1(output, val_label)

      
    print(
        f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_dataset): .3f} \
        | Train Accuracy: {total_acc_train / len(train_dataset): .3f} \
        | Train f1: {train_f1.compute().item(): .3f} \
        | Val Loss: {total_loss_val / len(valid_dataset): .3f} \
        | Val Accuracy: {total_acc_val / len(valid_dataset): .3f} \
        | Val f1: {valid_f1.compute().item(): .3f}')
    
    train_f1.reset()
    valid_f1.reset()

100%|██████████| 1418/1418 [00:29<00:00, 48.42it/s]
Epochs: 1 | Train Loss:  0.009         | Train Accuracy:  0.494         | Train f1:  0.494         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:26<00:00, 54.38it/s]
Epochs: 2 | Train Loss:  0.009         | Train Accuracy:  0.494         | Train f1:  0.494         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:25<00:00, 54.80it/s]
Epochs: 3 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:25<00:00, 54.77it/s]
Epochs: 4 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:26<00:00, 54.33it/s]
Epochs: 5 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:25<00:00, 54.64it/s]
Epochs: 6 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:26<00:00, 53.09it/s]
Epochs: 7 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:25<00:00, 54.62it/s]
Epochs: 8 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:25<00:00, 54.66it/s]
Epochs: 9 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495

100%|██████████| 1418/1418 [00:25<00:00, 54.84it/s]
Epochs: 10 | Train Loss:  0.009         | Train Accuracy:  0.493         | Train f1:  0.493         | Val Loss:  0.009         | Val Accuracy:  0.495         | Val f1:  0.495