**Урок 9. Трансформер**

Возьмите готовую модель из https://huggingface.co/models для классификации сентимента текста.
Сделайте предсказания на всем df_val. Посчитайте метрику качества.
Дообучите эту модель на df_train. Посчитайте метрику качества на df_val.
Данные на google drive: https://drive.google.com/file/d/1Mev_EEput0LlBj8MDHIJkBtahlJ6J901


In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.8 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 972 kB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 48.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 73.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

In [2]:
import numpy as np

import torch
import torch.nn as nn
from torch.optim import Adam
from tqdm import tqdm

from transformers import BertTokenizer, BertModel
from datasets import load_dataset

In [3]:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = 'cross-encoder/ms-marco-MiniLM-L-12-v2'
EPOCHS = 10
BATCH_SIZE = 128
MAX_LENGTH = 30

emotion_dataset = load_dataset("emo")


Downloading builder script:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

Downloading and preparing dataset emo/emo2019 (download: 3.21 MiB, generated: 2.72 MiB, post-processed: Unknown size, total: 5.93 MiB) to /root/.cache/huggingface/datasets/emo/emo2019/1.0.0/3bb182a8ea21ffa4656a89f870d16a7b75abb79f07cf990436beb9320d1d6ddd...


Downloading data:   0%|          | 0.00/2.87M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/495k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30160 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5509 [00:00<?, ? examples/s]

Dataset emo downloaded and prepared to /root/.cache/huggingface/datasets/emo/emo2019/1.0.0/3bb182a8ea21ffa4656a89f870d16a7b75abb79f07cf990436beb9320d1d6ddd. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
emotion_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 30160
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5509
    })
})

In [5]:
bertTokenizer = BertTokenizer.from_pretrained(MODEL)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/791 [00:00<?, ?B/s]

In [6]:
example_text = emotion_dataset["train"]["text"][0]
bert_input = bertTokenizer(example_text, padding='max_length', max_length=MAX_LENGTH, truncation=True, return_tensors="pt")
print(example_text)
print(bert_input)

don't worry  i'm girl hmm how do i know if you are what's ur name
{'input_ids': tensor([[  101,  2123,  1005,  1056,  4737,  1045,  1005,  1049,  2611, 17012,
          2129,  2079,  1045,  2113,  2065,  2017,  2024,  2054,  1005,  1055,
         24471,  2171,   102,     0,     0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0]])}


In [7]:
class TwitterDataset(torch.utils.data.Dataset):
    
    def __init__(self, txts, labels):
        self._labels = labels
        
        self.tokenizer = BertTokenizer.from_pretrained(MODEL)
        self._txts = [self.tokenizer(text, padding='max_length', max_length=MAX_LENGTH,
                                     truncation=True, return_tensors="pt")
                      for text in txts]
        
    def __len__(self):
        return len(self._txts)
    
    def __getitem__(self, index):
        return self._txts[index], self._labels[index]

In [8]:
train_dataset = TwitterDataset(emotion_dataset["train"]['text'], emotion_dataset["train"]["label"])
valid_dataset = TwitterDataset(emotion_dataset["test"]['text'],  emotion_dataset["test"]["label"])

train_loader = torch.utils.data.DataLoader(train_dataset,
                          batch_size=BATCH_SIZE,
                          shuffle=True,
                          num_workers=2)
valid_loader = torch.utils.data.DataLoader(valid_dataset,
                          batch_size=BATCH_SIZE,
                          shuffle=False,
                          num_workers=1)

In [9]:
class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):
        super().__init__()
        self.bert = BertModel.from_pretrained(MODEL)
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(384, 64)
        self.sigm = nn.Sigmoid()

    def forward(self, x, mask):
        
        _, pooled_output = self.bert(input_ids=x, attention_mask=mask, return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.sigm(linear_output)
        return final_layer

In [10]:
model = BertClassifier().to(DEVICE)
criterion = nn.CrossEntropyLoss()

optimizer = Adam(model.linear.parameters(), lr=1e-5)

Downloading:   0%|          | 0.00/127M [00:00<?, ?B/s]

Some weights of the model checkpoint at cross-encoder/ms-marco-MiniLM-L-12-v2 were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
print(model)
print("Parameters full train:", sum([param.nelement() for param in model.parameters()]))
print("Parameters transfer learning:", sum([param.nelement() for param in model.linear.parameters()]))

BertClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 384, padding_idx=0)
      (position_embeddings): Embedding(512, 384)
      (token_type_embeddings): Embedding(2, 384)
      (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=384, out_features=384, bias=True)
              (key): Linear(in_features=384, out_features=384, bias=True)
              (value): Linear(in_features=384, out_features=384, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=384, out_features=384, bias=True)
              (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=Tru

In [12]:
for epoch_num in range(EPOCHS):
    total_acc_train = 0
    total_loss_train = 0

    model.train()
    for train_input, train_label in tqdm(train_loader):
        mask = train_input['attention_mask'].to(DEVICE)
        input_id = train_input['input_ids'].squeeze(1).to(DEVICE)
        train_label = train_label.to(DEVICE)

        output = model(input_id, mask)

        batch_loss = criterion(output, train_label)
        total_loss_train += batch_loss.item()
                
        acc = (output.argmax(dim=1) == train_label).sum().item()
        total_acc_train += acc

        model.zero_grad()
        batch_loss.backward()
        optimizer.step()
            
    model.eval()
    total_loss_val, total_acc_val = 0.0, 0.0
    for val_input, val_label in valid_loader:
        val_label = val_label.to(DEVICE)
        mask = val_input['attention_mask'].to(DEVICE)
        input_id = val_input['input_ids'].squeeze(1).to(DEVICE)

        output = model(input_id, mask)

        batch_loss = criterion(output, val_label)
        total_loss_val += batch_loss.item()
                    
        acc = (output.argmax(dim=1) == val_label).sum().item()
        total_acc_val += acc
            
    print(
        f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_dataset): .3f} \
        | Train Accuracy: {total_acc_train / len(train_dataset): .3f} \
        | Val Loss: {total_loss_val / len(valid_dataset): .3f} \
        | Val Accuracy: {total_acc_val / len(valid_dataset): .3f}')

100%|██████████| 236/236 [00:44<00:00,  5.32it/s]


Epochs: 1 | Train Loss:  0.032         | Train Accuracy:  0.074         | Val Loss:  0.031         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:46<00:00,  5.08it/s]


Epochs: 2 | Train Loss:  0.030         | Train Accuracy:  0.258         | Val Loss:  0.029         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:45<00:00,  5.19it/s]


Epochs: 3 | Train Loss:  0.028         | Train Accuracy:  0.272         | Val Loss:  0.028         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:45<00:00,  5.16it/s]


Epochs: 4 | Train Loss:  0.028         | Train Accuracy:  0.278         | Val Loss:  0.028         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:45<00:00,  5.18it/s]


Epochs: 5 | Train Loss:  0.027         | Train Accuracy:  0.279         | Val Loss:  0.027         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:45<00:00,  5.16it/s]


Epochs: 6 | Train Loss:  0.027         | Train Accuracy:  0.283         | Val Loss:  0.027         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:45<00:00,  5.15it/s]


Epochs: 7 | Train Loss:  0.026         | Train Accuracy:  0.279         | Val Loss:  0.027         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:45<00:00,  5.14it/s]


Epochs: 8 | Train Loss:  0.026         | Train Accuracy:  0.284         | Val Loss:  0.027         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:46<00:00,  5.11it/s]


Epochs: 9 | Train Loss:  0.026         | Train Accuracy:  0.277         | Val Loss:  0.027         | Val Accuracy:  0.054


100%|██████████| 236/236 [00:45<00:00,  5.14it/s]


Epochs: 10 | Train Loss:  0.026         | Train Accuracy:  0.278         | Val Loss:  0.027         | Val Accuracy:  0.054
