# Домашнее задание

Почувствуй мощь трансформеров в бою

**Цель**:

Научиться работать с трансформерными моделями и применять их для различных NLP задач.

**Описание/Пошаговая инструкция выполнения домашнего задания:**

В качестве данных выберете возьмите датасет RuCoLA для русского языка https://github.com/RussianNLP/RuCoLA (в качестве train возьмите in_domain_train.csv, а в качестве теста in_domain_dev.csv).

Разбейте in_domain_train на train и val.

1. Зафайнтьюньте и протестируйте RuBert или RuRoBerta на данной задаче (можно взять любую предобученную модель руберт с сайта huggingface. Например, ruBert-base/large https://huggingface.co/sberbank-ai/ruBert-base / https://huggingface.co/sberbank-ai/ruBert-large или rubert-base-cased https://huggingface.co/DeepPavlov/rubert-base-cased, ruRoberta-large https://huggingface.co/sberbank-ai/ruRoberta-large, xlm-roberta-base https://huggingface.co/xlm-roberta-base).

2. Возьмите RuGPT3 base или large и решите данное задание с помощью методов few-/zero-shot.

а) переберите несколько вариантов затравок;

б) протестируйте различное число few-shot примеров (0, 1, 2, 4).

3. Обучите и протестируйте модель RuT5 на данной задаче (пример finetun’а можете найти здесь https://github.com/RussianNLP/RuCoLA/blob/main/baselines/finetune_t5.py).

Сравните полученные результаты.


In [3]:
import gc
import random
from functools import partial
from tqdm.auto import tqdm, trange

import numpy as np
import pandas as pd

import torch
from torch.optim import Adam
from torch.utils.data import DataLoader

from transformers import (
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    T5Tokenizer,
    T5ForConditionalGeneration,
)
import accelerate

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
from transformers import pipeline, DataCollatorWithPadding
from datasets import Dataset, DatasetDict, load_metric
from razdel import tokenize

from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import warnings
warnings.filterwarnings("ignore")

## 1. RuBERT

### 1.1 Загружаем датасет RuCoLa

In [6]:
df_train = pd.read_csv('data/in_domain_train.csv', index_col=0)
df_test = pd.read_csv('data/in_domain_dev.csv', index_col=0)

In [7]:
df_train

Unnamed: 0_level_0,sentence,acceptable,error_type,detailed_source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"Вдруг решетка беззвучно поехала в сторону, и н...",1,0,Paducheva2004
1,Этим летом не никуда ездили.,0,Syntax,Rusgram
2,Только Иван выразил какую бы то ни было готовн...,1,0,Paducheva2013
3,"Теперь ты видишь собственными глазами, как тут...",1,0,Paducheva2010
4,На поверку вся теория оказалась полной чепухой.,1,0,Paducheva2010
...,...,...,...,...
7864,Установки не было введено в действие.,0,Semantics,Paducheva2004
7865,"Конечно, против такой системы ценностей решите...",0,Semantics,Paducheva2013
7866,Симптомов болезни не исчезло.,0,Semantics,Paducheva2013
7867,Послезавтра температура у больного снижается д...,0,Semantics,Rusgram


In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7869 entries, 0 to 7868
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   sentence         7869 non-null   object
 1   acceptable       7869 non-null   int64 
 2   error_type       7869 non-null   object
 3   detailed_source  7869 non-null   object
dtypes: int64(1), object(3)
memory usage: 307.4+ KB


In [9]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 983 entries, 0 to 982
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   sentence         983 non-null    object
 1   acceptable       983 non-null    int64 
 2   error_type       983 non-null    object
 3   detailed_source  983 non-null    object
dtypes: int64(1), object(3)
memory usage: 38.4+ KB


In [10]:
df_train["acceptable"].nunique()

2

In [11]:
df_train["acceptable"].unique()

array([1, 0])

In [12]:
df_train.groupby(["acceptable"]).count()

Unnamed: 0_level_0,sentence,error_type,detailed_source
acceptable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2005,2005,2005
1,5864,5864,5864


In [13]:
df_train["acceptable"].value_counts()

acceptable
1    5864
0    2005
Name: count, dtype: int64

In [14]:
df_train["error_type"].nunique()

4

In [15]:
df_train["error_type"].unique()

array(['0', 'Syntax', 'Semantics', 'Morphology'], dtype=object)

In [16]:
ind = random.randint(0, df_train.shape[0]-1)
ind

7575

In [17]:
df_train.sentence[ind]

'Добрый поступок создает и накапливает добро, сделает жизнь лучше, развивает гуманность.'

## 1.1. Подготовим датасет для работы с моделью

In [19]:
train_ds = Dataset.from_dict({'text':df_train.sentence, 'label':df_train.acceptable}, split='train')
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 7869
})

In [20]:
train_ds['text'][0]

'Вдруг решетка беззвучно поехала в сторону, и на балконе возникла таинственная фигура, прячущаяся от лунного света, и погрозила Ивану пальцем.'

In [21]:
test_ds = Dataset.from_dict({'text':df_test.sentence, 'label':df_test.acceptable}, split='test')
test_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 983
})

In [22]:
test_ds['text'][982]

'На Марсе есть какие-либо (какие бы то ни было) разумные обитатели.'

## 1.2 Загрузим модель RuBERT с HaggingFace

In [24]:
base_model = 'ai-forever/ruBert-base'

In [25]:
tokenizer = AutoTokenizer.from_pretrained(base_model)

In [26]:
type(tokenizer)

transformers.models.bert.tokenization_bert_fast.BertTokenizerFast

In [27]:
train_ds_tokenized = train_ds.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Map:   0%|          | 0/7869 [00:00<?, ? examples/s]

In [28]:
test_ds_tokenized = test_ds.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Map:   0%|          | 0/983 [00:00<?, ? examples/s]

In [29]:
test_ds_tokenized[0]

{'label': 1,
 'input_ids': [101, 104691, 379, 5171, 672, 14207, 126, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [30]:
collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [31]:
train_dataloader = DataLoader(train_ds_tokenized, shuffle=True, batch_size=4, collate_fn=collator)

In [32]:
test_dataloader = DataLoader(test_ds_tokenized, shuffle=False, batch_size=4, collate_fn=collator)

In [33]:
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ai-forever/ruBert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [34]:
type(model)

transformers.models.bert.modeling_bert.BertForSequenceClassification

In [35]:
# Making the code device-agnostic
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [36]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(120138, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [37]:
optimizer = Adam(model.parameters(), lr=1e-6)  # with tiny batches, LR should be very small as well

In [38]:
gc.collect()
torch.cuda.empty_cache()

In [39]:
# set initial best loss to infinite
best_eval_loss = float('inf')

# empty list to store loss for each epoch
losses = []

for epoch in trange(1):  ## !!! Debugging - Do not forget revert back to 5!
    pbar = tqdm(train_dataloader)
    model.train()
    for i, batch in enumerate(pbar):
        out = model(**batch.to(model.device))
        out.loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        losses.append(out.loss.item())
        pbar.set_description(f'loss: {np.mean(losses[-100:]):2.2f}')

    model.eval()
    eval_losses = []
    eval_preds = []
    eval_targets = []
    for batch in tqdm(test_dataloader):
        with torch.no_grad():
                out = model(**batch.to(model.device))
        eval_losses.append(out.loss.item())
        eval_preds.extend(out.logits.argmax(1).tolist())
        eval_targets.extend(batch['labels'].tolist())
    print('Epoch:', epoch+1, 'Train Loss:', np.mean(losses[-100:]), 'Eval Loss:', np.mean(eval_losses), 'Accuracy', np.mean(np.array(eval_targets) == eval_preds))
    #save the best model
    if np.mean(eval_losses) < best_eval_loss:
        best_eval_loss = np.mean(eval_losses)
        torch.save(model.state_dict(), 'bert_saved_weights.pt')    

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1968 [00:00<?, ?it/s]

  0%|          | 0/246 [00:00<?, ?it/s]

Epoch: 1 Train Loss: 0.5661448127031327 Eval Loss: 0.5320105203162364 Accuracy 0.7477110885045778


Видим, что в определенный момент validation loss начинает увеличиваться и это означает, что модель переобучается на нашем небольшом наборе данных. Загрузим сохраненную наилучшую модель и посчитаем метрики:

In [41]:
def quality(true_y, prediction_y, ndig=3):
    """
    Evaluates and returns the following metrics: Accuracy, Precision, Recall, F1-score, AUC
    """
    accuracy = round(accuracy_score(true_y, prediction_y), ndig)
    precision = round(precision_score(true_y, prediction_y), ndig)
    recall = round(recall_score(true_y, prediction_y), ndig)
    f1 = round(f1_score(true_y, prediction_y), ndig)
    auc = round(roc_auc_score(true_y, prediction_y), ndig)
    print(f" Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"   Recall: {recall}")
    print(f" F1-score: {f1}")
    print(f"      AUC: {auc}")
    return [accuracy, precision, recall, f1, auc]

In [42]:
results = {}

In [43]:
#load weights of best model
path = 'bert_saved_weights.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [44]:
model.eval()
eval_losses = []
eval_preds = []
eval_targets = []
for batch in tqdm(test_dataloader):
    with torch.no_grad():
            out = model(**batch.to(model.device))
    eval_losses.append(out.loss.item())
    eval_preds.extend(out.logits.argmax(1).tolist())
    eval_targets.extend(batch['labels'].tolist())
print('recent train loss', np.mean(losses[-100:]), 'eval loss', np.mean(eval_losses), 'accuracy', np.mean(np.array(eval_targets) == eval_preds))

  0%|          | 0/246 [00:00<?, ?it/s]

recent train loss 0.5661448127031327 eval loss 0.5320105203162364 accuracy 0.7477110885045778


In [45]:
print(classification_report(eval_targets, eval_preds))

              precision    recall  f1-score   support

           0       1.00      0.01      0.02       250
           1       0.75      1.00      0.86       733

    accuracy                           0.75       983
   macro avg       0.87      0.50      0.44       983
weighted avg       0.81      0.75      0.64       983



In [46]:
results['ruBERT'] = quality(eval_targets, eval_preds)

 Accuracy: 0.748
Precision: 0.747
   Recall: 1.0
 F1-score: 0.855
      AUC: 0.504


In [47]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.748,0.747,1.0,0.855,0.504


In [48]:
model.to("cpu")

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(120138, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

### Zero-shot classification

Для zero-shot классификации воспользуемся стандарным pipeline от Haggingface

Links (delete later)
- [GFG: Zero shot text classification](https://www.geeksforgeeks.org/zero-shot-text-classification-using-huggingface-model/)
- [Medium: Map class labels from srings to numbers](https://medium.com/@duzhewang/change-the-class-labels-from-a-string-representation-into-an-integer-format-in-python-using-map-62414d4a1a7e)

In [52]:
# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="ai-forever/rugpt3large_based_on_gpt2", device=device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at ai-forever/rugpt3large_based_on_gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [53]:
text = "Установки не было введено в действие."
candidate_labels = ["корректное предложение", "некорректное предложение"]

In [54]:
result = classifier(text, candidate_labels)
print(result)

{'sequence': 'Установки не было введено в действие.', 'labels': ['корректное предложение', 'некорректное предложение'], 'scores': [0.5091734528541565, 0.4908265471458435]}


In [55]:
text = "Иван вчера не позвонил."
candidate_labels = ["некорректное предложение", "корректное предложение"]

In [56]:
result = classifier(text, candidate_labels)
print(result)

{'sequence': 'Иван вчера не позвонил.', 'labels': ['корректное предложение', 'некорректное предложение'], 'scores': [0.5179219245910645, 0.48207807540893555]}


In [57]:
eval_targets[:10]

[1, 0, 1, 1, 1, 1, 1, 1, 1, 1]

In [58]:
len(test_ds['text'])

983

In [59]:
## Getting results in batch режиме
zero_shot_out = classifier(test_ds['text'], candidate_labels)

In [60]:
type(zero_shot_out)

list

In [61]:
zero_shot_out[0]['labels']

['корректное предложение', 'некорректное предложение']

In [62]:
# Getting labels usin list comprehension
first_labels = [item['labels'][0] for item in zero_shot_out]

In [63]:
len(first_labels)

983

In [64]:
zs_preds = list(map(lambda x: 1 if x == 'корректное предложение' else 0, first_labels))

In [65]:
zs_preds[:5]

[1, 0, 1, 1, 1]

In [66]:
print(classification_report(eval_targets, zs_preds))

              precision    recall  f1-score   support

           0       0.36      0.07      0.11       250
           1       0.75      0.96      0.84       733

    accuracy                           0.73       983
   macro avg       0.56      0.51      0.48       983
weighted avg       0.65      0.73      0.66       983



In [67]:
results['Zero-shot'] = quality(eval_targets, zs_preds)

 Accuracy: 0.732
Precision: 0.751
   Recall: 0.959
 F1-score: 0.842
      AUC: 0.514


In [68]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.748,0.747,1.0,0.855,0.504
Zero-shot,0.732,0.751,0.959,0.842,0.514


In [69]:
classifier = pipeline("zero-shot-classification", model="ai-forever/rugpt3large_based_on_gpt2", device="cpu")

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at ai-forever/rugpt3large_based_on_gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


В zero-shot варианте метрики хуже, попробуем few-shots.

### Few-shots classification

Используем другой подход - будем вызывать инференс модели с few-shot промптом и считать loss для оценки грамматической корректности

In [73]:
tokenizer = AutoTokenizer.from_pretrained("ai-forever/rugpt3large_based_on_gpt2")
model = AutoModelForCausalLM.from_pretrained("ai-forever/rugpt3large_based_on_gpt2")

In [74]:
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1536)
    (wpe): Embedding(2048, 1536)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=4608, nx=1536)
          (c_proj): Conv1D(nf=1536, nx=1536)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=6144, nx=1536)
          (c_proj): Conv1D(nf=1536, nx=6144)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1536, out_features=50257, bias=False)
)

In [75]:
text = 'Иван вчера не позвонил.'
few_shots = ['Предложение далее корректное? ' + 'Солнце садилось за горизонт.' + " Ответ: да.",
             'Предложение далее корректное? ' + 'Не стоит сидеть сложить руки.' + " Ответ: нет."]

In [76]:
# Defining functions to calculate loss and get predictions
def calc_loss(phrase: str,
                        tokenizer,
                        model):

    phrase = tokenizer.encode(phrase)
    # Adding <EOS> token in case the given phrase is only 1 token length, to avoid an error
    if len(phrase) == 1:
         phrase.append(tokenizer.eos_token_id)
    phrase = torch.tensor(phrase, dtype=torch.long, device=device)
    phrase = phrase.unsqueeze(0)  # .repeat(num_samples, 1)
    with torch.no_grad():
        loss = model(phrase, labels=phrase)
    return loss[0].item()

def get_loss_num(text):
    loss = calc_loss(phrase=text, model=model, tokenizer=tokenizer)
    return loss

def get_correct_prompt(phrase, few_shots=few_shots):
    return '\n'.join(few_shots) +'\nПредложение далее корректное? ' + phrase + " Ответ: да."

def get_incorrect_prompt(phrase, few_shots=few_shots):
    return '\n'.join(few_shots) + '\nПредложение далее корректное? ' + phrase + " Ответ: нет."

def get_few_shot_pred(text):
    res = {}
    #print(get_correct_prompt(text)) ## Debugging
    correct_loss = calc_loss(phrase=get_correct_prompt(text), model=model, tokenizer=tokenizer)
    #print(f"Correct Loss: {correct_loss}") ## Debugiing
    
    #print(get_incorrect_prompt(text)) ## Debugging
    incorrect_loss = calc_loss(phrase=get_incorrect_prompt(text), model=model, tokenizer=tokenizer)
    #print(f"Incorrect Loss: {incorrect_loss}") ## Debugging

    pred_num = 1 if correct_loss < incorrect_loss else 0
    
    res["Correct_Loss"] = correct_loss
    res["Inorrect_Loss"] = incorrect_loss
    res["pred"] = pred_num
    return res    

In [77]:
correct_prompt = get_correct_prompt(text, few_shots)
print(correct_prompt)

Предложение далее корректное? Солнце садилось за горизонт. Ответ: да.
Предложение далее корректное? Не стоит сидеть сложить руки. Ответ: нет.
Предложение далее корректное? Иван вчера не позвонил. Ответ: да.


In [78]:
incorrect_prompt = get_incorrect_prompt(text, few_shots)
print(incorrect_prompt)

Предложение далее корректное? Солнце садилось за горизонт. Ответ: да.
Предложение далее корректное? Не стоит сидеть сложить руки. Ответ: нет.
Предложение далее корректное? Иван вчера не позвонил. Ответ: нет.


In [79]:
out = get_few_shot_pred(text)
out

{'Correct_Loss': 2.8430206775665283,
 'Inorrect_Loss': 2.852612257003784,
 'pred': 1}

In [80]:
len(test_ds['text'])

983

In [81]:
fewshot_preds = []

for text in tqdm(test_ds['text']):
    out = get_few_shot_pred(text)
    fewshot_preds.append(out['pred'])
    #print(out,'\n') ## Debugging

  0%|          | 0/983 [00:00<?, ?it/s]

In [82]:
len(fewshot_preds)

983

In [83]:
print(classification_report(eval_targets, fewshot_preds))

              precision    recall  f1-score   support

           0       0.21      0.12      0.15       250
           1       0.74      0.84      0.79       733

    accuracy                           0.66       983
   macro avg       0.47      0.48      0.47       983
weighted avg       0.60      0.66      0.63       983



In [84]:
results['Few-shots'] = quality(eval_targets, fewshot_preds)

 Accuracy: 0.66
Precision: 0.738
   Recall: 0.844
 F1-score: 0.788
      AUC: 0.482


In [85]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.748,0.747,1.0,0.855,0.504
Zero-shot,0.732,0.751,0.959,0.842,0.514
Few-shots,0.66,0.738,0.844,0.788,0.482


In [86]:
model.to("cpu")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1536)
    (wpe): Embedding(2048, 1536)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=4608, nx=1536)
          (c_proj): Conv1D(nf=1536, nx=1536)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=6144, nx=1536)
          (c_proj): Conv1D(nf=1536, nx=6144)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1536, out_features=50257, bias=False)
)

Загадочно, но с few-shots подходом результаты хуже, по сравнению с zero-shot - возможно, надо дополнительно поиграться с примерами в промпте.

## RuT5 finetuning

Обучите и протестируйте модель [RuT5](https://huggingface.co/ai-forever/ruT5-base) на данной задаче - пример finetun’а можете найти [здесь](https://github.com/RussianNLP/RuCoLA/blob/main/baselines/finetune_t5.py)

RuT5 finetuning
```python
python baselines/finetune_t5.py -m [MODEL_NAME]
```
Afterwards, you can get test set predictions in the format required by the leaderboard for all trained models. To do this, run 
```python
python baselines/get_csv_predictions.py -m MODEL1 MODEL2 ...
```

In [91]:
## Defining metrics
ACCURACY = load_metric("accuracy", keep_in_memory=True)
PRECISION = load_metric("precision", keep_in_memory=True)
RECALL = load_metric("recall", keep_in_memory=True)
F1 = load_metric("f1", keep_in_memory=True)
ROC_AUC = load_metric("roc_auc", keep_in_memory=True)
MCC = load_metric("matthews_correlation", keep_in_memory=True)

In [208]:
## Defining function to compute metrics
def compute_metrics(p, tokenizer):
    string_preds = tokenizer.batch_decode(p.predictions, skip_special_tokens=True)
    int_preds = [1 if prediction == POS_LABEL else 0 for prediction in string_preds]

    labels = np.where(p.label_ids != -100, p.label_ids, tokenizer.pad_token_id)
    string_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    int_labels = []

    for string_label in string_labels:
        if string_label == POS_LABEL:
            int_labels.append(1)
        elif string_label == NEG_LABEL or string_label == "":  # second case accounts for test data
            int_labels.append(0)
        else:
            raise ValueError()

    acc_result = ACCURACY.compute(predictions=int_preds, references=int_labels)
    precision_result = PRECISION.compute(predictions=int_preds, references=int_labels)
    recall_result = RECALL.compute(predictions=int_preds, references=int_labels)
    f1_result = F1.compute(predictions=int_preds, references=int_labels)
    #auc_result = ROC_AUC.compute(predictions=int_preds, references=int_labels)
    #mcc_result = MCC.compute(predictions=int_preds, references=int_labels)
    
    result = {"accuracy": acc_result["accuracy"], 
              "precision": precision_result["precision"],
              "recall": recall_result["recall"],
              "F1-score": f1_result["f1"],
              "AUC": 0.5 ##auc_result["roc_auc"]
             }
    ## Debugging intermediate results on each step 
    ##results['RuT5-in'] = quality(int_labels, int_preds)
    
    return result

In [93]:
model_t5_name = "ai-forever/ruT5-large"

In [94]:
N_SEEDS = 1
N_EPOCHS = 4
LR_VALUES = (1e-3,)
DECAY_VALUES = (1e-4,)
BATCH_SIZES = (128,)

POS_LABEL = "yes"
NEG_LABEL = "no"

In [95]:
tokenizer = T5Tokenizer.from_pretrained(model_t5_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [96]:
DATA_DIR = "./data"
TRAIN_FILE = DATA_DIR + "/" + "in_domain_train.csv"
IN_DOMAIN_DEV_FILE = DATA_DIR + "/" + "in_domain_dev.csv"
OUT_OF_DOMAIN_DEV_FILE = DATA_DIR + "/" + "out_of_domain_dev.csv"
TEST_FILE = DATA_DIR + "/" + "test.csv"

In [97]:
print(f"            TRAIN_FILE -> {TRAIN_FILE}")
print(f"    IN_DOMAIN_DEV_FILE -> {IN_DOMAIN_DEV_FILE}")
print(f"OUT_OF_DOMAIN_DEV_FILE -> {OUT_OF_DOMAIN_DEV_FILE}")
print(f"             TEST_FILE -> {TEST_FILE}")

            TRAIN_FILE -> ./data/in_domain_train.csv
    IN_DOMAIN_DEV_FILE -> ./data/in_domain_dev.csv
OUT_OF_DOMAIN_DEV_FILE -> ./data/out_of_domain_dev.csv
             TEST_FILE -> ./data/test.csv


In [98]:
def read_splits(*, as_datasets):
    train_df, test_df = map(
        pd.read_csv, (TRAIN_FILE, IN_DOMAIN_DEV_FILE)
    )

    # concatenate datasets to get aggregate metrics
    #dev_df = pd.concat((in_domain_dev_df, out_of_domain_dev_df))

    if as_datasets:
        train, test = map(Dataset.from_pandas, (train_df, test_df))
        return DatasetDict(train=train, test=test)
    else:
        return train_df, test_df

In [99]:
# we need to prepare datasets here
splits = read_splits(as_datasets=True)

In [100]:
splits

DatasetDict({
    train: Dataset({
        features: ['id', 'sentence', 'acceptable', 'error_type', 'detailed_source'],
        num_rows: 7869
    })
    test: Dataset({
        features: ['id', 'sentence', 'acceptable', 'error_type', 'detailed_source'],
        num_rows: 983
    })
})

In [101]:
def preprocess_examples(examples, tokenizer):
    result = tokenizer(examples["sentence"], padding=False)

    if "acceptable" in examples:
        label_sequences = []
        for label in examples["acceptable"]:
            if label == 1:
                target_sequence = POS_LABEL
            elif label == 0:
                target_sequence = NEG_LABEL
            else:
                raise ValueError("Unknown class label")
            label_sequences.append(target_sequence)

    else:
        # a hack to avoid the "You have to specify either decoder_input_ids or decoder_inputs_embeds" error
        # for test data
        label_sequences = ["" for _ in examples["sentence"]]

    result["labels"] = tokenizer(label_sequences, padding=False)["input_ids"]
    result["length"] = [len(list(tokenize(sentence))) for sentence in examples["sentence"]]
    return result

In [102]:
# Tokenizing our dataset
tokenized_splits = splits.map(
        partial(preprocess_examples, tokenizer=tokenizer),
        batched=True,
        remove_columns=["sentence"],
    )

Map:   0%|          | 0/7869 [00:00<?, ? examples/s]

Map:   0%|          | 0/983 [00:00<?, ? examples/s]

In [103]:
tokenized_splits

DatasetDict({
    train: Dataset({
        features: ['id', 'acceptable', 'error_type', 'detailed_source', 'input_ids', 'attention_mask', 'labels', 'length'],
        num_rows: 7869
    })
    test: Dataset({
        features: ['id', 'acceptable', 'error_type', 'detailed_source', 'input_ids', 'attention_mask', 'labels', 'length'],
        num_rows: 983
    })
})

In [104]:
data_collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8)

In [105]:
len(LR_VALUES), len(DECAY_VALUES), len(BATCH_SIZES)

(1, 1, 1)

In [106]:
# seed, lr, wd, bs
dev_metrics_per_run = np.empty((N_SEEDS, len(LR_VALUES), len(DECAY_VALUES), len(BATCH_SIZES), 5))

In [107]:
dev_metrics_per_run.shape

(1, 1, 1, 1, 5)

In [108]:
for i, learning_rate in enumerate(LR_VALUES):
    for j, weight_decay in enumerate(DECAY_VALUES):
        for k, batch_size in enumerate(BATCH_SIZES):
            for seed in range(N_SEEDS):
                model = T5ForConditionalGeneration.from_pretrained(model_t5_name)

                run_base_dir = f"{model_t5_name}_{learning_rate}_{weight_decay}_{batch_size}"

                training_args = Seq2SeqTrainingArguments(
                    output_dir=f"checkpoints/{run_base_dir}",
                    overwrite_output_dir=True,
                    evaluation_strategy="epoch",
                    per_device_train_batch_size=batch_size,
                    per_device_eval_batch_size=batch_size,
                    learning_rate=learning_rate,
                    weight_decay=weight_decay,
                    num_train_epochs=N_EPOCHS,
                    lr_scheduler_type="constant",
                    save_strategy="epoch",
                    save_total_limit=1,
                    seed=seed,
                    fp16=True,
                    dataloader_num_workers=4,
                    group_by_length=True,
                    report_to="none",
                    load_best_model_at_end=True,
                    metric_for_best_model="eval_F1-score", ##"eval_mcc",
                    optim="adafactor",
                    predict_with_generate=True,
                )

                trainer = Seq2SeqTrainer(
                    model=model,
                    args=training_args,
                    train_dataset=tokenized_splits["train"],
                    eval_dataset=tokenized_splits["test"],
                    compute_metrics=partial(compute_metrics, tokenizer=tokenizer),
                    tokenizer=tokenizer,
                    data_collator=data_collator,
                )

                train_result = trainer.train()
                print(f"{run_base_dir}_{seed}")
                print("train", train_result.metrics)

                #os.makedirs(f"results/{run_base_dir}_{seed}", exist_ok=True)

                dev_predictions = trainer.predict(
                    test_dataset=tokenized_splits["test"], metric_key_prefix="test", max_length=10
                )
                print("test", dev_predictions.metrics)
                dev_metrics_per_run[seed, i, j, k] = (
                    dev_predictions.metrics["test_accuracy"],
                    dev_predictions.metrics["test_precision"],
                    dev_predictions.metrics["test_recall"],
                    dev_predictions.metrics["test_F1-score"],
                    dev_predictions.metrics["test_AUC"],
                    #dev_predictions.metrics["test_mcc"],
                )

                predictions = trainer.predict(test_dataset=tokenized_splits["test"], max_length=10)

                string_preds = tokenizer.batch_decode(predictions.predictions, skip_special_tokens=True)

                int_preds = [1 if prediction == POS_LABEL else 0 for prediction in string_preds]
                int_preds = np.asarray(int_preds)
                # Calculating metrics
                results['RuT5'] = quality(eval_targets, int_preds)

                #np.save(f"results/{run_base_dir}_{seed}/preds.npy", int_preds)

                #rmtree(f"checkpoints/{run_base_dir}")

#os.makedirs("results_agg", exist_ok=True)
#np.save(f"results_agg/{model_name}_dev.npy", dev_metrics_per_run)

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1-score,Auc
1,No log,0.521325,0.745677,0.745677,1.0,0.854312,0.5
2,No log,0.364138,0.745677,0.745677,1.0,0.854312,0.5
3,No log,0.279618,0.745677,0.745677,1.0,0.854312,0.5
4,No log,0.280039,0.745677,0.745677,1.0,0.854312,0.5


 Accuracy: 0.746
Precision: 0.746
   Recall: 1.0
 F1-score: 0.854
      AUC: 0.5
 Accuracy: 0.746
Precision: 0.746
   Recall: 1.0
 F1-score: 0.854
      AUC: 0.5
 Accuracy: 0.746
Precision: 0.746
   Recall: 1.0
 F1-score: 0.854
      AUC: 0.5
 Accuracy: 0.746
Precision: 0.746
   Recall: 1.0
 F1-score: 0.854
      AUC: 0.5


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


ai-forever/ruT5-large_0.001_0.0001_128_0
train {'train_runtime': 94.9664, 'train_samples_per_second': 331.444, 'train_steps_per_second': 2.611, 'total_flos': 3393982267392000.0, 'train_loss': 0.7644554876512096, 'epoch': 4.0}


 Accuracy: 0.746
Precision: 0.746
   Recall: 1.0
 F1-score: 0.854
      AUC: 0.5
test {'test_loss': 0.5213252902030945, 'test_accuracy': 0.745676500508647, 'test_precision': 0.745676500508647, 'test_recall': 1.0, 'test_F1-score': 0.8543123543123543, 'test_AUC': 0.5, 'test_runtime': 3.366, 'test_samples_per_second': 292.042, 'test_steps_per_second': 2.377}


 Accuracy: 0.746
Precision: 0.746
   Recall: 1.0
 F1-score: 0.854
      AUC: 0.5
 Accuracy: 0.746
Precision: 0.746
   Recall: 1.0
 F1-score: 0.854
      AUC: 0.5


## Итоговое сравнение полученных результатов

Отсортируем полученные результаты

In [111]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T.sort_values(by=['F1-score'], ascending=False)

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.748,0.747,1.0,0.855,0.504
RuT5-in,0.746,0.746,1.0,0.854,0.5
RuT5,0.746,0.746,1.0,0.854,0.5
Zero-shot,0.732,0.751,0.959,0.842,0.514
Few-shots,0.66,0.738,0.844,0.788,0.482


Осталось
- Удалить ROC_AUC из состава метрик;
- вернуть 5 эпох в обучение ruBERT;
- написать итоговые выводы;
- удалить отладочные/промежуточные комментарии;
- сохранить jupyter notebook;
- загрузить в GitHub финальную версию;
- отправить ДЗ на проверку;