# Домашнее задание

Почувствуй мощь трансформеров в бою

**Цель**:

Научиться работать с трансформерными моделями и применять их для различных NLP задач.

**Описание/Пошаговая инструкция выполнения домашнего задания:**

В качестве данных выберете возьмите датасет RuCoLA для русского языка https://github.com/RussianNLP/RuCoLA (в качестве train возьмите in_domain_train.csv, а в качестве теста in_domain_dev.csv).

Разбейте in_domain_train на train и val.

1. Зафайнтьюньте и протестируйте RuBert или RuRoBerta на данной задаче (можно взять любую предобученную модель руберт с сайта huggingface. Например, ruBert-base/large https://huggingface.co/sberbank-ai/ruBert-base / https://huggingface.co/sberbank-ai/ruBert-large или rubert-base-cased https://huggingface.co/DeepPavlov/rubert-base-cased, ruRoberta-large https://huggingface.co/sberbank-ai/ruRoberta-large, xlm-roberta-base https://huggingface.co/xlm-roberta-base).

2. Возьмите RuGPT3 base или large и решите данное задание с помощью методов few-/zero-shot.

а) переберите несколько вариантов затравок;

б) протестируйте различное число few-shot примеров (0, 1, 2, 4).

3. Обучите и протестируйте модель RuT5 на данной задаче (пример finetun’а можете найти здесь https://github.com/RussianNLP/RuCoLA/blob/main/baselines/finetune_t5.py).

Сравните полученные результаты.


In [222]:
import gc
import random
import numpy as np
import pandas as pd

import torch
from torch.optim import Adam
from torch.utils.data import DataLoader

from transformers import (
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    T5Tokenizer,
    T5ForConditionalGeneration,
)

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
from transformers import pipeline, DataCollatorWithPadding
from datasets import Dataset, DatasetDict, load_metric

from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

from tqdm.auto import tqdm, trange

## 1. RuBERT

### 1.1 Загружаем датасет RuCoLa

In [6]:
df_train = pd.read_csv('data/in_domain_train.csv', index_col=0)
df_test = pd.read_csv('data/in_domain_dev.csv', index_col=0)

In [7]:
df_train

Unnamed: 0_level_0,sentence,acceptable,error_type,detailed_source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"Вдруг решетка беззвучно поехала в сторону, и н...",1,0,Paducheva2004
1,Этим летом не никуда ездили.,0,Syntax,Rusgram
2,Только Иван выразил какую бы то ни было готовн...,1,0,Paducheva2013
3,"Теперь ты видишь собственными глазами, как тут...",1,0,Paducheva2010
4,На поверку вся теория оказалась полной чепухой.,1,0,Paducheva2010
...,...,...,...,...
7864,Установки не было введено в действие.,0,Semantics,Paducheva2004
7865,"Конечно, против такой системы ценностей решите...",0,Semantics,Paducheva2013
7866,Симптомов болезни не исчезло.,0,Semantics,Paducheva2013
7867,Послезавтра температура у больного снижается д...,0,Semantics,Rusgram


In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7869 entries, 0 to 7868
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   sentence         7869 non-null   object
 1   acceptable       7869 non-null   int64 
 2   error_type       7869 non-null   object
 3   detailed_source  7869 non-null   object
dtypes: int64(1), object(3)
memory usage: 307.4+ KB


In [9]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 983 entries, 0 to 982
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   sentence         983 non-null    object
 1   acceptable       983 non-null    int64 
 2   error_type       983 non-null    object
 3   detailed_source  983 non-null    object
dtypes: int64(1), object(3)
memory usage: 38.4+ KB


In [10]:
df_train["acceptable"].nunique()

2

In [11]:
df_train["acceptable"].unique()

array([1, 0])

In [12]:
df_train.groupby(["acceptable"]).count()

Unnamed: 0_level_0,sentence,error_type,detailed_source
acceptable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2005,2005,2005
1,5864,5864,5864


In [13]:
df_train["acceptable"].value_counts()

acceptable
1    5864
0    2005
Name: count, dtype: int64

In [14]:
df_train["error_type"].nunique()

4

In [15]:
df_train["error_type"].unique()

array(['0', 'Syntax', 'Semantics', 'Morphology'], dtype=object)

In [16]:
ind = random.randint(0, df_train.shape[0]-1)
ind

3927

In [17]:
df_train.sentence[ind]

'Когда я вернулся, он спал.'

## 1.1. Подготовим датасет для работы с моделью

In [19]:
train_ds = Dataset.from_dict({'text':df_train.sentence, 'label':df_train.acceptable}, split='train')
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 7869
})

In [20]:
train_ds['text'][0]

'Вдруг решетка беззвучно поехала в сторону, и на балконе возникла таинственная фигура, прячущаяся от лунного света, и погрозила Ивану пальцем.'

In [21]:
test_ds = Dataset.from_dict({'text':df_test.sentence, 'label':df_test.acceptable}, split='test')
test_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 983
})

In [22]:
test_ds['text'][982]

'На Марсе есть какие-либо (какие бы то ни было) разумные обитатели.'

## 1.2 Загрузим модель RuBERT с HaggingFace

In [24]:
base_model = 'ai-forever/ruBert-base'

In [25]:
tokenizer = AutoTokenizer.from_pretrained(base_model)

In [26]:
type(tokenizer)

transformers.models.bert.tokenization_bert_fast.BertTokenizerFast

In [27]:
train_ds_tokenized = train_ds.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Map:   0%|          | 0/7869 [00:00<?, ? examples/s]

In [28]:
test_ds_tokenized = test_ds.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Map:   0%|          | 0/983 [00:00<?, ? examples/s]

In [29]:
test_ds_tokenized[0]

{'label': 1,
 'input_ids': [101, 104691, 379, 5171, 672, 14207, 126, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [30]:
collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [31]:
train_dataloader = DataLoader(train_ds_tokenized, shuffle=True, batch_size=4, collate_fn=collator)

In [32]:
test_dataloader = DataLoader(test_ds_tokenized, shuffle=False, batch_size=4, collate_fn=collator)

In [33]:
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ai-forever/ruBert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [34]:
type(model)

transformers.models.bert.modeling_bert.BertForSequenceClassification

In [35]:
# Making the code device-agnostic
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [36]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(120138, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [37]:
optimizer = Adam(model.parameters(), lr=1e-6)  # with tiny batches, LR should be very small as well

In [38]:
gc.collect()
torch.cuda.empty_cache()

In [39]:
# set initial best loss to infinite
best_eval_loss = float('inf')

# empty list to store loss for each epoch
losses = []

for epoch in trange(5):
    pbar = tqdm(train_dataloader)
    model.train()
    for i, batch in enumerate(pbar):
        out = model(**batch.to(model.device))
        out.loss.backward()

        optimizer.step()
        optimizer.zero_grad()
        losses.append(out.loss.item())
        pbar.set_description(f'loss: {np.mean(losses[-100:]):2.2f}')

    model.eval()
    eval_losses = []
    eval_preds = []
    eval_targets = []
    for batch in tqdm(test_dataloader):
        with torch.no_grad():
                out = model(**batch.to(model.device))
        eval_losses.append(out.loss.item())
        eval_preds.extend(out.logits.argmax(1).tolist())
        eval_targets.extend(batch['labels'].tolist())
    print('Epoch:', epoch+1, 'Train Loss:', np.mean(losses[-100:]), 'Eval Loss:', np.mean(eval_losses), 'Accuracy', np.mean(np.array(eval_targets) == eval_preds))
    #save the best model
    if np.mean(eval_losses) < best_eval_loss:
        best_eval_loss = np.mean(eval_losses)
        torch.save(model.state_dict(), 'bert_saved_weights.pt')    

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/1968 [00:00<?, ?it/s]

  0%|          | 0/246 [00:00<?, ?it/s]

Epoch: 1 Train Loss: 0.5451523773372173 Eval Loss: 0.5358655233451022 Accuracy 0.7507629704984741


  0%|          | 0/1968 [00:00<?, ?it/s]

  0%|          | 0/246 [00:00<?, ?it/s]

Epoch: 2 Train Loss: 0.50386698782444 Eval Loss: 0.5134998018421778 Accuracy 0.762970498474059


  0%|          | 0/1968 [00:00<?, ?it/s]

  0%|          | 0/246 [00:00<?, ?it/s]

Epoch: 3 Train Loss: 0.5086082202941179 Eval Loss: 0.4913178726243294 Accuracy 0.7812817904374364


  0%|          | 0/1968 [00:00<?, ?it/s]

  0%|          | 0/246 [00:00<?, ?it/s]

Epoch: 4 Train Loss: 0.41421750232577326 Eval Loss: 0.5015357464127909 Accuracy 0.780264496439471


  0%|          | 0/1968 [00:00<?, ?it/s]

  0%|          | 0/246 [00:00<?, ?it/s]

Epoch: 5 Train Loss: 0.34821811709553 Eval Loss: 0.5295565909501619 Accuracy 0.7782299084435402


Видим, что в определенный момент validation loss начинает увеличиваться и это означает, что модель переобучается на нашем небольшом наборе данных. Загрузим сохраненную наилучшую модель и посчитаем метрики:

In [41]:
def quality(true_y, prediction_y, ndig=3):
    """
    Evaluates and returns the following metrics: Accuracy, Precision, Recall, F1-score, AUC
    """
    accuracy = round(accuracy_score(true_y, prediction_y), ndig)
    precision = round(precision_score(true_y, prediction_y), ndig)
    recall = round(recall_score(true_y, prediction_y), ndig)
    f1 = round(f1_score(true_y, prediction_y), ndig)
    auc = round(roc_auc_score(true_y, prediction_y), ndig)
    print(f" Accuracy: {accuracy}")
    print(f"Precision: {precision}")
    print(f"   Recall: {recall}")
    print(f" F1-score: {f1}")
    print(f"      AUC: {auc}")
    return [accuracy, precision, recall, f1, auc]

In [42]:
results = {}

In [43]:
#load weights of best model
path = 'bert_saved_weights.pt'
model.load_state_dict(torch.load(path))

  model.load_state_dict(torch.load(path))


<All keys matched successfully>

In [44]:
model.eval()
eval_losses = []
eval_preds = []
eval_targets = []
for batch in tqdm(test_dataloader):
    with torch.no_grad():
            out = model(**batch.to(model.device))
    eval_losses.append(out.loss.item())
    eval_preds.extend(out.logits.argmax(1).tolist())
    eval_targets.extend(batch['labels'].tolist())
print('recent train loss', np.mean(losses[-100:]), 'eval loss', np.mean(eval_losses), 'accuracy', np.mean(np.array(eval_targets) == eval_preds))

  0%|          | 0/246 [00:00<?, ?it/s]

recent train loss 0.34821811709553 eval loss 0.4913178726243294 accuracy 0.7812817904374364


In [45]:
print(classification_report(eval_targets, eval_preds))

              precision    recall  f1-score   support

           0       0.74      0.22      0.33       250
           1       0.78      0.97      0.87       733

    accuracy                           0.78       983
   macro avg       0.76      0.60      0.60       983
weighted avg       0.77      0.78      0.73       983



In [46]:
results['ruBERT'] = quality(eval_targets, eval_preds)

 Accuracy: 0.781
Precision: 0.785
   Recall: 0.974
 F1-score: 0.869
      AUC: 0.595


In [47]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.781,0.785,0.974,0.869,0.595


### Zero-shot classification

Для zero-shot классификации воспользуемся стандарным pipeline от Haggingface

Links (delete later)
- [GFG: Zero shot text classification](https://www.geeksforgeeks.org/zero-shot-text-classification-using-huggingface-model/)
- [Medium: Map class labels from srings to numbers](https://medium.com/@duzhewang/change-the-class-labels-from-a-string-representation-into-an-integer-format-in-python-using-map-62414d4a1a7e)

In [51]:
# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="ai-forever/rugpt3large_based_on_gpt2", device=device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at ai-forever/rugpt3large_based_on_gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [52]:
text = "Установки не было введено в действие."
candidate_labels = ["корректное предложение", "некорректное предложение"]

In [53]:
result = classifier(text, candidate_labels)
print(result)

{'sequence': 'Установки не было введено в действие.', 'labels': ['корректное предложение', 'некорректное предложение'], 'scores': [0.504284143447876, 0.495715856552124]}


In [54]:
text = "Иван вчера не позвонил."
candidate_labels = ["некорректное предложение", "корректное предложение"]

In [55]:
result = classifier(text, candidate_labels)
print(result)

{'sequence': 'Иван вчера не позвонил.', 'labels': ['корректное предложение', 'некорректное предложение'], 'scores': [0.5066623687744141, 0.49333763122558594]}


In [56]:
eval_targets[:10]

[1, 0, 1, 1, 1, 1, 1, 1, 1, 1]

In [57]:
len(test_ds['text'])

983

In [58]:
## Getting results in batch режиме
zero_shot_out = classifier(test_ds['text'], candidate_labels)

In [59]:
type(zero_shot_out)

list

In [60]:
zero_shot_out[0]['labels']

['корректное предложение', 'некорректное предложение']

In [61]:
# Getting labels usin list comprehension
first_labels = [item['labels'][0] for item in zero_shot_out]

In [62]:
len(first_labels)

983

In [63]:
zs_preds = list(map(lambda x: 1 if x == 'корректное предложение' else 0, first_labels))

In [64]:
zs_preds[:5]

[1, 1, 1, 1, 1]

In [65]:
print(classification_report(eval_targets, zs_preds))

              precision    recall  f1-score   support

           0       0.31      0.06      0.11       250
           1       0.75      0.95      0.84       733

    accuracy                           0.73       983
   macro avg       0.53      0.51      0.47       983
weighted avg       0.64      0.73      0.65       983



In [66]:
results['Zero-shot'] = quality(eval_targets, zs_preds)

 Accuracy: 0.725
Precision: 0.749
   Recall: 0.951
 F1-score: 0.838
      AUC: 0.507


In [67]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.781,0.785,0.974,0.869,0.595
Zero-shot,0.725,0.749,0.951,0.838,0.507


В zero-shot варианте результаты несколько хуже, попробуем few-shots.

### Few-shots classification

Используем другой подход - будем вызывать инференс модели с few-shot промптом и считать loss для оценки грамматической корректности

In [71]:
tokenizer = AutoTokenizer.from_pretrained("ai-forever/rugpt3large_based_on_gpt2")
model = AutoModelForCausalLM.from_pretrained("ai-forever/rugpt3large_based_on_gpt2")

In [72]:
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1536)
    (wpe): Embedding(2048, 1536)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=4608, nx=1536)
          (c_proj): Conv1D(nf=1536, nx=1536)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=6144, nx=1536)
          (c_proj): Conv1D(nf=1536, nx=6144)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1536, out_features=50257, bias=False)
)

In [73]:
text = 'Иван вчера не позвонил.'
few_shots = ['Предложение далее корректное? ' + 'Солнце садилось за горизонт.' + " Ответ: да.",
             'Предложение далее корректное? ' + 'Не стоит сидеть сложить руки.' + " Ответ: нет."]

In [74]:
# Defining functions to calculate loss and get predictions
def calc_loss(phrase: str,
                        tokenizer,
                        model):

    phrase = tokenizer.encode(phrase)
    # Adding <EOS> token in case the given phrase is only 1 token length, to avoid an error
    if len(phrase) == 1:
         phrase.append(tokenizer.eos_token_id)
    phrase = torch.tensor(phrase, dtype=torch.long, device=device)
    phrase = phrase.unsqueeze(0)  # .repeat(num_samples, 1)
    with torch.no_grad():
        loss = model(phrase, labels=phrase)
    return loss[0].item()

def get_loss_num(text):
    loss = calc_loss(phrase=text, model=model, tokenizer=tokenizer)
    return loss

def get_correct_prompt(phrase, few_shots=few_shots):
    return '\n'.join(few_shots) +'\nПредложение далее корректное? ' + phrase + " Ответ: да."

def get_incorrect_prompt(phrase, few_shots=few_shots):
    return '\n'.join(few_shots) + '\nПредложение далее корректное? ' + phrase + " Ответ: нет."

def get_few_shot_pred(text):
    res = {}
    #print(get_correct_prompt(text)) ## Debugging
    correct_loss = calc_loss(phrase=get_correct_prompt(text), model=model, tokenizer=tokenizer)
    #print(f"Correct Loss: {correct_loss}") ## Debugiing
    
    #print(get_incorrect_prompt(text)) ## Debugging
    incorrect_loss = calc_loss(phrase=get_incorrect_prompt(text), model=model, tokenizer=tokenizer)
    #print(f"Incorrect Loss: {incorrect_loss}") ## Debugging

    pred_num = 1 if correct_loss < incorrect_loss else 0
    
    res["Correct_Loss"] = correct_loss
    res["Inorrect_Loss"] = incorrect_loss
    res["pred"] = pred_num
    return res    

In [75]:
correct_prompt = get_correct_prompt(text, few_shots)
print(correct_prompt)

Предложение далее корректное? Солнце садилось за горизонт. Ответ: да.
Предложение далее корректное? Не стоит сидеть сложить руки. Ответ: нет.
Предложение далее корректное? Иван вчера не позвонил. Ответ: да.


In [76]:
incorrect_prompt = get_incorrect_prompt(text, few_shots)
print(incorrect_prompt)

Предложение далее корректное? Солнце садилось за горизонт. Ответ: да.
Предложение далее корректное? Не стоит сидеть сложить руки. Ответ: нет.
Предложение далее корректное? Иван вчера не позвонил. Ответ: нет.


In [77]:
out = get_few_shot_pred(text)
out

{'Correct_Loss': 2.8430206775665283,
 'Inorrect_Loss': 2.852612257003784,
 'pred': 1}

In [78]:
len(test_ds['text'])

983

In [79]:
fewshot_preds = []

for text in tqdm(test_ds['text']):
    out = get_few_shot_pred(text)
    fewshot_preds.append(out['pred'])
    #print(out,'\n') ## Debugging

  0%|          | 0/983 [00:00<?, ?it/s]

In [80]:
len(fewshot_preds)

983

In [81]:
print(classification_report(eval_targets, fewshot_preds))

              precision    recall  f1-score   support

           0       0.21      0.12      0.15       250
           1       0.74      0.84      0.79       733

    accuracy                           0.66       983
   macro avg       0.47      0.48      0.47       983
weighted avg       0.60      0.66      0.63       983



In [82]:
results['Few-shots'] = quality(eval_targets, fewshot_preds)

 Accuracy: 0.66
Precision: 0.738
   Recall: 0.844
 F1-score: 0.788
      AUC: 0.482


In [83]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.781,0.785,0.974,0.869,0.595
Zero-shot,0.725,0.749,0.951,0.838,0.507
Few-shots,0.66,0.738,0.844,0.788,0.482


Загадочно, но с few-shots подходом результаты хуже, по сравнению с zero-shot - возможно, надо дополнительно поиграться с примерами в промпте.

## RuT5 finetuning

Обучите и протестируйте модель [RuT5](https://huggingface.co/ai-forever/ruT5-base) на данной задаче - пример finetun’а можете найти [здесь](https://github.com/RussianNLP/RuCoLA/blob/main/baselines/finetune_t5.py)

RuT5 finetuning
```python
python baselines/finetune_t5.py -m [MODEL_NAME]
```
Afterwards, you can get test set predictions in the format required by the leaderboard for all trained models. To do this, run 
```python
python baselines/get_csv_predictions.py -m MODEL1 MODEL2 ...
```

In [88]:
ACCURACY = load_metric("accuracy", keep_in_memory=True)
MCC = load_metric("matthews_correlation", keep_in_memory=True)

  ACCURACY = load_metric("accuracy", keep_in_memory=True)
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

In [89]:
model_t5_name = "ai-forever/ruT5-large"

In [90]:
N_SEEDS = 10
N_EPOCHS = 20
LR_VALUE = (1e-3,)
DECAY_VALUE = (1e-4,)
BATCH_SIZES = (128,)

POS_LABEL = "yes"
NEG_LABEL = "no"

In [91]:
tokenizer = T5Tokenizer.from_pretrained(model_t5_name)

tokenizer_config.json:   0%|          | 0.00/20.4k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [194]:
DATA_DIR = "./data"
TRAIN_FILE = DATA_DIR + "/" + "in_domain_train.csv"
IN_DOMAIN_DEV_FILE = DATA_DIR + "/" + "in_domain_dev.csv"
OUT_OF_DOMAIN_DEV_FILE = DATA_DIR + "/" + "out_of_domain_dev.csv"
TEST_FILE = DATA_DIR + "/" + "test.csv"

In [216]:
print(f"            TRAIN_FILE -> {TRAIN_FILE}")
print(f"    IN_DOMAIN_DEV_FILE -> {IN_DOMAIN_DEV_FILE}")
print(f"OUT_OF_DOMAIN_DEV_FILE -> {OUT_OF_DOMAIN_DEV_FILE}")
print(f"             TEST_FILE -> {TEST_FILE}")

            TRAIN_FILE -> ./data/in_domain_train.csv
    IN_DOMAIN_DEV_FILE -> ./data/in_domain_dev.csv
OUT_OF_DOMAIN_DEV_FILE -> ./data/out_of_domain_dev.csv
             TEST_FILE -> ./data/test.csv


In [218]:
def read_splits(*, as_datasets):
    train_df, test_df = map(
        pd.read_csv, (TRAIN_FILE, IN_DOMAIN_DEV_FILE)
    )

    # concatenate datasets to get aggregate metrics
    #dev_df = pd.concat((in_domain_dev_df, out_of_domain_dev_df))

    if as_datasets:
        train, test = map(Dataset.from_pandas, (train_df, test_df))
        return DatasetDict(train=train, test=test)
    else:
        return train_df, test_df

In [224]:
# we need to prepare datasets here
splits = read_splits(as_datasets=True)

In [226]:
splits

DatasetDict({
    train: Dataset({
        features: ['id', 'sentence', 'acceptable', 'error_type', 'detailed_source'],
        num_rows: 7869
    })
    test: Dataset({
        features: ['id', 'sentence', 'acceptable', 'error_type', 'detailed_source'],
        num_rows: 983
    })
})

In [None]:
#tokenized_splits = splits.map(
        partial(preprocess_examples, tokenizer=tokenizer),
        batched=True,
        remove_columns=["sentence"],
    )

In [178]:
#train_ds_tokenized = train_ds.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 7869
})

In [180]:
#test_ds_tokenized = test_ds.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 983
})

In [93]:
data_collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8)

## Итоговое сравнение полученных результатов

Отсортируем полученные результаты

In [96]:
pd.DataFrame(results, index = ['Accuracy', 'Precision', 'Recall', 'F1-score', 'AUC']).T.sort_values(by=['AUC'], ascending=False)

Unnamed: 0,Accuracy,Precision,Recall,F1-score,AUC
ruBERT,0.781,0.785,0.974,0.869,0.595
Zero-shot,0.725,0.749,0.951,0.838,0.507
Few-shots,0.66,0.738,0.844,0.788,0.482
