This notebook contains the results of experiments with T5. This model is heavier than RoBERTa and RuBERT and it takes time to comverge. Due to the resources limitations, the number of training epochs as well as batch sizes and some other parameters are not the most optimal, but the ones which allow all the processes run in reasonable time. A100 GPU seems to be the best for running T5 fine-tuning process. <br>
There are smaller and faster versions of this model, so for the future experiments it is probably better to one the optimized version instead.

In [1]:
pip install transformers datasets wget pymorphy2 accelerate tqdm simpletransformers deep_translator pdfminer-six wget

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting simpletransformers
  Downloading simpletransformers-0.70.0-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting deep_translator
  Downloading deep_translator-1.11.

In [1]:
from transformers import AutoTokenizer, AutoModel
from transformers import T5ForQuestionAnswering
import pandas as pd
import numpy as np
import torch
import sklearn
import wget
import accelerate
from tqdm import tqdm
from torch.utils.data import TensorDataset

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, TrainingArguments, Trainer
from datasets import load_dataset, Dataset

from sklearn.model_selection import train_test_split

import logging
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

from deep_translator import GoogleTranslator
from pdfminer.high_level import extract_text

In [2]:
torch.cuda.empty_cache() # T5 quickly overflows cuda memory

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 1. Training an LLM for a context-based question answering task

## 1.1. Initial fine-tuning on the dataset

In [4]:
xquad_dataset = load_dataset('xquad', 'xquad.ru') # using the same xquad dataset, russian subset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
batch_size = 16 # 32 or 64 is a recommended size usually
max_length = 512 # max length of the model input
stride = 128 # 1/4 of max model input length is recommended for starters; needed to keep the information that doesn't fit in the model

In [6]:
# preprocessing function is a bit different for this model,
# because T5 and BERT-like models use different padding tokens
# T5 doesn't have a [CLS] token, for example

def prepare_train_features(examples):
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",  # truncate context, not the question
        max_length=max_length,
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        sequence_ids = tokenized_examples.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # find the start of the context in the current input_ids window
        context_start_index = next((i for i, s_id in enumerate(sequence_ids) if s_id == 1), None)

        if context_start_index is None or len(answers["answer_start"]) == 0:
            # if no valid context start found or no answer specified, the first token of the sequence will be used
            tokenized_examples["start_positions"].append(0)
            tokenized_examples["end_positions"].append(0)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])
            token_start_index = context_start_index
            while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                token_start_index += 1
            token_start_index -= 1  # move back to the first token that is inside the answer span

            token_end_index = token_start_index
            while token_end_index < len(offsets) and offsets[token_end_index][1] <= end_char:
                token_end_index += 1
            token_end_index -= 1  # make sure the token_end_index is inside the answer

            tokenized_examples["start_positions"].append(token_start_index)
            tokenized_examples["end_positions"].append(token_end_index)

    return tokenized_examples

In [7]:
def evaluate_instance(instance, device):
    context = instance['context']
    question = instance['question']
    given_answer = instance['answers']['text'][0]
    inputs = tokenizer(question, context, return_tensors='pt', max_length=512, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        output = model(**inputs)
    start_idx = torch.argmax(output.start_logits)
    end_idx = torch.argmax(output.end_logits)
    predicted_answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_idx:end_idx + 1]))
    return predicted_answer.lower() == given_answer.lower()

In [8]:
# T5 for QA: https://huggingface.co/docs/transformers/v4.40.1/en/model_doc/t5#transformers.T5ForQuestionAnswering
# this one is multilingual (including russian): https://huggingface.co/google/mt5-base

model_name = 'google/mt5-base'

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForQuestionAnswering.from_pretrained(model_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using a model of type mt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.
Some weights of T5ForQuestionAnswering were not initialized from the model checkpoint at google/mt5-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
xquad_dataset = xquad_dataset['validation'].train_test_split(test_size=0.2) # first split into train and test
train_set = xquad_dataset['train']
validation_set = xquad_dataset['test']
validation_split_set = validation_set.train_test_split(test_size=0.5) # now test to validation and test
val_set = validation_split_set['train']
test_set = validation_split_set['test']

In [11]:
tokenized_train = train_set.map(prepare_train_features,
                                batched=True,
                                remove_columns=train_set.column_names)
tokenized_val = val_set.map(prepare_train_features,
                                batched=True,
                                remove_columns=val_set.column_names)

Map:   0%|          | 0/952 [00:00<?, ? examples/s]

Map:   0%|          | 0/119 [00:00<?, ? examples/s]

In [12]:
training_args = TrainingArguments(
    output_dir='./t5-results',
    num_train_epochs=5, # better results were obtained for 20 epochs (and losses indicate this number may be increased), but for the sake of computational complexity this time will stick to a smaller number
    per_device_train_batch_size=4, # trying different batch sizes in order to avoid out-of-memory error
    per_device_eval_batch_size=4, # same here (it is recommended to try smaller batch sizes, I had 16 for both)
    warmup_steps=0,
    weight_decay=0,
    logging_dir='./logs', # keeping logs for easier debugging
    logging_steps=10,
    fp16=True, # mixed precision
    do_train=True,
    do_eval=True,
    overwrite_output_dir=True,
    evaluation_strategy='epoch',
    learning_rate=5e-04,
    gradient_accumulation_steps = 8 # to deal with 'out of memory' error
)

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer
)

In [14]:
model.to(device)

T5ForQuestionAnswering(
  (shared): Embedding(250112, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(250112, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo): L

In [15]:
trainer.train() # requires a wandb.ai key
# about 15-20 minutes on T4; faster on A100
# the losses show that the resultts are poor (which is expected given all the limitations given above)

[34m[1mwandb[0m: Currently logged in as: [33mnpovarova97[0m ([33mnataliyap-test-org[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112657655555976, max=1.0…

Epoch,Training Loss,Validation Loss
0,0.0,
1,0.0,
2,0.0,
3,0.0,
4,0.0,


TrainOutput(global_step=155, training_loss=0.0, metrics={'train_runtime': 807.1935, 'train_samples_per_second': 6.163, 'train_steps_per_second': 0.192, 'total_flos': 3018018721406976.0, 'train_loss': 0.0, 'epoch': 4.979919678714859})

In [16]:
correct_count = 0
total_count = test_set.shape[0] # number of rows of a test set

for i in range(total_count):
    correct_count += evaluate_instance(test_set[i], device)

In [17]:
accuracy = correct_count / total_count
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 0.00%


## 1.2. Augmentation

In [19]:
# let's try to improve something

en_xquad_dataset = load_dataset('xquad', 'xquad.en')

Downloading data:   0%|          | 0.00/212k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1190 [00:00<?, ? examples/s]

In [20]:
# reinitializing the russian subset as well

xquad_dataset = load_dataset('xquad', 'xquad.ru')

In [21]:
data_en = pd.DataFrame(en_xquad_dataset['validation'])
data = pd.DataFrame(xquad_dataset['validation'])

In [22]:
tqdm.pandas() # progress bar

In [24]:
def en_ru_translator(text):
    translator = GoogleTranslator(source='en', target='ru')
    return translator.translate(text)

In [25]:
data_en.loc[:, 'context'] = data_en.context.progress_apply(en_ru_translator)
data_en.loc[:, 'question'] = data_en.question.progress_apply(en_ru_translator)

100%|██████████| 1190/1190 [04:17<00:00,  4.63it/s]
100%|██████████| 1190/1190 [02:54<00:00,  6.81it/s]


In [26]:
answers_list = data_en.answers.tolist()
translator = GoogleTranslator(source='en', target='ru')

for answer in answers_list:
  txt = answer['text'][0]
  t_txt = translator.translate(txt)
  answer['text'] = [t_txt]

data_en['answers'] = answers_list

In [27]:
augmented_data = pd.concat([data, data_en],
                           ignore_index=True,
                           axis=0)

In [28]:
augmented_dataset = Dataset.from_pandas(augmented_data)

In [29]:
X = augmented_dataset.train_test_split(test_size=0.2)
train = X['train']
validation = X['test']

validation_split = validation.train_test_split(test_size=0.5)
val = validation_split['train']
test = validation_split['test']

In [30]:
tokenized_train = train.map(prepare_train_features,
                            batched=True,
                            remove_columns=train.column_names)
tokenized_val = val.map(prepare_train_features,
                        batched=True,
                        remove_columns=val.column_names)

Map:   0%|          | 0/1904 [00:00<?, ? examples/s]

Map:   0%|          | 0/238 [00:00<?, ? examples/s]

In [31]:
torch.cuda.empty_cache()

In [32]:
training_args = TrainingArguments(
    output_dir='./t5-results',
    num_train_epochs=5, # better results were obtained for 20 epochs (and losses indicate this number may be increased), but for the sake of computational complexity this time will stick to a smaller number
    per_device_train_batch_size=4, # trying different batch sizes in order to avoid out-of-memory error
    per_device_eval_batch_size=4, # same here (it is recommended to try smaller batch sizes, I had 16 for both)
    warmup_steps=0,
    weight_decay=0,
    logging_dir='./logs', # keeping logs for easier debugging
    logging_steps=10,
    fp16=True, # mixed precision
    do_train=True,
    do_eval=True,
    overwrite_output_dir=True,
    evaluation_strategy='epoch',
    learning_rate=5e-04,
    gradient_accumulation_steps = 8 # to deal with 'out of memory' error
)

In [33]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer
)

In [34]:
model.to(device)

T5ForQuestionAnswering(
  (shared): Embedding(250112, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(250112, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo): L

In [35]:
trainer.train() # around 25-30 minutes on T4

Epoch,Training Loss,Validation Loss
0,0.0,
1,0.0,
2,0.0,
3,0.0,
4,0.0,


TrainOutput(global_step=305, training_loss=0.0, metrics={'train_runtime': 1560.2688, 'train_samples_per_second': 6.335, 'train_steps_per_second': 0.195, 'total_flos': 5936167573905408.0, 'train_loss': 0.0, 'epoch': 4.929292929292929})

In [36]:
correct_count = 0
total_count = test.shape[0]

for i in range(total_count):
    correct_count += evaluate_instance(test[i], device)

In [37]:
accuracy = correct_count / total_count
print(f'Accuracy: {accuracy * 100:.2f}%')

Accuracy: 0.00%


Since the accuracy did not improve, this doesn't make sence to proceed with the pdf-reading experiment. For this model specifically the further work is the following: <br>
* try a smaller / optimized version, which will take reasonable time to train for more epochs;
* try more epochs (starting from 25 based on some previous experiments);
* try better data augmentation (check the translation by hand, collect more data, maybe generate some data using a generative model).