# <center> Project-5: Question Answering

**Case Description**

In this notebook we will train a Question Answering model using DeBERTa.

**Task**

Using the data to build a Question Answering model.

**Data**: It is a [SberQuAD](https://arxiv.org/pdf/1912.09723) (Sberbank Question Answering Dataset)

**ML/DL task**: Question Answering

*Training on GPU*

`Attention!!! We use only part of the dataset in order to save time to training and GPU reources and memory. That's why metric's final result could be low`

# 0. Install and Import

In [1]:
%%capture
!pip install transformers # the huggingface library containing the general-purpose architectures for NLP
!pip install datasets # the huggingface library containing datasets and evaluation metrics for NLP
!pip install evaluate
!pip install -U ipywidgets
!pip install optuna
!pip install -U accelerate

In [2]:
import os
import random
import numpy as np

import evaluate
import optuna

# pytorch libraries
import torch

import transformers
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
from datasets import load_dataset, DatasetDict

import warnings
warnings.filterwarnings("ignore")

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["WANDB_DISABLED"] = "true"

In [3]:
# Fixing RANDOM_SEED to make experiment repetable
RANDOM_SEED = 42

# Set random seeds
def set_seed(seed):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    np.random.seed(seed)
    random.seed(seed)
#     tf.random.set_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    
set_seed(RANDOM_SEED)

In [4]:
# Fixing package versions to make experiment repetable
!pip freeze > requirements.txt

In [5]:
model_name = "timpal0l/mdeberta-v3-base-squad2" # Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

In [6]:
# If we have a GPU available, we'll set our device to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# 1. Data Loading: dataset exploration

In [7]:
# Load the SBERQUAD dataset - https://huggingface.co/datasets/kuznetsoffandrey/sberquad
dataset = load_dataset("sberquad")
dataset

Downloading readme:   0%|          | 0.00/5.16k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.43M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.93M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45328 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5036 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/23936 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 45328
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5036
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 23936
    })
})

There are several important fields here:

* **answers**: the starting location of the answer token and the answer text.
* **context**: background information from which the model needs to extract the answer.
* **question**: the question a model should answer.

In [8]:
"""
Let's look at the number of examples in train set
"""
print("Number of examples in Context: ", len(dataset["train"]["context"]))
print("Number of examples in Question: ", len(dataset["train"]["question"]))
print("Number of examples in Answer: ", len(dataset["train"]["answers"]))

Number of examples in Context:  45328
Number of examples in Question:  45328
Number of examples in Answer:  45328


In [9]:
# Examples of train dataset
print("Context: ", dataset["train"][1]["context"])
print("\nQuestion: ", dataset["train"][1]["question"])
print("\nAnswer: ", dataset["train"][1]["answers"])

Context:  В протерозойских отложениях органические остатки встречаются намного чаще, чем в архейских. Они представлены известковыми выделениями сине-зелёных водорослей, ходами червей, остатками кишечнополостных. Кроме известковых водорослей, к числу древнейших растительных остатков относятся скопления графито-углистого вещества, образовавшегося в результате разложения Corycium enigmaticum. В кремнистых сланцах железорудной формации Канады найдены нитевидные водоросли, грибные нити и формы, близкие современным кокколитофоридам. В железистых кварцитах Северной Америки и Сибири обнаружены железистые продукты жизнедеятельности бактерий.

Question:  что найдено в кремнистых сланцах железорудной формации Канады?

Answer:  {'text': ['нитевидные водоросли, грибные нити'], 'answer_start': [438]}


In [10]:
# Number of examples in validation set
print("Number of examples in Context (validation set): ", len(dataset["validation"]["context"]))
print("Number of examples in Question (validation set): ", len(dataset["validation"]["question"]))
print("Number of examples in Answer (validation set): ", len(dataset["validation"]["answers"]))

Number of examples in Context (validation set):  5036
Number of examples in Question (validation set):  5036
Number of examples in Answer (validation set):  5036


In [11]:
# Example of val set
print("Context (validation set): ", dataset["validation"][1]["context"])
print("\nQuestion (validation set): ", dataset["validation"][1]["question"])
print("\nAnswer (validation set): ", dataset["validation"][1]["answers"])

Context (validation set):  Первые упоминания о строении человеческого тела встречаются в Древнем Египте. В XXVII веке до н. э. египетский врач Имхотеп описал некоторые органы и их функции, в частности головной мозг, деятельность сердца, распространение крови по сосудам. В древнекитайской книге Нейцзин (XI—VII вв. до н. э.) упоминаются сердце, печень, лёгкие и другие органы тела человека. В индийской книге Аюрведа ( Знание жизни , IX-III вв. до н. э.) содержится большой объём анатомических данных о мышцах, нервах, типах телосложения и темперамента, головном и спинном мозге.

Question (validation set):  Когда египетский врач Имхотеп впервые описал некоторые органы и их функции?

Answer (validation set):  {'text': ['В XXVII веке до н. э.'], 'answer_start': [78]}


In [12]:
"""
If we use the whole dataset we'll lose a lot of time at the training stage - around 4 hours!!!
It will be better to use small part of the dataset in order to look how optuna hyperparams optimization works
"""
dataset_part = load_dataset("sberquad", split='train[:5000]+validation[:1250]')
dataset = dataset_part.train_test_split(train_size=5000)
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1250
    })
})

# 2. Steps before model building: data preparation

We want to get data at the next format:

[CLS] question [SEP] context [SEP]

In [13]:
"""
Check results
"""
context = dataset["train"][1]["context"]
question = dataset["train"][1]["question"]

inputs = tokenizer(question, context)
tokenizer.decode(inputs["input_ids"])

'[CLS] Когда в состав РСД вошла Международная рабочая партия?[SEP] В марте 2011 года путём объединения Социалистического движения Вперед (российская секция Четвертого интернационала) и организации Социалистическое сопротивление было создано Российское социалистическое движение[22]. В дальнейшем, к нему присоединились другие троцкистские организации. В апреле 2011 года к РСД присоединилось пермское отделение Революционной рабочей партии. В мае 2012 года в состав РСД вошла Международная рабочая партия (российская секция Международной лиги трудящихся — Четвёртого интернационала)[23]. В составе РСД действуют группы сторонников Четвёртого интернационала и Международной лиги трудящихся — Четвёртого интернационала[24][23].[SEP]'

In [14]:
print("cls_token: ", tokenizer.cls_token)
print("sep_token: ", tokenizer.sep_token)
print("eos_token: ", tokenizer.eos_token)
print("pad_token: ", tokenizer.pad_token)

cls_token:  [CLS]
sep_token:  [SEP]
eos_token:  [SEP]
pad_token:  [PAD]


So we see that the model checkpoint we're using uses the [CLS] token to denote the start of the question, then a [SEP] token to mark between the question and the context, and then is ended with another [SEP] token. This is in accordance with how SQUAD is defined.

**Every transformer model**, no matter how powerful, **has a maximum sequence length that it can handle**.

Let's look at the length for the input IDs without any truncation

In [15]:
max_len = []

for i in range(0, len(dataset["train"]["question"])):
    lenght = len(tokenizer(dataset["train"][i]["question"], dataset["train"][i]["context"])["input_ids"])
    max_len.append(lenght)

In [16]:
print(f'Max length without any truncation: {max(max_len)}')
print(f'Index of example with max length: {max_len.index(max(max_len))}')

Max length without any truncation: 994
Index of example with max length: 4391


In [17]:
print(f'Minimum length without any truncation: {min(max_len)}')
print(f'Index of example with min length: {max_len.index(min(max_len))}')

Minimum length without any truncation: 103
Index of example with min length: 474


In [18]:
max_len[:5]

[325, 194, 310, 163, 202]

In [19]:
"""
Let's look at the length for the input IDs with any truncation
"""
max_len_trunc = []

for i in range(0, len(dataset["train"]["question"])):
    lenght_trunc = len(tokenizer(dataset["train"][i]["question"], dataset["train"][i]["context"],
                                 max_length=384, truncation="only_second")["input_ids"])
    max_len_trunc.append(lenght_trunc)

In [20]:
print(f'Max length after truncation: {max(max_len_trunc)}')
print(f'Index of example with max length: {max_len_trunc.index(max(max_len_trunc))}')

Max length after truncation: 384
Index of example with max length: 23


In [21]:
print(f'Minimum length after truncation: {min(max_len_trunc)}')
print(f'Index of example with min length: {max_len_trunc.index(min(max_len_trunc))}')

Minimum length after truncation: 103
Index of example with min length: 474


In fact, unlike RNNs, transformers need a fixed input length to match up to their internals. So even short input will need to be padded.

But how can we handle training examples where the question + context exceeds the max length of the current architecture? We don't just want to cut off the context, as the answer may be contained in it, and we'll lose valuable training data.

The solution is, similar to time-series applications, we'll window the data into smaller chunks. We only window the context, and the question is never modified.

### `There are a few preprocessing steps particular to question answering tasks:`

- Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting *truncation="only_second"*. Note that **we never want to truncate the question**, only the context, else the only_second truncation picked
- Map the start and end positions of the answer to the original context by setting *return_offset_mapping=True*
- Use the *sequence_ids* method to find which part of the offset corresponds to the **question** and which corresponds to the **context**

In the labeled dataset, **answer_start** gives us the correponding location of the answer within the context string. Note that it's relative to the start of the context string, not the question + context. The answer **text** gives us the actual plaintext answer, from which we can easily calculate the answer_end position as just **answer_start plus the length of the answer**.

This format is not sufficient to train from — we'll need labels for both start and end positions.

In [22]:
MAX_LENGTH = 384 #The maximum length of a feature (question and context)
STRIDE = 128 #The authorized overlap between two part of the context when splitting it is needed. 64 - corrected else: PanicException: assertion failed: stride < max_len when using Question-Answering pipeline

We need to use **different functions to preprocess training and validation dataset** by information from [repository](https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py) (fine-tuning models for the QA task).

But during training process I want to check two key metrics used by many question answering datasets, including SQuAD: **exact match (EM)** and **F1 score**. That's why I'll use the same function to preprocess all data.

Next you will see 2 ocode snippets.

In [23]:
"""
Different fuctions to preprocess training and validation examples

preprocess_validation_examples: If there is no need to generate class labels for tokens.
"""
# # Training set
# # Function to truncate and map the start and end tokens of the answer to the context
# def preprocess_training_examples(examples):
#     questions = [q.strip() for q in examples['question']]
#     inputs = tokenizer(
#         questions,
#         examples['context'],
#         max_length=MAX_LENGTH,
#         truncation='only_second',
#         stride=STRIDE,
#         # return_overflowing_tokens=True,
#         return_offsets_mapping=True,
#         padding='max_length',
#         )

#     # sample_map = inputs.pop('overflow_to_sample_mapping')
#     offset_mapping = inputs['offset_mapping']
#     answers = examples['answers']
#     start_positions = []
#     end_positions = []

#     for (i, offset) in enumerate(offset_mapping):
#         # sample_idx = sample_map[i]
#         answer = answers[i]
#         start_char = answer['answer_start'][0]
#         end_char = answer['answer_start'][0] + len(answer['text'][0])
#         sequence_ids = inputs.sequence_ids(i)

#     # Find the start and end of the context

#         idx = 0
#         while sequence_ids[idx] != 1:
#             idx += 1
#         context_start = idx
#         while sequence_ids[idx] == 1:
#             idx += 1
#         context_end = idx - 1

#     # If the answer is not fully inside the context, label is (0, 0)

#         if offset[context_start][0] > end_char \
#             or offset[context_end][1] < start_char:
#             start_positions.append(0)
#             end_positions.append(0)
#         else:

#       # Otherwise it's the start and end token positions

#             idx = context_start
#             while idx <= context_end and offset[idx][0] <= start_char:
#                 idx += 1
#             start_positions.append(idx - 1)

#             idx = context_end
#             while idx >= context_start and offset[idx][1] >= end_char:
#                 idx -= 1
#             end_positions.append(idx + 1)

#     inputs['start_positions'] = start_positions
#     inputs['end_positions'] = end_positions
    
#     return inputs


# # To apply the preprocessing function over the entire dataset
# # We can speed up the map function by setting batched=True to process multiple elements of the dataset at once
# train_dataset = dataset_part["train"].map(
#     preprocess_training_examples,
#     batched=True,
#     remove_columns=dataset["train"].column_names
# )


# # Validation/Test set
# # Function to truncate and map the start and end tokens of the answer to the context
# def preprocess_validation_examples(examples):
#     questions = [q.strip() for q in examples['question']]
#     inputs = tokenizer(
#         questions,
#         examples['context'],
#         max_length=MAX_LENGTH,
#         truncation='only_second',
#         stride=STRIDE,
#         # return_overflowing_tokens=True,
#         return_offsets_mapping=True,
#         padding='max_length',
#         )

#     # sample_map = inputs.pop('overflow_to_sample_mapping')
#     example_ids = []

#     for i in range(len(inputs['input_ids'])):
#         # sample_idx = sample_map[i]
#         example_ids.append(examples['id'][i])

#         sequence_ids = inputs.sequence_ids(i)
#         offset = inputs['offset_mapping'][i]
#         inputs['offset_mapping'][i] = [(o if sequence_ids[k]
#                 == 1 else None) for (k, o) in enumerate(offset)]


#     inputs['example_id'] = example_ids

#     return inputs


# validation_dataset = dataset_part["test"].map(
#     preprocess_validation_examples,
#     batched=True,
#     remove_columns=dataset_part["test"].column_names
# )

'\nDifferent fuctions to preprocess training and validation examples\n\npreprocess_validation_examples: If there is no need to generate class labels for tokens.\n'

In [24]:
"""
The same function for both dataset
"""
def preprocess_examples(examples):
    questions = [q.strip() for q in examples['question']]
    inputs = tokenizer(
        questions,
        examples['context'],
        max_length=MAX_LENGTH,
        truncation='only_second', # Only truncate/window the context, not question!
        stride=STRIDE,
        return_offsets_mapping=True, #To also pass the truncated tokens to the model, we can use this parameter
        padding='max_length', # Added a padding strategy to make all batches same
        )

    offset_mapping = inputs['offset_mapping']
    answers = examples['answers']
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer['answer_start'][0]
        end_char = answer['answer_start'][0] + len(answer['text'][0])
        sequence_ids = inputs.sequence_ids(i)

    # Find the start and end of the context

        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

    # If the answer is not fully inside the context, label is (0, 0)

        if offset[context_start][0] > end_char \
            or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:

      # Otherwise it's the start and end token positions

            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs['start_positions'] = start_positions
    inputs['end_positions'] = end_positions
    
    return inputs

In [25]:
# DATASETS_for_optuna - Reduce the dataset to speed up the process of selecting hyperparameters
part_of_data = 0.1

DATASETS_for_optuna = DatasetDict({
    'train': dataset["train"].map(
        preprocess_examples,
        batched=True).select(
            np.random.choice(range(len(dataset["train"])), int(len(dataset["train"])*part_of_data), replace=False)
        ),
    'validation': dataset["test"].map(
        preprocess_examples,
        batched=True).select(
            np.random.choice(range(len(dataset["test"])), int(len(dataset["test"])*part_of_data), replace=False)
        )
})
DATASETS_for_optuna

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 500
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 125
    })
})

In [26]:
"""
To apply the preprocessing function over the entire dataset - train and validation
We can speed up the map function by setting batched=True to process multiple elements of the dataset at once
"""

DATASETS = DatasetDict({
    'train': dataset["train"].map(
        preprocess_examples,
        batched=True).select(
            np.random.choice(range(len(dataset["train"])), int(len(dataset["train"])), replace=False)
        ),
    'validation': dataset["test"].map(
        preprocess_examples,
        batched=True).select(
            np.random.choice(range(len(dataset["test"])), int(len(dataset["test"])), replace=False)
        )
})
DATASETS

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 1250
    })
})

# 3. Model training


Key steps
* Define metric computation function
* Training model with base hyperparameters
* Getting the best hyperparameters by optuna (automatic hyperparameter optimizations)
* Training model with the best hyperparameters

`Attention!!! We use only part of the dataset in order to save time to training and GPU resources. That's why metric's final result could be low`.

## 3.1 Metric computation function

In [27]:
# Define metric to compute
metric = evaluate.load("squad")

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

In [28]:
def compute_metrics_for_optuna(eval_preds):
    y_pred = np.argmax(eval_preds[0], -1).T

    f1_score = 0
    exact_match = 0
    
    for data, pred in zip(DATASETS_for_optuna['validation'], y_pred):
        # Convert answer start and end into characters positions in
        # original text using the offset mapping list
        start_char = data['offset_mapping'][pred[0]][0]
        end_char = data['offset_mapping'][pred[1]][1]
        
        # Create predictions and references dictionaries for metric function
        predictions = [{'prediction_text': data['context'][start_char:end_char],
                        'id': str(data['id'])}]
        references = [{'answers': data['answers'],
                       'id': str(data['id'])}]
        results = metric.compute(predictions=predictions,
                                 references=references)
        
        # Add metric to running sum variable to calculate average after,
        # change outputs from 0-100 range to 0-1 range
        f1_score += results['f1'] / 100
        exact_match += results['exact_match'] / 100
        
    # Calculate the average
    f1_score /= len(DATASETS_for_optuna['validation'])
    exact_match /= len(DATASETS_for_optuna['validation'])
    
    return {'f1': f1_score, 'exact_match': exact_match}

In [29]:
def compute_metrics(eval_preds):
    y_pred = np.argmax(eval_preds[0], -1).T

    f1_score = 0
    exact_match = 0
    
    for data, pred in zip(DATASETS['validation'], y_pred):
        # Convert answer start and end into characters positions in
        # original text using the offset mapping list
        start_char = data['offset_mapping'][pred[0]][0]
        end_char = data['offset_mapping'][pred[1]][1]
        
        # Create predictions and references dictionaries for metric function
        predictions = [{'prediction_text': data['context'][start_char:end_char],
                        'id': str(data['id'])}]
        references = [{'answers': data['answers'],
                       'id': str(data['id'])}]
        results = metric.compute(predictions=predictions,
                                 references=references)
        
        # Add metric to running sum variable to calculate average after,
        # change outputs from 0-100 range to 0-1 range
        f1_score += results['f1'] / 100
        exact_match += results['exact_match'] / 100
        
    # Calculate the average
    f1_score /= len(DATASETS['validation'])
    exact_match /= len(DATASETS['validation'])
    
    return {'f1': f1_score, 'exact_match': exact_match}

## 3.2 Training model with base hyperparams

In [31]:
# TRAINING HYPERPARAMS
BATCH_SIZE = 12
NUM_EPOCHS = 5
LR = 3e-5
WD = 0.01
GRAD_ACC = 8
WARMUP = 0.1

In [32]:
training_args_base = TrainingArguments("mdeberta-squad-base-params",
                                  evaluation_strategy="steps",
                                  eval_steps=50,
                                  logging_steps=50,
                                  save_steps=100,
                                  optim="adamw_torch",
                                  learning_rate=LR,
                                  per_device_train_batch_size=BATCH_SIZE,
                                  per_device_eval_batch_size=BATCH_SIZE,
                                  warmup_steps=50,
                                  lr_scheduler_type='cosine',
                                  weight_decay=WD,
                                  warmup_ratio=WARMUP,
                                  gradient_accumulation_steps=GRAD_ACC,
                                  num_train_epochs=NUM_EPOCHS)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [33]:
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [34]:
trainer_base = Trainer(model,
                  training_args_base,
                  train_dataset=DATASETS['train'],
                  eval_dataset=DATASETS['validation'],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)

In [35]:
torch.cuda.empty_cache()
os.environ["WANDB_DISABLED"] = "true"

# 16/12/2024
trainer_base.train()

Step,Training Loss,Validation Loss,F1,Exact Match
50,2.5374,1.784241,0.844313,0.652
100,1.5956,1.680552,0.834757,0.6472


TrainOutput(global_step=130, training_loss=1.9172415219820462, metrics={'train_runtime': 1610.3232, 'train_samples_per_second': 15.525, 'train_steps_per_second': 0.081, 'total_flos': 4879021147324416.0, 'train_loss': 1.9172415219820462, 'epoch': 4.976076555023924})

In [36]:
"""
Save the model
"""
trainer_base.save_model('./mdeberta-finetuned_base_params-QA_16_12_24')

## Evaluate

In [37]:
dataset = load_dataset("sberquad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 45328
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5036
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 23936
    })
})

In [38]:
test_dataset = dataset['test'].map(
    preprocess_examples,
    batched=True)

Map:   0%|          | 0/23936 [00:00<?, ? examples/s]

In [39]:
# predict + compute metrics on our test set
trainer_base.eval_dataset = test_dataset
trainer_base.evaluate()

{'eval_loss': 7.6507954597473145,
 'eval_f1': 0.037796401565128666,
 'eval_exact_match': 0.0,
 'eval_runtime': 562.6968,
 'eval_samples_per_second': 42.538,
 'eval_steps_per_second': 1.774,
 'epoch': 4.976076555023924}

## 3.3 Getting the best hyperparameters by optuna

`CREATE OPTUNA STUDY`

In [30]:
# hyperparameters - https://python-bloggers.com/2022/08/hyperparameter-tuning-a-transformer-with-optuna/
LR_MIN = 4e-5 # Learning rate minimum and maximum (ceiling) named LR_MIN and LR_CEIL
LR_CEIL = 0.01
WD_MIN = 4e-5 # Weight decay minimum and ceilling named WD_MIN and WD_CEIL
WD_CEIL = 0.01
WR_MIN = 0.01
WR_CEIL = 0.2
MIN_GRAD_ACC = 1
MAX_GRAD_ACC = 5
MIN_EPOCHS = 2 # Minimum and maximum epochs named MIN_EPOCHS and MAX_EPOCHS
MAX_EPOCHS = 5
PER_DEVICE_EVAL_BATCH = 10 # per device evaluation batch sizes for the training and evaluation sets
PER_DEVICE_TRAIN_BATCH = 10
NUM_TRIALS = 3 # number of Optuna trials to implement – incrementing this will perform multiple hyperparameter trials for each individual permutation and setting
SAVE_DIR = 'optuna-test' # SAVE_DIR is the name of the folder to save it to
NAME_OF_MODEL = 'optuna_bp' # NAME_OF_MODEL is what I want to call my serialised and fine tuned transformer network

In [31]:
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [32]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()
model.to(device)

DebertaV2ForQuestionAnswering(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(251000, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True

In [33]:
def objective(trial: optuna.Trial):     
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    
    training_args = TrainingArguments(         
        output_dir=SAVE_DIR, 
        optim="adamw_torch",
        learning_rate=trial.suggest_loguniform('learning_rate', low=LR_MIN, high=LR_CEIL),         
        weight_decay=trial.suggest_loguniform('weight_decay', WD_MIN, WD_CEIL),
        warmup_ratio=trial.suggest_loguniform('warmup_ratio', WR_MIN, WR_CEIL),
        gradient_accumulation_steps=trial.suggest_int('gradient_accumulation_steps', low = MIN_GRAD_ACC,high = MAX_GRAD_ACC),
        num_train_epochs=trial.suggest_int('num_train_epochs', low = MIN_EPOCHS,high = MAX_EPOCHS),         
        per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH,         
        per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
        seed = RANDOM_SEED,
        lr_scheduler_type='cosine')     

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=DATASETS_for_optuna['train'],
        eval_dataset=DATASETS_for_optuna['validation'],
        compute_metrics=compute_metrics_for_optuna)      
    
    result = trainer.train()
    
    return result.training_loss

In [34]:
def print_custom(text):
    print('\n')
    print(text)
    print('-'*100)

In [35]:
print_custom('Triggering Optuna study')
study = optuna.create_study(study_name='hp-search-electra', direction='minimize') 
study.optimize(func=objective, n_trials=NUM_TRIALS)

# This can be used to train the final model. Passed through using kwargs into the model
print_custom('Finding study best parameters')
best_lr = float(study.best_params['learning_rate'])
best_weight_decay = float(study.best_params['weight_decay'])
best_warmup_ratio = float(study.best_params['warmup_ratio'])
best_gradient_accumulation_steps = int(study.best_params['gradient_accumulation_steps'])
best_epoch = int(study.best_params['num_train_epochs'])

print_custom('Extract best study params')
print(f'The best learning rate is: {best_lr}')
print(f'The best weight decay is: {best_weight_decay}')
print(f'The best warmup ratio is: {best_warmup_ratio}')
print(f'The best gradient accumulation step is : {best_gradient_accumulation_steps}')
print(f'The best epoch is : {best_epoch}')

print_custom('Create dictionary of the best hyperparameters')
best_hp_dict = {
    'best_learning_rate': best_lr,
    'best_weight_decay': best_weight_decay,
    'best_warmup_ratio': best_warmup_ratio,
    'best_gradient_accumulation_steps': best_gradient_accumulation_steps,
    'best_epoch': best_epoch
}

[I 2024-12-17 07:38:21,071] A new study created in memory with name: hp-search-electra
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).




Triggering Optuna study
----------------------------------------------------------------------------------------------------


Step,Training Loss


[I 2024-12-17 07:40:58,623] Trial 0 finished with value: 1.7351736450195312 and parameters: {'learning_rate': 8.224316278391225e-05, 'weight_decay': 0.0016061074719990397, 'warmup_ratio': 0.013727255778614706, 'gradient_accumulation_steps': 5, 'num_train_epochs': 5}. Best is trial 0 with value: 1.7351736450195312.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss


[I 2024-12-17 07:43:19,062] Trial 1 finished with value: 5.449199829101563 and parameters: {'learning_rate': 0.00230422656093221, 'weight_decay': 0.006406776712275002, 'warmup_ratio': 0.017113996598093775, 'gradient_accumulation_steps': 1, 'num_train_epochs': 4}. Best is trial 0 with value: 1.7351736450195312.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss


[I 2024-12-17 07:45:28,625] Trial 2 finished with value: 1.917277216911316 and parameters: {'learning_rate': 5.2596423028803626e-05, 'weight_decay': 9.040727861254416e-05, 'warmup_ratio': 0.044475576825465414, 'gradient_accumulation_steps': 3, 'num_train_epochs': 4}. Best is trial 0 with value: 1.7351736450195312.




Finding study best parameters
----------------------------------------------------------------------------------------------------


Extract best study params
----------------------------------------------------------------------------------------------------
The best learning rate is: 8.224316278391225e-05
The best weight decay is: 0.0016061074719990397
The best warmup ratio is: 0.013727255778614706
The best gradient accumulation step is : 5
The best epoch is : 5


Create dictionary of the best hyperparameters
----------------------------------------------------------------------------------------------------


## 3.4 Training model with the best hyperparameters

In [36]:
training_args_bp = TrainingArguments("mdeberta-squad-best-params",
                                     evaluation_strategy="steps",
                                     eval_steps=100,
                                     logging_steps=100,
                                     # save_steps=100,
                                     optim="adamw_torch",
                                     learning_rate=best_lr,
                                     per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH,
                                     per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
                                     warmup_steps=50,
                                     seed = RANDOM_SEED,
                                     lr_scheduler_type='cosine',
                                     save_strategy = "no", # to avoid saving anything and save the final model once training is done with .save_model()
                                     fp16=True, #reduce the memory footprint - If you have an error as No space left on device during training&saving results
                                     weight_decay=best_weight_decay,
                                     warmup_ratio=best_warmup_ratio,
                                     gradient_accumulation_steps=best_gradient_accumulation_steps,
                                     num_train_epochs=best_epoch)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [37]:
trainer_bp = Trainer(model,
                  training_args_bp,
                  train_dataset=DATASETS['train'],
                  eval_dataset=DATASETS['validation'],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)

In [38]:
torch.cuda.empty_cache()
os.environ["WANDB_DISABLED"] = "true"

# 17/12/2024
trainer_bp.train()

Step,Training Loss,Validation Loss,F1,Exact Match
100,1.9762,1.641488,0.837407,0.6416
200,1.1279,1.904699,0.75872,0.5616


TrainOutput(global_step=250, training_loss=1.4074424591064454, metrics={'train_runtime': 1636.472, 'train_samples_per_second': 15.277, 'train_steps_per_second': 0.153, 'total_flos': 4899402662400000.0, 'train_loss': 1.4074424591064454, 'epoch': 5.0})

**Some observations**:
* Metrics reduction occurs due to the choice of hyperparameters based on a very small sample of the dataset

In [39]:
"""
Save the model
"""
trainer_bp.save_model('./mdeberta-finetuned-best_params-QA_17_12_24')

# 4. Evaluate

In [40]:
dataset = load_dataset("sberquad")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 45328
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5036
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 23936
    })
})

In [None]:
# dataset_test_part = dataset["test"].train_test_split()
# dataset_test_part['test']

In [41]:
test_dataset = dataset['test'].map(
    preprocess_examples,
    batched=True)

Map:   0%|          | 0/23936 [00:00<?, ? examples/s]

In [42]:
# predict + compute metrics on our test set
trainer_bp.eval_dataset = test_dataset
trainer_bp.evaluate()

{'eval_loss': 9.664545059204102,
 'eval_f1': 0.038647634497779845,
 'eval_exact_match': 0.0,
 'eval_runtime': 557.6996,
 'eval_samples_per_second': 42.919,
 'eval_steps_per_second': 2.146,
 'epoch': 5.0}

**Some observations**:
* As you can see, there is no any rapidly changes between metrics as the result of training model based on the base hyperparams VS hyperparams getting from automatic hyperparameter optimizations.
* Moreover loss and metric values are not stable (and worse than at the first trainig experiments) during training based on the best hyperparams because I got only small part of the dataset for that.
* It will be better to use more data for getting best hyperparams and training model if you have enough resources (time and GPU/CPU memory).

But in this case I want to check only:
* how does optuna work for transformer model
* baseline value of f_1 score and exact_match

And we get it!