# <center> Project-5: Question Answering

**Case Description**

In this notebook we fine-tune one of the 🤗 Transformers model (DeBERTa) to a Question Answering (QA) task, which is the task of extracting the answer to a question from a given context. This model selects a span of the input passage as the answer, does not generate new text!

**Task**

Extracting the answer to a question with QA transformer model.

**Data**: It is a [SberQuAD](https://arxiv.org/pdf/1912.09723) (Sberbank Question Answering Dataset)

**ML/DL task**: NlP task - Question Answering

*Training on GPU*

`Attention!!! We use only part of the dataset in order to save time to training and GPU resources and memory.`

# 0. Install and Import

In [1]:
%%capture
!pip install transformers # the huggingface library containing the general-purpose architectures for NLP
!pip install datasets # the huggingface library containing datasets and evaluation metrics for NLP
!pip install evaluate
!pip install -U ipywidgets
!pip install optuna
!pip install -U accelerate

In [2]:
import os
import gc
import random
import numpy as np
import pandas as pd
import collections

from tqdm.auto import tqdm
from IPython.display import display, HTML

import evaluate
import optuna

# pytorch libraries
import torch

import transformers
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, pipeline
from datasets import load_dataset, DatasetDict, ClassLabel, Sequence

import warnings
warnings.filterwarnings("ignore")

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["WANDB_DISABLED"] = "true"

# cache_dir = os.makedirs("cache",exist_ok=True)
# os.environ['TRANSFORMERS_CACHE'] = "cache"
# os.environ['HF_DATASETS_CACHE'] = "cache"

In [3]:
%%capture
os.system('python -m pip install datasets --upgrade')
os.environ['TRANSFORMERS_CACHE'] = "/opt/ml/checkpoints/"
os.environ['HF_DATASETS_CACHE'] = "/opt/ml/checkpoints/"

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hInstalling collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.21.0
    Uninstalling datasets-2.21.0:
      Successfully uninstalled datasets-2.21.0
Successfully installed datasets-3.2.0


In [4]:
# Fixing RANDOM_SEED to make experiment repetable
RANDOM_SEED = 42

# Set random seeds
def set_seed(seed):
    """
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).

    Args:
        seed (:obj:`int`): The seed to set.
    """
    np.random.seed(seed)
    random.seed(seed)
#     tf.random.set_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    
set_seed(RANDOM_SEED)

In [5]:
# Fixing package versions to make experiment repetable
!pip freeze > requirements.txt

In [6]:
# Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors.
# Set those two parameters, then the rest of the notebook should run smoothly:
model_checkpoint = "timpal0l/mdeberta-v3-base-squad2" # Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
BATCH_SIZE = 10

In [7]:
# If we have a GPU available, we'll set our device to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# 1. Data Loading: dataset exploration

We will use the 🤗 [Datasets](https://github.com/huggingface/datasets) library to download the data. This can be easily done with the functions *load_dataset*:

In [8]:
# Load the SBERQUAD dataset - https://huggingface.co/datasets/kuznetsoffandrey/sberquad
dataset_full = load_dataset("sberquad")

Downloading readme:   0%|          | 0.00/5.16k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.43M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.93M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45328 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5036 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/23936 [00:00<?, ? examples/s]

In [9]:
dataset_full

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 45328
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 5036
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 23936
    })
})

The `datasets` object itself is `DatasetDict`, which contains one key for the training, validation and test sets.

There are several important fields here:
* **context**: background information from which the model needs to extract the answer
* **question**: the question a model should answer
* **answers**: the starting location of the answer token and the answer text

In [10]:
# To access an actual element, you need to select a split first, then give an index:
print("Context: ", dataset_full["train"][1]["context"])
print("\nQuestion: ", dataset_full["train"][1]["question"])
print("\nAnswer: ", dataset_full["train"][1]["answers"])

Context:  В протерозойских отложениях органические остатки встречаются намного чаще, чем в архейских. Они представлены известковыми выделениями сине-зелёных водорослей, ходами червей, остатками кишечнополостных. Кроме известковых водорослей, к числу древнейших растительных остатков относятся скопления графито-углистого вещества, образовавшегося в результате разложения Corycium enigmaticum. В кремнистых сланцах железорудной формации Канады найдены нитевидные водоросли, грибные нити и формы, близкие современным кокколитофоридам. В железистых кварцитах Северной Америки и Сибири обнаружены железистые продукты жизнедеятельности бактерий.

Question:  что найдено в кремнистых сланцах железорудной формации Канады?

Answer:  {'text': ['нитевидные водоросли, грибные нити'], 'answer_start': [438]}


We can see the answers are indicated by their start position in the text (here at character 438) and their full text, which is a substring of the context as we mentioned above.

`To get a sense of what the data looks like, the following function will show some examples picked randomly from the dataset and decoded back to strings:`

In [11]:
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    
    picks = []
    
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
            
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    
    for column, typ in dataset.features.items():
        
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
            
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(
                lambda x: [typ.feature.names[i] for i in x]
            )
            
    display(HTML(df.to_html()))

In [12]:
show_random_elements(dataset_full["train"])

Unnamed: 0,id,title,context,question,answers
0,36049,SberChallenge,"В начале XX века Тебриз был пристанищем множества радикальных организаций, в силу чего играл важную роль в Конституционной революции каджарского Ирана 1905—1911 годов. Уже в 1908 году из города были изгнаны сторонники Мохаммед-Али шаха, который, будучи принцем, занимал пост губернатора Тебриза. Одним из главных требований протестующих было создание нового меджлиса. В ответ на революцию, английские и русские войска вторглись в Иран, чтобы её подавить. Последние заняли Тебриз, разоружили радикальные революционные группы, но при этом не признавали и отказывали возможности проехать в город шахскому губернатору. В итоге Мохаммед Али-шах вынужден был уехать в Россию, а к власти в 1909 году пришёл новый правитель Султан Ахмад-шах, последний из династии Каджаров[4].",Чьи сторонники были изгнаны в 1908 году во время протестов в Иране в начале XX века?,"{'text': ['Мохаммед-Али шаха'], 'answer_start': [218]}"
1,22571,SberChallenge,"В России первый алмаз был найден 5 июля 1829 года на Урале в Пермской губернии на Крестовоздвиженском золотом прииске четырнадцатилетним крепостным Павлом Поповым, который нашёл алмаз, промывая золото в шлиховом лотке. За полукаратный кристалл Павел получил вольную. Павел привёл учёных, участников экспедиции немецкого учёного Александра Гумбольдта, на то место, где он нашёл первый алмаз (сейчас это место называется Алмазный ключик (по одноимённому источнику) и расположено приблизительно в 1 км от пос. Промысла́ недалеко от старой дороги, связывающей посёлки Промысла́ и Тёплая Гора Горнозаводского района Пермского края), и там было найдено ещё два небольших кристалла. За 28 лет дальнейших поисков был найден только 131 алмаз общим весом в 60 карат.",Что получил крепостной Павел Попов за находку алмаза?,"{'text': ['Вольную.'], 'answer_start': [-1]}"
2,68245,SberChallenge,"В конце XVIII — начале XIX столетий в репертуаре театра появились оперы итальянских композиторов П. Анфосси, П. Гульельми, Д. Чимарозы, Л. Керудини, Дж. Паизиелло, С. Майра. В 1812 году на сцене театра состоялась премьера оперы Дж. Россини Пробный камень . Она положила начало так называемому россиниевскому периоду. Театр Ла Скала первым поставил его оперы Аурельяно в Пальмире (1813), Турок в Италии (1814), Сорока-воровка (1817) и др. Одновременно театр ставил широко известные оперы Россини.",В каком году состоялась премьера оперы Дж. Россини Пробный камень ?,"{'text': ['1812'], 'answer_start': [176]}"
3,71864,SberChallenge,"Бассейн каждого водоёма включает в себя поверхностный и подземный водосборы. Поверхностный водосбор представляет собой участок земной поверхности, с которого поступают воды в данную речную систему или определённую реку. Подземный водосбор образуют толщи рыхлых отложений, из которых вода поступает в речную сеть. В общем случае поверхностный и подземный водосборы не совпадают. Но так как определение границы подземного водосбора практически очень сложно, то за величину речного бассейна принимается только поверхностный водосбор.",Что образует подземный водосбор?,"{'text': ['толщи рыхлых отложений, из которых вода поступает в речную сеть'], 'answer_start': [375]}"
4,35398,SberChallenge,"Институционализации, как показывают П. Бергер и Т. Лукман, предшествует процесс хабитуализации, или опривычивания повседневных действий, приводящий к формированию образцов деятельности, которые в дальнейшем воспринимаются как естественные и нормальные для данного рода занятий или решения типичных в данных ситуациях проблем. Образцы действий выступают, в свою очередь, основой для формирования социальных институтов, которые описываются в виде объективных социальных фактов и воспринимаются наблюдателем как социальная реальность (или социальная структура). Эти тенденции сопровождаются процедурами сигнификации (процесс создания, употребления знаков и фиксации значений и смыслов в них) и формируют систему социальных значений, которые, складываясь в смысловые связи, фиксируются в естественном языке. Сигнификация служит целям легитимации (признание правомочным, общественно признанным, законным) социального порядка, то есть оправдания и обоснования привычных способов преодоления хаоса деструктивных сил, угрожающих подорвать стабильные идеализации повседневной жизни.","Как в дальнейшем воспринимаются образцы деятельности, сформированные в процессе хабитуализации?","{'text': ['Как естественные и нормальные для данного рода занятий.'], 'answer_start': [-1]}"


***

In [13]:
"""
If we use the whole dataset we'll lose a lot of time at the training stage - around 4 hours!!!
It will be better to use small part of the dataset in order to look how optuna hyperparams optimization works
"""
ds_part = load_dataset("sberquad", split='train[:8000]+validation[:4000]')
ds_train_val = ds_part.train_test_split(train_size=8000, seed=RANDOM_SEED)

ds_devtest = ds_train_val['test'].train_test_split(test_size=0.5, seed=RANDOM_SEED)

ds_final = DatasetDict({
    'train': ds_train_val['train'],
    'validation': ds_devtest['train'],
    'test': ds_devtest['test']
})

print("Part of the dataset for project: \n", ds_final)

Part of the dataset for project: 
 DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 2000
    })
})


# 2. Data preprocessing

## 2.1 Data preprocessing: tokenization
Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers Tokenizer which will tokenize the inputs and put it in a format the model expects, as well as generate the other inputs that model requires.

We want to get data at the next format:

[CLS] question [SEP] context [SEP]

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:
- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint

In [14]:
"""
That vocabulary will be cached, so it's not downloaded again the next time we run the cell
"""
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

The following assertion ensures that our tokenizer is a fast tokenizer from the 🤗 Tokenizers library.
Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing

`Checking which type of models have a fast tokenizer available and which don't in the` [big table of models](https://huggingface.co/docs/transformers/index#bigtable)

In [15]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [16]:
"""
Call this tokenizer on two sentences (one for the answer, one for the context)
"""
context = ds_final["train"][1]["context"]
question = ds_final["train"][1]["question"]

inputs = tokenizer(question, context)
print(inputs)
print("\n")
print(tokenizer.decode(inputs["input_ids"]))

{'input_ids': [1, 260, 15089, 43544, 434, 51646, 260, 53046, 3588, 140065, 148362, 6111, 280, 31476, 11408, 260, 230129, 309, 57683, 106364, 309, 15204, 151261, 42291, 138732, 260, 292, 2, 13574, 17355, 1966, 21785, 834, 260, 230129, 309, 42291, 138732, 43544, 167176, 3753, 140065, 148362, 6111, 280, 260, 1803, 31476, 11408, 261, 5779, 426, 658, 32610, 19529, 3927, 9329, 355, 311, 40966, 325, 69370, 412, 62045, 260, 25073, 13821, 42257, 82714, 13821, 14721, 11457, 19452, 544, 389, 149103, 98313, 1610, 545, 55329, 260, 76030, 818, 261, 260, 32307, 316, 108608, 10187, 13676, 15352, 260, 45251, 66584, 6900, 1840, 125999, 4190, 80933, 260, 412, 260, 230129, 325, 42291, 15501, 325, 262, 1050, 316, 18159, 355, 27993, 280, 260, 1803, 82125, 188845, 31300, 3672, 260, 153246, 947, 7749, 51780, 265, 178852, 355, 918, 1499, 1012, 260, 180553, 36660, 12296, 1012, 14746, 260, 196587, 42291, 15501, 260, 262, 3744, 170095, 311, 84684, 102369, 8491, 833, 260, 1012, 426, 74412, 355, 29271, 325, 260, 18

Depending on the model, it will be different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here.

In [17]:
print("cls_token: ", tokenizer.cls_token)
print("sep_token: ", tokenizer.sep_token)
print("eos_token: ", tokenizer.eos_token)
print("pad_token: ", tokenizer.pad_token)

cls_token:  [CLS]
sep_token:  [SEP]
eos_token:  [SEP]
pad_token:  [PAD]


So we see that the model checkpoint we're using uses the [CLS] token to denote the start of the question, then a [SEP] token to mark between the question and the context, and then is ended with another [SEP] token. This is in accordance with how SQUAD is defined.

## 2.2 Data preprocessing: sequence length and stride params
`There are a few preprocessing steps particular to question answering tasks:`how to deal with very long documents.

**Every transformer model has a maximum sequence length that it can handle**.

We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for.

To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`

In [18]:
MAX_LENGTH = 384  # The maximum length of a feature (question and context)
DOC_STRIDE = 128  # The allowed overlap between two part of the context when splitting is performed

In [19]:
"""
Let's find one long example in our dataset:
"""
for i, example in enumerate(ds_final["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
        
example = ds_final["train"][i]

In [20]:
"""
Without any truncation, we get the following length for the input IDs:
"""
len(tokenizer(example["question"], example["context"])["input_ids"])

407

In [21]:
"""
Now, if we just truncate, we will lose information (and possibly the answer to our question):
"""
len(
    tokenizer(
        example["question"],
        example["context"],
        max_length=MAX_LENGTH,
        truncation="only_second",
    )["input_ids"]
)

384

To sum up key points:
- Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting *truncation="only_second"*. **We never want to truncate the question**, only the context, else the `only_second` truncation picked
- Our tokenizer can automatically return a list of features capped by a certain maximum length, with the overlap we talked about above, we just have to tell it to do so with `return_overflowing_tokens=True` and by passing the stride
- Map the start and end positions of the answer to the original context by setting `return_offset_mapping=True`
- Use the `sequence_ids` method to find which part of the offset corresponds to the **question** and which corresponds to the **context**

In the labeled dataset, `answer_start` gives us the correponding location of the answer within the context string. Note that it's relative to the start of the context string, not the question + context. The **answer text** gives us the actual plaintext answer, from which we can easily calculate the `answer_end` position as just `answer_start` **plus the length of the answer**.

This format is not sufficient to train from — we'll need labels for both start and end positions.

In [22]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=MAX_LENGTH,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=DOC_STRIDE,
)

In [23]:
"""
Now we don't have one list of input_ids, but several:
"""
[len(x) for x in tokenized_example["input_ids"]]

[384, 171]

In [24]:
"""
And if we decode them, we can see the overlap:
"""
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] В каком году Мопертюи писал о естественных модификациях?[SEP] Однако в то время были и натуралисты, которые размышляли об эволюционном изменении организмов, происходящем в течение длительного времени. Мопертюи писал в 1751 году о естественных модификациях, происходящих во время воспроизводства, накапливающихся в течение многих поколений и приводящих к формированию новых видов. Бюффон предположил, что виды могут дегенерировать и превращаться в другие организмы. Эразм Дарвин считал, что все теплокровные организмы возможно происходят от одного микроорганизма (или филамента ). Первая полноценная эволюционная концепция была предложена Жаном Батистом Ламарком в 1809 году в труде Философия зоологии. Ламарк считал, что простые организмы (инфузории и черви) постоянно самозарождаются. Затем эти формы изменяются и усложняют своё строение, приспосабливаясь к окружающей среде. Эти приспособления происходят за счёт прямого влияния окружающей среды путём упражнения или неупражнения органов и по

It's going to take some work to properly label the answers here: we need to find in which of those features the answer actually is, and where exactly in that feature.

The models we will use require the **start and end positions of these answers** in the tokens, so we will also need to map parts of the original context to some tokens. Thankfully, the tokenizer we're using can help us with that by returning an `offset_mapping`

In [25]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=MAX_LENGTH,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=DOC_STRIDE,
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 1), (1, 5), (5, 7), (7, 12), (12, 15), (15, 18), (18, 20), (20, 21), (21, 27), (27, 28), (28, 29), (29, 39), (39, 42), (42, 46), (46, 54), (54, 55), (55, 56), (0, 0), (0, 1), (1, 6), (6, 8), (8, 11), (11, 17), (17, 21), (21, 22), (22, 23), (23, 24), (24, 32), (32, 36), (36, 37), (37, 38), (38, 45), (45, 49), (49, 52), (52, 56), (56, 59), (59, 61), (61, 67), (67, 69), (69, 72), (72, 73), (73, 81), (81, 82), (82, 91), (91, 93), (93, 94), (94, 100), (100, 105), (105, 107), (107, 109), (109, 112), (112, 117), (117, 120), (120, 129), (129, 137), (137, 138), (138, 141), (141, 144), (144, 146), (146, 147), (147, 153), (153, 155), (155, 156), (156, 160), (160, 165), (165, 166), (166, 167), (167, 177), (177, 180), (180, 184), (184, 192), (192, 193), (193, 194), (194, 200), (200, 205), (205, 207), (207, 210), (210, 216), (216, 220), (220, 228), (228, 232), (232, 233), (233, 236), (236, 238), (238, 242), (242, 249), (249, 251), (251, 254), (254, 259), (259, 264), (264, 266), (266, 27

This gives the corresponding start and end character in the original text for each token in our input IDs.

The very first token [CLS] has (0, 0) because it doesn't correspond to any part of the question/answer, then the second token is the same as the characters 0 to 1 of the question:

In [26]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]

print(
    tokenizer.convert_ids_to_tokens([first_token_id])[0],
    example["question"][offsets[0] : offsets[1]],
)

▁В В


So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

In [27]:
sequence_ids = tokenized_example.sequence_ids()

print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

It returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context).

Now let's put everything together in one function we will apply to our training set.

## 2.3 Data preprocessing: training set

In [28]:
# Training set
def preprocess_training_examples(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    questions = [q.strip() for q in examples["question"]]
    
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=MAX_LENGTH,
        truncation="only_second",
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = inputs.pop("offset_mapping")
    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_map = inputs.pop("overflow_to_sample_mapping")
    
    answers = examples["answers"]
    # Let's label those examples!
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        # One example can give several spans, this is the index of the example containing this span of text.
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        # Start/end character index of the answer in the text.
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [None]:
# # This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:
# features = preprocess_training_examples(ds_final["train"][:5])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of the dataset object we created earlier. Since our preprocessing changes the number of samples, we need to **remove the old columns** when applying it:

In [29]:
train_ds = ds_final["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=ds_final["train"].column_names,
    keep_in_memory=True
)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

In [30]:
train_ds

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 8414
})

In [31]:
len(ds_final["train"]), len(train_ds)

(8000, 8414)

As we can see, the preprocessing added roughly 414 features. Our training set is now ready to be used — let’s dig into the preprocessing of the validation set!

## 2.4 Data preprocessing: validation/testing sets

Preprocessing the validation data will be slightly easier as we don’t need to generate labels. The real joy will be to interpret the predictions of the model into spans of the original context. For this, we will just need to store both the offset mappings and some way to match each created feature to the original example it comes from. Since there is an ID column in the original dataset, we’ll use that ID.

In [32]:
# For validation and testing sets
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=MAX_LENGTH,
        truncation="only_second",
        stride=DOC_STRIDE,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [33]:
validation_ds = ds_final["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=ds_final["validation"].column_names,
    keep_in_memory=True
)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [34]:
validation_ds

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 2086
})

In [35]:
len(ds_final["validation"]), len(validation_ds)

(2000, 2086)

In this case we’ve only added 86 samples, so it appears the contexts in the validation dataset are a bit shorter.

Now that we have preprocessed all the data, we can get to the training.

# 3. Model fine-tuning

Key steps
* Define metric computation function
* Training model with base hyperparameters
* Getting the best hyperparameters by optuna (automatic hyperparameter optimizations)
* Training model with the best hyperparameters

## 3.1 Metric computation
Since we padded all the samples to the maximum length we set, there is no data collator to define, so this metric computation is really the only thing we have to worry about. The difficult part will be to post-process the model predictions into spans of text in the original examples; once we have done that, the metric from the 🤗 Datasets library will do most of the work for us.

We will use the 🤗 [Datasets](https://github.com/huggingface/datasets) library to get the metric we need to use for evaluation (to compare our model to the benchmark).
Use for that the functions *load_metric*

In [36]:
# Define metric to compute
metric = evaluate.load("squad")

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

The post-processing step will be similar to what we did there, so here’s a quick reminder of the actions we took:
- We masked the start and end logits corresponding to tokens outside of the context.
- We then converted the start and end logits into probabilities using a softmax.
- We attributed a score to each (`start_index`, `end_index`) pair by taking the product of the corresponding two probabilities.
- We looked for the pair with the maximum score that yielded a valid answer (e.g., a `start_index` lower than `end_index`).

In [37]:
def compute_metrics(start_logits, end_logits, features, examples):
    # We need to find the predicted answer for each example in val_set.
    # One example may have been split into several features in eval_set,
    # so the first step is to map each example in val_set to the corresponding features in eval_set
    example_to_features = collections.defaultdict(list)
    for (idx, feature) in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    n_best = 20
    max_answer_length = 30
    predicted_answers = []
    # we’ll look at the logit scores for the n_best start logits and end logits, excluding positions that give:
    # - An answer that wouldn’t be inside the context
    # - An answer with negative length
    # 
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1:-n_best - 1:-1].tolist()
            end_indexes = np.argsort(end_logit)[-1:-n_best - 1:-1].tolist()
            
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0]:offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": str(example_id), "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    # This metric expects the predicted answers in the format:
    # a list of dictionaries with one key for the ID of the example and one key for the predicted text
    # and the theoretical answers in the format: 
    # a list of dictionaries with one key for the ID of the example and one key for the possible answers:
    theoretical_answers = [{"id": str(ex["id"]), "answers": ex["answers"]} for ex in examples]
    
    return metric.compute(predictions=predicted_answers,
                          references=theoretical_answers)

Here we will need a bit more, as we have to look in the dataset of features for the offset and in the dataset of examples for the original contexts, so we won’t be able to use this function to get regular evaluation results during training. **We will only use it at the end of training to check the results**

## 3.2 Fine-tuning the model: base hyperparameters

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [38]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

The warning is telling us that we should fine-tune this model before using it for inference, which is exactly what we are going to do.

We also tweak the learning rate, use the batch_size defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

In [39]:
# TRAINING HYPERPARAMS
LR = 2e-5
NUM_EPOCHS = 5
WD = 0.01

In [40]:
training_args_base = TrainingArguments("bert-squad-base-params",
                                  evaluation_strategy="no",
                                  save_strategy="epoch",
                                  optim="adamw_torch",
                                  learning_rate=LR,
                                  weight_decay=WD,
                                  num_train_epochs=NUM_EPOCHS,
                                  fp16=True,
                                  seed=RANDOM_SEED)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [41]:
# Finally, we just pass everything to the Trainer class and launch the training
trainer_base = Trainer(model,
                  training_args_base,
                  train_dataset=train_ds,
                  eval_dataset=validation_ds,
                  tokenizer=tokenizer)

In [42]:
torch.cuda.empty_cache()
os.environ["WANDB_DISABLED"] = "true"
os.environ['TRANSFORMERS_CACHE'] = "/opt/ml/checkpoints/"
os.environ['HF_DATASETS_CACHE'] = "/opt/ml/checkpoints/"

# 15/01/2025
trainer_base.train()

Step,Training Loss
500,1.7149
1000,1.3868
1500,1.1763
2000,1.032
2500,0.9378


TrainOutput(global_step=2630, training_loss=1.2314208404192906, metrics={'train_runtime': 3409.99, 'train_samples_per_second': 12.337, 'train_steps_per_second': 0.771, 'total_flos': 8244714800286720.0, 'train_loss': 1.2314208404192906, 'epoch': 5.0})

In [43]:
"""
Save the model
"""
trainer_base.save_model('./bert-squad-base-params-QA_15_01_25')

## 3.3 Evaluation

In [44]:
# Tokenize test set
test_ds = ds_final["test"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=ds_final["test"].column_names,
)

len(ds_final["test"]), len(test_ds)

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

(2000, 2084)

In [45]:
# Getting prediction based on the test set
predictions, _, _ = trainer_base.predict(test_ds)
start_logits, end_logits = predictions

compute_metrics(start_logits, end_logits, test_ds, ds_final["test"])

  0%|          | 0/2000 [00:00<?, ?it/s]

{'exact_match': 62.05, 'f1': 81.24410868559568}

In [None]:
# del trainer_base
# gc.collect()
# torch.cuda.empty_cache()

***

## 3.4 Getting the best hyperparameters by Optuna

In [43]:
# DATASETS_for_optuna - Reduce the dataset to speed up the process of selecting hyperparameters
part_of_data = 0.1

DATASETS_for_optuna = DatasetDict({
    'train': ds_final["train"].map(
        preprocess_training_examples,
        batched=True,
        keep_in_memory=True,
        remove_columns=ds_final["train"].column_names).select(
            np.random.choice(range(len(ds_final["train"])), int(len(ds_final["train"])*part_of_data), replace=False)
        ),
    'validation': ds_final["validation"].map(
        preprocess_validation_examples,
        batched=True,
        keep_in_memory=True,
        remove_columns=ds_final["validation"].column_names).select(
            np.random.choice(range(len(ds_final["validation"])), int(len(ds_final["validation"])*part_of_data), replace=False)
    ),
    'test': ds_final["test"].map(
        preprocess_validation_examples,
        batched=True,
        keep_in_memory=True,
        remove_columns=ds_final["test"].column_names).select(
            np.random.choice(range(len(ds_final["test"])), int(len(ds_final["test"])*part_of_data), replace=False)
    )
})

DATASETS_for_optuna

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 800
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
        num_rows: 200
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'example_id'],
        num_rows: 200
    })
})

In [44]:
# hyperparameters - https://python-bloggers.com/2022/08/hyperparameter-tuning-a-transformer-with-optuna/
LR_MIN = 2e-5 # Learning rate minimum and maximum (ceiling) named LR_MIN and LR_CEIL
LR_CEIL = 0.01
WD_MIN = 4e-5 # Weight decay minimum and ceilling named WD_MIN and WD_CEIL
WD_CEIL = 0.01
WR_MIN = 0.01
WR_CEIL = 0.2
MIN_GRAD_ACC = 1
MAX_GRAD_ACC = 5
MIN_EPOCHS = 2 # Minimum and maximum epochs named MIN_EPOCHS and MAX_EPOCHS
MAX_EPOCHS = 5
PER_DEVICE_EVAL_BATCH = BATCH_SIZE # per device evaluation batch sizes for the training and evaluation sets
PER_DEVICE_TRAIN_BATCH = BATCH_SIZE
NUM_TRIALS = 3 # number of Optuna trials to implement – incrementing this will perform multiple hyperparameter trials for each individual permutation and setting
SAVE_DIR = 'optuna-test' # SAVE_DIR is the name of the folder to save it to
NAME_OF_MODEL = 'optuna_bp' # NAME_OF_MODEL is what I want to call my serialised and fine tuned transformer network

In [45]:
def objective(trial: optuna.Trial):     
    model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
    
    training_args = TrainingArguments(         
        output_dir=SAVE_DIR, 
        optim="adamw_torch",
        learning_rate=trial.suggest_loguniform('learning_rate', low=LR_MIN, high=LR_CEIL),         
        weight_decay=trial.suggest_loguniform('weight_decay', WD_MIN, WD_CEIL),
        warmup_ratio=trial.suggest_loguniform('warmup_ratio', WR_MIN, WR_CEIL),
        gradient_accumulation_steps=trial.suggest_int('gradient_accumulation_steps', low = MIN_GRAD_ACC,high = MAX_GRAD_ACC),
        num_train_epochs=trial.suggest_int('num_train_epochs', low = MIN_EPOCHS,high = MAX_EPOCHS),         
        per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH,         
        per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
        fp16 = True,  # for saving memory
        # gradient_checkpointing=True, # for saving memory
        seed = RANDOM_SEED,
        lr_scheduler_type='cosine')

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=DATASETS_for_optuna['train'],
        eval_dataset=DATASETS_for_optuna['validation'])      
    
    result = trainer.train()

    del model, trainer
    gc.collect()
    torch.cuda.empty_cache()
    
    return result.training_loss

In [46]:
def print_custom(text):
    print('\n')
    print(text)
    print('-'*100)

In [47]:
print_custom('Triggering Optuna study')
study = optuna.create_study(study_name='hp-search', direction='minimize') 
study.optimize(func=objective, n_trials=NUM_TRIALS)

# This can be used to train the final model. Passed through using kwargs into the model
print_custom('Finding study best parameters')
best_lr = float(study.best_params['learning_rate'])
best_weight_decay = float(study.best_params['weight_decay'])
best_warmup_ratio = float(study.best_params['warmup_ratio'])
best_gradient_accumulation_steps = int(study.best_params['gradient_accumulation_steps'])
best_epoch = int(study.best_params['num_train_epochs'])

print_custom('Extract best study params')
print(f'The best learning rate is: {best_lr}')
print(f'The best weight decay is: {best_weight_decay}')
print(f'The best warmup ratio is: {best_warmup_ratio}')
print(f'The best gradient accumulation step is : {best_gradient_accumulation_steps}')
print(f'The best epoch is : {best_epoch}')

print_custom('Create dictionary of the best hyperparameters')
best_hp_dict = {
    'best_learning_rate': best_lr,
    'best_weight_decay': best_weight_decay,
    'best_warmup_ratio': best_warmup_ratio,
    'best_gradient_accumulation_steps': best_gradient_accumulation_steps,
    'best_epoch': best_epoch
}

[I 2025-01-15 07:51:11,550] A new study created in memory with name: hp-search




Triggering Optuna study
----------------------------------------------------------------------------------------------------


config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss


[I 2025-01-15 07:54:00,646] Trial 0 finished with value: 5.4509521484375 and parameters: {'learning_rate': 0.008738763408612403, 'weight_decay': 5.960386473276296e-05, 'warmup_ratio': 0.04452377040926211, 'gradient_accumulation_steps': 1, 'num_train_epochs': 3}. Best is trial 0 with value: 5.4509521484375.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss


[I 2025-01-15 07:56:52,271] Trial 1 finished with value: 5.418101501464844 and parameters: {'learning_rate': 0.0033315096647181463, 'weight_decay': 4.1193418718520336e-05, 'warmup_ratio': 0.027063898887017348, 'gradient_accumulation_steps': 1, 'num_train_epochs': 3}. Best is trial 1 with value: 5.418101501464844.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Step,Training Loss


[I 2025-01-15 08:00:34,843] Trial 2 finished with value: 1.3733306884765626 and parameters: {'learning_rate': 3.979049360136424e-05, 'weight_decay': 0.0026403445952656534, 'warmup_ratio': 0.010395335112022534, 'gradient_accumulation_steps': 1, 'num_train_epochs': 4}. Best is trial 2 with value: 1.3733306884765626.




Finding study best parameters
----------------------------------------------------------------------------------------------------


Extract best study params
----------------------------------------------------------------------------------------------------
The best learning rate is: 3.979049360136424e-05
The best weight decay is: 0.0026403445952656534
The best warmup ratio is: 0.010395335112022534
The best gradient accumulation step is : 1
The best epoch is : 4


Create dictionary of the best hyperparameters
----------------------------------------------------------------------------------------------------


## 3.5 Model training: the best hyperparameters

In [48]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [49]:
training_args_bp = TrainingArguments("mdeberta-squad-best-params",
                                     evaluation_strategy="no",
                                     save_strategy="epoch",
                                     logging_steps=300,
                                     optim="adamw_torch",
                                     learning_rate=best_lr,
                                     seed = RANDOM_SEED,
                                     lr_scheduler_type='cosine',
                                     fp16=True, #reduce the memory footprint - If you have an error as No space left on device during training&saving results
                                     weight_decay=best_weight_decay,
                                     warmup_ratio=best_warmup_ratio,
                                     gradient_accumulation_steps=best_gradient_accumulation_steps,
                                     num_train_epochs=best_epoch)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [50]:
trainer_bp = Trainer(
    model,
    training_args_bp,
    train_dataset=train_ds,
    eval_dataset=validation_ds,
    tokenizer=tokenizer)

In [51]:
# os.environ["WANDB_DISABLED"] = "true"
torch.cuda.empty_cache()

# 15/01/2025
trainer_bp.train()

Step,Training Loss
300,1.7892
600,1.5505
900,1.303
1200,1.1468
1500,0.9816
1800,0.8445
2100,0.8096


TrainOutput(global_step=2104, training_loss=1.2028537315560837, metrics={'train_runtime': 2494.0298, 'train_samples_per_second': 13.495, 'train_steps_per_second': 0.844, 'total_flos': 6595771840229376.0, 'train_loss': 1.2028537315560837, 'epoch': 4.0})

In [54]:
# """
# Save the model
# """
# trainer_bp.save_model('./mdeberta-finetuned-best_params-QA_15_01_25')

## 3.6 Evaluation

In [53]:
predictions, _, _ = trainer_bp.predict(test_ds)
start_logits, end_logits = predictions

compute_metrics(start_logits, end_logits, test_ds, ds_final["test"])

  0%|          | 0/2000 [00:00<?, ?it/s]

{'exact_match': 61.35, 'f1': 80.77243935477296}

***

# 4. Using the fine-tuned model (with base params)

In [46]:
# Replace this with your own checkpoint
model_checkpoint = "./bert-squad-base-params-QA_15_01_25"
question_answerer = pipeline("question-answering", model=model_checkpoint)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [47]:
context = """
В начале XX века Тебриз был пристанищем множества радикальных организаций, в силу чего играл важную роль в Конституционной революции каджарского Ирана 1905—1911 годов.
Уже в 1908 году из города были изгнаны сторонники Мохаммед-Али шаха, который, будучи принцем, занимал пост губернатора Тебриза.
Одним из главных требований протестующих было создание нового меджлиса.
В ответ на революцию, английские и русские войска вторглись в Иран, чтобы её подавить.
Последние заняли Тебриз, разоружили радикальные революционные группы, но при этом не признавали и отказывали возможности проехать в город шахскому губернатору.
В итоге Мохаммед Али-шах вынужден был уехать в Россию, а к власти в 1909 году пришёл новый правитель Султан Ахмад-шах, последний из династии Каджаров[4].
"""
question = "Чьи сторонники были изгнаны в 1908 году во время протестов в Иране в начале XX века?" # Мохаммед-Али шаха
question_answerer(question=question, context=context)

{'score': 0.8873432874679565,
 'start': 218,
 'end': 237,
 'answer': ' Мохаммед-Али шаха,'}

Our model is working well!

***

**Some observations**:
* As you can see, there is no any rapidly changes between metrics as the result of training model based on the base hyperparams VS hyperparams getting from automatic hyperparameter optimizations.

In order to improve the results it will be better to use more data for getting best hyperparams and training model if you have enough resources (time and GPU/CPU memory).