# Fine-tuning Question Answering Model

This exam will guide you through loading, preprocessing, and fine-tuning a pre-trained model for a question-answering task using a dataset. Follow the steps carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `distilbert-base-cased` for both the model and tokenizer.
- **Dataset**: You will be using the `christti/squad-augmented-v2` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

In [1]:
!pip install datasets
!pip install transformers

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K 

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [2]:
from datasets import load_dataset
from transformers import pipeline

In [3]:
checkpoint = 'distilbert-base-cased'
dataset_train = load_dataset('christti/squad-augmented-v2',split="train[:40000]")
dataset_test = load_dataset('christti/squad-augmented-v2',split="validation[:4000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

train.json:   0%|          | 0.00/156M [00:00<?, ?B/s]

validation.json:   0%|          | 0.00/11.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

## Step 2: Load the Pretrained Tokenizer and Model

Use the model and tokenizer for the question-answering task.

In [4]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 3: Preprocess the Dataset

Define a function to preprocess the dataset by tokenizing both the context and the question. The function will also calculate the start and end positions of the answers. In the tokenizer you might face a problem if you use `truncation=True` so consider using `truncation='only_first'` if needed.

In [5]:
dataset_train

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 40000
})

In [6]:
dataset_test

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 4000
})

In [7]:
# from sklearn.model_selection import train_test_split

# train = dataset['train']
# test = dataset['validation']

train = dataset_train
test = dataset_test

In [8]:
import pandas as pd

pd.DataFrame(train).iloc[1].values

array(['5733be284776f4190066117f', 'University_of_Notre_Dame',
       'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
       'What is in front of the Notre Dame Main Building?',
       {'text': ['a copper statue of Christ'], 'answer_start': [188]}],
      dtype=object)

In [9]:
pd.DataFrame(train)

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...
...,...,...,...,...,...
39995,5725e90738643c19005ace67,Montevideo,The State Railways Administration of Uruguay (...,When was the General Artigas Central Station a...,"{'text': ['1 March 2003'], 'answer_start': [565]}"
39996,5725ea3089a1e219009ac098,Montevideo,The port on Montevideo Bay is one of the reaso...,What gives natural protection to ships in Mont...,"{'text': ['The port on Montevideo Bay'], 'answ..."
39997,5725ea3089a1e219009ac099,Montevideo,The port on Montevideo Bay is one of the reaso...,Between what years did the main engineering wo...,"{'text': ['1870 and 1930'], 'answer_start': [3..."
39998,5725ea3089a1e219009ac09a,Montevideo,The port on Montevideo Bay is one of the reaso...,What happened in 1923 that required repairs to...,"{'text': ['A major storm'], 'answer_start': [5..."


In [10]:
def pre_process(dataset):
  inputs = tokenizer(
      dataset['question'],
      dataset["context"],
      max_length=512,
      truncation=True,
      return_offsets_mapping=True,
      padding="max_length",
  )


  offset_mapping = inputs.pop("offset_mapping")

  answers = dataset["answers"]
  start_positions = []
  end_positions = []

  for i, offset in enumerate(offset_mapping):
    answer = answers[i]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)


    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
    context_start = idx
    while sequence_ids[idx] == 1:
        idx += 1
    context_end = idx - 1

    if offset[context_start][0] > end_char or offset[context_end][1] <start_char:
      start_positions.append(0)
      end_positions.append(0)
    else:
        idx = context_start
        while idx <= context_end and offset[idx][0] <= start_char:
            idx += 1
        start_positions.append(idx - 1)
        idx = context_end
        while idx >= context_start and offset[idx][1] >= end_char:
            idx -= 1
        end_positions.append(idx + 1)


  inputs.update({
      'start_positions':start_positions,
      'end_positions':end_positions
  })


  return inputs


data_train = train.map(pre_process,batched=True,remove_columns=['context','question','answers','id','title'])

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

In [None]:
# pd.DataFrame(data_data)

## Step 4: Define Training Arguments and Initialize the Trainer

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [11]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./result",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data_train,
)

## Step 5: Fine-tune the Model

In [12]:
trainer.train()

Step,Training Loss
500,3.1411
1000,2.0232
1500,1.7325
2000,1.5016
2500,1.4559
3000,1.3644
3500,1.352
4000,1.3104
4500,1.2399
5000,1.2203


TrainOutput(global_step=5000, training_loss=1.6341199462890625, metrics={'train_runtime': 1720.9746, 'train_samples_per_second': 23.243, 'train_steps_per_second': 2.905, 'total_flos': 3919593000960000.0, 'train_loss': 1.6341199462890625, 'epoch': 1.0})

Run the training process using the initialized trainer.

## Step 6: Inference

Once the model is trained, perform inference by answering a question based on a context. Use the tokenizer to process the input, and then feed it into the model to get the predicted answer.

In [13]:
data_test = test.map(pre_process,batched=True,remove_columns=['context','question','answers','id','title'])

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

In [14]:
# question_answerer = pipeline("question-answering", model="./result")
trainer.evaluate(data_test)

{'eval_loss': 1.3028285503387451,
 'eval_runtime': 54.8969,
 'eval_samples_per_second': 72.864,
 'eval_steps_per_second': 9.108,
 'epoch': 1.0}

In [15]:
pipe = pipeline("question-answering", model=trainer.model, tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [46]:
context = """
This is Tuwaiq academy, they teach Data Science and AI and Machine learning courses. In addition, they have LLM and Web Developing and Game Development courses. The bootcamp that we have right now is T5, and it is considered one of the best bootcamp ever. The name of the instructors are Ali, Saliyah, Hassan and there are helpers like Abdullah khaled and sanad. The location in Riyadh in Nourah University.
"""

#question_1 = "who is the instructors?"
#question_2 = "who is the helpers?"
#question_3 = "where is the location of the bootcamp?"

In [47]:
question_1 = "who is the instructors?"
pipe(question=question_1,context=context)

{'score': 0.4964500665664673,
 'start': 289,
 'end': 309,
 'answer': 'Ali, Saliyah, Hassan'}

In [48]:
question_2 = "who is the helpers?"
pipe(question=question_2,context=context)

{'score': 0.4992832541465759,
 'start': 337,
 'end': 362,
 'answer': 'Abdullah khaled and sanad'}

In [49]:
question_3 = "where is the location of the bootcamp?"
pipe(question=question_3,context=context)

{'score': 0.1990729719400406, 'start': 380, 'end': 386, 'answer': 'Riyadh'}