# Fine-tuning Question Answering Model

This exam will guide you through loading, preprocessing, and fine-tuning a pre-trained model for a question-answering task using a dataset. Follow the steps carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `distilbert-base-cased` for both the model and tokenizer.
- **Dataset**: You will be using the `christti/squad-augmented-v2` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [2]:
#!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [1]:
from datasets import load_dataset

ds = load_dataset("christti/squad-augmented-v2", split="train[:1%]")
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 1692
})

In [2]:
#ds = ds.remove_columns(['id','title'])

In [3]:
'''ds_train = ds['train']
ds_val = ds['validation']'''

"ds_train = ds['train']\nds_val = ds['validation']"

In [4]:
'''ds_train = ds_train.train_test_split(test_size=0.2)
ds_train'''

'ds_train = ds_train.train_test_split(test_size=0.2)\nds_train'

## Step 2: Load the Pretrained Tokenizer and Model

In [5]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("question-answering", model="distilbert/distilbert-base-cased-distilled-squad")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [6]:
# Load model directly
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch

tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-cased-distilled-squad")
model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-cased-distilled-squad")

Use the model and tokenizer for the question-answering task.

## Step 3: Preprocess the Dataset

Define a function to preprocess the dataset by tokenizing both the context and the question. The function will also calculate the start and end positions of the answers. In the tokenizer you might face a problem if you use `truncation=True` so consider using `truncation='only_first'` if needed.

In [8]:
def preprocess_data(examples):
    inputs = tokenizer(
        examples['context'],
        examples['question'],
        truncation='only_first',
        padding='max_length',
        max_length=400
    )

    # Tokenize the answer separately
    answers = examples['answers']
    start_positions = []
    end_positions = []

    for i, answer in enumerate(answers):
        answer_text = answer['text'][0]
        start_char = answer['answer_start'][0]

        # Tokenize the context
        context = examples['context'][i]
        tokenized_context = tokenizer(context, truncation=True, padding='max_length', max_length=400)

        # Tokenize the answer
        tokenized_answer = tokenizer(answer_text, truncation=True, padding='max_length', max_length=400)

        # Find the token indices corresponding to the start and end of the answer
        start_pos = None
        end_pos = None

        # Loop through the tokenized context and look for the answer tokens
        for idx in range(len(tokenized_context['input_ids']) - len(tokenized_answer['input_ids']) + 1):
            if tokenized_context['input_ids'][idx:idx + len(tokenized_answer['input_ids'])] == tokenized_answer['input_ids']:
                start_pos = idx
                end_pos = idx + len(tokenized_answer['input_ids']) - 1
                break

        if start_pos is None or end_pos is None:
            start_pos = 0
            end_pos = 0

        start_positions.append(start_pos)
        end_positions.append(end_pos)

    inputs.update({
        'start_positions': start_positions,
        'end_positions': end_positions
    })

    return inputs

tokenized_train_ds = ds.map(preprocess_data, batched=True, remove_columns=['id', 'title'])




Map:   0%|          | 0/1692 [00:00<?, ? examples/s]

In [9]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_train_ds, shuffle=True, batch_size=8)
#valid_data = DataLoader(ds_val, batch_size=8)

## Step 4: Define Training Arguments and Initialize the Trainer

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [10]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="no",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01
)



In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_ds
)

## Step 5: Fine-tune the Model

Run the training process using the initialized trainer.

In [14]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=212, training_loss=0.018126299921071756, metrics={'train_runtime': 72.1025, 'train_samples_per_second': 23.467, 'train_steps_per_second': 2.94, 'total_flos': 172707066604800.0, 'train_loss': 0.018126299921071756, 'epoch': 1.0})

## Step 6: Inference

Once the model is trained, perform inference by answering a question based on a context. Use the tokenizer to process the input, and then feed it into the model to get the predicted answer.

In [27]:
question = 'How many programming languages does BLOOM support?'
context = 'BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages.'

#input = tokenizer(question_ex, context_ex, return_tensors='pt')


In [32]:
inputs = tokenizer(question, context, return_tensors='pt').input_ids


#answer = model(question=question, context=context)
answer = pipe(question=question, context=context)
print("Answer:", answer['answer'])

Answer: 13
