# Fine-tuning Question Answering Model

This exam will guide you through loading, preprocessing, and fine-tuning a pre-trained model for a question-answering task using a dataset. Follow the steps carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `distilbert-base-cased` for both the model and tokenizer.
- **Dataset**: You will be using the `christti/squad-augmented-v2` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [1]:
! pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K 

***before fine-tuning***

In [23]:
from transformers import pipeline

answr = pipeline('question-answering', model='distilbert-base-uncased')

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [24]:
context="Canada is a country in North America. Its ten provinces and three territories extend from the Atlantic Ocean to the Pacific Ocean and northward into the Arctic Ocean, making it the world's second-largest country by total area, with the world's longest coastline. Its border with the United States is the world's longest international land border. The country is characterized by a wide range of both meteorologic and geological regions. It is a sparsely inhabited country of just over 41 million people, the vast majority residing south of the 55th parallel in urban areas. Canada's capital is Ottawa and its three largest metropolitan areas are Toronto, Montreal, and Vancouver.Indigenous peoples have continuously inhabited what is now Canada for thousands of years. Beginning in the 16th century, British and French expeditions explored and later settled along the Atlantic coast. As a consequence of various armed conflicts, France ceded nearly all of its colonies in North America in 1763. In 1867, with the union of three British North American colonies through Confederation, Canada was formed as a federal dominion of four provinces. This began an accretion of provinces and territories and a process of increasing autonomy from the United Kingdom, highlighted by the Statute of Westminster, 1931, and culminating in the Canada Act 1982, which severed the vestiges of legal dependence on the Parliament of the United Kingdom."
question='where is canada located?'
result=answr(question=question,context=context)

In [25]:
print(result['answer'])

territories extend from the Atlantic Ocean to the Pacific Ocean and northward into the Arctic


In [2]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split

dataset = load_dataset("christti/squad-augmented-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

train.json:   0%|          | 0.00/156M [00:00<?, ?B/s]

validation.json:   0%|          | 0.00/11.1M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [3]:
split_dataset = dataset['train'].train_test_split(test_size=0.2)


train_dataset = split_dataset['train']
test_dataset = split_dataset['test']


## Step 2: Load the Pretrained Tokenizer and Model

Use the model and tokenizer for the question-answering task.

In [4]:
train_dataset['context'][0]

'As of 1878, there were only three free Slavic states in the world: the Russian Empire, Serbia and Montenegro. Bulgaria was also free but was de jure vassal to the Ottoman Empire until official independence was declared in 1908. In the entire Austro-Hungarian Empire of approximately 50 million people, about 23 million were Slavs. The Slavic peoples who were, for the most part, denied a voice in the affairs of the Austro-Hungarian Empire, were calling for national self-determination. During World War I, representatives of the Czechs, Slovaks, Poles, Serbs, Croats, and Slovenes set up organizations in the Allied countries to gain sympathy and recognition. In 1918, after World War I ended, the Slavs established such independent states as Czechoslovakia, the Second Polish Republic, and the State of Slovenes, Croats and Serbs.'

In [5]:
from transformers import DistilBertTokenizer, DistilBertModel,AutoModelForQuestionAnswering
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 3: Preprocess the Dataset

Define a function to preprocess the dataset by tokenizing both the context and the question. The function will also calculate the start and end positions of the answers. In the tokenizer you might face a problem if you use `truncation=True` so consider using `truncation='only_first'` if needed.

In [6]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_first",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)


        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1


        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:

            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [7]:
traindata=dataset['train']

In [8]:
tokenized_squad = traindata.map(preprocess_function, batched=True, remove_columns=['id','title','answers'])

Map:   0%|          | 0/169211 [00:00<?, ? examples/s]

## Step 4: Define Training Arguments and Initialize the Trainer

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [9]:
from transformers import Trainer, TrainingArguments
training_args=TrainingArguments( output_dir='/content',
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    eval_strategy='no',
    num_train_epochs=1)


trainer=Trainer(args=training_args,
    model=model,
    train_dataset=tokenized_squad)

## Step 5: Fine-tune the Model

Run the training process using the initialized trainer.

In [10]:
trainer.train()

Step,Training Loss
500,2.8542
1000,1.8266
1500,1.6771
2000,1.5686
2500,1.5009
3000,1.4917
3500,1.4183
4000,1.3915
4500,1.3407
5000,1.3376


Step,Training Loss
500,2.8542
1000,1.8266
1500,1.6771
2000,1.5686
2500,1.5009
3000,1.4917
3500,1.4183
4000,1.3915
4500,1.3407
5000,1.3376


TrainOutput(global_step=21152, training_loss=1.2370134958160446, metrics={'train_runtime': 3364.2769, 'train_samples_per_second': 50.296, 'train_steps_per_second': 6.287, 'total_flos': 1.6580956282136064e+16, 'train_loss': 1.2370134958160446, 'epoch': 1.0})

In [12]:
model.save_pretrained('/content/model')
tokenizer.save_pretrained('/content/tokrnizer')

('/content/tokrnizer/tokenizer_config.json',
 '/content/tokrnizer/special_tokens_map.json',
 '/content/tokrnizer/vocab.txt',
 '/content/tokrnizer/added_tokens.json',
 '/content/tokrnizer/tokenizer.json')

## Step 6: Inference

Once the model is trained, perform inference by answering a question based on a context. Use the tokenizer to process the input, and then feed it into the model to get the predicted answer.

In [13]:
from transformers import pipeline

m = pipeline('question-answering', model='/content/model',tokenizer='/content/tokrnizer')

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [18]:
result=m(question,context)

In [20]:
result['answer']

'North America'