<a href="https://www.kaggle.com/code/aisuko/question-answering-task-nlp?scriptVersionId=174799850" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Question answering tasks return an answer given a question. If we have ever asked a virtual assistant like Siri what the weather is, then we have have used a question answering model before. There are two common types of question answering tasks:

* **Extractive:** extract the answer from the given context, most of time is a given set of documents or passges. The model selectes the answer from the existing text without any alteration or rephrasing.


* **Abstractive:** generate an answer from the context that correctly answers the question in more human-like manner. 

Abstractive answers are often more suitable for conversations where inforamtion is convoluted and unstructured, as they can provide more coherent and concise responses.


We will Finetune a pretrained model with a Question Answering label datasets. And use the finetuned model for inference.

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install evaluate==0.4.1

# Prepare the environment

We are going to use Transformers Trainer class in this notebook. And we also want to save the model to the Hub.

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-distilbert-base-uncased-with-squad"
os.environ["MODEL_NAME"]="distilbert-base-uncased"

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `distilbert-base-uncased` from `transformers`...
config.json: 100%|█████████████████████████████| 483/483 [00:00<00:00, 3.35MB/s]
┌────────────────────────────────────────────────────────────┐
│     Memory Usage for loading `distilbert-base-uncased`     │
├───────┬─────────────┬──────────┬───────────────────────────┤
│ dtype │Largest Layer│Total Size│    Training using Adam    │
├───────┼─────────────┼──────────┼───────────────────────────┤
│float32│   89.42 MB  │253.16 MB │         1012.63 MB        │
│float16│   44.71 MB  │126.58 MB │         506.32 MB         │
│  int8 │   22.35 MB  │ 63.29 MB │         253.16 MB         │
│  int4 │   11.18 MB  │ 31.64 MB │         126.58 MB         │
└───────┴─────────────┴──────────┴───────────────────────────┘


# Load SQuAD dataset

Start by loading a smaller subset of the SQuAD dataset.

In [4]:
from datasets import load_dataset

squad=load_dataset("squad", split="train[:5000]")

Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

And we split the dataset's `train` split into a train and test set with the `train_test_split` method:

In [5]:
squad=squad.train_test_split(test_size=0.2)
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1000
    })
})

According to the above output, we can see there are several important fields here:

* `answers:` the starting location of the answer token and the answer text
* `context:` background information from which the model needs to extract the answer
* `question:` the question a model should answer

# Preprocess

Load the pretrained model's tokenizer to process the `question` and `context` fileds:

In [6]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv('MODEL_NAME'))

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

There are a few preprocessing steps particular to question answering tasks we should be aware of:

1. Some examples in a dataset may have a very long `context` that exceed the maximum input length of the model. To deal with longer sequences, trauncate only the context by setting `truncation="only_second`.

2. Next, map the start and end positions of the answer to the original `context` by setting `return_offsets_mapping=True`.

3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the `sequence_ids` method to find whcih part of the offset corresponds to the `question` and which corresponds to the `context`.


In [7]:
def preprocess_function(examples):
    questions=[q.strip() for q in examples["question"]]
    inputs =tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )
    
    offset_mapping=inputs.pop("offset_mapping")
    answers=examples["answers"]
    start_positions=[]
    end_positions=[]
    
    for i, offset in enumerate(offset_mapping):
        answer=answers[i]
        start_char=answer["answer_start"][0]
        end_char=answer["answer_start"][0]+len(answer["text"][0])
        sequence_ids=inputs.sequence_ids(i)
        
        # Find the start and end of the context
        idx=0
        while sequence_ids[idx]!=1:
            idx+=1
        context_start=idx
        while sequence_ids[idx]==1:
            idx+=1
        context_end=idx-1
        
        # If the answer is not fully inside the context, label it (0,0)
        if offset[context_start][0]>end_char or offset[context_end][1]< start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx=context_start
            while idx<=context_end and offset[idx][0]<=start_char:
                idx+=1
            start_positions.append(idx-1)
            
            idx=context_end
            while idx>=context_start and offset[idx][1]>=end_char:
                idx-=1
            end_positions.append(idx+1)

    inputs["start_positions"]=start_positions
    inputs["end_positions"]=end_positions
    return inputs


Apply the preprocessing function over the entire dataset

In [8]:
tokenized_squad=squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let us create a batch of examples using DefaultDataCollator. And DefaultDataCollator does not apply any additional preprocessing such as padding.

In [9]:
from transformers import DefaultDataCollator

data_collator=DefaultDataCollator()

# Train

In [10]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model=AutoModelForQuestionAnswering.from_pretrained(os.getenv('MODEL_NAME'))

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=5,
    gradient_checkpointing=True,  # Enable gradient checkpointing
    num_train_epochs=2,
    fp16=True,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=20,
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.6 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240430_021306-2l2wt59a[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-distilbert-base-uncased-with-squad[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/2l2wt59a[0m


Epoch,Training Loss,Validation Loss
1,5.2755,4.372942
2,4.2731,3.977272




TrainOutput(global_step=50, training_loss=4.619023590087891, metrics={'train_runtime': 390.9317, 'train_samples_per_second': 20.464, 'train_steps_per_second': 0.128, 'total_flos': 783918600192000.0, 'train_loss': 4.619023590087891, 'epoch': 2.0})

In [12]:
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.22k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-distilbert-base-uncased-with-squad/commit/2a629906d1c8b8d069a676e452cb5c2bb42b77b4', commit_message='ft-distilbert-base-uncased-with-squad', commit_description='', oid='2a629906d1c8b8d069a676e452cb5c2bb42b77b4', pr_url=None, pr_revision=None, pr_num=None)

# Inference

In [13]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

In [14]:
from transformers import pipeline

question_answerer=pipeline("question-answering", model=os.getenv("WANDB_NAME"))
question_answerer(question=question, context=context)

{'score': 0.02168649062514305,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}