# Practical Use Case : Valeo
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/2b/Valeo_Logo.svg/2560px-Valeo_Logo.svg.png" alt="Logo" width="200"/>

### Nom : AMROUN
### PrÃ©nom : Abdelkader
### Mail : aek.amroun@gmail.com

#### Imports and installations

In [1]:
! pip install torch datasets transformers flask



In [None]:
import random
import collections

import pandas as pd
import numpy as np
import torch
from tqdm.auto import tqdm
from IPython.display import display, HTML

from datasets import load_dataset, load_metric, ClassLabel, Sequence
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    default_data_collator
)
import transformers

<!-- # 3.Fine-tuning a model on a question-answering task -->

<!-- In this notebook, we will see how to fine-tune the **DistillBert** model to a question answering task, which is the task of extracting the answer to a question from a given context. 

as for chosen dataset it is squad v1 ... add talk about it
We will use the `Trainer` API to fine-tune a model on it.


**Note:** This notebook finetunes models that answer question by taking a substring of a context, not by generating new text. -->


# 3. Fine-Tuning DistillBERT for Question Answering

In this notebook, we will explore how to fine-tune the **DistillBERT** model for a **Question Answering (QA)** task. Specifically, the goal of QA is to extract a precise answer from a given context based on a posed question. Rather than generating new text, our model will focus on selecting the most relevant substring from the context as the answer.

For this task, we will use the **SQuAD v1 (Stanford Question Answering Dataset)**, one of the most widely used datasets for QA. SQuAD v1 consists of over 100,000 question-answer pairs, where the answer is always a contiguous span of text from the context, making it an ideal dataset for extractive question answering.

To streamline the fine-tuning process, we will leverage Hugging Face's powerful `Trainer` API, which simplifies model training and evaluation. By the end of this notebook, we will have a fine-tuned DistillBERT model capable of efficiently answering questions by extracting the appropriate text from a given passage.

**Note:** The model will not generate new text but rather focus on identifying the correct answer span within the provided context.


<!-- This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly: -->

### Model Checkpoint

In this notebook, we use the pre-trained **DistilBERT** model with the checkpoint `"distilbert-base-uncased"`. DistilBERT is a compressed version of BERT that retains 97% of BERT's performance while being 60% faster and smaller, making it an ideal choice for resource-efficient fine-tuning. 

Since we are working with **SQuAD v1**, which only contains answerable questions, we set `squad_v2 = False`. This indicates that we are not using the SQuAD v2 dataset, which includes unanswerable questions.


In [6]:

squad_v2 = False
model_checkpoint = "distilbert-base-uncased"


### Loading the Dataset

We use the `datasets` library from Hugging Face to easily load and handle our dataset. Depending on whether we are working with **SQuAD v1** or **SQuAD v2**, the dataset is loaded accordingly:

- If `squad_v2 = True`, the **SQuAD v2** dataset is loaded, which includes unanswerable questions.
- Since we've set `squad_v2 = False` for this task, we are loading the **SQuAD v1** dataset, which contains only answerable questions.

The code simplifies dataset management by automatically selecting the appropriate version.


In [8]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [9]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, we need to select a split, then give an index:

In [10]:
datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

### Displaying Random Dataset Elements

To better understand the dataset, we use a helper function `show_random_elements` to display random examples from it. The function selects a specified number of random entries (by default 10) and presents them in a clear table format.

Hereâ€™s how the function works:
- It ensures the number of requested examples does not exceed the size of the dataset.
- It selects random, non-repeating indices from the dataset.
- The data is then converted into a **pandas DataFrame** for easy viewing.



In [11]:
def show_random_elements(dataset, num_examples=10):
    # Ensure we don't attempt to sample more elements than are available in the dataset.
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."

    # Randomly pick 'num_examples' unique indices from the dataset.
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        # Ensure we don't pick the same index more than once.
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    # Convert the selected entries into a pandas DataFrame for easy display.
    df = pd.DataFrame(dataset[picks])

    # Transform ClassLabel features into human-readable names for better understanding.
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])

    # Display the DataFrame in HTML format for better visualization.
    display(HTML(df.to_html()))



In [12]:
show_random_elements(datasets["train"], num_examples=3)

Unnamed: 0,id,title,context,question,answers
0,56f87773a6d7ea1400e1769a,Southampton,"The city has undergone many changes to its governance over the centuries and once again became administratively independent from Hampshire County as it was made into a unitary authority in a local government reorganisation on 1 April 1997, a result of the 1992 Local Government Act. The district remains part of the Hampshire ceremonial county.",What ceremonial county does Southampton still belong to?,"{'text': ['Hampshire'], 'answer_start': [316]}"
1,5728d1664b864d1900164eb7,Asthma,"For those with severe persistent asthma not controlled by inhaled corticosteroids and LABAs, bronchial thermoplasty may be an option. It involves the delivery of controlled thermal energy to the airway wall during a series of bronchoscopies. While it may increase exacerbation frequency in the first few months it appears to decrease the subsequent rate. Effects beyond one year are unknown. Evidence suggests that sublingual immunotherapy in those with both allergic rhinitis and asthma improve outcomes.",What treatment helps improve those with allergic rhinitis and asthma?,"{'text': ['sublingual immunotherapy'], 'answer_start': [415]}"
2,570cef0dfed7b91900d45b09,Digestion,"Teeth (singular tooth) are small whitish structures found in the jaws (or mouths) of many vertebrates that are used to tear, scrape, milk and chew food. Teeth are not made of bone, but rather of tissues of varying density and hardness, such as enamel, dentine and cementum. Human teeth have a blood and nerve supply which enables proprioception. This is the ability of sensation when chewing, for example if we were to bite into something too hard for our teeth, such as a chipped plate mixed in food, our teeth send a message to our brain and we realise that it cannot be chewed, so we stop trying.",What happens when you bite something you cant chew?,"{'text': ['our teeth send a message to our brain and we realise that it cannot be chewed, so we stop trying.'], 'answer_start': [502]}"


## Preprocessing the training data

Before we can feed those texts to our model, we need to preprocess them. This is done by a HuggingFace Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [13]:

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the HuggingFace Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [14]:

assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

We can directly call this tokenizer on two sentences (the first one is the answer, and the second one is for the context):

In [15]:
tokenizer("What is your name?", "My name is Maxime.")

{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In question answering, handling very long documents presents a unique challenge. For other tasks, we usually truncate inputs that exceed the model's maximum sentence length. However, in QA, truncating part of the context could remove the answer we're trying to extract. To address this, we split a long document into multiple smaller chunks (input features), each within the model's maximum length (or the length defined by a hyper-parameter). Additionally, to avoid cutting off answers that might span across two chunks, we introduce some overlap between the chunks. This overlap is controlled by a hyper-parameter `doc_stride`

In [16]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Note that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, our tokenizer can automatically return us a list of features capped by a certain maximum length, with the overlap we talked above, we just have to tell it with `return_overflowing_tokens=True` and by passing the stride

For this notebook to work with any kind of models, we need to account for the special case where the model expects padding on the left (in which case we switch the order of the question and the context):

### Tokenizing and Preparing Features for Training

In this section, we prepare our training data by tokenizing the input examples and handling long contexts that might get split into multiple chunks. Here's how the `prepare_train_features` function works:

- **Whitespace Cleanup**: We remove unnecessary left-side whitespace from the questions to ensure efficient tokenization.
  
- **Tokenization with Overflows**: The function tokenizes the question and context, truncating as needed. If a context is too long, it is split into multiple features using a stride, which creates overlapping chunks. This helps prevent losing the answer if it's near the split point between two chunks.

- **Mapping Features to Original Examples**: Since one example can generate multiple features due to splitting, we use the `overflow_to_sample_mapping` to keep track of which feature corresponds to which original example.

- **Offset Mapping**: The `offset_mapping` tells us where each token aligns with the original character position in the context, helping us accurately label the start and end positions of the answers.

- **Labeling the Start and End Positions**: For each tokenized feature, we identify where the answer starts and ends. If the answer is not found in the current span, we mark the CLS token as the answer. If the answer is found, we adjust the token indices to match the character positions of the answer.

This process ensures that our tokenized examples are ready for training, with accurate labels for start and end positions of the answers.


In [29]:
# Check if the tokenizer pads on the right side (for the question) or the left (for the context).
pad_on_right = tokenizer.padding_side == "right"

def prepare_train_features(examples):
    # Remove unnecessary whitespace from the beginning of the questions.
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize the questions and contexts with truncation and padding.
    # If the context is too long, split it into overlapping chunks (controlled by doc_stride).
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],   # Tokenize question if padding is on the right.
        examples["context" if pad_on_right else "question"],   # Otherwise, tokenize context first.
        truncation="only_second" if pad_on_right else "only_first",  # Truncate only the context or question.
        max_length=max_length,   # Set the max length for each chunk.
        stride=doc_stride,       # Use a stride to create overlapping chunks for long contexts.
        return_overflowing_tokens=True,   # Keep track of overflowing tokens (for multiple features per example).
        return_offsets_mapping=True,      # Return the character position mapping for each token.
        padding="max_length",    # Ensure all tokens are padded to the same max length.
    )

    # Create a mapping between features (splits) and their corresponding original examples.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # Offset mappings help us align tokens with their original character positions in the context.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Initialize lists for storing start and end positions of the answers.
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    # Iterate over each feature's offsets to determine the answer positions.
    for i, offsets in enumerate(offset_mapping):
        # Index of the CLS token, used for labeling unanswerable questions.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Get the sequence IDs (0 for question, 1 for context) to help locate the context in the feature.
        sequence_ids = tokenized_examples.sequence_ids(i)

        # Map the current feature back to its original example.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # If the example has no answer, mark it as the CLS token (unanswerable).
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Get the start and end character positions of the answer in the original context.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Find the token start index corresponding to the answer's start character.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # Find the token end index corresponding to the answer's end character.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # If the answer is outside the current span, label it with the CLS token.
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Adjust the token start and end indices to exactly match the answer span.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples


### Creating a Smaller Subset and Tokenizing the Dataset

In this section, we create a smaller subset of the training and evaluation datasets for faster experimentation and debugging. This allows us to fine-tune our model without using the full dataset initially. Here's the process:

- **Subset Creation**: We select a smaller portion of the training dataset (first 5008 examples) and the validation dataset (first 160 examples). This helps reduce training time during experimentation considering my training setup (Tesla T4 on Google Colab)

- **Tokenization**: We apply the `prepare_train_features` function to both subsets to tokenize the examples. During this process, the original columns are removed, and only the necessary features (e.g., input IDs, attention masks, token type IDs, start and end positions) are kept.

This approach is useful for running faster iterations while fine-tuning the model on a smaller portion of the data.


In [31]:
# Create a smaller subset of the dataset for faster experimentation and debugging.
# Here, we select the first 5008 examples from the training set and the first 160 examples from the validation set.
small_train_dataset = datasets["train"].select(range(5008))
small_eval_dataset = datasets["validation"].select(range(160))

# Apply the prepare_train_features function to tokenize the training subset.
# We use the map function to tokenize the dataset in batches.
# The original columns are removed to keep only the necessary tokenized features.
tokenized_train_dataset = small_train_dataset.map(
    prepare_train_features,  # Function used to tokenize and process the data.
    batched=True,            # Apply the function to batches of examples.
    remove_columns=small_train_dataset.column_names  # Remove the original columns (context, question, etc.).
)

# Apply the same tokenization process to the evaluation subset.
tokenized_eval_dataset = small_eval_dataset.map(
    prepare_train_features,
    batched=True,
    remove_columns=small_eval_dataset.column_names
)

# If you want to use the full dataset for training, the following commented line can be used to tokenize the entire dataset.
# tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)


## Fine-tuning the model

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [32]:


model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Setting Up Training Arguments

We define the training configuration using the `TrainingArguments` class from Hugging Face's `transformers` library. These arguments control how the model is fine-tuned and evaluated. Here's a breakdown of the key settings:

- **Model Name**: We extract the base model name from the `model_checkpoint` and use it for naming the output directory.
  
- **Batch Size**: We set a batch size of 16 for both training and evaluation, which is the number of examples processed at once on each device.

- **Learning Rate**: The learning rate is set to `2e-5`, a common value for fine-tuning transformer models, controlling how fast the model's weights are updated.

- **Evaluation Strategy**: The model will be evaluated at the end of each epoch, allowing us to monitor its performance.

- **Training Epochs**: The model will be trained for 3 full passes through the dataset (epochs).

- **Weight Decay**: A value of 0.01 is used for weight decay, which helps regularize the model and prevent overfitting.


In [33]:
# Extract the model's base name from the model checkpoint path.
model_name = model_checkpoint.split("/")[-1]

# Define the batch size for training and evaluation. 16 examples will be processed at a time per device.
batch_size = 16

# Set up the training arguments using Hugging Face's TrainingArguments class.
args = TrainingArguments(
    f"{model_name}_squad",  # The output directory where the model and checkpoints will be saved.
    evaluation_strategy="epoch",      # Evaluate the model at the end of every epoch.
    learning_rate=2e-5,               # Set the learning rate, controlling how fast the model learns.
    per_device_train_batch_size=batch_size,  # Number of examples per batch during training.
    per_device_eval_batch_size=batch_size,   # Number of examples per batch during evaluation.
    num_train_epochs=3,               # Number of training epochs (full passes through the dataset).
    weight_decay=0.01,                # Apply weight decay (regularization) to prevent overfitting.
)




Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-squad"` or `"huggingface/bert-finetuned-squad"`).

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [34]:
data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [35]:
# Initialize the Trainer with the specified parameters
trainer = Trainer(
    model,  # The model to be trained
    args,  # Training arguments
    # train_dataset=tokenized_datasets["train"],  # Original training dataset (commented out)
    # eval_dataset=tokenized_datasets["validation"],  # Original evaluation dataset (commented out)
    train_dataset=tokenized_train_dataset,  # Tokenized training dataset
    eval_dataset=tokenized_eval_dataset,  # Tokenized evaluation dataset
    data_collator=data_collator,  # Data collator for dynamic padding
    tokenizer=tokenizer,  # Tokenizer used for preprocessing
)

We can now finetune our model by just calling the `train` method:

In [36]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.173265
2,2.806700,1.675167
3,2.806700,1.65103


TrainOutput(global_step=957, training_loss=2.1279300701655566, metrics={'train_runtime': 563.223, 'train_samples_per_second': 27.176, 'train_steps_per_second': 1.699, 'total_flos': 1499832261817344.0, 'train_loss': 2.1279300701655566, 'epoch': 3.0})

Since this training is particularly long, let's save the model just in case we need to restart.

In [37]:
trainer.save_model("test-squad-trained")

# 4. Model Evaluation (Bonus)

To evaluate our fine-tuned DistilBERT model for the QA task, we need to map the model's predictions back to specific parts of the context. The model outputs logits for both the start and end positions of the predicted answers. Let's take a batch from our validation dataloader and examine the model's output:

In [38]:
# Get a batch from the evaluation dataloader.
# The loop stops after retrieving the first batch.
for batch in trainer.get_eval_dataloader():
    break

# Move the batch data to the device (GPU or CPU) being used by the model.
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}

# Disable gradient calculations since we're in evaluation mode (no need for backpropagation).
with torch.no_grad():
    # Pass the batch through the model and get the output.
    output = trainer.model(**batch)

# Display the keys of the model's output (these typically include logits, start/end positions for QA tasks).
output.keys()


odict_keys(['loss', 'start_logits', 'end_logits'])

The output of the model is a dict-like object that contains the loss (since we provided labels), the start and end logits. We won't need the loss for our predictions, let's have a look a the logits:

In [39]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([16, 384]), torch.Size([16, 384]))

We have one logit for each feature and each token. The most obvious thing to predict an answer for each feature is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

In [40]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([ 46,  57,  89,  43, 167, 162,  72, 160, 162, 159,  73,  41,  80,  91,
         156,  35], device='cuda:0'),
 tensor([ 47,  47,  92,  44, 141, 150,  75, 148, 150, 147,  76,  42,  83,  94,
         158,  35], device='cuda:0'))

While this approach works well in many cases, there are situations where the predicted answer might be invalid. For instance, the start position might be greater than the end position, or the model might highlight a span of text within the question itself instead of the answer. In such cases, itâ€™s useful to consider the second-best prediction and check whether it provides a valid answer instead.

However, selecting the second-best answer isn't straightforward. Should we choose the second-best start position with the best end position, or vice versa? And if the second-best prediction is also invalid, determining the third-best becomes even more complex.

To resolve this, we will classify our answers based on a score derived by summing the start and end logits. Instead of exhaustively ordering all possible answers, we will use a hyperparameter, `n_best_size`, to limit the number of candidates. We'll select the top indices from the start and end logits, generate the corresponding answers, and check their validity. Once validated, we sort them by their score and select the best option. 


In [41]:
n_best_size = 20

Once we have our `valid_answers`, we can sort them by their `score` and retain only the best one. The remaining challenge is ensuring that a predicted span is part of the context (and not the question) and extracting the corresponding text. To achieve this, we need to add two crucial elements to our validation features:

- The ID of the example that generated the feature, as each example can produce multiple features (as explained earlier).
- The offset mapping, which provides a mapping from token indices to character positions within the context.

To accommodate these needs, we will reprocess the validation set using a function that is slightly modified from `prepare_train_features`:


In [43]:
def prepare_validation_features(examples):
    # Some questions may have unnecessary whitespace at the beginning, which isn't useful and can interfere with
    # truncation of the context (by using up token space). So we remove the leading whitespace from the questions.
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize the examples, ensuring truncation is applied where necessary and overflows are handled with a stride.
    # If a context is too long, it may produce several features, each with some overlapping context from the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],  # Tokenize either the question or context first based on the padding direction
        examples["context" if pad_on_right else "question"],  # Tokenize the other part (context or question)
        truncation="only_second" if pad_on_right else "only_first",  # Truncate the second (or first) sequence based on padding direction
        max_length=max_length,  # Maximum token length allowed
        stride=doc_stride,  # Stride determines how much overlap between chunks when handling long contexts
        return_overflowing_tokens=True,  # Return any overflowed tokens that resulted from the tokenization
        return_offsets_mapping=True,  # Return the offset mapping to track positions of tokens relative to the original text
        padding="max_length",  # Pad sequences to the max length
    )

    # Since one example might produce several features due to a long context, we need a way to map each feature
    # back to its original example. "overflow_to_sample_mapping" provides this mapping.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We create a list to store the IDs of the examples that generated each feature. We also store offset mappings.
    tokenized_examples["example_id"] = []

    # Iterate through all tokenized features (input_ids) to process each feature individually.
    for i in range(len(tokenized_examples["input_ids"])):
        # Get the sequence IDs for the current feature, which helps distinguish between question and context.
        sequence_ids = tokenized_examples.sequence_ids(i)
        # Determine the index for the context part of the input based on whether we're padding on the right.
        context_index = 1 if pad_on_right else 0

        # Map the current feature to its original example using the sample_mapping.
        sample_index = sample_mapping[i]
        # Store the ID of the example that produced this feature.
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # For each token's offset mapping, set it to None if it belongs to the question (i.e., not part of the context).
        # This helps later in determining which tokens belong to the context.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)  # Keep only context offsets, nullify others
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples


And like in training, we can apply that function to our validation set easily:

In [44]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [45]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [46]:
# Set the format of the validation features to match the current format type.
# We specify the columns that we want to include, which are all the keys in the validation features' schema (features).
validation_features.set_format(
    type=validation_features.format["type"],  # Maintain the same format type (e.g., 'torch', 'numpy', etc.)
    columns=list(validation_features.features.keys())  # Set the columns to include all available feature keys
)


We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [47]:
max_answer_length = 30

The `postprocess_qa_predictions` function handles post-processing of raw predictions from our question-answering model. The process involves the following key steps:

1. **Mapping examples to features**: Since multiple features can correspond to a single example, a map is created to track which features belong to which example.

2. **Logits and score calculation**: The function retrieves the start and end logits for each feature and calculates scores for potential answer spans.

3. **Handling null predictions**: For models like SQuAD v2, where no-answer predictions are valid, the function tracks the null prediction score (based on the `[CLS]` token).

4. **Filtering and validation**: The function filters out invalid answers (e.g., answers where the end index is before the start index or spans that are too long). It uses the `offset_mapping` to map token positions back to the original text.

5. **Selecting the best answer**: For each example, the valid answers are sorted by score, and the highest-scoring answer is selected. If the null prediction score is higher than the best answer score (in SQuAD v2), the model outputs no answer.

The function returns the final predictions for each example.


In [51]:


def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []

        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}

        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [52]:
# Call the postprocessing function to convert the raw model predictions into final, human-readable answers.
# This function takes the validation examples, tokenized features, and the raw start and end logits from the model.

final_predictions = postprocess_qa_predictions(
    datasets["validation"],      # The original validation dataset with examples and contexts.
    validation_features,         # The processed validation features including tokenized inputs and offset mappings.
    raw_predictions.predictions  # The raw start and end logits predicted by the model for each feature.
)


Post-processing 10570 example predictions split into 10784 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

In this step, we load the `squad` metric from the datasets library, which is specifically designed for evaluating models on the **SQuAD (Stanford Question Answering Dataset)**. This metric provides two main scores:

1. **Exact Match (EM)**: This measures the percentage of predictions that match exactly with the ground truth answer. It requires the predicted answer to be identical to the correct answer, including punctuation and word choice.
   
2. **F1 Score**: This score considers the overlap between the predicted answer and the ground truth in terms of words. The F1 score is a balance between precision (the proportion of predicted words that are relevant) and recall (the proportion of relevant words that are predicted).

### How to interpret the metrics:
- **Exact Match (EM)**: A high EM score indicates that the model is providing answers that are highly accurate in terms of matching the reference answers exactly. A perfect score of 100% means every answer is correct without any differences.
- **F1 Score**: The F1 score allows for partial credit when the model's predicted answer overlaps with the correct answer. It is useful when an exact match is not found, but the model captures most of the relevant information. A higher F1 score (close to 100%) means better performance.

Both metrics are commonly used in QA tasks to assess the model's ability to understand and generate accurate answers.


### How SQuAD Metrics Are Calculated

In the context of the SQuAD (Stanford Question Answering Dataset), the two key evaluation metricsâ€”Exact Match (EM) and F1 Scoreâ€”are calculated as follows:

#### 1. Exact Match (EM)

The Exact Match metric checks if the predicted answer exactly matches the ground truth answer (the correct answer from the dataset).

###### Step-by-step Calculation:

1. The prediction is compared to the ground truth after normalizing both. Normalization includes lowercasing, removing articles (a, an, the), punctuation, and extra whitespace.
2. If the normalized predicted answer is exactly the same as the normalized ground truth answer, the score for that prediction is 1.
3. If not, the score is 0.
4. The Exact Match score for the entire dataset is the average of the EM scores across all examples in the dataset.

##### Formula:

$$ \text{Exact Match (EM)} = \frac{\text{Number of exact matches}}{\text{Total number of questions}} \times 100 $$

##### Example:

- **Ground truth:** "New York City"
  
  **Prediction:** "New York City"
  
  **EM = 1** (Exact Match)

- **Ground truth:** "New York City"
  
  **Prediction:** "NYC"
  
  **EM = 0** (Not an Exact Match)

#### 2. F1 Score

The F1 Score is a more forgiving metric that measures the word-level overlap between the predicted answer and the ground truth. It is calculated using both precision and recall.

##### Precision:

The fraction of predicted words that are in the ground truth answer.

$$ \text{Precision} = \frac{\text{Number of overlapping words}}{\text{Total number of predicted words}} $$

##### Recall:

The fraction of ground truth words that are in the predicted answer.

$$ \text{Recall} = \frac{\text{Number of overlapping words}}{\text{Total number of ground truth words}} $$

##### F1 Score:

The harmonic mean of precision and recall, giving a score between 0 and 1. A perfect F1 score of 1 indicates a perfect match, and 0 indicates no overlap.

$$ \text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

##### Step-by-step Calculation:

1. Break the predicted answer and the ground truth into words.
2. Count how many words overlap.
3. Calculate precision and recall based on the overlap.
4. Compute the F1 score from precision and recall.

##### Example:

- **Ground truth:** "New York City"
  
  **Prediction:** "New York"
  
  **Overlap:** "New York" (2 words overlap)
  
  $$ \text{Precision} = \frac{2}{2} = 1.0 $$
  
  $$ \text{Recall} = \frac{2}{3} \approx 0.67 $$
  
  $$ \text{F1} = \frac{2 \times 1.0 \times 0.67}{1.0 + 0.67} \approx 0.80 $$

- **Ground truth:** "New York City"
  
  **Prediction:** "Los Angeles"
  
  **Overlap:** None
  
  $$ \text{Precision} = \frac{0}{2} = 0 $$
  
  $$ \text{Recall} = \frac{0}{3} = 0 $$
  
  $$ \text{F1} = 0 $$

##### Key Points:

- EM is stricter: It demands an exact word match between the predicted answer and the correct one.
- F1 is more flexible: It accounts for partial matches, making it a useful measure for cases where the prediction contains part of the correct answer but isn't a perfect match.


In [53]:
metric = load_metric("squad")

  metric = load_metric("squad_v2" if squad_v2 else "squad")


Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary.

In [54]:
# Create a list of dictionaries where each dictionary contains an "id" and its corresponding "prediction_text".
# This is generated by iterating over the key-value pairs (k, v) of the 'final_predictions' dictionary.
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]

# Create a list of dictionaries where each dictionary contains an "id" and its corresponding "answers".
# This is generated by iterating over the "validation" dataset and extracting the "id" and "answers" for each example (ex).
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]

# Compute the metric by comparing the formatted predictions against the references.
# 'predictions' is the list of formatted predictions created earlier, and 'references' is the ground truth.
metric.compute(predictions=formatted_predictions, references=references)


{'exact_match': 54.34247871333964, 'f1': 64.99205071723347}

### Comment on Results

The results from the fine-tuning of the DistilBERT model on the SQuAD dataset, achieving an **Exact Match (EM) of 54.34%** and an **F1 score of 64.99%**, are lower than what would typically be expected from models trained on the full dataset. However, this outcome was anticipated due to the limited training data and computational resources used.

- **Training Data**: The model was trained on less than 10% of the SQuAD dataset, which significantly restricts the amount of information the model can learn from. Given the complexity of the dataset, which requires a deep understanding of context to answer questions accurately, the reduced dataset likely resulted in the model not fully capturing all nuances necessary for higher accuracy.

- **Training Setup**: The training was conducted on Google Colab using a **Tesla T4 GPU**, and to manage the available computational power, the model was fine-tuned for only **3 epochs**. These restrictions meant that the model had limited opportunities to optimize and learn the patterns in the data. With more computational resources, such as a more powerful GPU or the ability to train on the full dataset for a longer period, the model's performance would likely improve.

### Conclusion

These scores are reflective of the expected limitations given the constraints of training on a small fraction of the dataset and the available computational resources. Despite the lower scores, the results demonstrate that the model is learning and generalizing to some extent, especially when considering the significant limitations imposed by the setup.


In [None]:
breakpoint() # To stop the execution since the following code serves as an explanation

# 5. API Creation

### API Creation for Question Answering with Flask

In this implementation, I created a simple API using **Flask** to serve a fine-tuned **DistilBERT** model for Question Answering. The following components are covered:

1. **Model Loading**: 
    - The model and tokenizer were loaded from a locally saved directory (`distillbert_squad`) using the `AutoModelForQuestionAnswering` and `DistilBertTokenizerFast` classes from the HuggingFace `transformers` library.

2. **Pipeline Setup**: 
    - A HuggingFace `pipeline` was configured to handle the Question Answering task, which takes input as a question and a context and returns an answer from the context.

3. **API Endpoint**:
    - A `/predict` endpoint was defined using the Flask `@app.route` decorator, which listens for **POST** requests. 
    - The request body must include two fields: a "questiomn" and a "context". 
    - The input is validated, and if either field is missing, an error response with a 400 status code is returned.

4. **Inference**:
    - The model generates a prediction (answer) and confidence score using the `qa_pipeline`. This output is then returned to the client as a JSON object, which includes the original question, the context, the predicted answer, and the modelâ€™s confidence score.

5. **Deployment**:
    - The Flask app is set to run on `0.0.0.0` to allow external access, using port 5000. For local testing with ngrok, `run_with_ngrok` can be uncommented.

This API allows easy interaction with the fine-tuned DistilBERT model for question answering tasks.


In [4]:
from flask import Flask, request, jsonify
from transformers import DistilBertTokenizerFast, pipeline, AutoModelForQuestionAnswering
# from flask_ngrok import run_with_ngrok  # Uncomment if using ngrok for local deployment
# from optimum.onnxruntime import ORTModelForQuestionAnswering  # Optional: For ONNX runtime

# Initialize Flask app
app = Flask(__name__)

# Load the fine-tuned DistilBERT model and tokenizer from the specified directory
model_dir = "distillbert_squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_dir)  # Load the model for Question Answering
tokenizer = DistilBertTokenizerFast.from_pretrained(model_dir)    # Load the corresponding tokenizer

# Set up a HuggingFace pipeline for Question Answering using the loaded model and tokenizer
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

# Define an API endpoint at /predict that accepts POST requests
@app.route('/predict', methods=['POST'])
def predict():
    # Parse the JSON request body for the question and context
    data = request.get_json(force=True)

    # Extract the question and context from the request
    question = data.get("question", "")  # Default to empty string if question is not provided
    context = data.get("context", "")    # Default to empty string if context is not provided

    # Validate the input to ensure both question and context are present
    if not question or not context:
        return jsonify({"error": "Please provide both question and context"}), 400  # Return error if missing

    # Use the QA pipeline to generate the answer based on the input question and context
    result = qa_pipeline({
        "question": question,
        "context": context
    })

    # Return the result as a JSON response, including the question, context, answer, and confidence score
    return jsonify({
        "question": question,
        "context": context,
        "answer": result["answer"],  # Extract the predicted answer
        "score": result["score"]     # Extract the confidence score of the prediction
    })

if __name__ == '__main__':
    # Optional: For running the app with ngrok if testing locally
    # run_with_ngrok(app)

    # Start the Flask app on host 0.0.0.0 (all network interfaces) and port 5000
    app.run(host='0.0.0.0', port=5000)


2024-09-05 17:04:55.170598: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-05 17:04:55.548217: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-05 17:04:55.548455: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-05 17:04:55.592171: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-05 17:04:55.697270: I tensorflow/core/platform/cpu_feature_guar

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.20.10.8:5000
Press CTRL+C to quit
127.0.0.1 - - [05/Sep/2024 17:05:10] "GET / HTTP/1.1" 404 -
127.0.0.1 - - [05/Sep/2024 17:05:10] "GET /favicon.ico HTTP/1.1" 404 -
172.20.10.8 - - [05/Sep/2024 17:05:24] "GET / HTTP/1.1" 404 -
172.20.10.8 - - [05/Sep/2024 17:05:24] "GET /favicon.ico HTTP/1.1" 404 -
127.0.0.1 - - [05/Sep/2024 17:07:29] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [05/Sep/2024 17:07:52] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [05/Sep/2024 17:08:44] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [05/Sep/2024 17:11:47] "POST /predict HTTP/1.1" 400 -
127.0.0.1 - - [05/Sep/2024 17:13:32] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [05/Sep/2024 17:15:05] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [05/Sep/2024 17:16:12] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [05/Sep/2024 17:16:42] "POST /predict HTTP/1.1" 200 -
127.0.0.1 - - [05/Sep/2024 17:17:01] "POST /predict HTTP/

### API Testing with `unittest`

To ensure the API functions correctly, I wrote unit tests using Python's built-in `unittest` framework. The following describes the test cases implemented:

1. **Setup**: 
   - In the `setUp()` method, I initialized the Flask app's test client using `app.test_client()`. This allows us to simulate HTTP requests to the API without running the server.
   - Testing mode is enabled by setting `self.app.testing = True`, which helps catch any errors during test execution.

2. **Test Cases**:
   - **`test_predict_success()`**: 
     - This test checks if the `/predict` endpoint works as expected when valid input (both question and context) is provided.
     - A POST request is sent with a sample question and context related to "quantization".
     - The test verifies:
       - The response format is JSON.
       - The status code is `200 OK`.
       - The response contains an `answer` key, and the confidence `score` is greater than a reasonable threshold (0.2).
   
   - **`test_predict_missing_data()`**:
     - This test checks the API's behavior when the input is incomplete (missing context).
     - A POST request is sent with only a question, but no context.
     - The test verifies:
       - The API returns a `400 Bad Request` status due to missing data.
       - The response contains an `error` key indicating the issue.

3. **Running the Tests**:
   - The tests are executed using `unittest.main()` when the script is run directly. This will automatically discover and run all test cases in the class.

These tests ensure the core functionality of the API and handle common input validation scenarios.


In [None]:
import unittest
import json
from flask_app import app  # Import the Flask app from the main API module

# Define a test case class using unittest framework to test the Flask API
class FlaskAPITestCase(unittest.TestCase):

    def setUp(self):
        # Set up a test client for the Flask app
        self.app = app.test_client()  # Flask provides a test client to simulate API requests
        self.app.testing = True  # Enable testing mode for Flask (disables error catching)

    def test_predict_success(self):
        # Test case to check the API response when both question and context are provided

        # Sample payload with a valid question and context
        payload = {
            "question": "What is quantization?",
            "context": "Quantization is a technique to reduce model size and speed up inference."
        }

        # Send a POST request to the /predict endpoint with the payload as JSON
        response = self.app.post('/predict', data=json.dumps(payload), content_type='application/json')

        # Ensure the response is in JSON format
        self.assertEqual(response.content_type, 'application/json')

        # Ensure the status code is 200 (OK)
        self.assertEqual(response.status_code, 200)

        # Parse the response data from JSON
        data = json.loads(response.data)

        # Verify that the 'answer' key exists in the response
        self.assertIn("answer", data)

        # Ensure the confidence score ('score') is greater than a reasonable threshold (0.2 in this case)
        self.assertGreater(data["score"], 0.1)

    def test_predict_missing_data(self):
        # Test case to handle scenarios where either the question or context is missing

        # Payload with a missing context field
        payload = {
            "question": "What is quantization?"
        }

        # Send a POST request with the incomplete payload (missing context)
        response = self.app.post('/predict', data=json.dumps(payload), content_type='application/json')

        # Check that the response status is 400 (Bad Request) due to missing data
        self.assertEqual(response.status_code, 400)

        # Parse the response data
        data = json.loads(response.data)

        # Check that the error message is returned in the response
        self.assertIn("error", data)

# If this script is run directly, execute the unit tests
if __name__ == '__main__':
    unittest.main()
