# NLP Project
# Team 2: Acuna, W., Gallo, C., Ostrovsky, S.
# Question Answering System using DistilBERT
# October 21th 2023

# Introduction

Welcome to the "Question Answering System using DistilBERT" notebook. In this project, we will explore the development of a Question Answering (QA)	system that leverages the capabilities of the DistilBERT model. QA systems are essential in various natural language processing applications, including information retrieval, virtual assistants, and chatbots. They enable users to pose questions in natural language and receive precise answers from large volumes of text data. 

Our goal in this project is to fine-tune the DistilBERT model on the Standford Question Answering Dataset (SQuAD), and evaluate its performance in providing accurate answers to various questions. We will walk through the challenges of data preprocessing, model training, and evaluation, and discuss the rationale behind choosing DistilBERT for this task. The journey will include essential functions, such as extacting data from SQuAD, fine-tuning the model, and evaluating its accuracy. Let us dive into the world of Question Answering with DistilBERT!

# Import all the necessary libraries to be used in the following

In [1]:
import json
from pathlib import Path
import torch
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering
from torch.utils.data import DataLoader
from transformers import get_linear_schedule_with_warmup
import pandas as pd
import matplotlib.pyplot as plt

# Load data and preprocessing

First we loaded the Standford Question Answering Dataset available on Kaggle (Standford University, n.d.). It consists of two different json files, one for training and one for validation, each including a number of paragraphs. For each paragraph, there is then a context, a question, and an answer. We extracted contexts, questions, and answers from both the training and validation sets, by using the read_data function. The latter opens, reads, and parses the received file, before saving the contexts, questions, and answers reported as lists. For each answer, the function considers two pieces of information: the answer text and the character position of the answer start in the text the corresponding question refers to. 

Preprocessing was completed by following the same steps and procedures already proposed by ather authors (Apostolopoulou, n.d.; Fine-tuning with customer datasets, n.d.) with some modifications. 

References: 

Apostolopoulou, A. (n.d.). BERT-based-pretrained-model-using-SQuAD-2.0-dataset/Fine_Tuning_Bert.ipynb, GitHub. https://github.com/alexaapo/BERT-based-pretrained-model-using-SQuAD-2.0-dataset/blob/main/Fine_Tuning_Bert.ipynb

Fine-tuning with custom datasets. (n.d.). Hugging Face. Retrieved October 20, 2023, from https://huggingface.co/transformers/v4.3.3/custom_datasets.html?highlight=fine%20tune  

Standford University. (n.d.). Standford Question Answering Dataset. Kaggle. https://www.kaggle.com/datasets/stanfordu/stanford-question-answering-dataset

In [2]:
# Read SQuAD data

# Function used to extract context, questions, and answers as lists from a given json file
# This function was built as already proposed by other authors (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.)
def read_data(file):
    
    with open(file) as f: # open the file received as input
        
        dict = json.load(f) # read and parse the file object

    contexts = [] # define the list to save the contexts
    questions = [] # define the list to save the questions
    answers = [] # define the list to save the answers 
    
    for group in dict['data']: # for every group in data
        
        for passage in group['paragraphs']: # for every passage in paragraphs
            
            context = passage['context'] # save the context
            
            for qa in passage['qas']: # for every question/answer in passage
                
                question = qa['question'] # save the question
                
                if qa["answers"]:
                    
                    answer = {
                        "text": qa["answers"][0]["text"],
                        "answer_start": qa["answers"][0]["answer_start"]
                    } # save both the answer text and the character position of the answer start in the corresponding context
                    
                    contexts.append(context) # append the context to the contexts list
                    questions.append(question) # append the question to the questions list
                    answers.append(answer) # append the answer to the answers list
                    
    return contexts, questions, answers # return contexts, questions, and answers

train_contexts, train_questions, train_answers = read_data('train-v1.1.json') # extract contexts, questions, and answers from the training set by calling the read_data function
val_contexts, val_questions, val_answers = read_data('dev-v1.1.json') # extract contexts, questions, and answers from the validation set by calling the read_data function

print('Example of context:', train_contexts[0], '\n') # print an example of context
print('Example of question:', train_questions[0], '\n') # print an example of question
print('Example of answer:', train_answers[0]) # print an example of answer

Example of context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. 

Example of question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 

Example of answer: {'text': 'Saint Bernadette Soubirous', 'answer_start': 515}


From the results of the following windows, one can see that the SQuAD dataset is pretty big, consisting of 87599 training examples and 10570 validation examples. Thus, running the entire model with the complete dataset takes a great deal of time. For this reason, we used a small portion of the complete dataset while building the model. In particular, we adopted 10000 training samples and 1200 validation samples, thereby maintaining the original proportion (90:10) between the training and validation sets.

In [3]:
# Dataset sizes

# Evaluation of the dataset size
size_train_set = len(train_contexts) # size of the training set
size_val_set = len(val_contexts) # size of the validation set
print('Size of the training set:', size_train_set) # print the size of the training set
print('Size of the validation set', size_val_set, '\n') # print the size of the validation set

# To initailly build/evaluate the model, we used a small portion of the training and validation sets
n_sample_train = 10000 # number of training samples used to build the model
n_sample_val = 1200 # number of validation samples used to evaluate the model
train_contexts, train_questions, train_answers = train_contexts[:n_sample_train], train_questions[:n_sample_train], train_answers[:n_sample_train] # reduced versions of the train contexts, questions, and answers
val_contexts, val_questions, val_answers = val_contexts[:n_sample_val], val_questions[:n_sample_val], val_answers[:n_sample_val] # reduced versions of the validation contexts, questions, and answers

# Evaluation of the new dataset size
size_red_train_set = len(train_contexts) # new size of the reduced training set
size_red_val_set = len(val_contexts) # new size of the reduced validation set
print('Size of the reduced training set:', size_red_train_set) # print the new size of the reduced training set
print('Size of the reduced validation set', size_red_val_set) # print the new size of the reduced validation set


Size of the training set: 87599
Size of the validation set 10570 

Size of the reduced training set: 10000
Size of the reduced validation set 1200


By exploiting the function end_pos, we added the character position of the answer end in the corresponding context to all the answers in the reduced training and validation sets used to build and evaluate the model, respectively (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.). 

In [4]:
# Add the chracter position of the answer end

# Function to add the character positions of the answer ends in the given contexts to the given answers 
# This function was built as already proposed by other authors (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.)
def end_pos(answers, contexts):
    
    for answer, context in zip(answers, contexts): # for each answer and context received as inputs
        
        answer_text = answer['text'] # save the answer text
        start_pos_id = answer['answer_start'] # save the position of the answer start
        end_pos_id = start_pos_id + len(answer_text) # define the position of the answer end

        # Add the position of the answer end to each answer
        if context[start_pos_id : end_pos_id] == answer_text: # when the distance between the answer start and end is exactly equal to the answer text
            answer['answer_end'] = end_pos_id 
            
        elif context[start_pos_id -1 : end_pos_id - 1] == answer_text: # when the answer text is off by one character, also the position of the answer start needs to be adjusted
            answer['answer_start'] = start_pos_id - 1
            answer['answer_end'] = end_pos_id - 1  
            
        elif context[start_pos_id -2 : end_pos_id - 2] == answer_text: # when the answer text is off by two characters, also the position of the answer start needs to be adjusted
            answer['answer_start'] = start_pos_id - 2
            answer['answer_end'] = end_pos_id - 2  

# Add the character position of the answer end in the corresponding context to all the answers in the training and validation sets by calling the end_pos function
end_pos(train_answers, train_contexts)
end_pos(val_answers, val_contexts)

At this point, we applied the tokenizer to both the training/validation contexts and questions. Contexts and questions were encoded together as sequence pairs (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.).

In [5]:
# Initialize the tokenizer and tokenize the data (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d)

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased') # define the tokenizer
train_encodings = tokenizer(train_contexts, train_questions, truncation = True, padding = True) # apply the tokenizer to both the training contexts and questions
val_encodings = tokenizer(val_contexts, val_questions, truncation = True, padding = True) # apply the tokenizer to both the validation contexts and questions

Next, we added the token positions of the answer start and end in the context to the encodings of both the training and validation sets. This was done through the token_pos function, which exploits the char_to_token function of DistilBertTokenizerFast (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.). The latter function is in fact able to convert a character start/end position to a token start/end position.  

In [6]:
# From chart to token start/end positions in answers

# Function to add start and end token positions of all the given answers to the given encodings
# This function was built as already proposed by other authors (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.)
def token_pos(encodings, answers):
    
    start_tok_pos = [] # define the list to save the token positions of the answer starts in the contexts
    end_tok_pos = [] # define the list to save the token positions of the answer ends in the contexts
    
    for i in range(len(answers)): # for each answer
        
        start_tok_pos.append(encodings.char_to_token(i, answers[i]['answer_start'])) # evaluate and append the token position of the answer start
        end_tok_pos.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1)) # evaluate and append the token position of the answer end
        
        # Token positions of the answer starts and ends are adjusted with truncation when they are None
        if start_tok_pos[-1] is None:
            
            start_tok_pos[-1] = tokenizer.model_max_length
            
        if end_tok_pos[-1] is None:
            
            end_tok_pos[-1] = tokenizer.model_max_length
            
    encodings.update({'start_positions': start_tok_pos, 'end_positions': end_tok_pos}) # token positions of the answer start and end are added to the encodings received in input

# Add the token positions of the answer start and end to the training and validation encodings by calling the function token_pos
token_pos(train_encodings, train_answers)
token_pos(val_encodings, val_answers)

Finally, we created the training and validation datasets through the SquadDataset class (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.). 

In [7]:
# Function to define the SquadDataset class associated to the encodings received in input
# This function was built as already proposed by other authors (Apostolopoulou, n.d.; Fine-tuning with custom datasets, n.d.)
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

# Create the training and validation datasets to fine-tune the model through the SquadDataset class
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

# Initialization and fine-tuning of the selected DistilBERT pre-trained model

To fine-tune the model, we selected a batch size of 16 and a number of epoch of 3. For both the training and validation sets, we created Dataloaders to save memory during training. We chose to use the pre-trained DistilBERT model, while, as optimizer, we relied on Adam, which receive both the model parameters and the learning rate as inputs. We imposed a learning rate of 5e-5, which was proved to lead to good model results. Finally, we created a learning rate scheduler through the get_linear_schedule_with_warmup function. The training loop code was developed as proposed by other authors (Ding, 2023; McCormick and Ryan, 2019; Tran, n.d.). 

References:

Ding, S. (2023, January 3). [Fine Tune] Fine Tuning BERT for Sentiment Analysis. Medium. https://medium.com/@xiaohan_63326/fine-tune-fine-tuning-bert-for-sentiment-analysis-f5002b08f10a

McCormick, C., & Ryan, N. (2019, July 22). BERT Fine-Tuning Tutorial with PyTorch. Chris McCormick. https://mccormickml.com/2019/07/22/BERT-fine-tuning/

Tran, C. (n.d.). Tutorial: Fine tuning BERT for Sentiment Analysis. Skim AI. https://skimai.com/fine-tuning-bert-for-sentiment-analysis/

In [8]:
# Impose batch size and number of epochs
bs = 16
n_epochs = 3

# Create the DataLoader for the training set
train_loader = DataLoader(train_dataset, batch_size = bs, shuffle = True)
# Create the DataLoader for the validation set
val_loader = DataLoader(val_dataset, batch_size = bs, shuffle = False) # shuffle is not needed for evaluation

# Load the pre-trained DistilBERT model
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

# Define the optimizer and impose the learning rate (lr)
optimizer = torch.optim.Adam(model.parameters(), lr = 5e-5)

# Create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                           num_warmup_steps = 0, 
                                           num_training_steps = len(train_loader) * n_epochs)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The training loop was developed as proposed by McCormick and Ryan (2019). It contains two sections, one for training and the other one for validation. At the begining of each epoch, the model is put in training mode and training loss is set equal to zero. Then, for each step and batch in train_loader, five different operations are performed in sequence: clearing gradients, calculating model outputs, performing a backward pass, and updating both model parameters and learning rate. Finally, the average training loss is evaluated. After the completion of the training at each epoch, the model is put in evaluation loss and validation loss is reset to zero. Model outputs are now calculated on the validation set batch by batch in val_loader, before determining the average validation loss. Average training and validation losses for all the epochs were saved in a dataset used later for further evaluations, while the trained model was saved at end of each epoch and after all the epochs. Based on the comparison between average training and validation loss, we determined that is the right number of epochs to prevent both underfitting and overfitting. Thus, we used the model saved after two epochs for further evaluations in the following.

In [9]:
# Training loop

# Code developed as done by McCormick and Ryan (2019)

device = torch.device('cpu') # torch uses the CPU
summary_training_loop = [] # list to save average training and validation losses at the end of each epoch

for epoch in range(n_epochs): # for each epoch
    
    print('Epoch {:}/{:}'.format(epoch + 1, n_epochs)) # print the number of the epoch which is being run
    train_loss = 0 # reset the training loss to zero
    model.train() # put the model in training mode
    
    for step, batch in enumerate(train_loader): # for each step and batch in train_Loader

        model.zero_grad() # clear gradients 
        
        input_ids = batch['input_ids'].to(device) # select the input_ids to use
        attention_mask = batch['attention_mask'].to(device) # select the attention_mask to use
        start_positions = batch['start_positions'].to(device) # select the stat_positions to use
        end_positions = batch['end_positions'].to(device) # select the end_positions to use
        
        # Calculate the model outputs associated to the training set
        outputs = model(input_ids, attention_mask = attention_mask, start_positions = start_positions, end_positions = end_positions)
        
        loss = outputs[0] # extract loss from outputs
        train_loss += loss.item() # sum training loss batch by batch
        loss.backward() # perform a backward pass
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # norm of the gradients to one
        optimizer.step() # update parameters
        scheduler.step() # update learning rate
        
    train_loss_avg = train_loss / len(train_loader) # calculate the average training loss considering all the batches
    print('Average training loss:', round(train_loss_avg, 3)) # print the average training loss
    
    val_loss = 0 # for each epoch reset the validation loss to zero
    model.eval() # put the model in evaluation mode
    
    for batch in val_loader: # for each batch in val_loader

        with torch.no_grad(): # without computing graph
            
            input_ids = batch['input_ids'].to(device) # select the input_ids to use
            attention_mask = batch['attention_mask'].to(device) # select the attention_mask to use
            start_positions = batch['start_positions'].to(device) # select the stat_positions to use
            end_positions = batch['end_positions'].to(device) # select the end_positions to use
            
            # Calculate the model outputs associated to the validation set
            outputs = model(input_ids, attention_mask = attention_mask, start_positions = start_positions, end_positions = end_positions)
        
        loss = outputs[0] # extract loss from outputs
        val_loss += loss.item() # sum validation loss batch by batch
        
    val_loss_avg = val_loss / len(val_loader) # calculate the average validation loss considering all the batches
    print('Average validation loss:', round(val_loss_avg, 3), '\n') # print the average validation loss
    summary_training_loop.append({'epoch': epoch + 1, 'avg_training_loss': round(train_loss_avg, 3), 'avg_validation_loss': round(val_loss_avg, 3)}) # save the epoch number, together with the corresponding average training loss and validation loss
    # Save the model after each epoch
    torch.save(model, f"distilbert-epoch-{epoch + 1}.pth")
    
# Save the entire model as a single file
torch.save(model, "distilbert-final.pth")
# Covert the summary_training_loop list to dataframe
df_summary = pd.DataFrame(summary_training_loop) 
df_summary # print df_summary

Epoch 1/3
Average training loss: 2.188
Average validation loss: 1.47 

Epoch 2/3
Average training loss: 1.038
Average validation loss: 1.468 

Epoch 3/3
Average training loss: 0.62
Average validation loss: 1.52 



Unnamed: 0,epoch,avg_training_loss,avg_validation_loss
0,1,2.188,1.47
1,2,1.038,1.468
2,3,0.62,1.52


# Model evaluation

In this section, we used the function evaluate_model to quantify the goodness of the predictions provided by the trained model on the validation set. To this purpose, we adopted two different metrics: exact match and F1-score. The exact match tells us which is the percentage of answers for which the true and predicted start/end positions are exactly the same. The F1-score is instead a combination of both precision and recall, and is different from zero only when the sum of precision and recall is different from zero. 

In [18]:
# Model evaluation

# Function to calculate the exact match and F1-score of the final model on a given dataset
# This function was built similarly to other authors (Apostolopoulou, n.d.)
def evaluate_model(model, dataloader, device):
    
    exact_match = 0 # initialize exact match
    f1_score = 0 # initialize F1_score
    total_samples = 0 # initialize the total number of samples
    model.eval() # put the model in evaluation mode

    for batch in dataloader: # for each batch
        
        with torch.no_grad(): # without computing graph
            
            input_ids = batch['input_ids'].to(device) # select the input_ids to use
            attention_mask = batch['attention_mask'].to(device) # select the attention_mask to use
            start_positions = batch['start_positions'].to(device) # select the stat_positions to use
            end_positions = batch['end_positions'].to(device) # select the end_positions to use
            
            # Calculate the model outputs associated to the validation set
            outputs = model(input_ids, attention_mask = attention_mask)
            
            predicted_start = torch.argmax(outputs.start_logits, dim = 1) # evaluate where the answer starts based on the calculated logits
            predicted_end = torch.argmax(outputs.end_logits, dim = 1) # evaluate where the answer ends based on the calculated logits
            
            for jj in range(len(batch)): # for all the batches
                true_start = start_positions[jj].item() # define the true start position
                true_end = end_positions[jj].item() # define the true end position
                predicted_start_id = predicted_start[jj].item() # define the predicted start position
                predicted_end_id = predicted_end[jj].item() # define the predicted end position
                
                # Evaluate the exact match; if both the true and predicted start/end positions match, update the exact match
                if true_start == predicted_start_id and true_end == predicted_end_id:
                    exact_match += 1
                    
                # Evaluate the F1-score
                common = set(range(predicted_start_id, predicted_end_id +1)).intersection(set(range(true_start, true_end + 1))) # return the intersection of the two sets
                if predicted_end_id != predicted_start_id: # get precision 
                    precision = len(common) / (predicted_end_id - predicted_start_id + 1) # if predicted end and start positions to do not coincide, precision != 0
                else:
                    precision = 0 # else precision = 0
                recall = len(common) / (true_end - true_start + 1) if true_end != true_start else 0 # get recall, which is != 0 if true and and start positions do not coincide
                f1 = (2 * precision * recall) / (precision + recall) if precision + recall != 0 else 0 # calculate f1-score as a combination of precision and recall; F1_score != 0, if precision + recall != 0
                f1_score += f1 
                
            total_samples += len(batch) # update the number of samples

    # Calculate average metrics
    avg_exact_match = (exact_match / total_samples) * 100
    avg_f1_score = (f1_score / total_samples) * 100

    return avg_exact_match, avg_f1_score # return the average exact match and F1-score

# Evaluation
model_name = 'distilbert-epoch-2.pth' # select the model to use for evaluation
model = torch.load(model_name) # load the model
model.to(device) # move the model to the device
avg_exact_match, avg_f1_score = evaluate_model(model, val_loader, device) # evaluate the model exact match and F1-score
# Print the model exact match and F1-score
print('Exact Match:', round(avg_exact_match, 3))
print('F1 Score:', round(avg_f1_score, 3))

Exact Match: 54.0
F1 Score: 44.573


Function get_best_answer allows one to test the capability of the trained model to answer different kind of questions, once the context is provided. 

In [19]:
# Get answers from the model

# Function to get the best answer from model's output
# This function was built similarly to other authors (Apostolopoulou, n.d.)
def get_best_answer(contexts, questions, model, tokenizer, device):
    
    best_answers = [] # list to save the best answers
    model.eval() # put the model in evaluation mode

    for i in range(len(contexts)): # for each context (and question) received in input
        
        input_dict = tokenizer(
            contexts[i],
            questions[i],
            padding = 'max_length',
            max_length = 256,
            truncation = True,
            return_tensors = "pt",
        ) # tokenize both the context and question 

        with torch.no_grad():
            
            # Ensure all inputs are on the same device as the model
            input_dict = {key: value.to(device) for key, value in input_dict.items()}
            outputs = model(** input_dict)

        # Evaluate the start and end positions of the predicted answer in the context
        start_logits = outputs.start_logits 
        end_logits = outputs.end_logits

        start_id = torch.argmax(start_logits, dim = 1).item()
        end_id = torch.argmax(end_logits, dim = 1).item()

        # Define the complete answer, which is decoded starting from the answer tokens
        answer_tokens = input_dict['input_ids'][0][start_id : end_id + 1]
        answer = tokenizer.decode(answer_tokens, skip_special_tokens = True)

        best_answers.append(answer) # save the best answer

    return best_answers # return the best responses for all the received contexts and questions

# Test contexts and questions
test_contexts = [
    "The Eiffel Tower is a famous landmark in Paris, France.",
    "Mount Everest is the tallest mountain in the world.",
    "The Great Wall of China is a historic fortification in China.", 
    "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
    "The Panthers finished the regular season with a 15\u20131 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP). They defeated the Arizona Cardinals 49\u201315 in the NFC Championship Game and advanced to their second Super Bowl appearance since the franchise was founded in 1995. The Broncos finished the regular season with a 12\u20134 record, and denied the New England Patriots a chance to defend their title from Super Bowl XLIX by defeating them 20\u201318 in the AFC Championship Game. They joined the Patriots, Dallas Cowboys, and Pittsburgh Steelers as one of four teams that have made eight appearances in the Super Bowl.",
    "The Broncos took an early lead in Super Bowl 50 and never trailed. Newton was limited by Denver's defense, which sacked him seven times and forced him into three turnovers, including a fumble which they recovered for a touchdown. Denver linebacker Von Miller was named Super Bowl MVP, recording five solo tackles, 2\u00bd sacks, and two forced fumbles."
]

test_questions = [
    "What city is the Eiffel Tower located in?",
    "How tall is Mount Everest?",
    "Where is the Great Wall of China located?", 
    "Which NFL team represented the AFC at Super Bowl 50?",
    "Which Carolina Panthers player was named Most Valuable Player?",
    "Who was the Super Bowl 50 MVP?"
]

# Call the get_best_answer function with the model and tokenizer
best_answers = get_best_answer(test_contexts, test_questions, model, tokenizer, device)
print(best_answers)

['paris, france', 'tallest', 'china', 'denver broncos', 'cam newton', 'von miller']


# Conclusion

In this project, we have successfully developed a Question Answering System using DistilBERT model. We began by overcoming challenges related to data preprocessing, including extracting contexts, questions, and answers from the SQuAD dataset. We implemented functions to ensure that answers align with their respective contexts. Our choice of the DistilBERT model proved to be efficient and effective. DistilBERT balances model size and performance, making ot suitable for various applications. We fine-tuned the model of the SQuAD dataset, enabling it to predict the start and end tokens of answers given a context and a question. 

The evaluation results were promising, with am Exact Match accuracy of 53.33% and an F1-score of 44.41%. These metrics demonstrate the model's ability to provide reasonably accurate answers to diverse questions. 

In the future, several options exist for improving and scaling the system. Increasing the dataset size, implementing augmentation techniques, tuning the hyperparameters, and using of larger models can further enhance the system's performance, Additionally, deploying the model as an API for real-time question-answering in various applications is a practical next step. This project exemplified the potential of transformer-based models in natural language understanding tasks and their relevance in real-world applications. It is an exciting journey into Question Answering with the power of DistilBERT. 

# Model as a service

In this section, we tested the capability of the model to provide users with an answer without a context. This is possible as the asked question is already present among the training/validation questions.

In [20]:
import numpy as np

# Combine the training vand validation contexts
contexts = train_contexts.copy()
contexts.extend(val_contexts)
# Combine the training vand validation questions
questions = train_questions.copy()
questions.extend(val_questions)

# Create a DataFrame and save the DataFrame to a pickle file
df = pd.DataFrame(np.column_stack([questions, contexts]), columns = ['question', 'context'])
df.to_pickle('squad.pkl')

# Function similar to get_best_answer to evaluate the answer to a question
def get_answer(context, question, model, tokenizer, device):
       
    input_dict = tokenizer(
        context,
        question,
        padding = 'max_length',
        max_length = 256,
        truncation = True,
        return_tensors = "pt",
    ) # tokenize both the context and question 

    with torch.no_grad():
            
        # Ensure all inputs are on the same device as the model
        input_dict = {key: value.to(device) for key, value in input_dict.items()}
        outputs = model(** input_dict)

    # Evaluate the start and end positions of the predicted answer in the context
    start_logits = outputs.start_logits 
    end_logits = outputs.end_logits

    start_id = torch.argmax(start_logits, dim = 1).item()
    end_id = torch.argmax(end_logits, dim = 1).item()

    # Define the complete answer, which is decoded starting from the answer tokens
    answer_tokens = input_dict['input_ids'][0][start_id : end_id + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens = True)

    return answer # return the answer

# Test Question
question = 'When did the Scholastic Magazine of Notre dame begin publishing?'
df_selected = df[df['question'].str.contains(question )]
context = ''
if len(df_selected) > 0:
    context = df_selected.iloc[0].context
answer = get_answer(context, question, model, tokenizer, device)
answer

'september 1876'

In [21]:
# Save both the model and tokenizer in a specific format to create an appropriate user interface
model_directory = './my_model_directory'
model.save_pretrained(model_directory)
tokenizer.save_pretrained(model_directory)

('./my_model_directory/tokenizer_config.json',
 './my_model_directory/special_tokens_map.json',
 './my_model_directory/vocab.txt',
 './my_model_directory/added_tokens.json',
 './my_model_directory/tokenizer.json')