# **Building a Question Answering System Using BART Base On Squad Dataset**

In this project, we will build a Question Answering System using the BART model based on the SQuAD (Stanford Question Answering Dataset) dataset. The BART model is a powerful tool for natural language understanding and generation, and we will utilize it to answer questions based on the provided SQuAD dataset.

## **Project Overview**

- **Goal**: Develop a Question Answering System using BART.
- **Dataset**: We will use the SQuAD dataset, a popular question-answering dataset.
- **Methodology**: We will fine-tune the BART model on the SQuAD dataset and deploy it for answering user questions.
- **Tools**: Google Colab, Python, PyTorch, Hugging Face Transformers library.

## **About BART**

The Bart model was proposed in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.

* According to the abstract, Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT).

* The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.

* BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.

In [1]:
# Install the Transformers library
! pip install transformers datasets evaluate

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 

In [2]:
import warnings
warnings.simplefilter("ignore")

## **Load SQuAD Dataset**

* Load dataset from the source: https://rajpurkar.github.io/SQuAD-explorer/

In [3]:
# Load SQuAD Dataset
# We are using the 'datasets' library to load the Stanford Question Answering Dataset (SQuAD).
# The 'split="train[:600]"' parameter specifies that we are loading the first 600 examples from the training split of the dataset.
# This subset of data will be used for our project, likely for training and experimentation.

from datasets import load_dataset

squad = load_dataset("squad", split="train[:50]")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [4]:
# Splitting the SQuAD Dataset
# We are splitting the loaded SQuAD dataset into a training and testing set using a 80-20 ratio.
# The 'train_test_split' function with 'test_size=0.2' parameter ensures that 20% of the data is allocated for testing,
# while the remaining 80% is retained for training our Question Answering System.
# The 'seed=2' parameter ensures that the random splitting process is fixed, making the split reproducible.
squad = squad.train_test_split(test_size=0.2, seed=2)
squad["train"][2]

{'id': '5733a7bd4776f41900660f6a',
 'title': 'University_of_Notre_Dame',
 'context': 'The university first offered graduate degrees, in the form of a Master of Arts (MA), in the 1854–1855 academic year. The program expanded to include Master of Laws (LL.M.) and Master of Civil Engineering in its early stages of growth, before a formal graduate school education was developed with a thesis not required to receive the degrees. This changed in 1924 with formal requirements developed for graduate degrees, including offering Doctorate (PhD) degrees. Today each of the five colleges offer graduate education. Most of the departments from the College of Arts and Letters offer PhD programs, while a professional Master of Divinity (M.Div.) program also exists. All of the departments in the College of Science offer PhD programs, except for the Department of Pre-Professional Studies. The School of Architecture offers a Master of Architecture, while each of the departments of the College of Engineeri

## **Fine-Tuning a Custom Question-Answering BART Model**

* I am demonstrating the process of creating and fine-tuning a custom question-answering BART model using the Hugging Face Transformers library.

* Key Steps in This Example:

    * **Data Preparation**: We use a simplified custom dataset with a context, questions, and answers.
    * **Model Initialization:** We initialize a BART model and tokenizer.
    * **Data Tokenization:** We tokenize the data and define labels for start and end positions.
    * **Custom Dataset:** We create a custom dataset class to manage the input data.
    * **Fine-Tuning:** We fine-tune the BART model on the custom dataset (simplified).
    * **Saving and Loading:** We save the fine-tuned model and load it for inference.
    * **Testing:** We test the custom model with a sample question and context.

In [5]:
# Importing Required Libraries
# We are importing the essential libraries needed for building and training a Question Answering System.
# - 'AutoTokenizer' is used for tokenizing text data.
# - 'AutoModelForQuestionAnswering' is the model architecture designed for question answering tasks.
# - 'TrainingArguments' is used to configure the training process.
# - 'Trainer' is used for training machine learning models.
# - 'torch' is the PyTorch library for deep learning.
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer
import torch


In [6]:
# Load BART-Base Tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")

# Load BART-Base Question Answering Model
model = AutoModelForQuestionAnswering.from_pretrained("facebook/bart-large")

# Set the padding token to '[PAD]'
tokenizer.pad_token = "[PAD]"

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-large and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### **Data Preprocessing for BART-Based Question Answering Model**

* This section of the notebook contains a data preprocessing function designed for training a BART-based Question Answering model. The function handles various data preprocessing tasks, including tokenization, offset mapping, and determining answer positions within tokenized sequences. It ensures that the input data is properly formatted and ready for training the model.

* The code is structured to prepare the dataset by tokenizing text, extracting answers, and aligning them with the corresponding token positions. Additionally, it includes progress tracking with print statements to monitor the processing of examples.

* This data preprocessing step is crucial for training a Question Answering model that can effectively respond to user queries.

In [7]:
def preprocess_function(examples):
    # Extract and clean questions
    questions = [q.strip() for q in examples["question"]]

    # Tokenize questions and context
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=256,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Extract offset mappings from the inputs
    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    # Add a print statement to display the total number of examples
    print(f"Total examples: {len(offset_mapping)}")

    for i, offset in enumerate(offset_mapping):
        # Add another print statement to show the processing progress
        print(f"Processing example {i + 1}/{len(offset_mapping)}")
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        idx = 0
        while idx < len(sequence_ids) and sequence_ids[idx] != 1:
            idx += 1
        context_start = idx

        while idx < len(sequence_ids) and sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # Add start and end positions to the inputs
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs


In [8]:
# Tokenize and Preprocess the SQuAD Dataset
# We are using the 'map' function from the 'datasets' library to preprocess and tokenize the SQuAD dataset.
# - 'preprocess_function' is a user-defined function that prepares the dataset for training a Question Answering model.
# - 'batched=True' indicates that the mapping should be applied to batches of data for efficiency.
# - 'remove_columns=squad["train"].column_names' removes unnecessary columns in the processed dataset.
# The resulting 'tokenized_squad' dataset will be ready for use in training a BART-based Question Answering model.
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Total examples: 40
Processing example 1/40
Processing example 2/40
Processing example 3/40
Processing example 4/40
Processing example 5/40
Processing example 6/40
Processing example 7/40
Processing example 8/40
Processing example 9/40
Processing example 10/40
Processing example 11/40
Processing example 12/40
Processing example 13/40
Processing example 14/40
Processing example 15/40
Processing example 16/40
Processing example 17/40
Processing example 18/40
Processing example 19/40
Processing example 20/40
Processing example 21/40
Processing example 22/40
Processing example 23/40
Processing example 24/40
Processing example 25/40
Processing example 26/40
Processing example 27/40
Processing example 28/40
Processing example 29/40
Processing example 30/40
Processing example 31/40
Processing example 32/40
Processing example 33/40
Processing example 34/40
Processing example 35/40
Processing example 36/40
Processing example 37/40
Processing example 38/40
Processing example 39/40
Processing exam

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Total examples: 10
Processing example 1/10
Processing example 2/10
Processing example 3/10
Processing example 4/10
Processing example 5/10
Processing example 6/10
Processing example 7/10
Processing example 8/10
Processing example 9/10
Processing example 10/10


#### **Set Hugging Face Variables**

In [9]:
# Setting Hugging Face Environment Variables
# This code is setting environment variables related to Hugging Face's model repository and cache.
# - 'os.environ["HF_HOME"]' is used to define the path for Hugging Face model storage.
# - 'os.environ["HF_HOME"] += "/token"' appends the "/token" directory to the model storage path.
# - Finally, 'os.path.join(os.environ["HF_HOME"], ")"' creates the complete path, ensuring it's properly formatted.
# These environment variables help manage the location for storing Hugging Face models and token information.
import os

os.environ["HF_HOME"] = "/root/.huggingface"
os.environ["HF_HOME"] += "/token"
os.environ["HF_HOME"] = os.path.join(os.environ["HF_HOME"], "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")


### **Start Model Training**
* This section marks the beginning of the model training process. The code executed under this heading will train the BART-Base model for question answering.
* During training, the model learns from the provided training data to improve its ability to answer questions accurately. It's a critical step in the development of a question-answering system, and the model's performance will be refined over multiple training epochs.
* The training process involves adjusting the model's weights and parameters to minimize the error in answering questions, ultimately leading to improved accuracy and effectiveness.

In [10]:
from transformers import DefaultDataCollator
# Initializing Data Collator
# We are importing the 'DefaultDataCollator' from the 'transformers' library, which is used to collate and preprocess training data.
# 'DefaultDataCollator' helps prepare input data for the BART-based Question Answering model during training.
data_collator = DefaultDataCollator()

In [13]:
# !pip install accelerate -U
# !pip install transformers[torch]

Collecting accelerate
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.24.1


In [14]:
# Training Configuration for BART-Base Model
# We are defining training arguments for training a BART-base model for question answering.
# - 'output_dir' specifies the directory to save model checkpoints and output.
# - 'evaluation_strategy' sets the evaluation frequency to "epoch."
# - 'learning_rate' defines the initial learning rate for the optimizer.
# - 'per_device_train_batch_size' and 'per_device_eval_batch_size' set the batch sizes for training and evaluation.
# - 'num_train_epochs' determines the number of training epochs.
# - 'weight_decay' controls weight decay regularization for the optimizer.
training_args = TrainingArguments(
    output_dir="./custom_qa_model",
    evaluation_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    save_total_limit=2,
    logging_dir="./logs",
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps = 10,
)


ImportError: ignored

In [40]:
# Model Training Configuration
# We are initializing a Trainer for training the BART-Base model for question answering.
# - 'model' is the BART-Base model architecture.
# - 'training_args' contains the training configuration defined previously.
# - 'train_dataset' and 'eval_dataset' are the training and evaluation datasets, respectively.
# - 'tokenizer' is used for tokenization.
# - 'data_collator' is the data collator for preprocessing training data.
trainer = Trainer(
    model=model,  # Replace 'model' with your BART-Base model.
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


OutOfMemoryError: ignored

In [None]:
# Start Model Training
# This line initiates the training process for the BART-Base model using the configured Trainer.
trainer.train()

OutOfMemoryError: ignored

In [None]:
# Save the fine-tuned model
model.save_pretrained("custom_qa_model_large_cnn")

## **Model Evaluation**

* This section focuses on evaluating the performance of the trained BART-Base model for question answering. Evaluation is a crucial step in assessing how effectively the model can answer questions based on its training. The code initializes the necessary components for evaluation, such as data collation and evaluation arguments. The model's performance is assessed on a separate test dataset, and the results are stored and printed for analysis. It provides valuable insights into the model's accuracy and effectiveness in responding to questions, which is essential for model refinement and determining its readiness for practical use. Make sure the model and tokenizer are correctly configured for evaluation and consider adjusting the batch size and output directory as needed.

In [None]:
# Define your data collator
data_collator = DefaultDataCollator()

# Define evaluation arguments
evaluation_args = TrainingArguments(
    per_device_eval_batch_size=16,  # Adjust batch size for evaluation if needed
    output_dir="./evaluation_results",  # Specify an output directory for evaluation results
)

# Create a Trainer for evaluation
eval_trainer = Trainer(
    model=model,
    args=evaluation_args,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Evaluate the model on the test dataset
eval_results = eval_trainer.evaluate(tokenized_squad["test"])

# Print the evaluation results
print(eval_results)

{'eval_loss': 4.514838218688965, 'eval_runtime': 41.704, 'eval_samples_per_second': 0.959, 'eval_steps_per_second': 0.072}
