# Question Answering using DistilBERT on Custom SQuAD Data  

This project fine-tunes a **DistilBERT-based Question Answering model** on custom medical-style QA data in SQuAD format.  
The pipeline involves preprocessing JSON files, converting them into SQuAD-style datasets, training the model using **SimpleTransformers**, and evaluating predictions with **SQuAD metrics (Exact Match & F1 score)**.  
The model can answer patient-related questions based on provided contexts.  


In [14]:
!pip install numpy
!pip install pandas



In [None]:
import numpy as np # linear algebra
import pandas as pd

In [2]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.70.5-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.49.1-py3-none-any.whl.metadata (9.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit->sim

### Library Imports  

- **torch, DataLoader, TensorDataset** → Used for handling tensors and batching data efficiently.  
- **transformers (DistilBertTokenizer)** → Provides the tokenizer to preprocess text for DistilBERT.  
- **pandas** → Used for handling tabular data (questions, contexts, answers).  
- **json** → For reading and parsing the SQuAD-style JSON dataset.  


In [20]:
import torch
from torch.utils.data import DataLoader, TensorDataset
from transformers import DistilBertTokenizer
import pandas as pd
import json

### Data Extraction from JSON  

- The **test.json** file (SQuAD-style dataset) is loaded using the `json` library.  
- The script extracts three main components:  
  - **Questions (qns)** → from `qa['question']`  
  - **Contexts (context)** → from `paragraph['context']`  
  - **Answers (ans)** → first answer text if available, otherwise an empty string  
- These extracted values are stored in a **Pandas DataFrame** with columns: `question`, `context`, and `answer`, which will later be used for model training and evaluation.  


In [24]:
with open('/content/test.json', 'r') as file:
    data = json.load(file)
qns = []
context = []
ans= []
for document in data['data']:
  for paragraph in document['paragraphs']:
        for qa in paragraph['qas']:
            qns.append(qa['question'])
            context.append(paragraph['context'])
            if qa['answers']:
                ans.append(qa['answers'][0]['text'])
            else:
                ans.append("")

data = pd.DataFrame({
    'question': qns,
    'context' : context,
    'answer' : ans
})

### Converting Data to SQuAD Format  

- A helper function `convert_to_squad_format` is defined to transform each row of the DataFrame into **SQuAD-style JSON**.  
- For each row:  
  - **`id`** → Unique identifier (row index).  
  - **`is_impossible`** → Set to `True` if the answer is missing (`None`), else `False`.  
  - **`question`** → The extracted question text.  
  - **`answers`** → Contains the answer text and its starting character index within the context.  
- The function returns a dictionary with two keys:  
  - **`context`** → The passage from which the answer is derived.  
  - **`qas`** → A list of question-answer pairs following SQuAD format.  

This ensures the dataset is compatible with **transformer-based QA models**.  


In [25]:
def convert_to_squad_format(row):
    qas = [
        {
            "id": row.name,
            "is_impossible": row['answer'] is None,
            "question": row['question'],
            "answers": [{"text": str(row['answer']), "answer_start": str(row['context']).find(str(row['answer']))}]
        }
    ]
    return {"context": str(row['context']), "qas": qas}



In [26]:
train_data = data.apply(convert_to_squad_format, axis=1).tolist()
# test data
with open('/content/test.json', 'r') as file:
    data = json.load(file)
qns2 = []
context2 = []
ans2= []
for document in data['data']:
  for paragraph in document['paragraphs']:
        for qa in paragraph['qas']:
            qns2.append(qa['question'])
            context2.append(paragraph['context'])
            if qa['answers']:
                ans2.append(qa['answers'][0]['text'])
            else:
                ans2.append("")

data2 = pd.DataFrame({
    'question': qns2,
    'context' : context2,
    'answer' : ans2
})
test_data = data2.apply(convert_to_squad_format, axis=1).tolist()

### Preparing Training and Test Data  

- **Training Data**:  
  - The earlier DataFrame (`data`) is converted into **SQuAD-style dictionaries** using the `convert_to_squad_format` function.  
  - The resulting list (`train_data`) is used for model training.  

- **Test Data**:  
  - The `test.json` file is reloaded and parsed.  
  - Questions, contexts, and answers are again extracted into a new DataFrame (`data2`).  
  - This DataFrame is also converted into **SQuAD format** (`test_data`) for evaluation.  

This step ensures that **both training and testing datasets share the same structure**, making them directly compatible with the QA model.  


In [7]:
import pandas as pd
from simpletransformers.question_answering import QuestionAnsweringModel
model_type = "distilbert"
model_name = "distilbert-base-cased-distilled-squad"
model_args = {"train_batch_size": 16, "num_train_epochs": 20, "evaluate_during_training": True}
model = QuestionAnsweringModel(model_type, model_name, args=model_args,use_cuda=False)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [8]:
# training the model on train dataset
model.train_model(train_data, eval_data=test_data, output_dir="output/")

convert squad examples to features: 100%|██████████| 16/16 [00:16<00:00,  1.03s/it]
add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 155344.59it/s]


Epoch:   0%|          | 0/20 [00:00<?, ?it/s]

Running Epoch 1 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 70.80it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 135573.46it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 2 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 17.70it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 107892.06it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 3 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 76.14it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 141579.88it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 4 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 32.75it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 131072.00it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 5 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 61.50it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 128070.35it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 6 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 72.39it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 112788.01it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 7 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 76.61it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 129804.38it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 8 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 72.90it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 67041.82it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 9 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 72.15it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 139230.01it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 10 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 72.76it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 93466.38it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 11 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 42.01it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 154985.83it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 12 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 70.53it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 110923.74it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 13 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 73.20it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 125203.10it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 14 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 73.74it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 109476.12it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 15 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 70.55it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 136956.87it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 16 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 72.34it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 125203.10it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 17 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 69.64it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 107546.26it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 18 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 73.64it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 138084.08it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 19 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 70.07it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 130308.47it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 20 of 20:   0%|          | 0/3 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/16 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 73.94it/s]

add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 136123.46it/s]


Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

(60,
 {'global_step': [3,
   6,
   9,
   12,
   15,
   18,
   21,
   24,
   27,
   30,
   33,
   36,
   39,
   42,
   45,
   48,
   51,
   54,
   57,
   60],
  'correct': [4,
   7,
   9,
   12,
   12,
   13,
   13,
   14,
   14,
   14,
   14,
   14,
   14,
   14,
   14,
   14,
   14,
   14,
   14,
   14],
  'similar': [7, 9, 6, 4, 4, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
  'incorrect': [5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'train_loss': [4.373349189758301,
   0.7357711791992188,
   0.4937332272529602,
   1.0441111326217651,
   0.8098746538162231,
   0.004032487981021404,
   0.006327128037810326,
   0.03318631649017334,
   0.02557097189128399,
   0.007465791888535023,
   0.0007820134051144123,
   0.002122326288372278,
   0.0007244408479891717,
   5.7841385569190606e-05,
   4.526487464318052e-05,
   0.00013446196680888534,
   0.0002502153511159122,
   0.000166363111929968,
   0.0005404535331763327,
   0.0003655635518953204],
  'eval_loss': [-8.1912622451

### Generating Predictions  

- The fine-tuned QA model is used to make predictions on the **test dataset**.  
- `model.predict(test_data)` returns:  
  - **`predictions`** → A list of predicted answers for each question.  
  - **`raw_outputs`** → The raw model outputs (logits) before post-processing.  

These predictions will later be compared against the ground truth answers to evaluate model performance.  


In [9]:
predictions, raw_outputs = model.predict(test_data)

convert squad examples to features: 100%|██████████| 16/16 [00:00<00:00, 67.65it/s]
add example index and unique id: 100%|██████████| 16/16 [00:00<00:00, 111107.39it/s]


Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

### Displaying Predictions  

- The loop iterates over the list of **predictions**.  
- For each prediction:  
  - The corresponding **question** is retrieved from `test_data`.  
  - The **predicted answer** is extracted (choosing the first non-empty answer).  
  - Both the question and predicted answer are printed.  
- Additionally, if the question is *"How old is the patient?"*, a separator line (`=` repeated 200 times) is printed to highlight it.  

This step helps in **visually inspecting** model predictions for specific questions.  


In [19]:
for i, pred in enumerate(predictions):
    question = test_data[i]['qas'][0]['question']
    predicted_answer = next((ans for ans in pred['answer'] if ans != ''), pred['answer'][1])
    print(f"Question {i%4+1}: {question}")
    print(f"Predicted Answer: {predicted_answer}")
    print()
    if(question=="How old is the patient?"):
      print(200*"=")

Question 1: Does the patient have any complaints?
Predicted Answer: Urinary retention

Question 2: What is the reason for this consultation?
Predicted Answer: prostate cancer

Question 3: What other symptoms does the patient have?
Predicted Answer: colon cancer or lung and prostate problems on his father side of the family. He does not know whether his father side of the family had any history of prostate cancer.,HOME MEDICATIONS:,1. Norvasc.,2. Toprol 50 mg.,3. Clonidine 0.2 mg.,4. Hydralazine.,5. Flomax

Question 4: How old is the patient?
Predicted Answer: 66

Question 1: Does the patient have any complaints?
Predicted Answer: allergy

Question 2: What is the reason for this consultation?
Predicted Answer: further allergy evaluation and treatment

Question 3: What other symptoms does the patient have?
Predicted Answer: Dialyvite

Question 4: How old is the patient?
Predicted Answer: 34

Question 1: Does the patient have any complaints?
Predicted Answer: Morbid obesity

Question 2: W

### Installing and Importing Evaluation Library  

- `!pip install evaluate` → Installs the **Hugging Face Evaluate** library, which provides standard NLP evaluation metrics.  
- `import evaluate` → Imports the library into the notebook for use.  

This library is later used to compute **SQuAD metrics** such as **Exact Match (EM)** and **F1 score** to assess the QA model’s performance.  


In [29]:
!pip install evaluate
import evaluate



### Model Evaluation with SQuAD Metric  

- **Metric Loading**:  
  - `evaluate.load("squad")` loads the official **SQuAD evaluation metric**, which calculates **Exact Match (EM)** and **F1 score**.  

- **Formatting Predictions**:  
  - Each predicted answer is stored in a dictionary with fields:  
    - `"id"` → unique index of the example.  
    - `"prediction_text"` → model’s predicted answer text.  

- **Formatting References**:  
  - Ground truth answers are structured with fields:  
    - `"id"` → same index as the prediction.  
    - `"answers"` → includes the true answer text(s) and their starting position in the context.  

- **Computing Scores**:  
  - The formatted predictions and references are passed to `squad_metric.compute()`.  
  - This outputs two key metrics:  
    - **Exact Match (EM)** → % of predictions that exactly match the ground truth answer.  
    - **F1 Score** → harmonic mean of precision and recall based on word overlap.  

This step provides a **quantitative evaluation** of the QA model’s performance.  


In [28]:
# Load the official SQuAD metric
squad_metric = evaluate.load("squad")

# Format predictions
formatted_preds = [
    {"id": str(i), "prediction_text": pred['answer'][0] if pred['answer'] else ""}
    for i, pred in enumerate(predictions)
]

# Format references
references = [
    {"id": str(i), "answers": {"text": [ans], "answer_start": [test_data[i]['qas'][0]['answers'][0]['answer_start']]}}
    for i, ans in enumerate(data2['answer'])
]

# Compute SQuAD-style EM and F1
results = squad_metric.compute(predictions=formatted_preds, references=references)
print(results)


{'exact_match': 87.5, 'f1': 62.5}


### Evaluation Results  

- **Exact Match (EM): 87.5%**  
  - The model’s predicted answer exactly matches the ground truth in **87.5%** of the test cases.  

- **F1 Score: 62.5%**  
  - Measures the word-level overlap between predicted and true answers.  
  - A lower F1 compared to EM suggests that while many answers are exactly correct, some partially correct answers reduce the overall precision/recall balance.  

➡️ These results indicate that the model performs well in producing exact matches but has room for improvement in handling **partial or paraphrased answers**.  
