# Fine-Tuning a Small Language Model on SQuAD Dataset

## Task Overview
This notebook demonstrates fine-tuning a small language model (DistilBERT with ~66M parameters) on the SQuAD dataset for Question Answering task.

### Steps:
1. Install required libraries
2. Load and explore the dataset
3. Preprocess data (tokenization)
4. Load a pre-trained SLM
5. Fine-tune the model
6. Evaluate using EM and F1 metrics

## Step 1: Install Required Libraries

pip install transformers datasets accelerate evaluate

## Load Dataset

In [None]:
from datasets import load_dataset

# Load a dataset from Hugging Face
dataset = load_dataset('squad')

# Create smaller train/validation subsets to reduce training time
train_subset_size = 20000  # you can adjust if needed
val_subset_size = 5000

train_dataset_small = dataset["train"].shuffle(seed=42).select(range(train_subset_size))
validation_dataset_small = dataset["validation"].shuffle(seed=42).select(range(val_subset_size))

# Print the dataset structure and subset sizes
print(dataset)
print(f"Using train subset of size: {len(train_dataset_small)}")
print(f"Using validation subset of size: {len(validation_dataset_small)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})
Using train subset of size: 20000
Using validation subset of size: 5000


Let's inspect a sample from the training split to see the data format.

In [None]:
# Display a sample from the training split
print(dataset['train'][0])

{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Model selection: DistilBERT has ~66M parameters (much less than 3B)
model_name = "distilbert-base-uncased"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

print(f"Loaded model: {model_name}")
print(f"Model parameters: {model.num_parameters() / 1e6:.2f} Million")

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertForQuestionAnswering LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_transform.weight  | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
qa_outputs.weight       | MISSING    | 
qa_outputs.bias         | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


Loaded model: distilbert-base-uncased
Model parameters: 66.36 Million


## Step 3: Preprocess Data for Fine-tuning

### Important Note:
Tokenizer and model must be loaded BEFORE preprocessing. We now have both ready to use in the data preparation step.

For the SQuAD dataset, we tokenize questions and contexts. Since answers are spans within the context, we handle answer mapping during tokenization.

In [None]:
max_length = 384  # The maximum length of a feature (question and context)
stride = 128  # The overlap between consecutive chunks of a context

def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second", # Truncate only the context if it's too long
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1: # Find where the context starts
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1: # Find where the context ends
            idx += 1
        context_end = idx - 1

        # If the answer is not fully contained in the current context chunk, set it to (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

# Apply the preprocessing to the (smaller) training subset
tokenized_train_dataset = train_dataset_small.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=train_dataset_small.column_names,
)

print("Original full training dataset size:", len(dataset["train"]))
print("Training subset size:", len(train_dataset_small))
print("Tokenized training dataset size (subset):", len(tokenized_train_dataset))
print("Sample tokenized training example:", tokenized_train_dataset[0])

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Original full training dataset size: 87599
Training subset size: 20000
Tokenized training dataset size (subset): 20196
Sample tokenized training example: {'input_ids': [101, 2054, 7017, 1997, 23437, 26847, 2490, 2331, 6531, 2005, 2216, 2975, 7025, 1029, 102, 1996, 29071, 7057, 2006, 4676, 1004, 2270, 2166, 6938, 5279, 2004, 1996, 3587, 5409, 2406, 1999, 1996, 2088, 2005, 3412, 4071, 1012, 1996, 2142, 2163, 3222, 2006, 2248, 3412, 4071, 1010, 1037, 12170, 26053, 2981, 4034, 1997, 1996, 2149, 2231, 1010, 2038, 2872, 5279, 2006, 2049, 3422, 2862, 1997, 3032, 2008, 5478, 2485, 8822, 2349, 2000, 1996, 3267, 1998, 6698, 1997, 13302, 1997, 3412, 4071, 5117, 1999, 2030, 25775, 2011, 1996, 2231, 1012, 2429, 2000, 1037, 2230, 29071, 3795, 13818, 5002, 1010, 6391, 1003, 1997, 23437, 26847, 3569, 1996, 2331, 6531, 2005, 2216, 2040, 2681, 7025, 1025, 6255, 1003, 3569, 23016, 2015, 1998, 6276, 2125, 1997, 2398, 2005, 11933, 1998, 13742, 1025, 1998, 6445, 1003, 2490, 2358, 13369, 1037, 2711, 2040, 27

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    inputs["example_id"] = [examples["id"][idx] for idx in sample_map]
    # Keep offset_mapping for post-processing (no redundant assignment)
    offset_mapping = inputs.pop("offset_mapping")
    inputs["offset_mapping"] = offset_mapping

    # Add sequence_ids to the features for post-processing
    inputs["sequence_ids"] = [inputs.sequence_ids(i) for i in range(len(inputs["input_ids"]))]

    return inputs

# Apply the preprocessing to the (smaller) validation subset
tokenized_validation_dataset = validation_dataset_small.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=validation_dataset_small.column_names,
)

print("Original full validation dataset size:", len(dataset["validation"]))
print("Validation subset size:", len(validation_dataset_small))
print("Tokenized validation dataset size (subset):", len(tokenized_validation_dataset))

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Original full validation dataset size: 10570
Validation subset size: 5000
Tokenized validation dataset size (subset): 5099


## Fine-tune Model

### Subtask:
Fine-tune the selected SLM on the preprocessed text dataset. This will involve setting up training arguments and using the `Trainer` API from the `transformers` library.

We'll define a `DataCollatorForQuestionAnswering` to prepare batches and use `TrainingArguments` to configure our training process.

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

# Define a data collator. This will pad your batches to the longest example in each batch.
data_collator = DataCollatorWithPadding(tokenizer)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch", # Corrected argument name
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True, # Enable mixed precision training for faster training if GPU is available
    push_to_hub=False, # Set to True if you want to push your model to Hugging Face Hub
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    data_collator=data_collator,
)

# Start training
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.506732,No log
2,1.10125,No log
3,0.795915,No log


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=7575, training_loss=1.3331631107141475, metrics={'train_runtime': 1217.2309, 'train_samples_per_second': 49.775, 'train_steps_per_second': 6.223, 'total_flos': 5937007518554112.0, 'train_loss': 1.3331631107141475, 'epoch': 3.0})

## Evaluate Model Performance

### Subtask:
Evaluate the fine-tuned model using suitable metrics.

For Question Answering on SQuAD, we will calculate Exact Match (EM) and F1 Score. This requires generating predictions, post-processing them to extract answers, and then comparing with the ground truth using the SQuAD evaluation script from the `evaluate` library.

In [None]:
import collections
import numpy as np
from tqdm.auto import tqdm

print("Generating predictions on the (smaller) validation subset...")
raw_predictions = trainer.predict(tokenized_validation_dataset)

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size=20, max_answer_length=30):
    all_start_logits, all_end_logits = raw_predictions

    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionary to fill with the processed predictions
    predictions = collections.OrderedDict()

    # tqdm displays progress bar.
    print("Post-processing predictions...")
    for example_index, example in enumerate(tqdm(examples)):
        feature_indices = features_per_example[example_index]
        min_null_score = None
        valid_answers = []

        context = example["context"]
        for feature_index in feature_indices:
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            offset_mapping = features[feature_index]["offset_mapping"]

            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if (start_index >= len(offset_mapping) or
                        end_index >= len(offset_mapping) or
                        offset_mapping[start_index] is None or
                        offset_mapping[end_index] is None or
                        offset_mapping[start_index][0] < 0 or
                        offset_mapping[end_index][0] < 0 or
                        start_index > end_index or
                        (features[feature_index]["sequence_ids"][start_index] != 1 or
                         features[feature_index]["sequence_ids"][end_index] != 1)):
                        continue

                    length = offset_mapping[end_index][1] - offset_mapping[start_index][0]
                    if length > max_answer_length:
                        continue

                    valid_answers.append(
                        {
                            "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]),
                            "score": start_logits[start_index] + end_logits[end_index],
                            "start_logit": start_logits[start_index],
                            "end_logit": end_logits[end_index],
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the event that all valid answers have been filtered out, return an empty string
            best_answer = {"offsets": (0, 0), "score": 0.0, "start_logit": 0.0, "end_logit": 0.0}

        # If the null answer has a better score than the best non-null answer, then return null answer.
        if min_null_score is not None and min_null_score > best_answer["score"]:
            predictions[example["id"]] = ""
        else:
            predictions[example["id"]] = context[best_answer["offsets"][0] : best_answer["offsets"][1]]

    return predictions

# Call the post-processing function on the validation subset
final_predictions = postprocess_qa_predictions(
    validation_dataset_small, tokenized_validation_dataset, raw_predictions.predictions
)

print("Finished generating and post-processing predictions on the validation subset.")

Generating predictions on the (smaller) validation subset...


Post-processing predictions...


  0%|          | 0/5000 [00:00<?, ?it/s]

Finished generating and post-processing predictions on the validation subset.


In [None]:
# Compute SQuAD metrics (Exact Match and F1) on the validation subset
# This cell only depends on `final_predictions` and `validation_dataset_small`.

!pip install evaluate  # safe to re-run; does nothing if already installed
import evaluate

print("Computing SQuAD metrics (Exact Match and F1) on validation subset...")

# Load the SQuAD evaluation script
squad_metric = evaluate.load("squad")

# Prepare the references in the format expected by the SQuAD metric
references = [
    {"id": ex["id"], "answers": ex["answers"]}
    for ex in validation_dataset_small
]

# Convert final_predictions (dict id -> text) to the list-of-dicts format expected by the metric
predictions_for_metric = [
    {"id": qid, "prediction_text": pred_text}
    for qid, pred_text in final_predictions.items()
]

# Compute the metrics
eval_results = squad_metric.compute(predictions=predictions_for_metric, references=references)

print("\n" + "="*60)
print("SQuAD Evaluation Results (on validation subset):")
print("="*60)
print(f"Exact Match (EM): {eval_results['exact_match']:.4f}")
print(f"F1 Score: {eval_results['f1']:.4f}")
print("="*60)

Computing SQuAD metrics (Exact Match and F1) on validation subset...

SQuAD Evaluation Results (on validation subset):
Exact Match (EM): 67.5800
F1 Score: 77.6290


## Summary and Observations

### Task Completion:
✓ **Dataset Selected**: SQuAD - Different from standard classification tasks  
✓ **Model Selected**: DistilBERT (~66M parameters) - Much less than 3B limit  
✓ **Task Completed**: Question Answering fine-tuning  
✓ **Evaluation Metrics**: Exact Match (EM) and F1 Score (standard for QA)  
✓ **Results Displayed**: Clear performance metrics shown  

### Key Implementation Details:
1. **DistilBERT Model**: Compact version of BERT (66M vs 110M parameters)
2. **Data Preprocessing**: Tokenization with sliding window approach for long contexts
3. **Training Configuration**: 3 epochs with learning rate 2e-5
4. **Evaluation**: SQuAD metrics using HuggingFace evaluate library
5. **GPU Acceleration**: Mixed precision training (fp16) enabled for faster training

### Results Interpretation:
- **Exact Match**: Percentage of predictions matching ground truth exactly
- **F1 Score**: Harmonic mean of precision and recall at token level
- Higher values indicate better model performance on QA task