<a href="https://colab.research.google.com/github/Jai-Kumar786/Full-Fledged-BERT-Question-Answering-Application/blob/main/03_model_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 3: Model Loading & Fine-Tuning
## BERT Question Answering Project

**Objectives:** Load BERT model, configure training, fine-tune on SQuAD
**Expected Training Time:** 4-6 hours


In [1]:
# ============================================
# SETUP: Reinstall & Load Previous Work
# ============================================

# Install libraries
!pip install datasets transformers torch accelerate -q

# Imports
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    DefaultDataCollator
)
import torch
import numpy as np

# Check GPU
print(f"✅ GPU Available: {torch.cuda.is_available()}")
print(f"✅ GPU Name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")


✅ GPU Available: True
✅ GPU Name: Tesla T4


# Action: Recreate Day 2 preprocessing

In [2]:
# ============================================
# RELOAD DATASET & PREPROCESSING FUNCTION
# ============================================

print("📥 Reloading SQuAD dataset...")
dataset = load_dataset("squad")
train_dataset = dataset['train'].select(range(3000))
val_dataset = dataset['validation'].select(range(500))

print(f"✅ Training: {len(train_dataset)} examples")
print(f"✅ Validation: {len(val_dataset)} examples")

# Load tokenizer
print("\n🔧 Loading BERT tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(f"✅ Tokenizer loaded: {tokenizer.__class__.__name__}")

# Preprocessing function (from Day 2)
def preprocess_function(examples):
    """
    Same preprocessing function from Day 2 - COPY FROM YOUR NOTEBOOK 2!
    """
    tokenized_examples = tokenizer(
        examples['question'],
        examples['context'],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        sample_index = sample_mapping[i]
        answers = examples['answers'][sample_index]

        if len(answers['answer_start']) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        start_char = answers['answer_start'][0]
        end_char = start_char + len(answers['text'][0])

        sequence_ids = tokenized_examples.sequence_ids(i)

        context_start = 0
        while sequence_ids[context_start] != 1:
            context_start += 1

        context_end = len(sequence_ids) - 1
        while sequence_ids[context_end] != 1:
            context_end -= 1

        if not (offsets[context_start][0] <= start_char and
                offsets[context_end][1] >= end_char):
            start_positions.append(0)
            end_positions.append(0)
        else:
            token_start = context_start
            while token_start <= context_end and offsets[token_start][0] <= start_char:
                token_start += 1
            start_positions.append(token_start - 1)

            token_end = context_end
            while token_end >= context_start and offsets[token_end][1] >= end_char:
                token_end -= 1
            end_positions.append(token_end + 1)

    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions

    return tokenized_examples

# Apply preprocessing
print("\n🔄 Tokenizing datasets (this takes ~5 minutes)...")
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names,
    desc="Tokenizing training set"
)

tokenized_val = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names,
    desc="Tokenizing validation set"
)

print(f"\n✅ Tokenized Training: {len(tokenized_train)} features")
print(f"✅ Tokenized Validation: {len(tokenized_val)} features")


📥 Reloading SQuAD dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

✅ Training: 3000 examples
✅ Validation: 500 examples

🔧 Loading BERT tokenizer...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

✅ Tokenizer loaded: BertTokenizerFast

🔄 Tokenizing datasets (this takes ~5 minutes)...


Tokenizing training set:   0%|          | 0/3000 [00:00<?, ? examples/s]

Tokenizing validation set:   0%|          | 0/500 [00:00<?, ? examples/s]


✅ Tokenized Training: 3074 features
✅ Tokenized Validation: 520 features


# Question 3.1: Load BERT model for Question Answering (2 marks)

In [3]:
# ============================================
# QUESTION 3.1: Load BERT Model (2 marks)
# ============================================

print("🤖 Loading BERT model for Question Answering...")
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")

print(f"\n✅ Model loaded: {model.__class__.__name__}")
print(f"✅ Model parameters: {model.num_parameters():,}")

# Move model to GPU
if torch.cuda.is_available():
    model = model.cuda()
    print(f"✅ Model moved to GPU: {torch.cuda.get_device_name(0)}")
else:
    print("⚠️ No GPU detected - training will be slow!")

# Print model architecture summary
print(f"\n📊 Model Architecture:")
print(f"   Base model: bert-base-uncased")
print(f"   Layers: 12")
print(f"   Hidden size: 768")
print(f"   Attention heads: 12")
print(f"   Total parameters: {model.num_parameters():,}")


🤖 Loading BERT model for Question Answering...


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



✅ Model loaded: BertForQuestionAnswering
✅ Model parameters: 108,893,186
✅ Model moved to GPU: Tesla T4

📊 Model Architecture:
   Base model: bert-base-uncased
   Layers: 12
   Hidden size: 768
   Attention heads: 12
   Total parameters: 108,893,186


### BERT-base-uncased Model:
- **Parameters**: 109.5 million
- **Architecture**: 12 transformer layers
- **Hidden size**: 768 dimensions
- **QA Head**: Linear layers for start/end position prediction
- **Task**: Extractive Question Answering (SQuAD-style)


# Question 3.2: Define training arguments (5 marks)

In [6]:
# ============================================
# QUESTION 3.2: Define Training Arguments (5 marks)
# ============================================

training_args = TrainingArguments(
    # Output directories
    output_dir="./results",
    logging_dir="./logs",

    # Training hyperparameters
    num_train_epochs=3,               # Train for 3 epochs
    per_device_train_batch_size=16,   # Batch size for training
    per_device_eval_batch_size=16,    # Batch size for evaluation
    learning_rate=3e-5,               # Learning rate (slightly higher than default)
    weight_decay=0.01,                # L2 regularization

    # Evaluation strategy
    eval_strategy="epoch",      # Evaluate after each epoch
    save_strategy="epoch",            # Save checkpoint after each epoch
    load_best_model_at_end=True,     # Load best model when training ends
    metric_for_best_model="loss",     # Use validation loss as metric

    # Logging
    logging_steps=100,                # Log every 100 steps
    save_total_limit=2,              # Keep only 2 best checkpoints

    # Performance optimizations
    fp16=torch.cuda.is_available(),   # Use mixed precision if GPU available
    dataloader_num_workers=2,        # Parallel data loading

    # Reproducibility
    seed=42,

    # Other settings
    remove_unused_columns=True,
    push_to_hub=False,               # Don't push to Hugging Face Hub
)

print("✅ Training Arguments configured!")
print(f"\n📊 Training Configuration:")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Weight decay: {training_args.weight_decay}")
print(f"   FP16 (mixed precision): {training_args.fp16}")
print(f"   Total training steps: {len(tokenized_train) // training_args.per_device_train_batch_size * training_args.num_train_epochs}")

✅ Training Arguments configured!

📊 Training Configuration:
   Epochs: 3
   Batch size: 16
   Learning rate: 3e-05
   Weight decay: 0.01
   FP16 (mixed precision): True
   Total training steps: 576


### Training Arguments Justification:

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `num_train_epochs` | 3 | Standard for fine-tuning; prevents overfitting on small dataset |
| `batch_size` | 16 | Fits in 15GB GPU RAM; balances speed and memory |
| `learning_rate` | 3e-5 | Slightly higher than default (2e-5) for faster convergence |
| `weight_decay` | 0.01 | Regularization to prevent overfitting |
| `fp16` | True | Mixed precision training: 2× faster, 50% less memory |
| `save_strategy` | "epoch" | Save checkpoints after each epoch for recovery |



# Question 3.3: Initialize Trainer (5 marks)

In [7]:
# ============================================
# QUESTION 3.3: Initialize Trainer (5 marks)
# ============================================

# Data collator for dynamic padding
data_collator = DefaultDataCollator()

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

print("✅ Trainer initialized successfully!")
print(f"\n📊 Trainer Configuration:")
print(f"   Model: {model.__class__.__name__}")
print(f"   Training samples: {len(tokenized_train)}")
print(f"   Validation samples: {len(tokenized_val)}")
print(f"   Data collator: {data_collator.__class__.__name__}")
print(f"   Optimizer: AdamW (default)")
print(f"   Loss function: Cross-entropy on start/end positions")


  trainer = Trainer(


✅ Trainer initialized successfully!

📊 Trainer Configuration:
   Model: BertForQuestionAnswering
   Training samples: 3074
   Validation samples: 520
   Data collator: DefaultDataCollator
   Optimizer: AdamW (default)
   Loss function: Cross-entropy on start/end positions


### Trainer Components:

1. **Model**: BertForQuestionAnswering
   - Pre-trained BERT with QA head (2 linear layers for start/end prediction)

2. **Training Arguments**: Configured hyperparameters
   - Controls learning rate, batch size, epochs, etc.

3. **Datasets**: Tokenized training and validation sets
   - 3,074 training features
   - 520 validation features

4. **Tokenizer**: BERT tokenizer for encoding/decoding
   - Converts tokens ↔ text during evaluation

5. **Data Collator**: DefaultDataCollator
   - Handles batching and dynamic padding
   - Creates attention masks automatically

### Training Process:
1. Load batch from `train_dataset`
2. Forward pass through BERT
3. Compute loss (cross-entropy on start/end positions)
4. Backward pass (compute gradients)
5. Update weights using AdamW optimizer
6. Repeat for all batches (1 epoch)
7. Evaluate on `eval_dataset`
8. Save checkpoint if best model


# Question 3.4: Fine-tune and evaluate (3 marks)

In [10]:
# ============================================
# QUESTION 3.4: Fine-Tune Model (3 marks)
# ============================================

print("🚀 Starting training...")
print("⏰ Estimated time: 4-6 hours on Tesla T4 GPU")
print("📊 You will see progress bars and loss metrics\n")
print("💡 TIP: You can close this browser tab - training continues!")
print("=" * 70)

# Start training
train_result = trainer.train()

print("\n" + "=" * 70)
print("🎉 Training complete!")
print("=" * 70)

# Print training metrics
print(f"\n📊 Final Training Metrics:")
print(f"   Total training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"   Training loss: {train_result.metrics['train_loss']:.4f}")
print(f"   Training steps: {len(tokenized_train) // training_args.per_device_train_batch_size * training_args.num_train_epochs}")
print(f"   Samples/second: {train_result.metrics['train_samples_per_second']:.2f}")

🚀 Starting training...
⏰ Estimated time: 4-6 hours on Tesla T4 GPU
📊 You will see progress bars and loss metrics

💡 TIP: You can close this browser tab - training continues!


Epoch,Training Loss,Validation Loss
1,1.1314,1.647568
2,0.6137,1.805327
3,0.4221,1.865861



🎉 Training complete!

📊 Final Training Metrics:
   Total training time: 279.17 seconds
   Training loss: 0.6804
   Training steps: 576
   Samples/second: 33.03


# STEP 8: Evaluate Results (After Training)

In [11]:
# ============================================
# EVALUATE TRAINED MODEL
# ============================================

print("📊 Evaluating model on validation set...")
eval_results = trainer.evaluate()

print(f"\n✅ Evaluation Complete!")
print(f"\n📊 Validation Metrics:")
print(f"   Validation loss: {eval_results['eval_loss']:.4f}")
print(f"   Evaluation time: {eval_results['eval_runtime']:.2f} seconds")
print(f"   Samples/second: {eval_results['eval_samples_per_second']:.2f}")

# Interpret results
print(f"\n🎯 Model Performance Analysis:")
if eval_results['eval_loss'] < 1.5:
    print("   ✅ EXCELLENT: Loss < 1.5 indicates strong performance!")
elif eval_results['eval_loss'] < 2.0:
    print("   ✅ GOOD: Loss < 2.0 indicates decent performance")
else:
    print("   ⚠️ NEEDS IMPROVEMENT: Loss > 2.0 suggests more training needed")


📊 Evaluating model on validation set...



✅ Evaluation Complete!

📊 Validation Metrics:
   Validation loss: 1.6476
   Evaluation time: 2.79 seconds
   Samples/second: 186.27

🎯 Model Performance Analysis:
   ✅ GOOD: Loss < 2.0 indicates decent performance


# STEP 9: Save Model (CRITICAL!)

In [12]:
# ============================================
# SAVE TRAINED MODEL & TOKENIZER
# ============================================

print("💾 Saving fine-tuned model...")

# Save model
model.save_pretrained("./bert-qa-model")
tokenizer.save_pretrained("./bert-qa-model")

print(f"✅ Model saved to './bert-qa-model'")
print(f"✅ Model size: ~440 MB")
print(f"\n📂 Saved files:")
print(f"   - config.json (model configuration)")
print(f"   - pytorch_model.bin (model weights)")
print(f"   - tokenizer_config.json")
print(f"   - vocab.txt (vocabulary)")


💾 Saving fine-tuned model...
✅ Model saved to './bert-qa-model'
✅ Model size: ~440 MB

📂 Saved files:
   - config.json (model configuration)
   - pytorch_model.bin (model weights)
   - tokenizer_config.json
   - vocab.txt (vocabulary)
