Preprocessing & Tokenization

Goals for Today:​

✅ Load BERT tokenizer (Q2.1 - 5 marks)

✅ Create preprocessing function (Q2.2 - 5 marks)

✅ Apply preprocessing to datasets (Q2.3 - 5 marks)

✅ Answer PyTorch tensor question (Q2.4 - 5 marks)

✅ CRITICAL: Validate answer position mapping

# Notebook 2: Preprocessing & Tokenization
## BERT Question Answering Project

**Objectives:** Load tokenizer, preprocess dataset, validate answer mapping


STEP 2: Reload Dataset & Libraries ​
Question 2.1: Load BERT Tokenizer (5 marks)

In [1]:
# ============================================
# SETUP: Reinstall Libraries & Load Dataset
# ============================================

# Install required libraries
!pip install datasets transformers torch -q

# Import libraries
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np
import torch

# Reload SQuAD dataset
print("📥 Loading SQuAD dataset...")
dataset = load_dataset("squad")

# Create subsets (same as Day 1)
train_dataset = dataset['train'].select(range(3000))
val_dataset = dataset['validation'].select(range(500))

print(f"✅ Training: {len(train_dataset)} examples")
print(f"✅ Validation: {len(val_dataset)} examples")


📥 Loading SQuAD dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

✅ Training: 3000 examples
✅ Validation: 500 examples


STEP 3: Load BERT Tokenizer​
Question 2.1: Load the BERT tokenizer (bert-base-uncased) (5 marks)


In [2]:
# ============================================
# QUESTION 2.1: Load BERT Tokenizer (5 marks)
# ============================================

# Load bert-base-uncased tokenizer
print("🔧 Loading BERT tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Verify tokenizer properties
print(f"\n✅ Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"✅ Vocabulary size: {tokenizer.vocab_size:,}")
print(f"✅ Max length: {tokenizer.model_max_length}")
print(f"✅ Special tokens:")
print(f"   [CLS] token: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"   [SEP] token: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"   [PAD] token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")


🔧 Loading BERT tokenizer...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


✅ Tokenizer loaded: BertTokenizerFast
✅ Vocabulary size: 30,522
✅ Max length: 512
✅ Special tokens:
   [CLS] token: [CLS] (ID: 101)
   [SEP] token: [SEP] (ID: 102)
   [PAD] token: [PAD] (ID: 0)


### Key Tokenizer Properties:
- **Vocabulary**: 30,522 WordPiece tokens
- **Max Length**: 512 tokens (BERT's limit)
- **Special Tokens**:
  - `[CLS]`: Classification token (start of sequence)
  - `[SEP]`: Separator token (between question & context)
  - `[PAD]`: Padding token (for batch consistency)


STEP 4: Test Tokenization

---


​
Action: Understand how tokenization works before preprocessing

---



In [3]:
# ============================================
# TOKENIZATION TESTING (Understanding Phase)
# ============================================

# Test on a simple example
test_question = "What is the capital of France?"
test_context = "Paris is the capital and largest city of France."

# Tokenize
encoded = tokenizer(
    test_question,
    test_context,
    max_length=50,
    truncation="only_second",  # Only truncate context if needed
    padding="max_length",
    return_offsets_mapping=True,  # CRITICAL for answer mapping!
    return_tensors="pt"
)

# Display results
print("🔍 TOKENIZATION TEST")
print("=" * 70)
print(f"\n**Original Question:** {test_question}")
print(f"**Original Context:** {test_context}")

print(f"\n**Tokenized Structure:**")
print(f"Input IDs shape: {encoded['input_ids'].shape}")
print(f"Attention Mask shape: {encoded['attention_mask'].shape}")
print(f"Offset Mapping shape: {encoded['offset_mapping'].shape}")

# Decode to see tokens
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print(f"\n**Tokens (first 20):**")
for i, token in enumerate(tokens[:20]):
    print(f"  {i}: '{token}'")


🔍 TOKENIZATION TEST

**Original Question:** What is the capital of France?
**Original Context:** Paris is the capital and largest city of France.

**Tokenized Structure:**
Input IDs shape: torch.Size([1, 50])
Attention Mask shape: torch.Size([1, 50])
Offset Mapping shape: torch.Size([1, 50, 2])

**Tokens (first 20):**
  0: '[CLS]'
  1: 'what'
  2: 'is'
  3: 'the'
  4: 'capital'
  5: 'of'
  6: 'france'
  7: '?'
  8: '[SEP]'
  9: 'paris'
  10: 'is'
  11: 'the'
  12: 'capital'
  13: 'and'
  14: 'largest'
  15: 'city'
  16: 'of'
  17: 'france'
  18: '.'
  19: '[SEP]'


STEP 5: Create Preprocessing Function ​​

---


Question 2.2: Preprocess the dataset (5 marks)



---





In [4]:
# ============================================
# QUESTION 2.2: Create Preprocessing Function (5 marks)
# ============================================

def preprocess_function(examples):
    """
    Tokenize questions + contexts and map answer positions to token indices.

    CRITICAL: This function must correctly map character-level answer positions
    to token-level positions for BERT to learn properly.
    """

    # Tokenize questions and contexts
    tokenized_examples = tokenizer(
        examples['question'],
        examples['context'],
        max_length=384,  # Based on Day 1 analysis: 95%+ contexts fit
        truncation="only_second",  # Never truncate questions
        stride=128,  # Overlap for long contexts
        return_overflowing_tokens=True,
        return_offsets_mapping=True,  # CRITICAL for answer mapping!
        padding="max_length"
    )

    # Map sample indices to their corresponding features
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Initialize lists for start and end positions
    start_positions = []
    end_positions = []

    # Process each tokenized example
    for i, offsets in enumerate(offset_mapping):
        # Get the sample index this feature corresponds to
        sample_index = sample_mapping[i]
        answers = examples['answers'][sample_index]

        # If no answer, set positions to CLS token
        if len(answers['answer_start']) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        # Get answer start and end character positions
        start_char = answers['answer_start'][0]
        end_char = start_char + len(answers['text'][0])

        # Find the start and end token index
        # Token indices for context start after question + [SEP]
        sequence_ids = tokenized_examples.sequence_ids(i)

        # Find context start and end
        context_start = 0
        while sequence_ids[context_start] != 1:  # 1 = context
            context_start += 1

        context_end = len(sequence_ids) - 1
        while sequence_ids[context_end] != 1:
            context_end -= 1

        # Check if answer is in this feature (not truncated)
        if not (offsets[context_start][0] <= start_char and
                offsets[context_end][1] >= end_char):
            # Answer is outside this feature, set to CLS
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Find token start position
            token_start = context_start
            while token_start <= context_end and offsets[token_start][0] <= start_char:
                token_start += 1
            start_positions.append(token_start - 1)

            # Find token end position
            token_end = context_end
            while token_end >= context_start and offsets[token_end][1] >= end_char:
                token_end -= 1
            end_positions.append(token_end + 1)

    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions

    return tokenized_examples

print("✅ Preprocessing function created!")


✅ Preprocessing function created!


### Preprocessing Function Key Features:
1. **max_length=384**: Covers 95%+ of contexts (from Day 1 analysis)
2. **stride=128**: 33% overlap for long contexts
3. **truncation="only_second"**: Preserves full questions
4. **return_offsets_mapping**: Maps character positions → token positions
5. **Answer position logic**: Handles truncated/out-of-bounds answers


STEP 6: Apply Preprocessing ​

---


Question 2.3: Apply preprocessing to datasets (5 marks)

---



In [5]:
# ============================================
# QUESTION 2.3: Apply Preprocessing (5 marks)
# ============================================

print("🔄 Starting tokenization (this may take 5-10 minutes)...")

# Apply preprocessing to training set
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names,
    desc="Tokenizing training set"
)

# Apply preprocessing to validation set
tokenized_val = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names,
    desc="Tokenizing validation set"
)

print(f"\n✅ Tokenization complete!")
print(f"\n📊 Tokenized Training Set:")
print(tokenized_train)
print(f"\n📊 Tokenized Validation Set:")
print(tokenized_val)

# Check number of features created
print(f"\n📈 Dataset Expansion:")
print(f"   Original train: {len(train_dataset)} examples")
print(f"   Tokenized train: {len(tokenized_train)} features")
print(f"   Expansion ratio: {len(tokenized_train)/len(train_dataset):.2f}x")


🔄 Starting tokenization (this may take 5-10 minutes)...


Tokenizing training set:   0%|          | 0/3000 [00:00<?, ? examples/s]

Tokenizing validation set:   0%|          | 0/500 [00:00<?, ? examples/s]


✅ Tokenization complete!

📊 Tokenized Training Set:
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 3074
})

📊 Tokenized Validation Set:
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 520
})

📈 Dataset Expansion:
   Original train: 3000 examples
   Tokenized train: 3074 features
   Expansion ratio: 1.02x


In [6]:
# ============================================
# CRITICAL: ANSWER POSITION VALIDATION
# (Recommended enhancement from PDF)
# ============================================

def validate_answer_mapping(tokenized_dataset, original_dataset, num_samples=15):
    """
    Validate that token positions correctly map back to original answers.

    THIS PREVENTS TRAINING ON GARBAGE DATA!
    """
    print("🔍 VALIDATING ANSWER POSITION MAPPING...")
    print("=" * 70)
    print(f"Testing {num_samples} random samples...\n")

    # Track results
    passed = 0
    failed = 0
    errors = []

    for i in range(min(num_samples, len(tokenized_dataset))):
        try:
            # Get tokenized data
            input_ids = tokenized_dataset[i]['input_ids']
            start_pos = tokenized_dataset[i]['start_positions']
            end_pos = tokenized_dataset[i]['end_positions']

            # Skip if answer was truncated (positions at CLS)
            if start_pos == 0 and end_pos == 0:
                print(f"Example {i+1}: ⚠️  SKIPPED (answer truncated)")
                continue

            # Decode predicted answer from tokens
            predicted_tokens = input_ids[start_pos:end_pos+1]
            predicted_answer = tokenizer.decode(predicted_tokens).strip()

            # Get original answer (find corresponding original example)
            # This is tricky due to stride creating multiple features per example
            original_idx = i % len(original_dataset)  # Approximation
            original_answer = original_dataset[original_idx]['answers']['text'][0].strip()

            # Compare (case-insensitive, handle wordpiece tokens)
            # Clean predicted answer (remove ##)
            predicted_clean = predicted_answer.replace('##', '').lower()
            original_clean = original_answer.lower()

            # Check if match
            if predicted_clean == original_clean or original_clean in predicted_clean:
                print(f"Example {i+1}: ✅ PASS")
                print(f"  Original:  '{original_answer}'")
                print(f"  Predicted: '{predicted_answer}'")
                passed += 1
            else:
                print(f"Example {i+1}: ❌ FAIL")
                print(f"  Original:  '{original_answer}'")
                print(f"  Predicted: '{predicted_answer}'")
                print(f"  ⚠️  MISMATCH DETECTED!")
                failed += 1
                errors.append({
                    'example': i+1,
                    'original': original_answer,
                    'predicted': predicted_answer
                })
            print()

        except Exception as e:
            print(f"Example {i+1}: ⚠️  ERROR: {str(e)}\n")
            failed += 1

    # Print summary
    print("=" * 70)
    print("📊 VALIDATION SUMMARY")
    print("=" * 70)
    print(f"✅ Passed: {passed}")
    print(f"❌ Failed: {failed}")
    print(f"📈 Success Rate: {(passed/(passed+failed)*100):.1f}%")

    if failed > 0:
        print(f"\n⚠️  WARNING: {failed} validations failed!")
        print("⚠️  DO NOT PROCEED TO TRAINING UNTIL THIS IS FIXED!")
        print("\nDebug the preprocessing function before continuing.")
    else:
        print("\n🎉 ALL VALIDATIONS PASSED!")
        print("✅ Safe to proceed to model training!")

    return passed, failed, errors

# RUN VALIDATION
passed, failed, errors = validate_answer_mapping(
    tokenized_train,
    train_dataset,
    num_samples=15
)


🔍 VALIDATING ANSWER POSITION MAPPING...
Testing 15 random samples...

Example 1: ✅ PASS
  Original:  'Saint Bernadette Soubirous'
  Predicted: 'saint bernadette soubirous'

Example 2: ✅ PASS
  Original:  'a copper statue of Christ'
  Predicted: 'a copper statue of christ'

Example 3: ✅ PASS
  Original:  'the Main Building'
  Predicted: 'the main building'

Example 4: ✅ PASS
  Original:  'a Marian place of prayer and reflection'
  Predicted: 'a marian place of prayer and reflection'

Example 5: ✅ PASS
  Original:  'a golden statue of the Virgin Mary'
  Predicted: 'a golden statue of the virgin mary'

Example 6: ✅ PASS
  Original:  'September 1876'
  Predicted: 'september 1876'

Example 7: ✅ PASS
  Original:  'twice'
  Predicted: 'twice'

Example 8: ✅ PASS
  Original:  'The Observer'
  Predicted: 'the observer'

Example 9: ✅ PASS
  Original:  'three'
  Predicted: 'three'

Example 10: ✅ PASS
  Original:  '1987'
  Predicted: '1987'

Example 11: ✅ PASS
  Original:  'Rome'
  Predicted: 'rome

STEP 8: Answer PyTorch Tensor Question

---


Question 2.4: PyTorch tensor conversion (5 marks)

---



## Question 2.4: Is PyTorch Tensor Conversion Necessary? (5 marks)

### Answer: **NO, conversion is NOT required**

### Explanation:

The Hugging Face `Trainer` API **automatically handles tensor conversion** internally. When we pass the tokenized datasets to the `Trainer` (which we'll do on next step), the following happens automatically:

1. **Dynamic Batching**: The `DataCollator` handles creating batches
2. **Automatic Conversion**: Datasets are converted to PyTorch tensors on-the-fly
3. **Memory Efficiency**: Only converts what's needed per batch

### Why Manual Conversion Would Be Redundant:
Manual conversion is only necessary when:
- Using custom PyTorch `DataLoader` directly (not using `Trainer`)
- Implementing custom training loops
- Working outside the Hugging Face ecosystem

**Conclusion**: For our project using `Trainer`, manual conversion is unnecessary and would add no benefit.


In [7]:
# Verify dataset is ready for training
print(f"✅ Training features: {len(tokenized_train)}")
print(f"✅ Validation features: {len(tokenized_val)}")
print(f"✅ Feature keys: {tokenized_train.column_names}")


✅ Training features: 3074
✅ Validation features: 520
✅ Feature keys: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
