# Vocabulary, Encoding & Padding

## üéØ Concept Primer
Build vocabulary, encode words as integers, pad/truncate sequences to fixed length.

**Expected:** Encoded sequences [N, max_len], vocab size

## üìã Objectives
1. Build vocabulary from training data
2. Encode tokens as integers
3. Pad/truncate to fixed length
4. Create DataLoader

## üîß Setup

In [1]:
# TODO 1: Import libraries
import torch
from torch.utils.data import Dataset, DataLoader
from collections import Counter

## üìö Build Vocabulary

### TODO 2: Create word-to-index mapping

**Expected:** vocab_size, word2idx dict

In [2]:
# TODO 2: Build vocab
import pandas as pd
import re
from collections import Counter

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

df = pd.read_csv('../data/processed/specialty_taxonomy_v1.csv')

df['text_clean'] = df['text'].apply(clean_text)

df['tokens'] = df['text_clean'].apply(lambda x: x.split())
df['token_count'] = df['tokens'].apply(len)

words = [word for tokens in df['tokens'] for word in tokens]
vocab = Counter(words)
word2idx = {
    "<PAD>": 0,
    "<UNK>": 1,
}

VOCAB_SIZE = 15000  # Start with 10K, adjust later if needed

# Get top 10K-2 words (subtract 2 for <PAD> and <UNK>)
most_common_words = vocab.most_common(VOCAB_SIZE - 2)

for idx, (word, count) in enumerate(most_common_words):
    word2idx[word] = idx + 2  # Start at 2 (after <PAD> and <UNK>)

print(f"Vocabulary size: {len(word2idx)}")
print(f"Total unique words in corpus: {len(vocab)}")
print(f"\nTop 10 most common words:")
print(vocab.most_common(10))
print(f"\nLeast common words in vocab:")
print(most_common_words[-5:])  # Show the last 5 words you kept




Vocabulary size: 15000
Total unique words in corpus: 30644

Top 10 most common words:
[('the', 191081), ('of', 125100), ('and', 81675), ('a', 76784), ('in', 69925), ('to', 57829), ('is', 57530), ('or', 44890), ('are', 30766), ('that', 30339)]

Least common words in vocab:
[('antiquitin', 4), ('vrk', 4), ('correlates', 4), ('clonal', 4), ('upwards', 4)]


## üî¢ Encode Sequences

### TODO 3: Convert tokens to integers

**Expected:** Encoded lists of integers

In [3]:
# TODO 3: Encode
def encode(tokens):
    encoded = []
    for token in tokens:
        idx = word2idx.get(token, word2idx['<UNK>'])
        encoded.append(idx)
    return encoded

df['encoded'] = df['tokens'].apply(encode)

print(df.head())
print("Number of <UNK> tokens:")
print(df["encoded"].apply(lambda seq: word2idx['<UNK>'] in seq).sum())


   id                                               text      specialty  \
0   0  Glaucoma is a group of diseases that can damag...  Ophthalmology   
1   1  Nearly 2.7 million people have glaucoma, a lea...  Ophthalmology   
2   2  Symptoms of Glaucoma  Glaucoma can develop in ...  Ophthalmology   
3   3  Although open-angle glaucoma cannot be cured, ...  Ophthalmology   
4   4  Glaucoma is a group of diseases that can damag...  Ophthalmology   

                                          text_clean  \
0  glaucoma is a group of diseases that can damag...   
1  nearly  million people have glaucoma a leading...   
2  symptoms of glaucoma  glaucoma can develop in ...   
3  although openangle glaucoma cannot be cured it...   
4  glaucoma is a group of diseases that can damag...   

                                              tokens  token_count  \
0  [glaucoma, is, a, group, of, diseases, that, c...          319   
1  [nearly, million, people, have, glaucoma, a, l...          192   
2  [s

## üìè Pad Sequences

### TODO 4: Pad/truncate to fixed length

**Expected:** All sequences same length (e.g., 200)

In [6]:
# TODO 4: Pad
max_len = 512 # So we can use BERT

def pad_sequence(seq, max_len):
    if len(seq) > max_len:
        return seq[:max_len]
    return seq + [0] * (max_len - len(seq))

df['padded'] = df['encoded'].apply(lambda x: pad_sequence(x, max_len))

#df["specialty"] = df["specialty"].str.lower()

unique_specialities = df['specialty'].unique()
label2idx = {label: idx for idx, label in enumerate(unique_specialities)}

print(label2idx)

df['label_encoded'] = df['specialty'].map(label2idx)

print(f"Any missing encodings? {df['label_encoded'].isna().sum()}")
print(df[['specialty', 'label_encoded']].head(25))

# Test cases:
test_seq_short = [1, 2, 3]           # Too short
test_seq_long = [i for i in range(600)]  # Too long
test_seq_perfect = [i for i in range(512)]  # Perfect

print("Short:", len(pad_sequence(test_seq_short, max_len)))      # Should be 512
print("Long:", len(pad_sequence(test_seq_long, max_len)))        # Should be 512
print("Perfect:", len(pad_sequence(test_seq_perfect, max_len)))  # Should be 512

# Check the values
print("Short example:", pad_sequence(test_seq_short, max_len)[:10])  # First 10
print("Short example:", pad_sequence(test_seq_short, max_len)[-10:]) # Last 10 (should be zeros!)

{'Ophthalmology': 0, 'Endocrinology & Diabetes': 1, 'Oncology': 2, 'Infectious Diseases': 3, 'Cardiology & Vascular': 4, 'Nephrology & Urology': 5, 'Obstetrics & Gynecology': 6, 'Neurology & Neurosurgery': 7, 'Molecular Genetics & Mechanisms': 8, 'General Health & Prevention': 9, 'Pediatrics & Congenital Disorders': 10, 'Genetic & Chromosomal Syndromes': 11, 'Rare Genetic Disorders': 12}
Any missing encodings? 0
                   specialty  label_encoded
0              Ophthalmology              0
1              Ophthalmology              0
2              Ophthalmology              0
3              Ophthalmology              0
4              Ophthalmology              0
5              Ophthalmology              0
6              Ophthalmology              0
7              Ophthalmology              0
8              Ophthalmology              0
9              Ophthalmology              0
10             Ophthalmology              0
11             Ophthalmology              0
12          

## üîÑ Create DataLoader

### TODO 5: Setup PyTorch DataLoader

**Expected:** DataLoader with batch_size=32

In [7]:
# TODO 5: DataLoader
from torch.utils.data import Dataset, DataLoader

texts = df['padded'].tolist()
labels = df['label_encoded'].tolist()

class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        return torch.tensor(text, dtype=torch.long), torch.tensor(label, dtype=torch.long)


dataset = TextDataset(texts, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch_texts, batch_labels in dataloader:
    print(f"Text dtype: {batch_texts.dtype}")      # Should be: torch.int64
    print(f"Label dtype: {batch_labels.dtype}")    # Should be: torch.int64
    print(f"Text shape: {batch_texts.shape}")      # Should be: [batch_size, 512]
    print(f"Label shape: {batch_labels.shape}")    # Should be: [batch_size]
    break




Text dtype: torch.int64
Label dtype: torch.int64
Text shape: torch.Size([32, 512])
Label shape: torch.Size([32])


## ü§î Reflection
1. Vocab size? Any OOV issues?
2. Max length choice?

### 1. Vocab size? Any OOV issues?

**Vocabulary Size: 15,000 words**

**Decision Rationale:**
- Total unique words in corpus: 30,644
- Kept top 15,000 most frequent words (49% coverage)
- Excluded 15,644 rare words (51% of unique words)

**Why 15,000?**
- **Balance:** Large enough to capture medical terminology, small enough for efficient training
- **Medical domain:** Needs more vocabulary than general text due to specialized terminology
- **Diminishing returns:** Words beyond top 15K appear ‚â§4 times, offering little signal
- **Model efficiency:** Smaller embedding matrix (15K √ó 768) vs. full vocab (30K √ó 768)

**OOV (Out Of Vocabulary) Analysis:**
- **48% of samples (7,905/16,407)** contain at least one <UNK> token
- This is expected and acceptable because:
  - Rare words (appearing 1-3 times) provide little learning signal
  - Medical jargon variations often convey same meaning (e.g., "hyperglycemia" vs "high blood sugar")
  - Model can infer meaning from context even with some <UNK> tokens
  - Typos and formatting artifacts are naturally filtered out

**Trade-offs Considered:**
- Smaller vocab (5K-10K): Would miss important medical terms, higher UNK rate
- Larger vocab (20K-25K): Marginal benefit (rare words), larger model, slower training
- Full vocab (30K): Includes noise (typos, OCR errors), memory-intensive

**Validation Strategy:**
- Monitor model performance to see if 15K is sufficient
- Can increase to 20K if UNK rate impacts accuracy
- Least common words kept: appeared 4 times (e.g., 'antiquitin', 'vrk', 'correlates')

---

### 2. Max length choice?

**Max Sequence Length: 512 tokens**

**Decision Rationale (from Notebook 02 analysis):**
- **95th percentile:** 499 tokens ‚Üê Only 5% of texts are longer
- **BERT's native max:** 512 tokens (no custom configuration needed)
- **Data retention:** 95% of texts captured fully (15,587 samples)
- **Truncation impact:** Only 5% truncated (820 samples) - acceptable loss

**Why 512 is optimal:**
1. ‚úÖ **Evidence-based:** Aligns with our token distribution analysis
2. ‚úÖ **Standard:** BERT/transformer models trained with 512 max length
3. ‚úÖ **Memory efficient:** 10x better than padding to max (4,183)
4. ‚úÖ **Performance:** Sufficient context for specialty classification
5. ‚úÖ **Practical:** Batch size of 32 √ó 512 = manageable memory usage

**Alternative considered (not chosen):**
- max_len = 256: Too short, would truncate 25% of data, lose information
- max_len = 1024: Only capture 1% more data, double memory usage, slower training
- max_len = 499 (exact 95th): Non-standard, minimal benefit vs. 512

**Impact:**
- **Padding overhead:** 75% of texts need padding (short texts)
- **Truncation:** 5% of texts lose information (long medical explanations)
- **Trade-off:** Accepted minor information loss for computational efficiency

**Next steps:**
- Monitor if truncated texts come from specific specialties (potential bias)
- Consider attention masks to help model ignore padding tokens
- Evaluate model performance to validate max_len choice

## üìå Summary
‚úÖ Vocab built  
‚úÖ Sequences encoded  
‚úÖ Padding applied  
‚úÖ DataLoader ready

**Next:** `04_baseline_classifier.ipynb`