# 📚 Notebook 03: Vocabulary Building & Encoding

## From Words to Numbers

This notebook teaches you how to build a vocabulary dictionary that maps words to integers, then use that vocabulary to encode your tokenized text into numerical representations that neural networks can process.


## 🧠 Concept Primer: Vocabulary and Encoding

### What We're Doing
Building a vocabulary dictionary that maps words to integers, then encoding tokenized text into numerical sequences.

### Why This Step is Critical
**Neural networks require numerical inputs.** We need to:
- **Map words to integers** (vocabulary dictionary)
- **Handle unknown words** (special `<unk>` token)
- **Manage sequence length** (special `<pad>` token)
- **Limit vocabulary size** (most common 1000 words + specials)

### What We'll Build
- **Word frequency counting** using `collections.Counter`
- **Vocabulary dictionary** with special tokens at positions 0 and 1
- **Encoding function** that converts tokens to integers
- **Out-of-vocabulary handling** (unknown words → `<unk>`)

### Special Tokens Strategy
- `<unk>` = 0 (unknown/rare words)
- `<pad>` = 1 (padding for fixed-length sequences)
- Regular words = 2, 3, 4, ... (most frequent first)

### Expected Output Example
```python
encode_text("This movie was great", vocab)
# Returns: [45, 12, 8, 203] (actual numbers depend on your vocab)
```


## 🔧 TODO #1: Build Word Frequencies

**Task:** Count word frequencies across all tokenized reviews to identify the most common words.

**Hint:** Flatten `tokenized_corpus` into a single list, then use `Counter(combined_corpus)`

**Expected Variables:**
- `combined_corpus` → Flat list of all tokens from all reviews
- `word_freqs` → Counter object with word frequencies

**Expected Output:** You should see the most frequent words when you print `word_freqs.most_common(10)`


In [7]:
# TODO #1: Build word frequencies
from collections import Counter
import re

# Your code here
import pandas as pd

def tokenize(text):
    # Use regex to find words, ignoring punctuation
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens


train_reviews_df = pd.read_csv('../data/imdb_movie_reviews_train.csv')
test_reviews_df = pd.read_csv('../data/imdb_movie_reviews_test.csv')

tokenized_corpus_train = train_reviews_df['review'].apply(tokenize).tolist() 
tokenized_corpus_test = test_reviews_df['review'].apply(tokenize).tolist()

print(tokenized_corpus_train[:1])  # Print first tokenized review from training set

combined_corpus = []

for word in tokenized_corpus_train:
    combined_corpus.extend(word)

print(combined_corpus[:20])  # Print first 20 words from combined corpus

word_freqs = Counter(combined_corpus)
print(word_freqs.most_common(10))  # Print 10 most common words


[['ibiza', 'filming', 'location', 'looks', 'very', 'enchanting']]
['ibiza', 'filming', 'location', 'looks', 'very', 'enchanting', 'randolph', 'scott', 'always', 'played', 'men', 'you', 'could', 'look', 'up', 'to', 'for', 'their', 'sense', 'of']
[('the', 732), ('a', 307), ('and', 306), ('of', 296), ('is', 218), ('to', 213), ('in', 177), ('it', 134), ('s', 109), ('that', 105)]


## 🔧 TODO #2: Create Vocabulary Dictionary

**Task:** Build vocabulary dictionary with most common 1000 words plus special tokens.

**Hint:** Use `word_freqs.most_common(1000)` to get top words, then create vocab with `{word: idx+2 for idx, (word, _) in enumerate(most_common_words)}`, then add special tokens.

**Expected Variables:**
- `MAX_VOCAB_SIZE = 1000`
- `most_common_words` → List of (word, count) tuples
- `vocab` → Dictionary mapping words to integers

**Expected Output:** `len(vocab) == 1002` (1000 words + 2 special tokens)


In [10]:
# TODO #2: Create vocabulary dictionary
# Your code here
max_vocab_size = 1002
most_common_words = word_freqs.most_common(max_vocab_size - 2)  # Reserve 2 for <PAD> and <UNK>
vocab = {'<PAD>': 0, '<UNK>': 1, **{word: idx + 2 for idx, (word, _) in enumerate(most_common_words)}}
print(list(vocab.items())[:10])  # Print first 10 items in vocabulary dictionary
vocab_size = len(vocab)
print(f'Vocabulary size: {vocab_size}')  # Print vocabulary size

[('<PAD>', 0), ('<UNK>', 1), ('the', 2), ('a', 3), ('and', 4), ('of', 5), ('is', 6), ('to', 7), ('in', 8), ('it', 9)]
Vocabulary size: 1002


## 🔧 TODO #3: Implement Encoding Function

**Task:** Create function that converts text to list of integers using vocabulary.

**Hint:** Tokenize first with your existing function, then use `[vocab.get(token, 0) for token in tokens]` to handle unknown words.

**Expected Function Signature:**
```python
def encode_text(text, vocab):
    # Your implementation here
    return encoded_ids  # List of integers
```

**Expected Output Example:**
```python
encode_text("This movie was great", vocab)
# Returns: [45, 12, 8, 203] (numbers will vary based on your vocab)
```


In [13]:
# TODO #3: Implement encoding function
# Your code here
def encode_text(text, vocab):
    tokens = tokenize(text)
    encoded = [vocab.get(token, vocab['<UNK>']) for token in tokens]
    return encoded

encoded_reviews_train = train_reviews_df['review'].apply(lambda x: encode_text(x, vocab)).tolist()
encoded_reviews_test = test_reviews_df['review'].apply(lambda x: encode_text(x, vocab)).tolist()

num_unk = sum(token == vocab['<UNK>'] for review in encoded_reviews_train for token in review)
print(f'Number of <UNK> tokens in training set: {num_unk}')  # Print number of <UNK> tokens
print(f"Percentage of <UNK> tokens in training set: {num_unk / sum(len(review) for review in encoded_reviews_train) * 100:.2f}%")

print(encoded_reviews_train[:1])  # Print first encoded review from training set
print(encoded_reviews_test[:1])  # Print first encoded review from test set

Number of <UNK> tokens in training set: 2027
Percentage of <UNK> tokens in training set: 18.86%
[[1, 252, 591, 114, 26, 1]]
[[2, 91, 6, 1, 9, 34, 440, 125, 170, 1]]


## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why start vocab IDs at 2?** What would happen if you started at 0?

2. **What happens if test data has many unknown words?** How does this affect model performance?

3. **Why limit vocabulary to 1000 words?** What's the tradeoff between vocabulary size and model performance?

4. **How does the `<unk>` token help with generalization?** What would happen without it?

### 🎯 Vocabulary Quality
- Are the most frequent words what you'd expect for movie reviews?
- How many words in your vocabulary do you recognize as movie-related?
- What percentage of tokens will become `<unk>` in unseen text?

---

**Write your reflections here:**


## 📝 My Reflections

### 🤔 Understanding Check Answers

1. **Why start vocab IDs at 2?** 
   - If we started at 0, we would overwrite the special tokens `<PAD>` and `<UNK>`
   - Special tokens need reserved positions (0 and 1) to handle edge cases consistently
   - Starting at 2 ensures regular words don't conflict with special token functionality

2. **What happens if test data has many unknown words?**
   - If test data has too many unknown words, the trained model will struggle to make predictions
   - The core words needed for classification might not be in the vocabulary
   - This highlights the importance of having a representative vocabulary from training data

3. **Why limit vocabulary to 1000 words?**
   - We get a representative vocabulary of the most important words
   - We remove noise from training by filtering out rare words
   - Balances vocabulary coverage with computational efficiency
   - Prevents overfitting to very rare words that might not generalize

4. **How does the `<unk>` token help with generalization?**
   - Actually, `<UNK>` doesn't directly help with generalization - it's more about handling edge cases
   - It prevents crashes when encountering unknown words during inference
   - It provides a consistent fallback for words not seen during training
   - The real generalization comes from having a diverse, representative vocabulary

### 🎯 Vocabulary Quality Assessment

**Most frequent words analysis:**
- The most frequent words are indeed what I expected: mainly stopwords ("the", "a", "and", "of", "is")
- These are common English words that appear frequently across all types of text
- Stopwords provide grammatical structure but less semantic meaning for classification

**Movie-related vocabulary:**
- I don't have a specific count of movie-related words, but I'd be interested to learn how to analyze this
- Could potentially filter vocabulary for domain-specific terms related to cinematography, characters, story

**Unknown word prediction:**
- In unseen text, I believe we'll have at least 50% unknown words
- This is because our vocabulary only covers the top 1000 most frequent words from training
- Many domain-specific terms, proper nouns, and less common words will become `<UNK>`
- This limitation will be interesting to compare with TinyBERT's subword approach

**Key insight:** The vocabulary limitation highlights why transformer models with subword tokenization often perform better - they can handle unknown words by breaking them into meaningful subword pieces rather than losing all information with `<UNK>` tokens.
