# 📚 Notebook 03: Vocabulary Building & Encoding

## From Words to Numbers

This notebook teaches you how to build a vocabulary dictionary that maps words to integers, then use that vocabulary to encode your tokenized text into numerical representations that neural networks can process.


## 🧠 Concept Primer: Vocabulary and Encoding

### What We're Doing
Building a vocabulary dictionary that maps words to integers, then encoding tokenized text into numerical sequences.

### Why This Step is Critical
**Neural networks require numerical inputs.** We need to:
- **Map words to integers** (vocabulary dictionary)
- **Handle unknown words** (special `<unk>` token)
- **Manage sequence length** (special `<pad>` token)
- **Limit vocabulary size** (most common 1000 words + specials)

### What We'll Build
- **Word frequency counting** using `collections.Counter`
- **Vocabulary dictionary** with special tokens at positions 0 and 1
- **Encoding function** that converts tokens to integers
- **Out-of-vocabulary handling** (unknown words → `<unk>`)

### Special Tokens Strategy
- `<unk>` = 0 (unknown/rare words)
- `<pad>` = 1 (padding for fixed-length sequences)
- Regular words = 2, 3, 4, ... (most frequent first)

### Expected Output Example
```python
encode_text("This movie was great", vocab)
# Returns: [45, 12, 8, 203] (actual numbers depend on your vocab)
```


## 🔧 TODO #1: Build Word Frequencies

**Task:** Count word frequencies across all tokenized reviews to identify the most common words.

**Hint:** Flatten `tokenized_corpus` into a single list, then use `Counter(combined_corpus)`

**Expected Variables:**
- `combined_corpus` → Flat list of all tokens from all reviews
- `word_freqs` → Counter object with word frequencies

**Expected Output:** You should see the most frequent words when you print `word_freqs.most_common(10)`


In [None]:
# TODO #1: Build word frequencies
from collections import Counter

# Your code here


## 🔧 TODO #2: Create Vocabulary Dictionary

**Task:** Build vocabulary dictionary with most common 1000 words plus special tokens.

**Hint:** Use `word_freqs.most_common(1000)` to get top words, then create vocab with `{word: idx+2 for idx, (word, _) in enumerate(most_common_words)}`, then add special tokens.

**Expected Variables:**
- `MAX_VOCAB_SIZE = 1000`
- `most_common_words` → List of (word, count) tuples
- `vocab` → Dictionary mapping words to integers

**Expected Output:** `len(vocab) == 1002` (1000 words + 2 special tokens)


In [None]:
# TODO #2: Create vocabulary dictionary
# Your code here


## 🔧 TODO #3: Implement Encoding Function

**Task:** Create function that converts text to list of integers using vocabulary.

**Hint:** Tokenize first with your existing function, then use `[vocab.get(token, 0) for token in tokens]` to handle unknown words.

**Expected Function Signature:**
```python
def encode_text(text, vocab):
    # Your implementation here
    return encoded_ids  # List of integers
```

**Expected Output Example:**
```python
encode_text("This movie was great", vocab)
# Returns: [45, 12, 8, 203] (numbers will vary based on your vocab)
```


In [None]:
# TODO #3: Implement encoding function
# Your code here


## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why start vocab IDs at 2?** What would happen if you started at 0?

2. **What happens if test data has many unknown words?** How does this affect model performance?

3. **Why limit vocabulary to 1000 words?** What's the tradeoff between vocabulary size and model performance?

4. **How does the `<unk>` token help with generalization?** What would happen without it?

### 🎯 Vocabulary Quality
- Are the most frequent words what you'd expect for movie reviews?
- How many words in your vocabulary do you recognize as movie-related?
- What percentage of tokens will become `<unk>` in unseen text?

---

**Write your reflections here:**
