# 02 — Character Tokenization & Vocabulary
## Build the Character-to-ID Mapping

---


## 🎯 Concept Primer

### What is Tokenization?

**Tokenization** breaks text into units (tokens). For us, each token = one character.

**Example:**
```
Text:    "Hi!"
Tokens:  ['H', 'i', '!']
```

### Why Character-Level Tokens?

- **Small vocabulary**: ~50-80 unique characters (letters, punctuation, spaces)
- **No unknown words**: Every possible character is in the vocab
- **Simplicity**: No need for complex tokenizers like BPE or WordPiece

### Building the Vocabulary

We need **two dictionaries**:

1. **`c2ix`** (char → index): Encode characters to integers for the model  
   `{'a': 0, 'b': 1, ..., ' ': 26, ...}`

2. **`ix2c`** (index → char): Decode integers back to characters for text generation  
   `{0: 'a', 1: 'b', ..., 26: ' ', ...}`

### What Breaks If We Skip This?

- No mapping = can't convert text to numbers
- Inconsistent ordering = non-reproducible results
- Missing characters = crashes during generation

### Shapes
- **Input**: `first_letter_text` (string, ~6,850 chars)
- **Tokens**: List of 6,850 characters
- **Unique chars**: ~50-80 (sorted for consistency)
- **ID sequence**: List of 6,850 integers

---


## ✅ Objectives

By the end of this notebook, you should:

- [ ] Convert `first_letter_text` into a list of characters → `tokenized_text`
- [ ] Extract all unique characters and sort them → `unique_char_tokens`
- [ ] Build `c2ix` dictionary (char → index)
- [ ] Build `ix2c` dictionary (index → char)
- [ ] Calculate `vocab_size`
- [ ] Convert the full text to IDs → `tokenized_id_text`
- [ ] Verify by printing first 100 IDs

---


## 🎓 Acceptance Criteria

**You pass this notebook when:**

✅ `vocab_size` prints (should be ~50-80)  
✅ First 100 IDs of `tokenized_id_text` display correctly  
✅ You can manually check: `c2ix['a']` and `ix2c[0]` work as expected  
✅ You understand why we sort the unique characters

---


## 📝 TODO 0: Load the Data

**Note:** Copy your loading code from Notebook 01, or re-run it here.

**You need:**
- `first_letter_text` variable populated


In [None]:
# TODO: Load first_letter_text from Notebook 01
# (Copy your code from 01, or simply re-run those cells)

with open('../datasets/frankenstein.txt', 'r', encoding='utf-8') as f:
    frankenstein = f.read()
    
first_letter_text = frankenstein[1380:8230]

print(f"Loaded {len(first_letter_text)} characters")


## 📝 TODO 1: Tokenize Into Characters

**Hint:**  
A string in Python is already iterable. Convert it to a list.

**Steps:**
1. Use `list(first_letter_text)` to convert the string into a list of characters
2. Assign to `tokenized_text`

**Example:**
```python
text = "Hi!"
tokens = list(text)  # ['H', 'i', '!']
```


In [None]:
# TODO: Tokenize the text into characters
# tokenized_text = list(first_letter_text)

tokenized_text = None  # Replace this line

# Verify
if tokenized_text:
    print(f"Total tokens: {len(tokenized_text)}")
    print(f"First 50 tokens: {tokenized_text[:50]}")


## 📝 TODO 2: Extract Unique Characters (Sorted)

**Hint:**  
Use `set()` to get unique characters, then `sorted()` to sort alphabetically.

**Why sort?**  
Sorting ensures consistent IDs across runs. If we don't sort, `set()` order is unpredictable.

**Steps:**
1. Convert `tokenized_text` to a set to get unique characters
2. Sort it with `sorted()`
3. Assign to `unique_char_tokens`


In [None]:
# TODO: Get unique characters and sort them
# unique_char_tokens = sorted(set(tokenized_text))

unique_char_tokens = None  # Replace this line

# Verify
if unique_char_tokens:
    print(f"Unique characters: {len(unique_char_tokens)}")
    print(f"Vocabulary: {unique_char_tokens}")


## 📝 TODO 3: Build c2ix (char → index)

**Hint:**  
Use dictionary comprehension with `enumerate()`.

**Steps:**
1. Enumerate through `unique_char_tokens` to get (index, char) pairs
2. Create a dict mapping char → index
3. Assign to `c2ix`

**Example:**
```python
chars = ['a', 'b', 'c']
c2ix = {char: idx for idx, char in enumerate(chars)}
# Result: {'a': 0, 'b': 1, 'c': 2}
```


In [None]:
# TODO: Build char-to-index dictionary
# c2ix = {char: idx for idx, char in enumerate(unique_char_tokens)}

c2ix = None  # Replace this line

# Verify
if c2ix:
    print(f"Sample mappings from c2ix:")
    for char in [' ', 'a', 'e', 't', '.']:
        if char in c2ix:
            print(f"  '{char}' → {c2ix[char]}")


## 📝 TODO 4: Build ix2c (index → char)

**Hint:**  
Reverse the c2ix dictionary.

**Steps:**
1. Use dictionary comprehension: swap keys and values
2. Assign to `ix2c`

**Example:**
```python
c2ix = {'a': 0, 'b': 1}
ix2c = {idx: char for char, idx in c2ix.items()}
# Result: {0: 'a', 1: 'b'}
```


In [None]:
# TODO: Build index-to-char dictionary
# ix2c = {idx: char for char, idx in c2ix.items()}

ix2c = None  # Replace this line

# Verify
if ix2c:
    print(f"Sample mappings from ix2c:")
    for idx in [0, 1, 2, 3, 4]:
        if idx in ix2c:
            print(f"  {idx} → '{ix2c[idx]}'")


## 📝 TODO 5: Calculate vocab_size

**Hint:**  
The vocabulary size is simply the number of unique characters.

**Steps:**
1. Use `len(c2ix)` or `len(unique_char_tokens)`
2. Assign to `vocab_size`


In [None]:
# TODO: Calculate vocabulary size
# vocab_size = len(c2ix)

vocab_size = None  # Replace this line

# Verify
if vocab_size:
    print(f"Vocabulary size: {vocab_size}")


## 📝 TODO 6: Convert Text to IDs

**Hint:**  
Map each character through the `c2ix` dictionary.

**Steps:**
1. Use a list comprehension to map each char in `tokenized_text` to its ID
2. Look up each char: `c2ix[char]`
3. Assign to `tokenized_id_text`

**Example:**
```python
text = ['h', 'i']
c2ix = {'h': 5, 'i': 8}
ids = [c2ix[char] for char in text]
# Result: [5, 8]
```


In [None]:
# TODO: Convert all characters to IDs
# tokenized_id_text = [c2ix[char] for char in tokenized_text]

tokenized_id_text = None  # Replace this line

# Verify
if tokenized_id_text:
    print(f"Total IDs: {len(tokenized_id_text)}")
    print(f"First 100 IDs:\n{tokenized_id_text[:100]}")


## 💭 Reflection Prompts

**Write your observations:**

1. **Non-letter characters**: Which non-letter characters appear in your vocabulary? (spaces, punctuation, newlines?)

2. **Most common characters**: From the first 100 IDs, which IDs (and their corresponding characters) appear most frequently?

3. **Why sorting matters**: What would happen if we didn't sort `unique_char_tokens`?

4. **Encoding vs. Decoding**: Why do we need both `c2ix` and `ix2c`? When do we use each?

---


## 🚀 Next Steps

Once you've completed all TODOs and verified your vocab size:

➡️ **Move to Notebook 03**: Creating Dataset & DataLoader with sliding windows

---

## 📌 Key Takeaways

- ✅ Character tokenization = `list(text)`
- ✅ Vocab = sorted unique characters for consistency
- ✅ `c2ix` encodes (text → model), `ix2c` decodes (model → text)
- ✅ IDs are what the model actually processes
- ✅ Every character (including spaces, punctuation, newlines) gets an ID

---

*Next up: We'll create sliding windows of these IDs to form training examples!*
