In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Set the device for computation
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

Using device: mps


In [10]:
from pathlib import Path

text = Path('../../data/tiny-shakespeare.txt').read_text()

In [11]:
print(text[0:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.




---

# Character-Level Tokenizer for Language Models

###### Purpose and Overview

The `CharTokenizer` class implements a character-level tokenization system that converts text into numerical representations for neural language models. Unlike word-based tokenizers, this approach treats each individual character as a token, making it suitable for character-level language modeling tasks.

###### Core Functionality

**Bidirectional Mapping System:**
The tokenizer maintains two dictionaries for efficient conversion:
- `token_id_for_char`: Maps characters to unique integer IDs
- `char_for_token_id`: Maps integer IDs back to characters

**Key Operations:**
- **Encoding**: Text → Tensor of integer IDs
- **Decoding**: Tensor of integer IDs → Text
- **Vocabulary Management**: Builds and maintains character vocabulary

###### Detailed Method Analysis

**Initialization Process:**
```python
tokenizer = CharTokenizer(['a', 'b', 'c', ' ', '!'])
# Creates mappings:
# token_id_for_char = {'a': 0, 'b': 1, 'c': 2, ' ': 3, '!': 4}
# char_for_token_id = {0: 'a', 1: 'b', 2: 'c', 3: ' ', 4: '!'}
```

**Training from Text:**
```python
text = "hello world"
tokenizer = CharTokenizer.train_from_text(text)
# Vocabulary: [' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w']  # Sorted unique characters
```

**Encoding Example:**
```python
text = "hello"
encoded = tokenizer.encode(text)
# Result: tensor([3, 4, 5, 5, 6])  # Each character mapped to its ID
```

**Decoding Example:**
```python
token_ids = torch.tensor([3, 4, 5, 5, 6])
decoded = tokenizer.decode(token_ids)
# Result: "hello"  # IDs mapped back to characters
```

The IDs are generated automatically by the `enumerate()` function in the `__init__` method. Here's exactly how it works:

###### ID Assignment Process

```python
def __init__(self, vocabulary):
    self.token_id_for_char = {char: token_id for token_id, char in enumerate(vocabulary)}
    self.char_for_token_id = {token_id: char for token_id, char in enumerate(vocabulary)}
```

###### Step-by-Step Breakdown

**Input vocabulary:** `['a', 'b', 'c', ' ', '!']`

**enumerate() function creates pairs:**
```python
list(enumerate(['a', 'b', 'c', ' ', '!']))
# Result: [(0, 'a'), (1, 'b'), (2, 'c'), (3, ' '), (4, '!')]
```

**Dictionary comprehension assigns IDs:**
```python
# For token_id_for_char:
{char: token_id for token_id, char in enumerate(vocabulary)}
# Becomes: {'a': 0, 'b': 1, 'c': 2, ' ': 3, '!': 4}

# For char_for_token_id:
{token_id: char for token_id, char in enumerate(vocabulary)}
# Becomes: {0: 'a', 1: 'b', 2: 'c', 3: ' ', 4: '!'}
```

###### The enumerate() Function

`enumerate(sequence)` returns pairs of `(index, item)` where the index starts at 0:
- Position 0: 'a' gets ID 0
- Position 1: 'b' gets ID 1
- Position 2: 'c' gets ID 2
- Position 3: ' ' gets ID 3
- Position 4: '!' gets ID 4

The IDs are simply the sequential positions of characters in the vocabulary list, starting from 0. This ensures each character has a unique integer identifier that can be used as an index in neural network embedding layers.



###### Practical Application Workflow

**Step 1: Vocabulary Creation**
The tokenizer scans input text to identify all unique characters and creates a sorted vocabulary ensuring consistent ordering across different runs.

**Step 2: Text Processing**
During encoding, each character in the input text is looked up in the vocabulary and replaced with its corresponding integer ID, creating a sequence of numbers suitable for neural network processing.

**Step 3: Neural Network Integration**
The encoded tensors can be fed directly into embedding layers of neural networks, where each character ID is mapped to a learned vector representation.

**Step 4: Output Decoding**
Model predictions (sequences of token IDs) are converted back to readable text using the reverse mapping.

###### Use Cases and Applications

**Character-Level Language Models:**
- Text generation at character granularity
- Handling out-of-vocabulary words naturally
- Working with any language or script without preprocessing

**Advantages:**
- No unknown token issues (every character is in vocabulary)
- Handles misspellings and novel words
- Language-agnostic approach
- Simple implementation and debugging

**Limitations:**
- Longer sequences than word-based tokenization
- May struggle with long-range dependencies
- Computationally more intensive for long texts

###### Technical Implementation Details

**Data Types:**
The tokenizer uses `torch.long` dtype for token IDs, which is standard for indexing operations in PyTorch embeddings and ensures compatibility with neural network layers.

**Memory Efficiency:**
The bidirectional mapping approach provides O(1) lookup time for both encoding and decoding operations, making it efficient for processing large texts.

**Vocabulary Consistency:**
The sorting of unique characters ensures deterministic vocabulary creation, crucial for model reproducibility across different training runs.

This tokenizer serves as a foundational component for character-level natural language processing tasks, providing the essential interface between raw text and numerical representations required by neural networks.


---

You're absolutely right to question this! The current `__init__` method has a **serious flaw** - it doesn't handle duplicates properly.

###### The Problem with Duplicates

If the vocabulary contains duplicates, the dictionary comprehension will overwrite previous mappings:

```python
# Problematic input with duplicates:
vocabulary = ['a', 'b', 'a', 'c', 'a']

# enumerate() produces:
[(0, 'a'), (1, 'b'), (2, 'a'), (3, 'c'), (4, 'a')]

# Dictionary comprehension overwrites:
token_id_for_char = {'a': 0, 'b': 1, 'a': 2, 'c': 3, 'a': 4}
# Final result: {'a': 4, 'b': 1, 'c': 3}  # 'a' only maps to ID 4!

char_for_token_id = {0: 'a', 1: 'b', 2: 'a', 3: 'c', 4: 'a'}
# Final result: {0: 'a', 1: 'b', 2: 'a', 3: 'c', 4: 'a'}  # Multiple IDs for 'a'
```

###### The Broken Behavior

This creates inconsistent mappings:
- `encode('a')` would return ID 4 (the last occurrence)
- `decode([0])` would return 'a' 
- But `encode(decode([0]))` wouldn't equal `[0]` - it would be `[4]`!

###### How the Code Should Handle This

The code currently **doesn't** handle duplicates properly. A robust implementation should either:

**Option 1: Remove duplicates before processing:**
```python
def __init__(self, vocabulary):
    unique_vocab = list(dict.fromkeys(vocabulary))  # Preserves order, removes duplicates
    self.token_id_for_char = {char: token_id for token_id, char in enumerate(unique_vocab)}
    self.char_for_token_id = {token_id: char for token_id, char in enumerate(unique_vocab)}
```

**Option 2: Raise an error for duplicates:**
```python
def __init__(self, vocabulary):
    if len(set(vocabulary)) != len(vocabulary):
        raise ValueError("Vocabulary contains duplicate characters")
    self.token_id_for_char = {char: token_id for token_id, char in enumerate(vocabulary)}
    self.char_for_token_id = {token_id: char for token_id, char in enumerate(vocabulary)}
```

###### Why `train_from_text()` Works

The `train_from_text()` method avoids this issue by explicitly removing duplicates:
```python
vocabulary = sorted(list(set(text)))  # set() removes duplicates
```

So the class works correctly when using `train_from_text()`, but the `__init__` method itself is vulnerable to duplicate inputs. This is a design flaw that should be addressed for robustness.

---

In [18]:
import torch

class CharTokenizer:
    """
    A simple character-level tokenizer for converting text to and from numerical IDs.

    This tokenizer builds a vocabulary from a given text and provides methods
    to encode strings into integer tensors and decode them back into strings.
    It is a basic but essential component for character-level language models.

    Attributes:
        token_id_for_char (dict): A mapping from each character in the vocabulary
            to its unique integer ID.
        char_for_token_id (dict): A reverse mapping from each integer ID back
            to its corresponding character.
    """
  
    # def __init__(self, vocabulary):
    #     """
    #     Initializes the CharTokenizer with a predefined vocabulary.

    #     Args:
    #         vocabulary (list or str): An ordered list or string of unique
    #             characters that will form the tokenizer's vocabulary.
    #     """
    #     self.token_id_for_char = {char: token_id for token_id, char in enumerate(vocabulary)}
    #     self.char_for_token_id = {token_id: char for token_id, char in enumerate(vocabulary)}

    def __init__(self, vocabulary):
        """
        Initializes the CharTokenizer with a predefined vocabulary.

        Args:
            vocabulary (list or str): An ordered list or string of unique
                characters that will form the tokenizer's vocabulary.
        """        
        
        unique_vocab = list(dict.fromkeys(vocabulary))  # Preserves order, removes duplicates
        self.token_id_for_char = {char: token_id for token_id, char in enumerate(unique_vocab)}
        self.char_for_token_id = {token_id: char for token_id, char in enumerate(unique_vocab)}

    @staticmethod
    def train_from_text(text):
        """
        Creates a new CharTokenizer instance by building a vocabulary from text.

        This static method scans the input text, finds all unique characters,
        sorts them to ensure a consistent vocabulary order, and then creates
        a new tokenizer instance based on this vocabulary.

        Args:
            text (str): The corpus of text from which to build the vocabulary.

        Returns:
            CharTokenizer: A new instance of the tokenizer trained on the text.
        """
        vocabulary = sorted(list(set(text)))
        return CharTokenizer(vocabulary)

    def encode(self, text):
        """
        Encodes a string of text into a tensor of token IDs.

        Each character in the input string is mapped to its corresponding integer
        ID from the vocabulary.

        Args:
            text (str): The string to encode.

        Returns:
            torch.Tensor: A 1D tensor of dtype torch.long containing the sequence
                of token IDs.
        """
        token_ids = []
        for char in text:
            token_ids.append(self.token_id_for_char[char])
        return torch.tensor(token_ids, dtype=torch.long)

    def decode(self, token_ids):
        """
        Decodes a tensor of token IDs back into a string of text.

        Each integer ID in the input tensor is mapped back to its corresponding
        character from the vocabulary.

        Args:
            token_ids (torch.Tensor): A 1D tensor of token IDs to decode.

        Returns:
            str: The decoded string.
        """
        chars = []
        # .tolist() converts the tensor to a standard Python list for iteration.
        for token_id in token_ids.tolist():
            chars.append(self.char_for_token_id[token_id])
        return ''.join(chars)

    def vocabulary_size(self):
        """
        Returns the total number of unique characters in the vocabulary.

        Returns:
            int: The size of the vocabulary.
        """
        return len(self.token_id_for_char)

In [19]:
tokenizer = CharTokenizer.train_from_text(text)

In [21]:
print(tokenizer.encode("Hello world"))

print("\n")

print(tokenizer.encode("Helleoo"))

tensor([20, 43, 50, 50, 53,  1, 61, 53, 56, 50, 42])


tensor([20, 43, 50, 50, 43, 53, 53])


In [16]:
print(tokenizer.decode(tokenizer.encode("Hello world")))

Hello world


In [17]:
tokenizer.vocabulary_size()

65

In [22]:
import pprint

pp = pprint.PrettyPrinter(depth=4)

In [23]:
pp.pprint(tokenizer.char_for_token_id)

{0: '\n',
 1: ' ',
 2: '!',
 3: '$',
 4: '&',
 5: "'",
 6: ',',
 7: '-',
 8: '.',
 9: '3',
 10: ':',
 11: ';',
 12: '?',
 13: 'A',
 14: 'B',
 15: 'C',
 16: 'D',
 17: 'E',
 18: 'F',
 19: 'G',
 20: 'H',
 21: 'I',
 22: 'J',
 23: 'K',
 24: 'L',
 25: 'M',
 26: 'N',
 27: 'O',
 28: 'P',
 29: 'Q',
 30: 'R',
 31: 'S',
 32: 'T',
 33: 'U',
 34: 'V',
 35: 'W',
 36: 'X',
 37: 'Y',
 38: 'Z',
 39: 'a',
 40: 'b',
 41: 'c',
 42: 'd',
 43: 'e',
 44: 'f',
 45: 'g',
 46: 'h',
 47: 'i',
 48: 'j',
 49: 'k',
 50: 'l',
 51: 'm',
 52: 'n',
 53: 'o',
 54: 'p',
 55: 'q',
 56: 'r',
 57: 's',
 58: 't',
 59: 'u',
 60: 'v',
 61: 'w',
 62: 'x',
 63: 'y',
 64: 'z'}


In [24]:
pp.pprint(tokenizer.token_id_for_char)

{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64}
