# Tokenization from Scratch

This walkthrough builds a tiny, regex-based tokenizer step by step. We start by loading a raw text sample, experiment with different splitting heuristics, and then evolve the tokenizer to support vocabularies and special tokens. Feel free to run cells in order or jump into sections that interest you.

---

## 1. Inspect the Raw Text

In [None]:
# Load a short story from the local filesystem
with open('the-verdict.txt', "r", encoding='utf-8') as f:
    raw_text = f.read()

# Quick sanity checks: overall length and a peek at the first 100 characters
print('total number of characters:', len(raw_text))
print(raw_text[:99])

total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### 1.1 Basic Stats

A quick double-check of the character count helps ensure the file loaded correctly. Feel free to delete or skip this cell once you're confident in the setup.


In [None]:
print(len(raw_text))

20479


## 2. Experiment with Regex-Based Splitting

Before touching the real text, it's useful to practice on a tiny example. The next few cells incrementally refine the regular expression so we understand which characters end up as individual tokens.


In [None]:
import re

text = "Hello, world. this, is a test."
# Start simple: split only on whitespace to see how punctuation clings to the words
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'this,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [None]:
# Add commas and periods to the delimiter set; note the empty strings caused by grouped matches
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'this', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [None]:
# Filter out the empty strings introduced by the capturing groups
result = [item for item in result if item.strip(',.')]

print(result)

['Hello', ' ', 'world', ' ', 'this', ' ', 'is', ' ', 'a', ' ', 'test']


In [None]:
# Expand the delimiter list to include more punctuation as separate tokens
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item for item in result if item.strip()]

print(result)

['Hello', ',', 'world', '.', 'this', ',', 'is', 'a', 'test', '.']


## 3. Tokenize the Full Corpus

With the expression behaving nicely on the toy example, we can apply it to the entire story. The first cell previews the first few tokens; the second lets us confirm how many tokens we produced.


In [None]:
# Apply the regex to the full story and drop pure-whitespace fragments
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item.strip()]

print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [None]:
# Total number of tokens extracted with the current regex
print(len(preprocessed))

4690


## 4. Build a Vocabulary

We can now gather the unique tokens, inspect the vocabulary size, and create lookup tables that map between strings and integers. These tables are the backbone of our first tokenizer implementation.

In [None]:
# Collect unique tokens and inspect the vocabulary size
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1130


In [None]:
# Map token strings to integer ids (string -> id)
encoder = {token: integer for integer, token in enumerate(all_words)}
for i, item in enumerate(encoder.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


In [None]:
# Map integer ids back to their token strings (id -> string)
decoder = {idx: token for idx, token in enumerate(all_words)}
for i, item in enumerate(decoder.items()):
    print(item)
    if i >= 50:
        break

(0, '!')
(1, '"')
(2, "'")
(3, '(')
(4, ')')
(5, ',')
(6, '--')
(7, '.')
(8, ':')
(9, ';')
(10, '?')
(11, 'A')
(12, 'Ah')
(13, 'Among')
(14, 'And')
(15, 'Are')
(16, 'Arrt')
(17, 'As')
(18, 'At')
(19, 'Be')
(20, 'Begin')
(21, 'Burlington')
(22, 'But')
(23, 'By')
(24, 'Carlo')
(25, 'Chicago')
(26, 'Claude')
(27, 'Come')
(28, 'Croft')
(29, 'Destroyed')
(30, 'Devonshire')
(31, 'Don')
(32, 'Dubarry')
(33, 'Emperors')
(34, 'Florence')
(35, 'For')
(36, 'Gallery')
(37, 'Gideon')
(38, 'Gisburn')
(39, 'Gisburns')
(40, 'Grafton')
(41, 'Greek')
(42, 'Grindle')
(43, 'Grindles')
(44, 'HAD')
(45, 'Had')
(46, 'Hang')
(47, 'Has')
(48, 'He')
(49, 'Her')
(50, 'Hermia')


## 5. Implement `SimpleTokenizerV1`

The first tokenizer assumes the input text only contains tokens seen during preprocessing. The encode path mirrors the earlier regex pipeline, while decode stitches tokens back into a readable string.


In [None]:
class SimpleTokenizerV1:
    """A minimal tokenizer that works only for in-vocabulary tokens."""

    def __init__(self, encoder, decoder):
        self.str_to_int = encoder
        self.int_to_str = decoder

    def encode(self, text: str) -> list[int]:
        """Split text with the earlier regex and map each token to an id."""
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids: list[int]) -> str:
        """Convert ids back to tokens and clean up spacing around punctuation."""
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
# Instantiate the tokenizer and try encoding a short phrase
tokenizer = SimpleTokenizerV1(encoder, decoder)

text = "It's the last he painted, you know"
ids = tokenizer.encode(text)
print(ids)

[56, 2, 850, 988, 602, 533, 746, 5, 1126, 596]


In [None]:
# Decoding reverses the process, though notice the spacing issue around the apostrophe
text = tokenizer.decode(ids)
print(text)

It' s the last he painted, you know


## 6. Handle Special Tokens and Unknown Words

Real-world tokenizers reserve ids for things like sequence termination and unknown words. The next few cells extend the vocabulary and introduce `<|unk|>` handling.

In [None]:
# Extend the vocabulary with special tokens commonly used in language models
all_tokens = sorted(set(preprocessed))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token: integer for integer, token in enumerate(all_tokens)}


In [None]:
# Confirm the new vocabulary size after adding special tokens
len(vocab)

1132

In [None]:
# Peek at the tail end of the vocabulary to confirm special tokens are present
for token, idx in list(vocab.items())[-5:]:
    print(f"{token!r} -> {idx}")

younger
your
yourself
<|endoftext|
<|unk|>


In [None]:
# Rebuild encoder/decoder to include the special tokens
encoder = {token: idx for idx, token in enumerate(all_tokens)}
decoder = {idx: token for idx, token in enumerate(all_tokens)}


class SimpleTokenizerV2:
    """Adds `<|unk|>` support by mapping unseen tokens to a fallback id."""

    def __init__(self, encoder, decoder):
        self.str_to_int = encoder
        self.int_to_str = decoder

    def encode(self, text: str) -> list[int]:
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids: list[int]) -> str:
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [None]:
# Instantiate the improved tokenizer and encode a string with an unknown token
tokenizer = SimpleTokenizerV2(encoder, decoder)

text = "It's the last he painted, you know ss"
ids = tokenizer.encode(text)
print(ids)

[56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 1131]


In [None]:
# The unknown token now maps back to the `<|unk|>` placeholder during decoding
text = tokenizer.decode(ids)
print(text)


It' s the last he painted, you know <|unk|>
