# Tokenization

A simple tokenizer takes text and splits it into tokens. it mainly has an encoder and decoder.
- The encoder turns text into token IDs.
- The decoder turns token IDs back into text.

Let's use the La La Land movie script to demonstrate tokenization:

In [1]:
with open("../data/la_la_land.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(raw_text[:30])
print("Total number of characters:", len(raw_text))

LA LA LAND
by
Damien Chazelle

Total number of characters: 102179


The goal is to turn the raw text into a sequence of tokens usable by an LLM.
First, we need to preprocess the text to remove unwanted characters and split it into tokens.
I have chosen the following set of characters to be removed:
- Punctuation: ,.:;?_!"()'
- Hyphens: -
- Whitespace: \s
- Double hyphen: --
Alot more characters can be removed, but for now, let's keep it simple.

In [3]:
import re
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['LA', 'LA', 'LAND', 'by', 'Damien', 'Chazelle', 'FADE', 'IN', '.', '.', '.', 'A', 'sun-blasted', 'sky', '.', 'We', 'HEAR', 'radios', '--', 'one', 'piece', 'of', 'music', 'after', 'another', '.', '.', '.', 'We’re', '--']


Total number of tokens:

In [4]:
print(len(preprocessed))

23341


From these tokens we can build a vocabulary of unique tokens.

In [12]:
all_tokens = sorted(set(preprocessed))
print(len(all_tokens))

3836


In [6]:
vocab = {token: i for i, token in enumerate(all_tokens)}


In [7]:
list(vocab.items())[1000:1020]

[('PUSH', 1000),
 ('PUSHED', 1001),
 ('Pantages', 1002),
 ('Paris', 1003),
 ('Parisian', 1004),
 ('Parisian-style', 1005),
 ('Park', 1006),
 ('Parker', 1007),
 ('Pasadena', 1008),
 ('Passes', 1009),
 ('Passing', 1010),
 ('Pasta', 1011),
 ('Peer', 1012),
 ('Peers', 1013),
 ('People', 1014),
 ('Pfeiffer’s', 1015),
 ('Photographer', 1016),
 ('Photographer’s', 1017),
 ('Piano', 1018),
 ('Picks', 1019)]

Now that we have a vocabulary, we can encode the raw text into token IDs. We can build a simple tokenizer class.

In [8]:
class TokenizerV1:
    def __init__(self, vocab):
        self.vocab = vocab

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) # split the text into tokens, by the same characters we built the vocab from.
        preprocessed = [item.strip() for item in preprocessed if item.strip()] # remove leading and trailing whitespace
        return [self.vocab[token] for token in preprocessed] # return the token IDs

    def decode(self, ids):
        text = " ".join([{i:s for s,i in self.vocab.items()}[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [9]:
tokenizer = TokenizerV1(vocab)
text = """Maybe I’m one of those people who’s
always wanted to do it but never had a
chance."""
ids = tokenizer.encode(text)
print(ids)


[804, 679, 2768, 2752, 3525, 2867, 3727, 1431, 3675, 3545, 1913, 2434, 1664, 2713, 2266, 1381, 1707, 10]


In [10]:
tokenizer.decode(ids)

'Maybe I’m one of those people who’s always wanted to do it but never had a chance.'

But let's use a sentence containing a word that is not in the vocab:

In [11]:
text = """But on Instagram it said you liked hip hop more than jazz!"""
ids = tokenizer.encode(text)

KeyError: 'Instagram'

This is a problem, but its one of the reasons that special tokens are useful.

Inspired by GPT-2, let's add special tokens to our vocab:
- "<|endoftext|>" which is used in GPT-2 denoting the end of a text. 
- "<|unk|>" to denote unknown tokens. 


In [15]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(f"Vocabulary size: {len(vocab)}")


Vocabulary size: 3838


In [16]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('“working”', 3833)
('”', 3834)
('”Casting”', 3835)
('<|endoftext|>', 3836)
('<|unk|>', 3837)


We can change the tokenizer the include the special token:

In [17]:
class TokenizerV2:
    def __init__(self, vocab):
        self.vocab = vocab

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) # split the text into tokens, by the same characters we built the vocab from.
        preprocessed = [item.strip() for item in preprocessed if item.strip()] # remove leading and trailing whitespace
        preprocessed = [item if item in self.vocab else "<|unk|>" for item in preprocessed]
        return [self.vocab[token] for token in preprocessed] # return the token IDs

    def decode(self, ids):
        text = " ".join([{i:s for s,i in self.vocab.items()}[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Let's try to tokenize again, now using the new tokenzier:

In [19]:
tokenizer = TokenizerV2(vocab)
text = """But on Instagram it said you liked hip hop more than jazz!"""
ids = tokenizer.encode(text)
ids

[248, 2765, 3837, 2434, 3110, 3778, 2539, 3837, 2354, 2662, 3492, 2444, 0]

In [20]:
tokenizer.decode(ids)

'But on <|unk|> it said you liked <|unk|> hop more than jazz!'

## BPE (Byte-Pair Encoding)
BytePair encoding allows for unknown words in the text to be dissected to smaller parts sometimes even 1 character at a time depending on the trained BPE merges. 
Let's use OpenAi's opensource tiktoken library to showcase the encoding:

In [21]:
import tiktoken
import importlib
print("tiktoken version:", importlib.metadata.version("tiktoken"))


tiktoken version: 0.7.0


In [22]:
tokenizer = tiktoken.get_encoding("gpt2")


In [25]:
text = """But on Instagram it said you liked hip hop more than jazz! <|endoftext|> They said the next couple of days... But
I’m not expecting to find anything out."""
ids = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(ids)

[1537, 319, 10767, 340, 531, 345, 8288, 10359, 1725, 517, 621, 21274, 0, 220, 50256, 1119, 531, 262, 1306, 3155, 286, 1528, 986, 887, 198, 40, 447, 247, 76, 407, 12451, 284, 1064, 1997, 503, 13]


In [26]:
tokenizer.decode(ids)

'But on Instagram it said you liked hip hop more than jazz! <|endoftext|> They said the next couple of days... But\nI’m not expecting to find anything out.'

In [28]:
tokenizer.n_vocab

50257