The [GPT 2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) in section 2.2 talks about BPE to enforce sub-word tokenization. 

Through this NB, we will dive deeper into [encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py) file (which is misnamed since it encodes _and_ decodes). 
<hr>

1. The GPT2 paper incorporates a regex split, to avoid similarity in subwords due to the greedy nature of BPE such as `dog.` , `dog!` , `dog?` being classified as separate tokens. Hoever, `'dog '` is more common and the spaces are allowed which proves to be helpful in compressing information at a higher level. Decoupling punctuation from semantics. 

2. So this is done by enfocring some "merging rules". i.e. words cannot combine with punctuations and so on to for a subword. 

3. Regex library is used to enforce these separations. 


## Forced splits using Regex patterns

In [1]:
import regex as re

gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") # picked from gpt2 encoder.py
print(re.findall(gpt2pat, "Hello've world123 how's HOW'S are        you!!!?"))

['Hello', "'ve", ' world', '123', ' how', "'s", ' HOW', "'", 'S', ' are', '       ', ' you', '!!!?']


On a high level: we are trying not to merge across letters, punctuations, numbers, abbreviations. 

## Important:

In nb1 we took the entire text and passed it through the `encoder` to get tokens. But practically:

1. We enforce the split using regex => `['Hello', "'ve", ' world', '123', ' how', "'s", ' are', ' you', '!!!?']`

2. Apply encode on each element of the list 

3. Then concatinate

In this process, each list element is tokenized independently before concat and our regex rules are followed!

<hr>

__Some objections to the regex pattern in gpt2 paper:__

- HOW'S vs how's tokenizes differently (case sensetive)
- ' vs ’ tokenizes differently (apostrophe)
- Langague (english) is hardcoded

In [2]:
example = """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""
print(re.findall(gpt2pat, example))

['\n', 'for', ' i', ' in', ' range', '(', '1', ',', ' 101', '):', '\n   ', ' if', ' i', ' %', ' 3', ' ==', ' 0', ' and', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'FizzBuzz', '")', '\n   ', ' elif', ' i', ' %', ' 3', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Fizz', '")', '\n   ', ' elif', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Buzz', '")', '\n   ', ' else', ':', '\n       ', ' print', '(', 'i', ')', '\n']


There are some additional rules OPENAI has enforced, such as: spaces are never merged. For ex: "    " + "  " dont get merged. It not clear how they have enforced this, since __`encoder.py` is just the inference code__, _not training code_. 

__tiktoken is the official openai library for tokenization (again, only for inference)__

In [3]:
import tiktoken

# GPT-2 (does not merge spaces)
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("    hello world!!! air"))

# GPT-4 (merges spaces)
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode("       hello world!!!"))

[220, 220, 220, 23748, 995, 10185, 1633]
[996, 24748, 1917, 12340]


Curiously, for gpt 4 tokenizer: 1 space, 2 spaces, 3 spaces.. each correspond to a different token. WHile for gpt-2, each space corresponds to `220`. 

Checking the tiktoken repo openai_public, [clk_100 base tokenizer for gpt 4](https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py) we see the regex pattern has evolved: <br>
`"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s"""`

- Some problem from _gpt-2 regex_ string have been fixed here (case sensativity, limiting (merged) number lengths to 3, punctuations etc.)



Further note that for a string such as `"      hello world"` with 6 trailing spaces gpt-4 would tokenize it as: `"     " + " hello"` i.e. 1 trailing space is attached to hello token and all 6 are not grouped into 1. This was learnt during training perhaps. See the below token division by gpt-4:

<img title="a title" alt="Alt text" src="images/gpt-4_tokenizer.png" width = 30%>

<hr>

Now lets check a few more aspects of encoder.py of gpt2:


In [17]:
import urllib.request

urllib.request.urlretrieve(
    "https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe", 
    "vocab.bpe"
)

urllib.request.urlretrieve(
    "https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json", 
    "encoder.json"
)

('encoder.json', <http.client.HTTPMessage at 0x17fe49040e0>)

__Note:__ The BPE token IDs in encoder.json are completely different from UTF-8 byte values!

But [encoder.json](encoder.json) plays the same role as `vocab` dictionary in [base1.ipynb](base1.ipynb) notebook. It allows us to efficiently switch between integer and bytes of that integer. <br>

While wading though [encoder.json](encoder.json), `Ġ` makes many an appearance. It just represents a leading space. 

Also note that
- Kaprpathy starts with a base UTF-8 vocabulary (indices 0-255 for all possible bytes) and then adds BPE merges starting from index 256
- whereas, OPENAI uses a custom base vocabulary that doesn't follow the simple 0-255 byte mapping, still adds BPE merges on top
-  The core BPE algorithm (finding most frequent pairs and merging them) is the same in both cases - it's just the starting vocabulary that differs

Whereas [vocab.bpe](/tokenizer/vocab.bpe) is a list of merges carried out on the training text. We depart a bit and maintain merges as a dict in [base1.ipynb](/tokenizer/base1.ipynb)



In [2]:
import os, json 

with open('encoder.json', 'r') as f:
    encoder = json.load(f) # equivalent to 'vocab' of base1 nb

with open('vocab.bpe', 'r', encoding="utf-8") as f:
    bpe_data = f.read()
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
# ^---- ~equivalent to our "merges"

## Special tokens

In [3]:
len(encoder)  # 256 raw byte tokens. 50,000 merges. +1 special token

50257

In [4]:
encoder

{'!': 0,
 '"': 1,
 '#': 2,
 '$': 3,
 '%': 4,
 '&': 5,
 "'": 6,
 '(': 7,
 ')': 8,
 '*': 9,
 '+': 10,
 ',': 11,
 '-': 12,
 '.': 13,
 '/': 14,
 '0': 15,
 '1': 16,
 '2': 17,
 '3': 18,
 '4': 19,
 '5': 20,
 '6': 21,
 '7': 22,
 '8': 23,
 '9': 24,
 ':': 25,
 ';': 26,
 '<': 27,
 '=': 28,
 '>': 29,
 '?': 30,
 '@': 31,
 'A': 32,
 'B': 33,
 'C': 34,
 'D': 35,
 'E': 36,
 'F': 37,
 'G': 38,
 'H': 39,
 'I': 40,
 'J': 41,
 'K': 42,
 'L': 43,
 'M': 44,
 'N': 45,
 'O': 46,
 'P': 47,
 'Q': 48,
 'R': 49,
 'S': 50,
 'T': 51,
 'U': 52,
 'V': 53,
 'W': 54,
 'X': 55,
 'Y': 56,
 'Z': 57,
 '[': 58,
 '\\': 59,
 ']': 60,
 '^': 61,
 '_': 62,
 '`': 63,
 'a': 64,
 'b': 65,
 'c': 66,
 'd': 67,
 'e': 68,
 'f': 69,
 'g': 70,
 'h': 71,
 'i': 72,
 'j': 73,
 'k': 74,
 'l': 75,
 'm': 76,
 'n': 77,
 'o': 78,
 'p': 79,
 'q': 80,
 'r': 81,
 's': 82,
 't': 83,
 'u': 84,
 'v': 85,
 'w': 86,
 'x': 87,
 'y': 88,
 'z': 89,
 '{': 90,
 '|': 91,
 '}': 92,
 '~': 93,
 '¡': 94,
 '¢': 95,
 '£': 96,
 '¤': 97,
 '¥': 98,
 '¦': 99,
 '§': 100

In [5]:
encoder['<|endoftext|>'] # very last, special token. 

50256

^used to delimit documents in the training set. So we insert this between the text to signal that one doc has ended and what follows is not related. Ofc, the meaning of this signal must be _learnt_ by the LM; we are just giving a hint to increase the quality of data. 

So the special token `<|endoftext|>` is immune to merges. This is visible in the tiktoken/src/lib.rs (implemented in rust) where exceptions are made for special tokens. 

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">NOTES:</span>

- What we are dealing with so far are base model encoder/decoders. But in the fine tuned models (such as gpt turbo 3.5, even 4o, 4o-mini are somewhat finetuned), there are many more special tokens to delimit `system prompt`, `user prompt` etc: 

<img title="a title" alt="Alt text" src="images/special_tokens.png" width = 60%>

- `<|im_start|>` , `<|im_end|>` are other special tokens. 'im' in <|im_start|> stands for _imaginary monologue_!

- You can add your own special tokens as well, tiktoken has the provision for that. Scroll down through [readme.md](https://github.com/openai/tiktoken) in the "Extending tiktoken" section.

- As we can see [here](https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py) __gpt2__ has only a single special token `{ENDOFTEXT: 50256}`, whereas __gpt4__ has many: `ENDOFTEXT`, `FIM_PREFIX`, `FIM_MIDDLE`, `FIM_SUFFIX`, `ENDOFPROMPT`   <br>

FIM stands for Fill in the middle and is introduced in [this paper](https://arxiv.org/pdf/2207.14255) by OPENAI.

- Once these special tokens are coined, the _model parameters change_ a bit. The embedding table needs an extra row, the intermediate layers change in dimensionsion by 1, output layer also becomes +1 in size. 

<hr>

## sentencepiece

Commonly used because (unlike tiktoken) it can efficiently both train and inference BPE tokenizers. It is used in both Llama and Mistral series.

[sentencepiece on Github link](https://github.com/google/sentencepiece)

The big difference: __sentencepiece runs BPE on the Unicode code points directly__! It then has an option `character_coverage` for what to do with very very rare codepoints that appear very few times, and it either maps them onto an UNK token, or if `byte_fallback` is turned on, it encodes them with utf-8 and then encodes the raw bytes instead.

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">TLDR:</span>

- tiktoken encodes to utf-8 and then BPEs bytes
- sentencepiece BPEs the code points and optionally falls back to utf-8 bytes for rare code points (rarity is determined by character_coverage hyperparameter), which then get translated to byte tokens.


[I have compiled examples on how these both operate from first principles here](https://www.notion.so/Sentencepiece-vs-tiktoken-tokenizer-25d6be0e11f1803ea60cc4745cc65f6f?source=copy_link)

In [1]:
import sentencepiece as spm

In [2]:
# write a toy.txt file with some random text
with open("toy.txt", "w", encoding="utf-8") as f:
  f.write("SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.")

^ serves as the training vocab

In [3]:
# train a sentencepiece model on it
# the settings here are (best effort) those used for training Llama 2
import os

options = dict(
  # input spec
  input="toy.txt",
  input_format="text",
  # output spec
  model_prefix="tok400", # output filename prefix
  # algorithm spec
  # BPE alg
  model_type="bpe",
  vocab_size=400,
  # normalization
  normalization_rule_name="identity", # ew, turn off normalization
  remove_extra_whitespaces=False,
  input_sentence_size=200000000, # max number of training sentences
  max_sentence_length=4192, # max number of bytes per sentence
  seed_sentencepiece_size=1000000,
  shuffle_input_sentence=True,
  # rare word treatment
  character_coverage=0.99995,
  byte_fallback=True,
  # merge rules
  split_digits=True,
  split_by_unicode_script=True,
  split_by_whitespace=True,
  split_by_number=True,
  max_sentencepiece_length=16,
  add_dummy_prefix=True,
  allow_whitespace_only_pieces=True,
  # special tokens
  unk_id=0, # the UNK token MUST exist
  bos_id=1, # the others are optional, set to -1 to turn off
  eos_id=2,
  pad_id=-1,
  # systems
  num_threads=os.cpu_count(), # use ~all system resources
)

spm.SentencePieceTrainer.train(**options)


In [4]:
sp = spm.SentencePieceProcessor()
sp.load('tok400.model')
vocab = [[sp.id_to_piece(idx), idx] for idx in range(sp.get_piece_size())]
vocab

[['<unk>', 0],
 ['<s>', 1],
 ['</s>', 2],
 ['<0x00>', 3],
 ['<0x01>', 4],
 ['<0x02>', 5],
 ['<0x03>', 6],
 ['<0x04>', 7],
 ['<0x05>', 8],
 ['<0x06>', 9],
 ['<0x07>', 10],
 ['<0x08>', 11],
 ['<0x09>', 12],
 ['<0x0A>', 13],
 ['<0x0B>', 14],
 ['<0x0C>', 15],
 ['<0x0D>', 16],
 ['<0x0E>', 17],
 ['<0x0F>', 18],
 ['<0x10>', 19],
 ['<0x11>', 20],
 ['<0x12>', 21],
 ['<0x13>', 22],
 ['<0x14>', 23],
 ['<0x15>', 24],
 ['<0x16>', 25],
 ['<0x17>', 26],
 ['<0x18>', 27],
 ['<0x19>', 28],
 ['<0x1A>', 29],
 ['<0x1B>', 30],
 ['<0x1C>', 31],
 ['<0x1D>', 32],
 ['<0x1E>', 33],
 ['<0x1F>', 34],
 ['<0x20>', 35],
 ['<0x21>', 36],
 ['<0x22>', 37],
 ['<0x23>', 38],
 ['<0x24>', 39],
 ['<0x25>', 40],
 ['<0x26>', 41],
 ['<0x27>', 42],
 ['<0x28>', 43],
 ['<0x29>', 44],
 ['<0x2A>', 45],
 ['<0x2B>', 46],
 ['<0x2C>', 47],
 ['<0x2D>', 48],
 ['<0x2E>', 49],
 ['<0x2F>', 50],
 ['<0x30>', 51],
 ['<0x31>', 52],
 ['<0x32>', 53],
 ['<0x33>', 54],
 ['<0x34>', 55],
 ['<0x35>', 56],
 ['<0x36>', 57],
 ['<0x37>', 58],
 ['<0x38>', 5

In [5]:
ids = sp.encode("hello 안녕하세요")
print(ids)

[362, 378, 361, 372, 358, 362, 239, 152, 139, 238, 136, 152, 240, 152, 155, 239, 135, 187, 239, 157, 151]


We have set the hyperparameter `byte_fallback = True`, which is why `안녕하세요`, which is not encountered in the tranining data is represented in its utf-8 byte representation and doesnt give an error

__If you set `byte_fallback = false`:__  <br>
hello is encoded but `안녕하세요` encodes to 0 or unk - unknown token. The vocab also wont have the bytes up to 255 in such case and will be limited to tokens seen in the training set only. 

In [7]:
# decoding

print([sp.id_to_piece(idx) for idx in ids]) 

['▁', 'h', 'e', 'l', 'lo', '▁', '<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '<0xED>', '<0x95>', '<0x98>', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>']


__Decoding quirk:__

As you can see, during decoding '▁' is appended. This is because we set the hyperparam `add_dummy_prefix = True`. This trailing space serves the following function: 

- Sentence 1: "`hello` world"
- Sentence 2: "Say` hello` to your uncle"

Ideally you want both hellos to be processed in the same way. To uphold this semantic accuracy, a dummy white space is introduced. You can set `add_dummy_prefix = False`if you want 'hello' and ' hello' to be treated as separate tokens. 


<hr>


## Vocab size

- Its mostly an empirical hyperparameter
- Where does vocab size come up really? $\rightarrow$ _Embedding table, final linear layer of the NN_
- If `vocab_size` is too high: 
    - the probabilities of them occuring is scant and the NN may be undertrained (since the specific vocab element just doesnt occur often enough)
    - but this also means we can pack more text into the transformer block, at the risk of compressing _too much_
- If `vocab_size` is too less: 
    - Transformer block attends to less information and text size balloons
    - Not efficient utilisation of text structures

- In modern SOTA models vocab_size in 100k-150k range is seen (2025)

- __Extending vocab__: You may wish to add custom tokens as per use case and this is fairly commonly done. It is done by freezing the base model vocab and adding these custom tokens and then _only_ training them. 
    - You may wish to extend to [compress prompt](https://arxiv.org/pdf/2304.08467) and other such innovative applications too. 


<hr>

## Addressing anomaly behaviors observed in LMs due to tokenization 

- Just watch the next 20 minutes of [this video](https://youtu.be/zduSFxRajkE?si=Qnnvdbds1x0oA4bM&t=6701) of Andrej breaking LLMs, it too hilarious!
- [SolidGoldMagikarp](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)


## Final recommendations 
by Andrej Karpathy

- Don't brush off tokenization. A lot of footguns and sharp edges here. Security issues. Safety issues.
- Eternal glory to anyone who can delete tokenization as a required step in LLMs.
- In your own application:
    - Maybe you can just re-use the GPT-4 tokens and tiktoken?
    - If you're training a vocab, ok to use BPE with sentencepiece. Careful with the million settings.
    - Switch to minbpe once it is as efficient as sentencepiece :)

__Also worth looking at:__<br>

[Huggingface Tokenizer](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftransformers%2Fmain_classes%2Ftokenizer). I didn't cover it in detail in the lecture because the algorithm (to my knowledge) is very similar to sentencepiece, but worth potentially evaluating for use in practice.