The [GPT 2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) in section 2.2 talks about BPE to enforce sub-word tokenization. 

Through this NB, we will dive deeper into [encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py) file (which is misnamed since it encodes _and_ decodes). 
<hr>

1. The GPT2 paper incorporates a regex split, to avoid similarity in subwords due to the greedy nature of BPE such as `dog.` , `dog!` , `dog?` being classified as separate tokens. Hoever, `'dog '` is more common and the spaces are allowed which proves to be helpful in compressing information at a higher level. Decoupling punctuation from semantics. 

2. So this is done by enfocring some "merging rules". i.e. words cannot combine with punctuations and so on to for a subword. 

3. Regex library is used to enforce these separations. 


## Forced splits using Regex patterns

In [4]:
import regex as re

gpt2pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") # picked from gpt2 encoder.py
print(re.findall(gpt2pat, "Hello've world123 how's HOW'S are        you!!!?"))

['Hello', "'ve", ' world', '123', ' how', "'s", ' HOW', "'", 'S', ' are', '       ', ' you', '!!!?']


On a high level: we are trying not to merge across letters, punctuations, numbers, abbreviations. 

## Important:

In nb1 we took the entire text and passed it through the `encoder` to get tokens. But practically:

1. We enforce the split using regex => `['Hello', "'ve", ' world', '123', ' how', "'s", ' are', ' you', '!!!?']`

2. Apply encode on each element of the list 

3. Then concatinate

In this process, each list element is tokenized independently before concat and our regex rules are followed!

<hr>

__Some objections to the regex pattern in gpt2 paper:__

- HOW'S vs how's tokenizes differently (case sensetive)
- ' vs ’ tokenizes differently (apostrophe)
- Langague (english) is hardcoded

In [5]:
example = """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""
print(re.findall(gpt2pat, example))

['\n', 'for', ' i', ' in', ' range', '(', '1', ',', ' 101', '):', '\n   ', ' if', ' i', ' %', ' 3', ' ==', ' 0', ' and', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'FizzBuzz', '")', '\n   ', ' elif', ' i', ' %', ' 3', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Fizz', '")', '\n   ', ' elif', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Buzz', '")', '\n   ', ' else', ':', '\n       ', ' print', '(', 'i', ')', '\n']


There are some additional rules OPENAI has enforced, such as: spaces are never merged. For ex: "    " + "  " dont get merged. It not clear how they have enforced this, since __`encoder.py` is just the inference code__, _not training code_. 

__tiktoken is the official openai library for tokenization (again, only for inference)__

In [18]:
import tiktoken

# GPT-2 (does not merge spaces)
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("    hello world!!! air"))

# GPT-4 (merges spaces)
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode("       hello world!!!"))

[220, 220, 220, 23748, 995, 10185, 1633]
[996, 24748, 1917, 12340]


Curiously, for gpt 4 tokenizer: 1 space, 2 spaces, 3 spaces.. each correspond to a different token. WHile for gpt-2, each space corresponds to `220`. 

Checking the tiktoken repo openai_public, [clk_100 base tokenizer for gpt 4](https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py) we see the regex pattern has evolved: <br>
`"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s"""`

- Some problem from _gpt-2 regex_ string have been fixed here (case sensativity, limiting (merged) number lengths to 3, punctuations etc.)



Further note that for a string such as `"      hello world"` with 6 trailing spaces gpt-4 would tokenize it as: `"     " + " hello"` i.e. 1 trailing space is attached to hello token and all 6 are not grouped into 1. This was learnt during training perhaps. See the below token division by gpt-4:

<img title="a title" alt="Alt text" src="images/gpt-4_tokenizer.png" width = 30%>

<hr>

Now lets check a few more aspects of encoder.py of gpt2:


In [17]:
import urllib.request

urllib.request.urlretrieve(
    "https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe", 
    "vocab.bpe"
)

urllib.request.urlretrieve(
    "https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json", 
    "encoder.json"
)

('encoder.json', <http.client.HTTPMessage at 0x17fe49040e0>)

__Note:__ The BPE token IDs in encoder.json are completely different from UTF-8 byte values!

But [encoder.json](/tokenizer/encoder.json) plays the same role as `vocab` dictionary in [base1.ipynb](/tokenizer/base1.ipynb) notebook. It allows us to efficiently switch between integer and bytes of that integer. <br>

While wading though [encoder.json](/tokenizer/encoder.json), `Ġ` makes many an appearance. It just represents a leading space. 

Also note that
- Kaprpathy starts with a base UTF-8 vocabulary (indices 0-255 for all possible bytes) and then adds BPE merges starting from index 256
- whereas, OPENAI uses a custom base vocabulary that doesn't follow the simple 0-255 byte mapping, still adds BPE merges on top
-  The core BPE algorithm (finding most frequent pairs and merging them) is the same in both cases - it's just the starting vocabulary that differs

Whereas [vocab.bpe](/tokenizer/vocab.bpe) is a list of merges carried out on the training text. We depart a bit and maintain merges as a dict in [base1.ipynb](/tokenizer/base1.ipynb)

