<a href="https://colab.research.google.com/github/Firojpaudel/Demystifying_Language_Modeling/blob/main/GPT_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The previous notebook: [BPE Tokenizer](https://github.com/Firojpaudel/Demystifying_Language_Modeling/blob/main/BPE_Tokenization.ipynb)

In the OpenAI paper for [**GPT 2** paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), they have talked about the demerits of vanilla BPE.

<div align="center">
  <img src= "https://i.postimg.cc/vTmDt3ds/Screenshot-2.png" height="375 px" >
  <p> <b><i>SS_1:</i></b><i> Snippet from the paper</i>
</div>

So yea, they add some complex regex on top of this. The main code is in OpenAI repo: [Click Here](https://github.com/openai/gpt-2/blob/master/src/encoder.py)

They have a regex defined:

```
re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")


```

Concept: *First we breakdown, then we tokenize*

In [14]:
import regex as re

GPT2pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

print(re.findall(GPT2pattern, "Hello dog! Whats up dog?"))

['Hello', ' dog', '!', ' Whats', ' up', ' dog', '?']


##### Explaining the regex here:

- When we encounter `'s or 't or 're or 've or 'm or 'll or 'd`, we spilt them first into seperate token.

- Next up: we have: `?\p{L}+` or `?\p{N}+`. So this means that when we encounter a text like "Hello123"; it gets tokenized as `['Hello', '123']`

- Next: We have `?[^\s\p{L}\p{N}]+`. This is for any character that is **not space, letter or a number**. So, that includes all special characters like `!@#.. `.

- And finally we have: `\s+(?!\S)|\s+`. Here, `\s+` matches one or more whitespaces. Then we have negative lookahead `(?!\S)`. It's used to assert that the "matched" whitespace is not followed by non-whitespace character `(\S)`

In [11]:
##@ Test...
test1 = """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
"""
br = re.findall(GPT2pattern, test1)
print(br)

['\n', 'for', ' i', ' in', ' range', '(', '1', ',', ' 101', '):', '\n   ', ' if', ' i', ' %', ' 3', ' ==', ' 0', ' and', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'FizzBuzz', '")', '\n   ', ' elif', ' i', ' %', ' 3', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Fizz', '")', '\n   ', ' elif', ' i', ' %', ' 5', ' ==', ' 0', ':', '\n       ', ' print', '("', 'Buzz', '")', '\n   ', ' else', ':', '\n       ', ' print', '(', 'i', ')', '\n']


And after this, we can pass into our tokenizer we defined in previous notebok (BPE_Tokenization)

In [20]:
str_to_id = {}
id_to_str = {}
base_str_to_id = {}
base_id_to_str = {}
vocab = {}

def tokenize(text):
    tokens = findall(GPT2pattern, text)
    ids = []
    for token in tokens:
        if token not in base_str_to_id:
            idx = len(base_str_to_id)
            base_str_to_id[token] = idx
            base_id_to_str[idx] = token
        ids.append(base_str_to_id[token])
    return ids

In [24]:
##@ Redefining all that here once again

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if (
            i < len(ids) - 1 and
            ids[i] == pair[0] and
            ids[i+1] == pair[1]
        ):
            # Safety check: ensure we’re not replacing huge stuff repeatedly
            if len(id_to_str[pair[0]] + id_to_str[pair[1]]) > 40:  # or any threshold
                newids.append(ids[i])
                i += 1
                continue

            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

def get_stats(ids):
  counts = {}
  for pair in zip(ids, ids[1:]):
    counts[pair] = counts.get(pair, 0) + 1
  return counts


merges = {}
ids = tokenize(test1 := """
for i in range(1, 101):
    if i % 3 == 0 and i % 5 == 0:
        print("FizzBuzz")
    elif i % 3 == 0:
        print("Fizz")
    elif i % 5 == 0:
        print("Buzz")
    else:
        print(i)
""")
str_to_id = base_str_to_id.copy()
id_to_str = base_id_to_str.copy()

num_merges = 20
for i in range(num_merges):
    stats = get_stats(ids)
    if not stats:
        break
    pair = max(stats, key=stats.get)
    idx = max(id_to_str) + 1
    merged_token = id_to_str[pair[0]] + id_to_str[pair[1]]

    print(f"merging {pair} → '{merged_token}' into new token {idx}")

    str_to_id[merged_token] = idx
    id_to_str[idx] = merged_token
    ids = merge(ids, pair, idx)
    merges[pair] = idx

def encode(text):
    tokens = findall(GPT2pattern, text)
    ids = [base_str_to_id[token] for token in tokens]

    merge_count = 0
    max_merges = 20  #! or same as num_merges from training

    while merge_count < max_merges:
        stats = get_stats(ids)
        if not stats:
            break
        pair = min(stats, key=lambda p: merges.get(p, float("inf")))
        if pair not in merges:
            break
        idx = merges[pair]
        ids = merge(ids, pair, idx)
        merge_count += 1

    return ids

def decode(ids):
    return "".join(id_to_str[idx] for idx in ids)

print("-------")
print("Encoded part:\n")
encoded_ids = encode(test1)
print(encoded_ids)
print("-------")
print("\nDecoded output:\n")
print(decode(encoded_ids))

merging (2, 12) → ' i %' into new token 30
merging (14, 15) → ' == 0' into new token 31
merging (18, 19) → ':
       ' into new token 32
merging (32, 20) → ':
        print' into new token 33
merging (31, 33) → ' == 0:
        print' into new token 34
merging (34, 21) → ' == 0:
        print("' into new token 35
merging (23, 10) → '")
   ' into new token 36
merging (30, 13) → ' i % 3' into new token 37
merging (30, 17) → ' i % 5' into new token 38
merging (38, 35) → ' i % 5 == 0:
        print("' into new token 39
merging (36, 24) → '")
    elif' into new token 40
merging (0, 1) → '
for' into new token 41
merging (41, 2) → '
for i' into new token 42
merging (42, 3) → '
for i in' into new token 43
merging (43, 4) → '
for i in range' into new token 44
merging (44, 5) → '
for i in range(' into new token 45
merging (45, 6) → '
for i in range(1' into new token 46
merging (46, 7) → '
for i in range(1,' into new token 47
merging (47, 8) → '
for i in range(1, 101' into new token 48
merging (48