# Overview

The tokenizer is a completely seperate, independent module from the LLM. It has its own training dataset of text, on which you train the vocabulary using the Byte Pair encoding (BPE) algorithm. It then translates back and forth between raw text and sequences of tokens. The LLM later only ever sees the toknes and never directly deals with any text.

In [1]:
text="RMIT University’s AWS Cloud Supercomputing facility, or RACE, opened in July this year for RMIT researchers, who are now using it to power advances into battery technologies, photonics and geospatial science. With its public launch this week, external research partners are now able to use it too. RACE provides fast, secure and private connections – powered by Amazon Web Services (AWS) and AARNet – ideal for workloads that require higher speed and fewer delays than the internet. RACE Director Dr Robert Shen said the increased bandwidth gives researchers, students, and industry partners the ability make discoveries faster and for RMIT to fast-track the time between initial concepts and products going to market. “RACE will enable researchers to test out ideas and solutions up to 80 times faster compared to the existing on-premises servers,” Shen said. “Research typically involves many failures before success: this facility lets researchers fail quickly so they can fine-tune their solutions and improve them.”"

tokens=text.encode('utf-8')
tokens=list(map(int, tokens)) # 0-255

print('---')
print(text)
print('length:', len(text))
print('---')
print(tokens)
print('length:', len(tokens))

---
RMIT University’s AWS Cloud Supercomputing facility, or RACE, opened in July this year for RMIT researchers, who are now using it to power advances into battery technologies, photonics and geospatial science. With its public launch this week, external research partners are now able to use it too. RACE provides fast, secure and private connections – powered by Amazon Web Services (AWS) and AARNet – ideal for workloads that require higher speed and fewer delays than the internet. RACE Director Dr Robert Shen said the increased bandwidth gives researchers, students, and industry partners the ability make discoveries faster and for RMIT to fast-track the time between initial concepts and products going to market. “RACE will enable researchers to test out ideas and solutions up to 80 times faster compared to the existing on-premises servers,” Shen said. “Research typically involves many failures before success: this facility lets researchers fail quickly so they can fine-tune their solu

# Merge the most common pairs

In [2]:
def get_stats(ids):
    counts={}
    for pair in zip(ids, ids[1:]):
        counts[pair]=counts.get(pair,0)+1
    return counts

def merge(ids, pair, idx):
    # replace all consecutive occurences of pair with the new token idx
    newids=[]
    i=0
    while i<len(ids):
        # if we are not at the very last position and the pair matches, replace it
        if i<len(ids)-1 and ids[i]==pair[0] and ids[i+1]==pair[1]:
            newids.append(idx)
            i+=2
        else:
            newids.append(ids[i])
            i+=1
    return newids

# testing
print(f"An example for merge function {merge([5,6,6,7,9,1],[6,7],99)}")

vocab_size=276 # final vocabulary size
num_merges=vocab_size-256
ids=list(tokens) # copy

merges={} # (int, int) -> int
for i in range(num_merges):
    stats=get_stats(ids)
    pair=max(stats, key=stats.get) # get the rank based on the value
    idx=256+i
    print(f"merging {pair} into a new token {idx}" )
    ids=merge(ids, pair, idx)
    merges[pair]=idx

An example for merge function [5, 6, 99, 9, 1]
merging (115, 32) into a new token 256
merging (101, 114) into a new token 257
merging (32, 116) into a new token 258
merging (114, 101) into a new token 259
merging (100, 32) into a new token 260
merging (97, 110) into a new token 261
merging (105, 110) into a new token 262
merging (258, 104) into a new token 263
merging (97, 114) into a new token 264
merging (115, 101) into a new token 265
merging (105, 116) into a new token 266
merging (261, 260) into a new token 267
merging (102, 97) into a new token 268
merging (44, 32) into a new token 269
merging (99, 104) into a new token 270
merging (111, 110) into a new token 271
merging (115, 116) into a new token 272
merging (101, 32) into a new token 273
merging (121, 32) into a new token 274
merging (226, 128) into a new token 275


In [3]:
print("tokens length:", len(tokens))
print("ids length:", len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

tokens length: 1034
ids length: 795
compression ratio: 1.30X


# Decoding

Given a sequence of integers in the range [0, vocab_size], what is the text?

**Note: Based on the UTF-8 schema not every character is valid. We use decode(errors="replace") to fix it. Which means the tokenizer process has some issues**

In [4]:
vocab={idx:bytes([idx]) for idx in range(256)}
for (p0, p1), idx in merges.items():
    vocab[idx]=vocab[p0]+vocab[p1]

def decode(ids):
    # given ids (list of integers), return Python string
    tokens=b"".join(vocab[idx] for idx in ids)
    text=tokens.decode("utf-8", errors="replace")
    return text

print(decode([128]))

�


# Encoding

Given a string, what are the tokens?

In [5]:
merges

{(115, 32): 256,
 (101, 114): 257,
 (32, 116): 258,
 (114, 101): 259,
 (100, 32): 260,
 (97, 110): 261,
 (105, 110): 262,
 (258, 104): 263,
 (97, 114): 264,
 (115, 101): 265,
 (105, 116): 266,
 (261, 260): 267,
 (102, 97): 268,
 (44, 32): 269,
 (99, 104): 270,
 (111, 110): 271,
 (115, 116): 272,
 (101, 32): 273,
 (121, 32): 274,
 (226, 128): 275}

In [6]:
def encode(text):
    # given a string return tokens
    tokens=list(text.encode('utf-8'))
    while True:
        stats=get_stats(tokens)
        pair=min(stats, key=lambda p: merges.get(p, float("inf")))
        if pair not in merges:
            break
        idx=merges[pair]
        tokens=merge(tokens, pair, idx)
    return tokens

print(encode('hello python'))

[104, 101, 108, 108, 111, 32, 112, 121, 116, 104, 271]


# Testing

In [7]:
print(decode(encode("hello tokenizer")))

hello tokenizer


In [8]:
valtext='rust fails to build with cargo command error'
valtext2=decode(encode(valtext))
print(valtext==valtext2)

True


# Forced splits using regex patterns (GPT series)

In [9]:
import tiktoken # only for tokenizer inference

# GPT-2 (does not merge spaces)
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("    hello world!!!"))

# GPT-4 (merges spaces)
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode("    hello world!!!"))

[220, 220, 220, 23748, 995, 10185]
[262, 24748, 1917, 12340]


In [10]:
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
!wget https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json

--2025-01-27 07:38:11--  https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/vocab.bpe
Resolving openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)... 57.150.97.129
Connecting to openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)|57.150.97.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [application/octet-stream]
Saving to: ‘vocab.bpe’


2025-01-27 07:38:11 (1.43 MB/s) - ‘vocab.bpe’ saved [456318/456318]

--2025-01-27 07:38:11--  https://openaipublic.blob.core.windows.net/gpt-2/models/1558M/encoder.json
Resolving openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)... 57.150.97.129
Connecting to openaipublic.blob.core.windows.net (openaipublic.blob.core.windows.net)|57.150.97.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘encoder.json’


2025-01-27 07:38:12 (2.15 MB/s) - ‘encode

In [11]:
# https://github.com/openai/gpt-2/blob/master/src/encoder.py --> the code for inference rather than the training code for tokenizer

import os, json

with open('encoder.json', 'r') as f:
    vocab = json.load(f)

with open('vocab.bpe', 'r', encoding="utf-8") as f:
    merges = f.read()
bpe_merges = [tuple(merge_str.split()) for merge_str in merges.split('\n')[1:-1]]

In [12]:
len(vocab) # 256 raw byte toknes. 50000 merges. +1 speical token only gpt2
vocab['<|endoftext|>']

50256

# Reference
* https://www.rmit.edu.au/news/all-news/2022/oct/race-hub-launch
* https://youtu.be/zduSFxRajkE?feature=shared