# 2. BPE Tokenizer

## 2.1 The Unicode Standard

In [None]:
ord("Áâõ")

In [None]:
chr(29275)

In [None]:
chr(0) # null , U+0000

In [None]:
ord('\x00')

In [None]:
print(chr(0)) # not printable

In [None]:
"this is a test" + chr(0) + "string"

In [None]:
print("this is a test" + chr(0) + "string")

## 2.2 Unicode Encodings

In [None]:
test_string = "hello! „Åì„Çì„Å´„Å°„ÅØ!"
utf8_encoded = test_string.encode("utf-8")
print(utf8_encoded)
print(type(utf8_encoded))
# Get the byte values for the encoded string (integers from 0 to 255).
list(utf8_encoded)
# One byte does not necessarily correspond to one Unicode character!
print(len(test_string))
print(len(utf8_encoded))
print(utf8_encoded.decode("utf-8"))
print(list(utf8_encoded))

In [None]:
# Test with various strings
test_strings = [
    "hello",                    # Pure ASCII
    "hello! „Åì„Çì„Å´„Å°„ÅØ!",        # Mixed ASCII and Japanese
    "Hello ‰∏ñÁïå",                # Mixed ASCII and Chinese
    "üöÄ rocket",                 # Emoji and ASCII
    "Caf√©",                      # ASCII with accents
]

for s in test_strings:
    utf8 = s.encode("utf-8")
    utf16 = s.encode("utf-16")
    utf32 = s.encode("utf-32")
    
    print(f"\nString: '{s}'")
    print(f"  Characters: {len(s)}")
    print(f"  UTF-8:  {len(utf8):2d} bytes - {utf8}")
    print(f"  UTF-16: {len(utf16):2d} bytes - {utf16}")
    print(f"  UTF-32: {len(utf32):2d} bytes - {utf32}")

utf-8 encoding has fewer bytes

In [None]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

decode_utf8_bytes_to_str_wrong("hello!".encode("utf-8"))

In [None]:
# decode_utf8_bytes_to_str_wrong("hello! „Åì„Çì„Å´„Å°„ÅØ!".encode("utf-8")) # Some characters require multiple bytes, one to one decoding cannot work.

## 2.4 BPE Tokenizer Training

In [None]:
import regex as re

PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

# Test Case 1: Basic contractions
text1 = "I'm happy, you're sad, they'll go, we've seen, it's fine"
print(re.findall(PAT, text1))

# Test Case 2: Numbers
text2 = "There are 123 cats and 456 dogs in 2024"
print(re.findall(PAT, text2))

- add space to the next tokenÔºö" hello" ‚Üí [' hello']
- split abbreviationsÔºö"don't" ‚Üí ['don', "'t"]
- sum the signsÔºö"!!!" ‚Üí ['!!!']
- unicode supported: with all languages and numbers
- split numbers and lettersÔºö"GPT4" ‚Üí ['GPT', '4']

## 2.5 Experimenting with BPE Tokenizer Training

- Problem (train_bpe_tinystories): BPE Training on TinyStories

In [None]:
from bpe_trainer import run_train_bpe, save_bpe_model, load_bpe_model

input_path = "data/TinyStoriesV2-GPT4-train.txt"

vocab, merges = run_train_bpe(
    input_path=input_path,
    vocab_size=10000,
    special_tokens=["<|endoftext|>"],
    num_processes=10
)

save_bpe_model(vocab, merges, output_dir="bpe_model")

In [None]:
import subprocess
# Launch snakeviz to visualize profiling results
subprocess.Popen(['snakeviz', 'bpe.prof'])

In [None]:
from bpe_trainer_heap import run_train_bpe, save_bpe_model, load_bpe_model

input_path = "data/owt_train.txt"

vocab, merges = run_train_bpe(
    input_path=input_path,
    vocab_size=32000,
    special_tokens=["<|endoftext|>"],
    num_processes=10
)

save_bpe_model(vocab, merges, output_dir="bpe_model")

mac with 16GB RAM and M4 processors

| Rank | Function | Time (s) | % of Total | Calls | Issue |
|------|----------|----------|------------|-------|-------|
| 1 | run_train_bpe | 641.3 | 74.6% | 1 | Core BPE algorithm (unavoidable) |
| 2 | posix.read | 111.5 | 13.0% | 76 | File I/O - reading chunks |
| 3 | len() | 51.3 | 6.0% | 1.17B | **Excessive calls - can optimize** |
| 4 | max() | 55.2 | 6.4% | 9,804 | Finding best pair each iteration |
| 5 | lambda | 25.4 | 3.0% | 369M | Lambda overhead in max() |

Already done dict.get and list.append opt, opted out 40% time during merging

- Problem (train_bpe_expts_owt): BPE Training on OpenWebText

    ```bash
    uv run python train_bpe.py
    ```

    Total time: 14m 7.99s


    

##