<a href="https://colab.research.google.com/github/Joan947/mini_LLM/blob/main/compare_bpe_tiktoken.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


- Install the additional package requirements for this bonus notebook by uncommenting and running the following cell:

In [None]:
#pip install -r requirements-extra.txt



# Comparing Various Byte Pair Encoding (BPE) Implementations

<br>
&nbsp;

## Using BPE from `tiktoken`

In [59]:
from importlib.metadata import version

print("tiktoken version:", version("tiktoken"))


tiktoken version: 0.11.0


In [60]:
import tiktoken

tik_tokenizer = tiktoken.get_encoding("gpt2")

text = "'Can you understand that, little one? We ate horseflesh."

In [61]:
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[6, 6090, 345, 1833, 326, 11, 1310, 530, 30, 775, 15063, 8223, 69, 29730, 13]


In [62]:
strings = tik_tokenizer.decode(integers)

print(strings)

'Can you understand that, little one? We ate horseflesh.


In [63]:
print(tik_tokenizer.n_vocab)

50257


<br>
&nbsp;

## Using the original BPE implementation used in GPT-2

In [64]:
from bpe_openai_gpt2 import get_encoder, download_vocab

In [65]:
download_vocab()

Fetching encoder.json: 1.04Mit [00:00, 3.92Mit/s]                                                   
Fetching vocab.bpe: 457kit [00:00, 3.22Mit/s]                                                       


In [66]:
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")

In [67]:
integers = orig_tokenizer.encode(text)

print(integers)

[6, 6090, 345, 1833, 326, 11, 1310, 530, 30, 775, 15063, 8223, 69, 29730, 13]


In [68]:
strings = orig_tokenizer.decode(integers)

print(strings)

'Can you understand that, little one? We ate horseflesh.


<br>
&nbsp;

## Using the BPE via Hugging Face transformers

In [None]:
import transformers

transformers.__version__

'4.56.1'

In [69]:
from transformers import GPT2Tokenizer

hf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [70]:
hf_tokenizer(strings)["input_ids"]

[6, 6090, 345, 1833, 326, 11, 1310, 530, 30, 775, 15063, 8223, 69, 29730, 13]

In [71]:
from transformers import GPT2TokenizerFast

hf_tokenizer_fast = GPT2TokenizerFast.from_pretrained("gpt2")

In [72]:
hf_tokenizer_fast(strings)["input_ids"]

[6, 6090, 345, 1833, 326, 11, 1310, 530, 30, 775, 15063, 8223, 69, 29730, 13]

<br>
&nbsp;

## Using my own from-scratch BPE tokenizer

In [74]:
import os
import sys
import io
import nbformat
import types

def import_from_notebook():
    def import_definitions_from_notebook(fullname, names):
        current_dir = os.getcwd()
        path = os.path.join( fullname + ".ipynb")
        path = os.path.normpath(path)

        # Load the notebook
        if not os.path.exists(path):
            raise FileNotFoundError(f"Notebook file not found at: {path}")

        with io.open(path, "r", encoding="utf-8") as f:
            nb = nbformat.read(f, as_version=4)

        # Create a module to store the imported functions and classes
        mod = types.ModuleType(fullname)
        sys.modules[fullname] = mod

        # Go through the notebook cells and only execute function or class definitions
        for cell in nb.cells:
            if cell.cell_type == "code":
                cell_code = cell.source
                for name in names:
                    # Check for function or class definitions
                    if f"def {name}" in cell_code or f"class {name}" in cell_code:
                        exec(cell_code, mod.__dict__)
        return mod

    fullname = "bpe-from-scratch"
    names = ["BPETokenizerSimple"]

    return import_definitions_from_notebook(fullname, names)

In [75]:
imported_module = import_from_notebook()
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)

tokenizer_gpt2 = BPETokenizerSimple()
tokenizer_gpt2.load_vocab_and_merges_from_openai(
    vocab_path=os.path.join("gpt2_model", "encoder.json"),
    bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe")
)

In [76]:
integers = tokenizer_gpt2.encode(text)

print(integers)

[6, 6090, 345, 1833, 326, 11, 1310, 530, 30, 775, 15063, 8223, 69, 29730, 13]


<br>
&nbsp;

## A quick performance benchmark

In [91]:
import os
import urllib.request
if not os.path.exists("siege-of-berlin.txt"):
    url = ("https://raw.githubusercontent.com/Joan947/"
           "mini_LLM/main/"
           "siege-of-berlin.txt")
    file_path = "siege-of-berlin.txt"
    urllib.request.urlretrieve(url, file_path)
with open("siege-of-berlin.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

### Original OpenAI GPT-2 tokenizer

In [82]:
%timeit orig_tokenizer.encode(raw_text)

6.43 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Tiktoken OpenAI GPT-2 tokenizer

In [83]:
%timeit tik_tokenizer.encode(raw_text)

2.52 ms ± 37.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Hugging Face OpenAI GPT-2 tokenizer

In [85]:
%timeit hf_tokenizer(raw_text)["input_ids"]

19.2 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [86]:
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]

19.4 ms ± 5.34 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [87]:
%timeit hf_tokenizer_fast(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (2406 > 1024). Running this sequence through the model will result in indexing errors


8.56 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [88]:
%timeit hf_tokenizer_fast(raw_text, max_length=5145, truncation=True)["input_ids"]

8.62 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### My own GPT-2 tokenizer (for educational purposes)

In [92]:
%timeit tokenizer_gpt2.encode(raw_text)

19.7 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
