<a href="https://colab.research.google.com/github/Joan947/mini_LLM/blob/main/compare_bpe_tiktoken.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


- Install the additional package requirements for this bonus notebook by uncommenting and running the following cell:

In [None]:
#pip install -r requirements-extra.txt



# Comparing Various Byte Pair Encoding (BPE) Implementations

<br>
&nbsp;

## Using BPE from `tiktoken`

In [None]:
from importlib.metadata import version

print("tiktoken version:", version("tiktoken"))


tiktoken version: 0.11.0


In [None]:
import tiktoken

tik_tokenizer = tiktoken.get_encoding("gpt2")

text = "'Can you understand that, little one? We ate horseflesh."

In [None]:
integers = tik_tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [None]:
strings = tik_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


In [None]:
print(tik_tokenizer.n_vocab)

50257


<br>
&nbsp;

## Using the original BPE implementation used in GPT-2

In [None]:
from bpe_openai_gpt2 import get_encoder, download_vocab

In [None]:
download_vocab()

Fetching encoder.json: 1.04Mit [00:00, 4.35Mit/s]                                                   
Fetching vocab.bpe: 457kit [00:00, 2.73Mit/s]                                                       


In [None]:
orig_tokenizer = get_encoder(model_name="gpt2_model", models_dir=".")

In [None]:
integers = orig_tokenizer.encode(text)

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


In [None]:
strings = orig_tokenizer.decode(integers)

print(strings)

Hello, world. Is this-- a test?


<br>
&nbsp;

## Using the BPE via Hugging Face transformers

In [None]:
import transformers

transformers.__version__

'4.56.1'

In [None]:
from transformers import GPT2Tokenizer

hf_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [None]:
hf_tokenizer(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

In [None]:
from transformers import GPT2TokenizerFast

hf_tokenizer_fast = GPT2TokenizerFast.from_pretrained("gpt2")

In [None]:
hf_tokenizer_fast(strings)["input_ids"]

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]

<br>
&nbsp;

## Using my own from-scratch BPE tokenizer

In [None]:
import os
import sys
import io
import nbformat
import types

def import_from_notebook():
    def import_definitions_from_notebook(fullname, names):
        current_dir = os.getcwd()
        path = os.path.join(current_dir, "..", "05_bpe-from-scratch", fullname + ".ipynb")
        path = os.path.normpath(path)

        # Load the notebook
        if not os.path.exists(path):
            raise FileNotFoundError(f"Notebook file not found at: {path}")

        with io.open(path, "r", encoding="utf-8") as f:
            nb = nbformat.read(f, as_version=4)

        # Create a module to store the imported functions and classes
        mod = types.ModuleType(fullname)
        sys.modules[fullname] = mod

        # Go through the notebook cells and only execute function or class definitions
        for cell in nb.cells:
            if cell.cell_type == "code":
                cell_code = cell.source
                for name in names:
                    # Check for function or class definitions
                    if f"def {name}" in cell_code or f"class {name}" in cell_code:
                        exec(cell_code, mod.__dict__)
        return mod

    fullname = "bpe-from-scratch"
    names = ["BPETokenizerSimple"]

    return import_definitions_from_notebook(fullname, names)

In [None]:
imported_module = import_from_notebook()
BPETokenizerSimple = getattr(imported_module, "BPETokenizerSimple", None)

tokenizer_gpt2 = BPETokenizerSimple()
tokenizer_gpt2.load_vocab_and_merges_from_openai(
    vocab_path=os.path.join("gpt2_model", "encoder.json"),
    bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe")
)

FileNotFoundError: Notebook file not found at: /05_bpe-from-scratch/bpe-from-scratch.ipynb

In [None]:
integers = tokenizer_gpt2.encode(text)

print(integers)

[15496, 11, 995, 13, 1148, 428, 438, 257, 1332, 30]


<br>
&nbsp;

## A quick performance benchmark

In [None]:
# import os
# import urllib.request
# if not os.path.exists("siege-of-berlin.txt"):
#     url = ("https://raw.githubusercontent.com/Joan947/"
#            "mini_LLM/main/"
#            "siege-of-berlin.txt")
#     file_path = "siege-of-berlin.txt"
#     urllib.request.urlretrieve(url, file_path)
with open("siege-of-berlin.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

### Original OpenAI GPT-2 tokenizer

In [None]:
%timeit orig_tokenizer.encode(raw_text)

3.84 ms ± 9.83 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Tiktoken OpenAI GPT-2 tokenizer

In [None]:
%timeit tik_tokenizer.encode(raw_text)

901 μs ± 6.27 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Hugging Face OpenAI GPT-2 tokenizer

In [None]:
%timeit hf_tokenizer(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


11 ms ± 94.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%timeit hf_tokenizer(raw_text, max_length=5145, truncation=True)["input_ids"]

10.8 ms ± 180 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%timeit hf_tokenizer_fast(raw_text)["input_ids"]

Token indices sequence length is longer than the specified maximum sequence length for this model (5145 > 1024). Running this sequence through the model will result in indexing errors


3.66 ms ± 3.67 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%timeit hf_tokenizer_fast(raw_text, max_length=5145, truncation=True)["input_ids"]

3.77 ms ± 49.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### My own GPT-2 tokenizer (for educational purposes)

In [None]:
%timeit tokenizer_gpt2.encode(raw_text)

9.37 ms ± 50.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
