This section covers a more sophisticated tokenization scheme based on a concept called byte pair encoding (BPE).
The BPE tokenizer covered in this section was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.

In [2]:
# https://github.com/openai/tiktoken

In [None]:
# Tiktoken (OpenAI’s GPT-4 / GPT-4o / GPT-4.5)
# GPT-4 and GPT-4o use tiktoken, OpenAI’s efficient tokenizer.

# Based on byte-level BPE, but highly optimized for speed & memory.

# Backward-compatible with GPT-3 vocabulary but faster.


In [3]:
! pip3 install tiktoken



In [5]:
# 1 > Word Based Tokenizer.
# 2 > Sub-Word Based Tokenizer.
# 3 > Charecter Wised Tokenizer.

In [12]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encode method: https://github.com/PrashantTakale369/Transformer-Basics/blob/b0eb0d70b16f09b11f4c5e8bd99e803c8e51771e/Tokanizer/Tokanizer.ipynb

In [16]:
tokenizer = tiktoken.get_encoding("gpt2")

In [8]:
text = (
    "Hello, do you like tea? <|endoftext|> My name is Prashant someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 2011, 1438, 318, 1736, 1077, 415, 617, 34680, 27271, 13]


We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizerV2 earlier:
https://github.com/PrashantTakale369/Transformer-Basics/blob/b0eb0d70b16f09b11f4c5e8bd99e803c8e51771e/Tokanizer/Tokanizer.ipynb

In [9]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> My name is Prashant someunknownPlace.


In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID.

Second, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" correctly.
The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?

**Lets Try on Diff meningless word **

In [17]:
text = (
    "jjnd difn"
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[41098, 358, 288, 361, 77]


In [18]:
strings = tokenizer.decode(integers)
print(strings)

jjnd difn
