In [2]:
! pip install -q tiktoken==0.12.0 pandas==2.2.2 numpy==2.0.2 torch==2.9.0

In [8]:
import tiktoken
import pandas as pd

# Initialize Encodings
encodings = {
    "GPT-2": tiktoken.get_encoding("gpt2"),
    "GPT-3": tiktoken.get_encoding("p50k_base"),
    "GPT-4": tiktoken.get_encoding("cl100k_base"),
    "GPT-4o": tiktoken.get_encoding("o200k_base"), 
}

text = "Learning never stops | ‡§∏‡•Ä‡§ñ‡§®‡§æ ‡§ï‡§≠‡•Ä ‡§®‡§π‡•Ä‡§Ç ‡§∞‡•Å‡§ï‡§§‡§æ | Â≠¶„Å≥„ÅØÊ≠¢„Åæ„Çâ„Å™„ÅÑ üòä"

print("Input Text:")
print(text)
print()

summary = []
decoded_tokens = {}
max_len = 0

# Encode once per tokenizer:
for model, enc in encodings.items():
    token_ids = enc.encode(text)
    pieces = [repr(enc.decode([tid])) for tid in token_ids]
    # repr(...) so that leading spaces, invisible characters, and special symbols are clearly visible in the output.


    summary.append({
        "Model": model,
        "Vocab size": enc.n_vocab,
        "Token count": len(token_ids),
        "Round-trip OK": enc.decode(token_ids) == text
    })

    # The round-trip check verifies that decoding the full token sequence reconstructs the original text exactly, confirming that tokenization is reversible for this input.
    # Note that this round-trip correctness confirms that differences observed later are purely about segmentation, not loss of information.

    decoded_tokens[model] = pieces
    max_len = max(max_len, len(pieces))

# Print Compact Summary
summary_df = pd.DataFrame(summary)
print("Summary:")
display(summary_df)
# Build aligned token table (token index as rows)
rows = []
for i in range(max_len):
    row = {"Idx": i}
    for model in decoded_tokens:
        row[model] = decoded_tokens[model][i] \
                        if i < len(decoded_tokens[model]) else ""
        rows.append(row)

tokens_df = pd.DataFrame(rows)

print("Token piece comparison:")
display(tokens_df)

Input Text:
Learning never stops | ‡§∏‡•Ä‡§ñ‡§®‡§æ ‡§ï‡§≠‡•Ä ‡§®‡§π‡•Ä‡§Ç ‡§∞‡•Å‡§ï‡§§‡§æ | Â≠¶„Å≥„ÅØÊ≠¢„Åæ„Çâ„Å™„ÅÑ üòä

Summary:


Unnamed: 0,Model,Vocab size,Token count,Round-trip OK
0,GPT-2,50257,51,True
1,GPT-3,50281,51,True
2,GPT-4,100277,38,True
3,GPT-4o,200019,20,True


Token piece comparison:


Unnamed: 0,Idx,GPT-2,GPT-3,GPT-4,GPT-4o
0,0,'Learning','Learning','Learning','Learning'
1,0,'Learning','Learning','Learning','Learning'
2,0,'Learning','Learning','Learning','Learning'
3,0,'Learning','Learning','Learning','Learning'
4,1,' never',' never',' never',' never'
...,...,...,...,...,...
199,49,' ÔøΩ',' ÔøΩ',,
200,50,'ÔøΩ','ÔøΩ',,
201,50,'ÔøΩ','ÔøΩ',,
202,50,'ÔøΩ','ÔøΩ',,


 #### Analysis
 
 - GPT-2 and GPT-3 have almost identical vocabulary sizes (~50k) and produce the same token count (51 tokens) for our input sentence. Whereas GPT-4 has a much larger vocabulary (~100k) and already reduces the token count significantly to 38 tokens, indicating better coverage of non-Latin scripts and more compact subword units.

- GPT-4o has the largest vocabulary (~200k) and produces only 20 tokens for the same sentence, showing a dramatic improvement in token efficiency, especially for multilingual text.

- The key takeaway from this table is that newer tokenizers don‚Äôt just add vocabulary, they materially reduce token counts, which directly affects context usage, latency, and cost.

- The repeated ÔøΩ (replacement character) visible in GPT-2, GPT-3 and GPT-4 columns around the Hindi/Japanese and emoji segments indicates that these tokenizers are effectively operating at a fragmented Unicode level for those scripts.

- This fragmentation happens because these tokenizers were trained with weaker coverage for non-Latin scripts, so they fall back to representing tokens as smaller, less meaningful units.

- In the table, we are decoding one token at a time, which means each token is decoded in isolation, without its neighboring bytes. Many Unicode characters (such as Hindi letters, Japanese characters, and emojis) are represented by multiple bytes (due to weaker coverage for non-Latin scripts), and decoding only a fragment of those bytes produces invalid Unicode, which when decoded on their own, are shown as ÔøΩ. This is why the full round-trip decode works correctly, but several individual token pieces are unable to render when shown separately.

- You can also observe that GPT-4o groups larger semantic chunks together, which is why it needs far fewer rows (tokens) overall.

- Leading spaces and separators appear attached to tokens in several places, which is expected behavior and reflects how tokenizers optimize for natural language statistics rather than word boundaries.

- Now, the most important practical insight from this output is that tokenization quality strongly affects multilingual robustness and efficiency, even before embeddings, attention, or model architecture come into play.

- This is why prompt length, context limits, and cost estimates must always be understood relative to the tokenizer used by the target model, not by counting characters or words. Hence, the token count is practically a very important metric for performance evaluations and cost management.

