# Tokenization Deep Dive: How LLMs Slice Text
**Before an LLM tastes your words, it needs to slice them.**

In our companion post, we likened tokenization to slicing a sandwich. Now, let’s go deeper.

From whitespace to byte-level slicing and explore how different models tokenize inputs, decode them back, and how this all ties into cost and performance.

_Let’s begin at the café counter..._

## Analogy Recap: Your Sentence = A Sandwich
Each token is a slice. Different slicing tools lead to different results:
- **Whitespace Tokenization** → Clean visible cuts
- **Subword Tokenization** → Balanced bites like 'Sand' + '##wich'
- **Character Tokenization** → Micro-crumbs
- **Advanced Models** use: BPE, WordPiece, SentencePiece

> More slices = more tokens = more cost.

## Let's Start with a Few Sentences

In [14]:
sentences = [
    "I love spicy sandwiches.",
    "Tokenization affects both performance and cost.",
    "Sheldon likes his sandwiches in isosceles triangles.",
    "😂 emojis and multilingual नमस्ते text test subword chops.",
    "Hashtags like #AI and longwords like Donaudampfschifffahrtsgesellschaft are tricky!"
]

## Custom Tokenizers: Whitespace, Regex, Character

In [15]:
import re

def whitespace_tokens(text):
    return text.strip().split()

def regex_tokens(text):
    return re.findall(r'\w+|[^\w\s]', text)

def char_tokens(text):
    return list(text)

for text in sentences:
    print(f"\n📝 Sentence: {text}")
    print("• Whitespace:", whitespace_tokens(text))
    print("• Regex     :", regex_tokens(text))
    print("• Characters:", char_tokens(text))


📝 Sentence: I love spicy sandwiches.
• Whitespace: ['I', 'love', 'spicy', 'sandwiches.']
• Regex     : ['I', 'love', 'spicy', 'sandwiches', '.']
• Characters: ['I', ' ', 'l', 'o', 'v', 'e', ' ', 's', 'p', 'i', 'c', 'y', ' ', 's', 'a', 'n', 'd', 'w', 'i', 'c', 'h', 'e', 's', '.']

📝 Sentence: Tokenization affects both performance and cost.
• Whitespace: ['Tokenization', 'affects', 'both', 'performance', 'and', 'cost.']
• Regex     : ['Tokenization', 'affects', 'both', 'performance', 'and', 'cost', '.']
• Characters: ['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'a', 'f', 'f', 'e', 'c', 't', 's', ' ', 'b', 'o', 't', 'h', ' ', 'p', 'e', 'r', 'f', 'o', 'r', 'm', 'a', 'n', 'c', 'e', ' ', 'a', 'n', 'd', ' ', 'c', 'o', 's', 't', '.']

📝 Sentence: Sheldon likes his sandwiches in isosceles triangles.
• Whitespace: ['Sheldon', 'likes', 'his', 'sandwiches', 'in', 'isosceles', 'triangles.']
• Regex     : ['Sheldon', 'likes', 'his', 'sandwiches', 'in', 'isosceles', 'triangles',

## HuggingFace Tokenizers: Model-Level Tokenization

In [16]:
from transformers import AutoTokenizer

model_ids = ["bert-base-uncased", "gpt2", "roberta-base", "xlm-roberta-base"]

for model_id in model_ids:
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"\n🔧 Model: {model_id}")
    for s in sentences:
        tokens = tokenizer.tokenize(s)
        print(f"  • \"{s}\" → {len(tokens)} tokens →", tokens)


🔧 Model: bert-base-uncased
  • "I love spicy sandwiches." → 5 tokens → ['i', 'love', 'spicy', 'sandwiches', '.']
  • "Tokenization affects both performance and cost." → 8 tokens → ['token', '##ization', 'affects', 'both', 'performance', 'and', 'cost', '.']
  • "Sheldon likes his sandwiches in isosceles triangles." → 10 tokens → ['sheldon', 'likes', 'his', 'sandwiches', 'in', 'iso', '##sc', '##eles', 'triangles', '.']
  • "😂 emojis and multilingual नमस्ते text test subword chops." → 19 tokens → ['[UNK]', 'em', '##oj', '##is', 'and', 'multi', '##ling', '##ual', 'न', '##म', '##स', '##त', 'text', 'test', 'sub', '##word', 'chop', '##s', '.']
  • "Hashtags like #AI and longwords like Donaudampfschifffahrtsgesellschaft are tricky!" → 25 tokens → ['hash', '##tag', '##s', 'like', '#', 'ai', 'and', 'long', '##words', 'like', 'dona', '##uda', '##mp', '##fs', '##chi', '##ff', '##fa', '##hr', '##ts', '##ges', '##ell', '##schaft', 'are', 'tricky', '!']

🔧 Model: gpt2
  • "I love spicy sandwiches." 

## Encoding vs Tokenizing vs Decoding

In [17]:
example = "Tokenization is awesome! 🚀"
tokenizer = AutoTokenizer.from_pretrained("gpt2")

print("Tokenized :", tokenizer.tokenize(example))
print("Encoded IDs:", tokenizer.encode(example))
print("Decoded   :", tokenizer.decode(tokenizer.encode(example)))

Tokenized : ['Token', 'ization', 'Ġis', 'Ġawesome', '!', 'ĠðŁ', 'ļ', 'Ģ']
Encoded IDs: [30642, 1634, 318, 7427, 0, 12520, 248, 222]
Decoded   : Tokenization is awesome! 🚀


## Special Tokens and Model Quirks (BERT Example)

In [18]:
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
print("Special Tokens:", bert_tok.special_tokens_map)
ids = bert_tok.encode("AI is cool", add_special_tokens=True)
print("Encoded:", ids)
print("Decoded:", bert_tok.decode(ids))

Special Tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
Encoded: [101, 9932, 2003, 4658, 102]
Decoded: [CLS] ai is cool [SEP]


## Byte-Level Tokenization (GPT-2 Behavior)

In [19]:
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
print("Without space:", gpt2_tok.tokenize("hello"))
print("With space   :", gpt2_tok.tokenize(" hello"))  # GPT-2 treats space as a token

Without space: ['hello']
With space   : ['Ġhello']


## Emoji & Multilingual Tokenization Comparison

In [20]:
samples = ["😂", "नमस्ते", "Donaudampfschifffahrtsgesellschaft", "#AI", "Café"]
for s in samples:
    print(f"\nInput: {s}")
    print("BERT :", AutoTokenizer.from_pretrained("bert-base-uncased").tokenize(s))
    print("GPT-2:", AutoTokenizer.from_pretrained("gpt2").tokenize(s))


Input: 😂
BERT : ['[UNK]']
GPT-2: ['ðŁĺ', 'Ĥ']

Input: नमस्ते
BERT : ['न', '##म', '##स', '##त']
GPT-2: ['à¤', '¨', 'à¤', '®', 'à¤', '¸', 'à¥', 'į', 'à¤', '¤', 'à¥', 'ĩ']

Input: Donaudampfschifffahrtsgesellschaft
BERT : ['dona', '##uda', '##mp', '##fs', '##chi', '##ff', '##fa', '##hr', '##ts', '##ges', '##ell', '##schaft']
GPT-2: ['Don', 'aud', 'amp', 'fs', 'ch', 'if', 'ff', 'ah', 'r', 'ts', 'ges', 'ells', 'cha', 'ft']

Input: #AI
BERT : ['#', 'ai']
GPT-2: ['#', 'AI']

Input: Café
BERT : ['cafe']
GPT-2: ['C', 'af', 'Ã©']


## Vocabulary Peek: What Tokens Are in the Model?

In [21]:
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
vocab = gpt2_tok.get_vocab()
top_tokens = list(vocab.items())[:20]
print("Top 20 GPT-2 tokens:", top_tokens)

Top 20 GPT-2 tokens: [('Ġfunctional', 10345), ('Ġtrunk', 21427), ("''''", 39115), ('ĠIndeed', 9676), ('Ġvent', 7435), ('Winged', 47418), ('ĠBoss', 15718), ('favorite', 35200), ('idine', 39422), ('kers', 15949), ('Ġlinear', 14174), ('Ġjuvenile', 21904), ('ulty', 10672), ('ĠDee', 29195), ('ĠSwed', 7289), ('ĠLEGO', 29108), ('Tickets', 43254), ('aternal', 14744), ('Ġleaping', 45583), ('Ġmandated', 28853)]


## Padding, Truncation & Sequence Management

In [22]:
encoded = bert_tok(["Short sentence.", "This one is a bit longer and needs padding."],
                   padding=True, truncation=True, return_tensors="pt")
print(encoded)

{'input_ids': tensor([[  101,  2460,  6251,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0],
        [  101,  2023,  2028,  2003,  1037,  2978,  2936,  1998,  3791, 11687,
          4667,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


## Token Count = Cost
Let's estimate how tokens relate to cost (for GPT models)

In [23]:
def estimate_cost(token_count, model="gpt-4"):
    prices = {
        "gpt-3.5": 0.0015 / 1000,
        "gpt-4": 0.03 / 1000,
        "gpt-4o": 0.005 / 1000
    }
    return round(token_count * prices.get(model, 0), 6)

text = "Soft kitty, warm kitty!"
tok = AutoTokenizer.from_pretrained("gpt2")
tokens = tok.tokenize(text)
print(f"Tokens ({len(tokens)}): {tokens}")
print("Estimated Cost (gpt-4): $", estimate_cost(len(tokens), model="gpt-4"))

Tokens (8): ['Soft', 'Ġk', 'itty', ',', 'Ġwarm', 'Ġk', 'itty', '!']
Estimated Cost (gpt-4): $ 0.00024


## Wrap-Up: Tokenization = Prep Chef of LLMs
- Before embeddings, before transformers, before predictions we slice.
- The type of slicing affects cost, performance, and accuracy.
- Choose the right tokenizer for the job and never underestimate the power of prep work.