In [1]:
input_text ="I can't believe unbelievable things happened in ২০২5! 👋🏽"


In [2]:
chars = list(input_text)
print(chars)
print(len(chars))

['I', ' ', 'c', 'a', 'n', "'", 't', ' ', 'b', 'e', 'l', 'i', 'e', 'v', 'e', ' ', 'u', 'n', 'b', 'e', 'l', 'i', 'e', 'v', 'a', 'b', 'l', 'e', ' ', 't', 'h', 'i', 'n', 'g', 's', ' ', 'h', 'a', 'p', 'p', 'e', 'n', 'e', 'd', ' ', 'i', 'n', ' ', '২', '০', '২', '5', '!', ' ', '👋', '🏽']
56


*Models do not understand text.
They understand numbers, and tokenization is the process of turning text into countable units.*

---



## Character-Level Tokenization
this section Show that:

* No OOV
* Exploding sequence length
* No semantic grouping

In [3]:
def char_tokenize(text):
  return list(text)

char_tokens = char_tokenize(input_text)
print(char_tokens)
print("Token count:", len(char_tokens))

['I', ' ', 'c', 'a', 'n', "'", 't', ' ', 'b', 'e', 'l', 'i', 'e', 'v', 'e', ' ', 'u', 'n', 'b', 'e', 'l', 'i', 'e', 'v', 'a', 'b', 'l', 'e', ' ', 't', 'h', 'i', 'n', 'g', 's', ' ', 'h', 'a', 'p', 'p', 'e', 'n', 'e', 'd', ' ', 'i', 'n', ' ', '২', '০', '২', '5', '!', ' ', '👋', '🏽']
Token count: 56


**Observations** *Emoji is split into multiple characters, Token count is very high, No concept of “word” exists*


> Character-level tokenization solves OOV completely, but destroys efficiency and meaning.


###character tokenization makes sense

* Educational purposes
* Research
* Symbolic data (DNA, code, math)
* Not practical for LLMs




---



## Word-Level Tokenization
* Grouping characters into words feels better
* OOV failure
* Sequence length drops
* Language + punctuation assumptions appear

In [4]:
def word_tokenize(text):
  return text.lower().split()

word_tokens = word_tokenize(input_text)
print(word_tokens)
print("Token count:", len(word_tokens))

# OOV (Out-of-Vocabulary)  Demo
# Word-level tokenization requires a massive vocabulary and still fails on unseen words
# assume
vocab ={"i", "can't", "believe", "things", "happened"}
[word for word in word_tokens if word not in vocab]



['i', "can't", 'believe', 'unbelievable', 'things', 'happened', 'in', '২০২5!', '👋🏽']
Token count: 9


['unbelievable', 'in', '২০২5!', '👋🏽']

**Observations** *"can't" is one token (with punctuation), "2025!" is glued to !,Emoji becomes its own “word”*


> word-level tokenization solves: Shorter sequences, Human-readable tokens. But it Breaks: OOV problem, Vocabulary explosion, Punctuation handling, Multilingual robustness


> Language dependency: Word tokenization assumes whitespace, which many languages do not use.









---



## Subword Tokenization (BPE concept)
this section Show that:
* Words are decomposable
* OOV is reduced, not eliminated
* Vocabulary becomes manageable



> *Instead of storing whole words,store frequently occurring word pieces.
New or rare words are broken into smaller known parts.*



In [5]:
#using huggingface tokenizer

from transformers import AutoTokenizer

subword_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
subword_tokens = subword_tokenizer.tokenize(input_text)
print(subword_tokens)
print("Token count:", len(subword_tokens))

#the [UNK] (Unknown) token is produced when the input text contains characters or symbols that are not present in the tokenizer's trained vocabulary.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['i', 'can', "'", 't', 'believe', 'unbelievable', 'things', 'happened', 'in', '[UNK]', '!', '[UNK]']
Token count: 12


**Observations** *[UNK] exists cause - WordPiece reduces OOV by breaking words into pieces,
but it is still limited to the script it was trained on.*

**subword tokenization solves** *Massive vocabulary problem, Most OOV words, Morphological variation*

*But Tokens are less human-readable, Tokenizer training matters, Some scripts may still be unsupported*



---



## SentencePiece / Unigram (Language-agnostic)

* Tokenization does not require whitespace
* Multilingual text can be handled natively
* Tokens become less human-readable (Used in many modern models)



> *WordPiece assumes whitespace and is usually trained on a specific language.
SentencePiece treats text as a raw stream of characters, making it suitable for multilingual and non-space languages.*



In [6]:
from transformers import AutoTokenizer

sp_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
sp_tokens = sp_tokenizer.tokenize(input_text)
print(sp_tokens)
print("Token count:", len(sp_tokens))

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

['▁I', '▁can', "'", 't', '▁believe', '▁un', 'beli', 'ev', 'able', '▁things', '▁happened', '▁in', '▁২০', '২', '5', '!', '▁', '👋', '🏽']
Token count: 19


In [7]:
sp_tokenizer.tokenize("Iloveunbelievablethings")
# Unlike word tokenization, this still works meaningfully.(No-whitespace assumption demo)

['▁I', 'love', 'un', 'beli', 'ev', 'able', 'thing', 's']



> **tokens look strange now:** *SentencePiece prioritizes statistical efficiency, not human readability.
Tokens are optimized for likelihood, not visual clarity.*

**SentencePiece solves:** *Multilingual tokenization, No whitespace dependency, Reduced [UNK] across scripts*

**But** *Tokens are harder to interpret, Slightly higher complexity*



> SentencePiece makes a single tokenizer usable across many languages.







---



## Byte-Level Tokenization (GPT-style)
this section Show that:
* No OOV ever
* Everything becomes representable
* Tokens stop looking like words
* Robustness beats readability



> Instead of starting from characters or words, byte-level tokenization starts from raw bytes.
Since every text can be represented as bytes, [UNK] is eliminated entirely.



In [8]:
from transformers import GPT2TokenizerFast
byte_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
byte_tokens = byte_tokenizer.tokenize(input_text)
print(byte_tokens)
print("Token count:", len(byte_tokens))


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

['I', 'Ġcan', "'t", 'Ġbelieve', 'Ġunbelievable', 'Ġthings', 'Ġhappened', 'Ġin', 'Ġ', 'à', '§', '¨', 'à', '§', '¦', 'à', '§', '¨', '5', '!', 'ĠðŁĳ', 'ĭ', 'ðŁ', 'ı', '½']
Token count: 25




> Whitespace matters, spaces are part of the token.



> tokens look “weird”:
tokens are optimized for compression and robustness,
not for human interpretation.







---



## Summary
| Tokenization   | OOV | Multilingual | Readability | Used by     |
| -------------- | --- | ------------ | ----------- | ----------- |
| Character      | ❌   | ✅            | High        | Research    |
| Word           | ❌❌  | ❌            | Very high   | Classic NLP |
| WordPiece      | ⚠️  | ⚠️           | Medium      | BERT        |
| SentencePiece  | ⚠️  | ✅            | Low         | XLM, T5     |
| Byte-level BPE | ✅   | ✅            | Very low    | GPT         |




---



---



## Token IDs (Text → Integers)


> **Text → Tokens → Token IDs → Model Input**

*LLMs do not consume tokens as text.
They consume token IDs, which are integers mapped from the tokenizer’s vocabulary.*



In [9]:
# Token IDs for each tokenizer

# WordPiece (BERT)
wp_ids = subword_tokenizer.convert_tokens_to_ids(subword_tokens)
list(zip(subword_tokens, wp_ids))

# Tokens map to stable integers
# [UNK] has a fixed ID
# Vocabulary size limits IDs



[('i', 1045),
 ('can', 2064),
 ("'", 1005),
 ('t', 1056),
 ('believe', 2903),
 ('unbelievable', 23653),
 ('things', 2477),
 ('happened', 3047),
 ('in', 1999),
 ('[UNK]', 100),
 ('!', 999),
 ('[UNK]', 100)]

**Token IDs are not random, but they also do not carry semantic meaning by themselves.**

In [10]:
# SentencePiece (XLM-R)
sp_ids = sp_tokenizer.convert_tokens_to_ids(sp_tokens)
list(zip(sp_tokens, sp_ids))

# No [UNK] for Bangla digits
# Token IDs still fixed-size
# Multilingual vocab reflected in IDs


[('▁I', 87),
 ('▁can', 831),
 ("'", 25),
 ('t', 18),
 ('▁believe', 18822),
 ('▁un', 51),
 ('beli', 14473),
 ('ev', 4854),
 ('able', 2886),
 ('▁things', 8966),
 ('▁happened', 73659),
 ('▁in', 23),
 ('▁২০', 32427),
 ('২', 46176),
 ('5', 758),
 ('!', 38),
 ('▁', 6),
 ('👋', 246511),
 ('🏽', 245164)]

In [12]:
# Byte-level
byte_ids = byte_tokenizer.convert_tokens_to_ids(byte_tokens)
list(zip(byte_tokens, byte_ids))


[('I', 40),
 ('Ġcan', 460),
 ("'t", 470),
 ('Ġbelieve', 1975),
 ('Ġunbelievable', 24479),
 ('Ġthings', 1243),
 ('Ġhappened', 3022),
 ('Ġin', 287),
 ('Ġ', 220),
 ('à', 156),
 ('§', 100),
 ('¨', 101),
 ('à', 156),
 ('§', 100),
 ('¦', 99),
 ('à', 156),
 ('§', 100),
 ('¨', 101),
 ('5', 20),
 ('!', 0),
 ('ĠðŁĳ', 50169),
 ('ĭ', 233),
 ('ðŁ', 8582),
 ('ı', 237),
 ('½', 121)]