### TOKENIZATION - INTRODUCTION

#### How Machine Read Text

**Analogy:** Imagine teaching a child to read. You don't start with whole sentences. You start with letters (A, B, C) or sounds (phonics).

1. **Tokenization** is similar for AI. It breaks text into smaller chunks called tokens.

2.  These tokens are then converted into numbers (IDs) because models like GPT-4 or Llama 3 only understand math, not words.

#### The Three Types of Tokenization
1. **Word Tokenization:** Splitting by spaces.

*Problem:* "Apple" and "Apples" are treated as totally unrelated words. Huge vocabulary required.

2. **Character Tokenization:** Splitting by letter.

*Problem:* Sequences become too long (100 words = 500+ characters). "Apple" loses its meaning when split into 'a', 'p', 'p', 'l', 'e'.

3. **Subword Tokenization (The Standard):** A mix of both.

*Solution:* Common words ("apple") stay as one token. Rare words ("fingerprinted") split into parts ("finger", "print", "ed").

Benefit: Efficient and handles unknown words well. This is what GPT and BERT use.

#### Example 1: Space vs. Character Split (The "Old Way")

In [3]:
text = "This is a sample tokenization tutorial"

# 1. Word Tokenization (Naive)
word_tokens = text.split()
print(f"Word Tokens: {word_tokens}")
# Critique: Punctuation is stuck to words ('Unbelievably,'), making it messy.

# 2. Character Tokenization
char_tokens = list(text)
print(f"Character Tokens: {char_tokens[:10]}...") # Printing just first 10
# Critique: The list is huge and individual letters represent no meaning.

Word Tokens: ['This', 'is', 'a', 'sample', 'tokenization', 'tutorial']
Character Tokens: ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ']...


#### Example 2: Tokenization with OpenAI's tiktoken

In [None]:
import tiktoken

# Load the tokenizer used by GPT-4 (cl100k_base)
encoder = tiktoken.get_encoding("cl100k_base")

tokens = encoder.encode(text)
print(f"Token IDs: {tokens}")

print("--------------------------------")
print("Mapping IDs back to Text chunks:")
for t in tokens:
    print(f"{t} -> '{encoder.decode([t])}'")

Token IDs: [2028, 374, 264, 6205, 4037, 2065, 22237]
--------------------------------
Mapping IDs back to Text chunks:
2028 -> 'This'
374 -> ' is'
264 -> ' a'
6205 -> ' sample'
4037 -> ' token'
2065 -> 'ization'
22237 -> ' tutorial'


#### Example 3: Tokenization with Hugging Face (BERT/Llama)

In [6]:
from transformers import AutoTokenizer

# Load a popular tokenizer (BERT)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Tokenization is fun!"

# 1. Tokenize (Text to Strings)
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")

# 2. Convert to IDs (Strings to Numbers)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {ids}")

# Observation:
# Notice how BERT handles capital letters (it lowers them) and distinct tokens.

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Tokens: ['token', '##ization', 'is', 'fun', '!']
Token IDs: [19204, 3989, 2003, 4569, 999]


#### Visualizing the "Subword" Magic

In [7]:
# Let's use GPT-4's tokenizer again
text_complex = "Tokenization is unputdownable" 
# 'unputdownable' is a rare word (meaning a book you can't put down)

tokens = encoder.encode(text_complex)

print(f"Original Text: {text_complex}")
print("\nBreakdown:")

for t in tokens:
    part = encoder.decode([t])
    print(f"ID: {t:<8} | Token: '{part}'")
    
# Expected Result Explanation:
# Common words like "Token" might stay whole.
# Complex words like "unputdownable" will likely get split into:
# 'un', 'put', 'down', 'able' (or similar variations).
# This proves the AI understands the *structure* of words it hasn't seen before.

Original Text: Tokenization is unputdownable

Breakdown:
ID: 3404     | Token: 'Token'
ID: 2065     | Token: 'ization'
ID: 374      | Token: ' is'
ID: 653      | Token: ' un'
ID: 631      | Token: 'put'
ID: 2996     | Token: 'down'
ID: 481      | Token: 'able'
