# ✨  Normalization and pre-tokenization
Before subword tokenization (like BPE, WordPiece, or Unigram) takes place, inputs are **normalized** and **pre-tokenized**.  

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 1️⃣ Normalization: Cleaning Text Before Tokenization

Normalization prepares text for tokenization by:
- Lowercasing,
- Removing accents,
- Stripping redundant whitespace,
- Possibly applying Unicode normalization (NFC/NFKC, etc).

In [None]:
from  transformers import AutoTokenizer

# Load a typical "uncased" BERT tokenizer,which applies normalization
tokenizer=AutoTokenizer.from_pretrained("bert-base-uncased")

# Display the type of backend tokenization object(Hugging Face Tokenizers Library)
print(type(tokenizer.backend_tokenizer))

# Use the .normalizer.normalize_str method to see normalization directly
example_text="Héllò hôw are ü?"
print(tokenizer.backend_tokenizer.normalizer.normalize_str(example_text)) # Should be "hello how are u?"

### Try: Normalization with a Cased Tokenizer

Check how normalization works differently when you use cased vs. uncased models.


In [None]:
tokenizer_cased=AutoTokenizer.from_pretrained("bert-base-cased")
print(tokenizer_cased.backend_tokenizer.normalizer.normalize_str(example_text))
# "Héllò hôw are ü?" — different, as cased does NOT lowercase or remove accents!


## 2️⃣ Pre-Tokenization: First Splits (Before Subword Model Training or Tokenization)

A tokenizer first splits text into chunks ("pre-tokens")—commonly words, punctuation, and spaces.  
This step is **essential** before learning subword merges during training.


In [None]:
# For demonstration, use the same basic sentence for all tokenizers
sample="Hello,how are  you?"

# BERT:splits into words & punctuation, ignores double spaces
bert_tok=AutoTokenizer.from_pretrained("bert-base-uncased")
print(bert_tok.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sample))
# [('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]

## 3️⃣ Different Tokenizers, Different Pre-tokenization Rules

### GPT-2: Whitespace + punctuation split, but keeps leading spaces! (using special marker Ġ)


In [None]:
gpt2_tok=AutoTokenizer.from_pretrained("gpt2")
print(gpt2_tok.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sample))
# [('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)), ('?', (19, 20))]

### T5 (SentencePiece): Splits on whitespace (special "▁") but not punctuation.

Also, note the added space at the start!


In [None]:
t5_tok=AutoTokenizer.from_pretrained("t5-small")
print(t5_tok.backend_tokenizer.pre_tokenizer.pre_tokenize_str(sample))
# [('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]

## 4️⃣ Why Is This Important? (Reversible Example)

- BERT tokenizer removes repeating spaces (non-reversible).
- GPT-2/SentencePiece encodes spaces as special characters, so decoding is "reversible": you can reconstruct *normalized* text, including spacing, from tokens.
- Languages without explicit spaces (like Chinese or Japanese) benefit: SentencePiece does not rely on pre-tokenization.


# ✅ Summary

- **Normalization** = clean, standardize, or lowercase text
- **Pre-tokenization** = split into word/punctuation "pre-tokens" + offsets
- Different models (BERT, GPT-2, T5) apply different steps, yielding different results.
- Understanding these steps is essential before building or customizing tokenizer pipelines for new domains or languages.

Ready to explore how the three principal algorithms work? Onward to BPE, WordPiece, and Unigram!
