# Tokanization

### Why Tokenization Matters
Neural networks can‚Äôt read text directly ‚Äî they work with numbers (vectors, tensors).
Tokenization is the bridge between human language and numerical representation.

You‚Äôre learning to:
- Convert raw text ‚Üí numeric sequences (input_ids)
- Control how meaning and structure are preserved
- Integrate this process efficiently in TensorFlow pipelines
  
### What Tokenization Actually Does

At its simplest:
```
"I love TensorFlow" 
‚Üí ["I", "love", "TensorFlow"]
‚Üí [101, 202, 303]   # token IDs
```
But the real world isn‚Äôt so clean.
What about ‚ÄúTensorFlow‚Äôs‚Äù, emojis, different scripts, or rare words like ‚Äúelectroencephalogram‚Äù?
So we need smarter ways to split text. Tokenization isn‚Äôt just about splitting ‚Äî it‚Äôs about encoding language structure efficiently.

### Types of Tokenization (Conceptually)
#### 1Ô∏è‚É£ Word-level
Splits text by words.
- ‚ÄúI love TensorFlow‚Äù ‚Üí ["I", "love", "TensorFlow"]
- Each unique word gets an ID.
- Pros: Simple, intuitive.
- Cons: Fails on unknown words (OOV problem), needs a huge vocabulary.
  
### 2Ô∏è‚É£ Subword-level (most common)
Splits words into smaller, reusable parts.
- Example (BPE / SentencePiece / WordPiece):
- "TensorFlow" ‚Üí ["Tensor", "Flow"]
- "Flowing"    ‚Üí ["Flow", "ing"]
- Pros:
  - Handles new words (compositional)
  - Keeps vocab manageable (e.g., 30k tokens)
- Cons:
  - Slightly more complex training & encoding
  - Tokens no longer perfectly align with words



### 3Ô∏è‚É£ Character-level
Every character (letter, punctuation, emoji) is a token.
- "cat" ‚Üí ["c", "a", "t"]
- Pros: Never OOV.
- Cons: Very long sequences ‚Üí slower training.
  
### 4Ô∏è‚É£ Byte-level / Unicode-aware
- Each byte or Unicode character is a token.
- Used by GPT-2 and newer tokenizers (robust to any input).

### Vocabulary, IDs, and Special Tokens
The vocabulary is the mapping between tokens and IDs:
```
{"<PAD>": 0, "<UNK>": 1, "I": 2, "love": 3, "Tensor": 4, "Flow": 5}
```
You‚Äôll also have special tokens:
```
- <PAD> ‚Äî padding shorter sequences
- <UNK> ‚Äî unknown token
- <CLS> / <SEP> ‚Äî sentence markers (used in Transformers)
```

### Padding and Truncation
padding referes to adding special tokens to make all sequences in a batch the same length.

EX: ["I love TF"]  --> ["I love TF <PAD> <PAD>"] 
- we do this because models often require inputs of uniform length.
  
truncation refers to cutting off sequences that are too long to fit a specified maximum length.
EX: ["I love TensorFlow and KerasNLP"] --> ["I love TensorFlow"] # we do this to ensure inputs 

- don't exceed model limits.

### From Tokens ‚Üí Numbers ‚Üí TensorFlow Tensors
Once tokenized and converted to IDs, we can turn sequences into tensors:
``` py
tokens = ["I", "love", "TensorFlow"]
ids = [2, 3, 4]
tensor = tf.constant(ids)
print(tensor)
# tf.Tensor([2 3 4], shape=(3,), dtype=int32)
```
When batching, we need uniform shapes, so we pad sequences:
# Example
``` py
batch = tf.constant([[2,3,4,0,0],
                     [2,5,0,0,0]])
```
Now each example is [max_length].



## Tokenization Tools in TensorFlow Ecosystem
You have three major paths depending on how deep or flexible you want to go.


In [None]:
%pip install keras-nlp
%pip install transformers

In [None]:
"""
A. KerasNLP Tokenizers (newest & cleanest)
KerasNLP provides tokenization as layers ‚Äî meaning they fit naturally inside models.

why: It integrates with your tf.data pipeline or even directly in your model.

Example:
"""
import tensorflow as tf
import keras_nlp

# creating your own tokenizer
""" 
‚û°Ô∏è The BytePairTokenizer class doesnt come pre-trained.
‚û°Ô∏è It expects that you already trained a BPE tokenizer on your data and saved the resulting files (vocab.txt and merges.txt).
So this example demonstrates the pattern:
You train a BPE tokenizer on your corpus.
It outputs vocab.txt and merges.txt.
You then load them locally into the KerasNLP tokenizer class for inference or model training.

tokenizer = keras_nlp.tokenizers.BytePairTokenizer(
    vocabulary="assets/vocab.txt",
    merges="assets/merges.txt",
)
text = tf.constant(["I love TensorFlow"])
tokens = tokenizer.tokenize(text) # Output: tf.RaggedTensor (variable-length token lists).
"""

# using pre-trained tokenizer (GPT2) recommended
tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset("gpt2_base_en")
text = tf.constant(["I Love TensorFlow"])
tokens = tokenizer.tokenize(text) 
print(tokens)  # Output: [[40, 5896, 309, 22854, 37535]] list of token ids here each one maps to a token in the vocab these are defined by the pre-trained model but for ex 40 = "I", 5896 = " Love", 309 = " Flow" etc.
# Decode tokens back to text
decoded_text = tokenizer.detokenize(tokens)
print(decoded_text)  # Output: tf.Tensor([b'I love TensorFlow'], shape=(1,), dtype=string)

In [None]:
""" 
B. tensorflow_text (lower-level, powerful)
This library provides TensorFlow-native tokenization ops (runs on GPU/TPU).

You can later map tokens to IDs using lookup tables.

Example: 
"""
import tensorflow_text as tf_text
tokenizer = tf_text.WhitespaceTokenizer()
text = tf.constant(["I love TensorFlow"])
tokens = tokenizer.tokenize(text)
print(tokens)  # RaggedTensor

""" 
Why you need a lookup table? 

Most neural models (e.g., embeddings, transformers) dont work directly with string tokens ‚Äî they expect integer IDs representing each word or subword.
Thats where a lookup table comes in.

# You define a mapping like:
word_to_id = {
    "I": 1,
    "love": 2,
    "TensorFlow": 3,
    "<UNK>": 0
}
# Then you create a TensorFlow lookup layer:
table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
        keys=tf.constant(list(word_to_id.keys())),
        values=tf.constant(list(word_to_id.values()), dtype=tf.int64),
    ),
    num_oov_buckets=1
)
ids = table.lookup(tokens)
print(ids) # would ouput: [[1, 2, 3]]

or 

import tensorflow as tf

# Example vocab and IDs
vocab = tf.constant(["I", "love", "TensorFlow", "<UNK>"])
ids = tf.range(tf.size(vocab, out_type=tf.int64))

# Create lookup table
table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(vocab, ids),
    num_oov_buckets=1  # for unknown tokens
)

# Apply it to your tokens
token_ids = table.lookup(tokens)
print(token_ids) # <tf.RaggedTensor [[0, 1, 2]]>

# Now each token string becomes an integer ID that your model can process.
"""

In [None]:
""" 
üß© C. External Tokenizers (Hugging Face, SentencePiece)
You can use the same tokenizers that big LLMs use, like SentencePiece or BERT tokenizers.
Example (SentencePiece): 

‚ö†Ô∏è But these usually run in Python, not in the TensorFlow graph.
So if used inside tf.data, they require tf.py_function wrappers (can slow things down a bit).
"""
from transformers import AutoTokenizer

# Load a pretrained tokenizer (e.g., BERT)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Encode text into token IDs
encoded = tokenizer.encode("I love TensorFlow", return_tensors=None)
print("Token IDs:", encoded)

# Decode back to text
decoded = tokenizer.decode(encoded)
print("Decoded text:", decoded)

# Token IDs: [101, 1045, 2293, 23435, 12314, 102]
# Decoded text: [CLS] i love tensorflow [SEP]

# These tokenizers usually run in Python, not inside the TensorFlow graph ‚Äî so if you want to use them in a tf.data pipeline, you‚Äôll need to wrap them in tf.py_function, like:
import tensorflow as tf

def encode_py(text):
    return tokenizer.encode(text.numpy().decode("utf-8"))

def encode_tf(text):
    ids = tf.py_function(func=encode_py, inp=[text], Tout=tf.int32)
    return ids

ds = tf.data.Dataset.from_tensor_slices(["I love TensorFlow", "This is fun!"])
ds = ds.map(encode_tf)
for d in ds:
    print(d)
    
# Output:
# tf.Tensor([ 101  1045  2293  9899  8710  102], shape=(6,), dtype=int32)
# tf.Tensor([ 101  2023  2003  4569  999  102], shape=(6,), dtype=int32)

Token IDs: [101, 1045, 2293, 23435, 12314, 102]
Decoded text: [CLS] i love tensorflow [SEP]
tf.Tensor(101, shape=(), dtype=int32)
tf.Tensor(101, shape=(), dtype=int32)


### Ragged vs Dense tensors (important TensorFlow detail)

In [49]:
# Ragged vs Dense tensors (important TensorFlow detail)
# Text sequences vary in length. TensorFlow uses RaggedTensors to handle that efficiently:
tokens = tf.ragged.constant([
    [2, 3, 4],
    [5, 6]
])
print(tokens)
# shape = [2, None]
# Later, you can convert to dense and pad to fixed length:
max_len = 4
dense = tokens.to_tensor(default_value=0, shape=[None, max_len])
# This step is crucial when you batch examples for model input.


<tf.RaggedTensor [[2, 3, 4], [5, 6]]>


### What‚Äôs actually happening under the hood
Let‚Äôs trace what happens when you run:
```tokens = tokenizer.tokenize(tf.constant(["I love TensorFlow"]))```
1. String tensor goes into the tokenizer layer.
2. The tokenizer splits strings ‚Üí outputs a RaggedTensor of subwords.
3. (Optional) a vocab lookup converts subwords ‚Üí numeric IDs.
4. Padding/truncation shapes the data into [batch, max_len].
5. You feed that tensor into an embedding layer.

You can even combine these in a preprocessing model:
```py 
inputs = tf.keras.Input(shape=(), dtype=tf.string)
x = tokenizer.tokenize(inputs)
x = tokenizer.detokenize(x)
preproc = tf.keras.Model(inputs, x)
```

### Typical Tokenization Flow in TensorFlow
Conceptually:

Raw text (string)

   ‚Üì

Tokenization (split into subwords)

   ‚Üì

Vocab lookup (convert to IDs)

   ‚Üì

Padding/truncation

   ‚Üì

Batching

   ‚Üì

Feed to Embedding layer

### Choosing the Right Approach (as an Engineer)

| **Use Case**                 | **Recommended Approach**                                         |
|------------------------------|------------------------------------------------------------------|
| Simple demos, small tasks    | `tensorflow_text.WhitespaceTokenizer()`                          |
| Custom tokenizer/vocab       | `tensorflow_text` + lookup table                                 |
| Production / TPU / Performance | TF-native ops (`tensorflow_text` or `KerasNLP`)                |
| LLM / pretrained models      | Hugging Face or SentencePiece                                    |
| Reproducibility & sharing    | SentencePiece (saves model + vocab)                              |




# Tokenizer selection & integration (advanced points)

You‚Äôre converting raw text into integer token IDs that a model can process. Tokenization is a huge topic ‚Äî choices here affect model size, speed, ability to handle new words, and final performance. You're also deciding whether tokenization runs in Python (off-graph) or as TF ops (in-graph).

### Tokenizer families (short overview)
- Rule-based / simple tokenizers: whitespace split, regex. Quick and sometimes fine for simple tasks.
- Subword tokenizers (most common): Byte-Pair Encoding (BPE), WordPiece, SentencePiece (Unigram). They balance vocabulary size and OOV handling.
- Character-level: every char is a token ‚Äî robust but long sequences.
- Neural / learned tokenizers: KerasNLP or tokenizers libraries with end-to-end integration.
  
### Common libs:
- SentencePiece: trains a model (BPE or unigram); outputs ids; has fast C++ bindings.
- HuggingFace tokenizers (Rust): very fast (but using it in TF graph requires py_function or pre-tokenizing).
- tensorflow_text: TF ops for tokenization, normalizing, BERT-style WordPiece.
- KerasNLP: high-level tokenizers that integrate as TF layers (if available in your TF/Keras version).

### Two tokenization modes you should support (learning framing)
- Python/tokenizer-mode: call SentencePiece or HuggingFace tokenizers from Python; simplest to implement for prototyping.
- TF/tokenizer-mode: use TF-native tokenizers (tensorflow_text or KerasNLP layers) to tokenize inside the tf.data pipeline (faster, avoids py_function overhead).
  
### Core concepts to implement regardless of tokenizer
- Special tokens: [PAD], [CLS], [SEP], [UNK] ‚Äî decide IDs and reserve them in the vocab.
- Max length (max_len): choose a sequence length. Truncate longer sequences, pad shorter ones.
- Attention masks: binary mask (1 for real tokens, 0 for padding) used by many models.
- Padding side: left or right (commonly right-pad for sequence models).
- Token type ids: for tasks like Next Sentence Prediction or pair inputs ‚Äî optional.
  
### Example pipeline choices
- Raw text ‚Üí Python tokenizer ‚Üí pad/truncate ‚Üí batch.
- Raw text ‚Üí tf.py_function wrapping a Python tokenizer inside a tf.data.map.
- Raw text ‚Üí TF tokenizer (tensorflow_text or KerasNLP) inside tf.data.map (pure graph).
- Pre-tokenized TFRecord ‚Üí parse input_ids and attention_mask and batch (fastest).

