# Lecture 7: Tokenization Deep Dive
Welcome to Lecture 7 of AI Engineering Essentials (Part 2)! In this notebook, we'll break down how text is transformed into numbers for machine learning models. You'll see how tokenization works, what Byte Pair Encoding (BPE) is, how attention masks help, and even try out some multilingual examples.

## Learning Goals
- Understand how BPE tokenization works
- Explore attention masks and their purpose
- Tokenize a French sentence and compare models

## 1. Setup: Import Libraries and Load a Tokenizer
Let's start by importing the Hugging Face Transformers library and loading a basic tokenizer. We'll use `bert-base-uncased` for our first examples.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

## 2. Tokenize a Simple English Sentence
Let's see how a basic English sentence gets split into tokens and IDs. This is the first step in turning text into something a model can understand.

In [None]:
sentence = "Tokenization is the first step in NLP magic!"

tokens = tokenizer.tokenize(sentence)
print("Tokens:", tokens)

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)

Tokens: ['token', '##ization', 'is', 'the', 'first', 'step', 'in', 'nl', '##p', 'magic', '!']
Token IDs: [19204, 3989, 2003, 1996, 2034, 3357, 1999, 17953, 2361, 3894, 999]


## 3. Byte Pair Encoding (BPE) Explained
Byte Pair Encoding (BPE) is a clever way to break words into smaller pieces (subwords). This helps the model handle new or rare words by splitting them into known chunks. Let's see how BPE works in practice.

**How BPE works:**
1. Start with all characters as your vocabulary.
2. Find the most common pair of symbols (characters or subwords).
3. Merge them into a new symbol.
4. Repeat until you reach a set number of merges or can't merge anymore.

This keeps the vocabulary small and lets the model handle words it's never seen before.

## 4. Visualizing Token Splits for a Complex Word
Let's pick a tricky word and see how the tokenizer breaks it down. We'll use "unbelievable" as our example.

In [None]:
word = "unbelievable"
tokens = tokenizer.tokenize(word)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['unbelievable']
Token IDs: [23653]


Notice how the word is split into smaller pieces. This is BPE in action! It helps the model understand and process words it might not have seen during training.

## 5. Attention Masks: What and Why?
When working with batches of sentences, we often need to pad them so they're all the same length. Attention masks tell the model which parts are real data and which are just padding.

In [None]:
sentences = [
    "Tokenization is fun!",
    "Let's see how attention masks work."
]

encoded = tokenizer(
    sentences,
    padding=True,
    return_tensors="pt"
)

print("Input IDs:\n", encoded["input_ids"])
print("Attention Masks:\n", encoded["attention_mask"])

Input IDs:
 tensor([[  101, 19204,  3989,  2003,  4569,   999,   102,     0,     0,     0,
             0],
        [  101,  2292,  1005,  1055,  2156,  2129,  3086, 15806,  2147,  1012,
           102]])
Attention Masks:
 tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


The attention mask is a series of 1s and 0s. 1 means "real token," 0 means "padding." This helps the model ignore the padding during training and inference.

## 6. Multilingual Example: Tokenizing French
Let's see how the tokenizer handles a French sentence. We'll also compare how different models split the same sentence.

In [None]:
french_sentence = "L'intelligence artificielle change le monde."

french_tokens_bert = tokenizer.tokenize(french_sentence)
print("BERT tokens:", french_tokens_bert)

multi_tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
french_tokens_multi = multi_tokenizer.tokenize(french_sentence)
print("Multilingual BERT tokens:", french_tokens_multi)

BERT tokens: ['l', "'", 'intelligence', 'art', '##ific', '##iel', '##le', 'change', 'le', 'monde', '.']


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Multilingual BERT tokens: ['L', "'", 'intelligence', 'arti', '##ficie', '##lle', 'change', 'le', 'monde', '.']


Notice the difference in how the two tokenizers split the French sentence. Multilingual models are better at handling non-English text.

## 7. Key Takeaways
- Tokenization is the first step in NLP magic—understanding it helps you debug and optimize your pipelines.
- BPE lets models handle new words by breaking them into subwords.
- Attention masks help models ignore padding.
- Multilingual tokenizers are great for non-English text.

## Call to Action
Try tokenizing a sentence in your native language and share your findings in the course repo!