# Lab: Tokenize Texts into Subword Tokens
## Purpose:
- Apply subword tokenization to address the OOV problem
- Use special tokens

### Topics:
- Subword Tokenization
- Special tokens

### Steps
- Experiment with Gemma's tokenizer to explore subword tokenization.
- Implement a function to tokenize the made-up word "Clusterophonexia".
- Inspect how Gemma handles emojis and the purpose of its special tokens.

Date: 2026-02-20

Source: https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_2/gdm_lab_2_3_tokenize_texts_into_subword_tokens.ipynb

References: https://github.com/google-deepmind/ai-foundations
- GDM GH repo used in AI training courses at the university & college level.

In [None]:
%%capture

# Install the custom package for this course. This also installs the gemma
# package.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

from gemma import gm # For interacting with the Gemma tokenizer.
# For providing feedback on your implementations.
from ai_foundations.feedback.course_2 import subword_tokens as feedback

### Subword Tokenization
A compromise between character-level and word-level tokenization.
* Frequent words (like "the" or "is") are kept as single, complete tokens.
* Rare or complex words (like "Baobab") are broken down into smaller, meaningful sub-units.

### Load and experiment with the Gemma tokenizer

To gain a better intuition of how the Gemma tokenizer works, run the following cell to load it.

Gemma has a vocabulary of more than 260,000 tokens.

In [None]:
# Load the tokenizer.
gemma_tokenizer = gm.text.Gemma3Tokenizer()

# Inspect the vocabaulary size.
print(f"Gemma's vocabulary consists of {gemma_tokenizer.vocab_size:,} tokens.")

#### Encoding and decoding

Use the `encode()` function of the tokenizer to translate arbitrary input text to token IDs Gemma can process.

In [None]:
# Encode a text into token IDs.
text = "The Baobab (genus Adansonia) is one of the most iconic trees."

gemma_tokens = gemma_tokenizer.encode(text)
print(f"Result of tokenizing the text \"{text}\":")
print(gemma_tokens)

In [None]:
# Decode the tokens back to a text.
decoded_text = gemma_tokenizer.decode(gemma_tokens)
print(f"Decoded sentence from tokens: {decoded_text}\n")

# Check whether this results in the same text as the original one.
is_equal = "✅" if text == decoded_text else "❌"
print(
    f"Decoding the tokens results in the same text as the original one:"
    f" {is_equal}\n"
)

# Decode individual tokens.
for token in gemma_tokens:
    decoded_token = gemma_tokenizer.decode(token)
    print(f"Token {token}:\t{decoded_token}")

### Tokenize a made-up word

In [None]:
# Set the following two variables as described in the instructions above.
clusterophonexia = "Clusterophonexia"
clusterophonexia_tokens = gemma_tokenizer.encode(clusterophonexia)
first_token_as_text = gemma_tokenizer.decode(clusterophonexia_tokens[0])
print(first_token_as_text)

### Tokenizing Unicode characters

In [None]:
gemma_tokens = gemma_tokenizer.encode("I am smiling ☺️!")

for i, token in enumerate(gemma_tokens):
    decoded_token = gemma_tokenizer.decode(token)
    print(f"Token {token}:\t{decoded_token}")

## Special tokens
* **`<BOS>`** and **`<EOS>`**:

  Mark the start and end of a distinct piece of text. Advantages:

  * Efficient batching: Feed multiple documents to the model in a single batch without extensive padding.

  * Dynamic generation: During text generation, the `<EOS>` token serves as a stop signal. Instead of generating a fixed number of tokens, the model can generate text until it produces an `<EOS>` token, allowing it to decide when a response is complete.

* **`<PAD>`**:

  Transformer models require inputs to have a fixed size, so shorter sequences are "padded" with this token until they match the length of the longest sequence in the batch.

* **`<UNK>`**:

  Placeholder for a character or symbol not in the tokenizer's vocabulary.

### Special tokens in Gemma

Can be accessed through `gemma_tokenizer.special_tokens`.

**Expected output**
```
<_Gemma3SpecialTokens.BOS: 2>
<_Gemma3SpecialTokens.EOS: 1>
```

In [None]:
# Beginning of sentence (BOS) token.
gemma_tokenizer.special_tokens.BOS

# End of sentence (EOS) token.
gemma_tokenizer.special_tokens.EOS

The tokenizer also supports automatically adding the BOS and EOS tokens to a sequence. Useful when prepping data for finetuning a chatbot on prompts and model answers, to get the model to learn when it should stop generating.

**Expected output**
```
[<_Gemma3SpecialTokens.BOS: 2>,
 9259,
 1902,
 236888,
 <_Gemma3SpecialTokens.EOS: 1>]
```

In [None]:
token_ids = gemma_tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
token_ids