# Lab: Tokenize Texts into Characters and Words
## Purpose:
- Explore Statistical probability of token distribution
- Explore tokenization strategies

### Topics:
- Zipf's Law
- Token frequency
- Preprocessing alters token frequency
- Character-level tokenization
- Word-level tokenization

### Steps
Preprocess and count tokens in the Africa Galore dataset.
Visualize word frequency distributions in relation to Zipf's law.
Observe the effects of character-level versus word-level tokenization on sequence length and vocabulary size.

Date: 2026-02-20

Source: https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_2/gdm_lab_2_2_tokenize_texts_into_characters_and_words.ipynb

References: https://github.com/google-deepmind/ai-foundations
- GDM GH repo used in AI training courses at the university & college level.

In [None]:
%%capture
# Install the custom package for this course.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

import re # For defining regular expressions.
import pandas as pd # For loading the dataset.
import textwrap # For making paragraphs more readable.
from collections import Counter # For counting tokens.

from ai_foundations import visualizations # For visualizations.
from ai_foundations.feedback.course_2 import tokenize # For providing feedback.

In [None]:
# Load the Africa Galore dataset.
africa_galore = pd.read_json(
    "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json"
)
dataset = africa_galore["description"].values
print("Loaded dataset with", dataset.shape[0], "paragraphs.\n")
print(f"The first paragraph is:\n{textwrap.fill(dataset[0])}")

### Preprocess Text & Count Tokens
1.  The `preprocess_text()` removes punctuation so punctuation marks do not interfere with token interpretation.
2. The `preprocess_text()` converts all text to lowercase.

In [None]:
def preprocess_text(paragraphs: list[str]) -> list[str]:
    """Preprocesses a list of text paragraphs.

    Lowercases & tokenizes text, removing punctuation.
    Use regex for more precise tokenization than a space tokenizer.
    Args:
      paragraphs: A list of strings, where each string represents a paragraph
        of text.
    Returns:
      A list of strings, where each string is a lowercase token extracted from
        the input paragraphs.
    """

    # Convert the text to lower case.
    paragraphs = [text.lower() for text in paragraphs]

    tokens_list = []
    # The regular expression (r'\b\w+\b') splits a paragraph on word boundaries
    # to remove punctuation. This breaks the text into individual words while
    # handling punctuation and spacing more precisely than `.split()`.
    for paragraph in paragraphs:
        for token in re.findall(r'\b\w+\b', paragraph):
            tokens_list.append(token)
    return tokens_list


# Process all paragraphs in the dataset.
tokens_list = preprocess_text(dataset)
print(tokens_list[:10])

### Compute token counts

Next, you will compute how many times each token appears in the dataset.

------
>
> > Complete the implementation of the `get_token_counts()`.
> It should return a [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) object where the keys are tokens and the values are the frequency of the token in `tokens_list`.
>
------

In [None]:
def get_token_counts(tokens_list: list[str]) -> Counter[str]:
    """Calculates the frequency of each token in a list of tokens.
    Args:
      tokens_list: A list of string tokens.

    Returns:
      A Counter where keys are the unique tokens and values are their
        corresponding frequencies.
    """

    # Add your code here.
    token_counts = Counter(tokens_list)

    return token_counts

token_counts = get_token_counts(tokens_list)

# Print the 10 tokens with the highest counts.
list(token_counts.most_common(10))

# Print the 10 least common words in the African Galore dataset.
# Note that most_common() returns all items in descending order of their count.
list(token_counts.most_common())[-10:]

### Zipf's law

A very small number of tokens (like "the", "a", "and") are extremely common, while the vast majority are rare, often appearing only once.
This creates a "long tail" distribution.

**log-log plot**: Solves the problem of the long tail when plotting distribution frequency.
- Plots the logarithm of the rank against the logarithm of the frequency.

In [None]:
sample_text = dataset[0]
print(textwrap.fill(sample_text))

## Tokenization
Begin making decisions about breaking up text data into smaller "units," then translate to token IDs. Use the IDs as the input to the language model.

**Character-level tokenization**: Solves the out-of vocabulary problem by rendering every character meaningless.

In [None]:
def character_tokenize(text: str) -> list[str]:
    """Splits text on characters.
    Args:
      text: The text to split.
    Returns:
      A list of tokens.
    """
    tokens = list(text)
    return tokens

print(character_tokenize(dataset[0])[:10])

**Word-level tokenization**: Splitting on spaces results in many tokens that are variations on the same word.

In [None]:
def space_tokenize(text: str) -> list[str]:
    """Splits text on spaces.

    Args:
      text: The text to split.

    Returns:
      A list of tokens.
    """
    tokens = text.split(" ")
    return tokens

print(space_tokenize(dataset[0])[:10])

### Tokenization and Sequence Length
- Tokenization method impacts number of tokens and vocabulary size inversely.
- Sequence length affects memory and compute demands proportionately.

In [None]:
# character-based tokenizer
tokens_char = character_tokenize(dataset[0])
print(tokens_char)
print(f"The length of the sequence is: {len(tokens_char)}")

# word-based tokenizer
tokens_word = space_tokenize(dataset[0])
print(tokens_word)
print(f"The length of the sequence is: {len(tokens_word)}")

### Tokenization & Vocabulary Size
* A model can neither process nor generate any token outside its vocabulary.
1. Larger vocabularies allow more information to be distributed across tokens.

With a **word-level vocabulary**, the distinct meanings of "time" and "the" are captured in separate, specific token representations. Information is clearly distributed.

With a **character-level vocabulary**, each token must contribute to representing every word containing it. For example, "t" is part of "time", "the", and "train." This forces a single token's parameters to hold a vast amount of contextual information, making it difficult for the model to learn representations that capture precise meaning.

2. Larger vocabularies **require more parameters for the model**, directly increasing size and computational cost. Each token needs a unique representation stored in the model's parameters. Bigger vocabularies lead to larger models that demand more memory and processing power, making both training and inference slower and more expensive.

3. Larger vocabularies contain **more tokens that appear very infrequently** in training data. The model cannot learn a reliable representation for these rare tokens because it lacks sufficient examples of their usage. This "data sparsity" problem hinders the model's ability to handle less common words or concepts effectively.

In [None]:
# Print the unique number of tokens in character and word level tokenization.
vocab_char = set(tokens_char)
vocab_word = set(tokens_word)
print(f"The character-level vocabulary consists of {len(vocab_char):,} tokens.")
print(f"The word-level vocabulary consists of {len(vocab_word):,} tokens.")