# Lab: Experiment With N-Gram Models
## Purpose:
- Estimate next-word probabilities
- Build a (small) n-gram model on a (tiny) dataset.
- Understand n-gram models & their limitations
### Topics:
- Tokenization
- Probability estimation
- Token prediction

Date: 2026-02-14

Source: https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_1/gdm_lab_1_2_experiment_with_n_gram_models.ipynb#scrollTo=pbtgZxrpjm6j

References: https://github.com/google-deepmind/ai-foundations
- GDM GH repo used in AI training courses at the university & college level.

### Understanding the math
**N-gram**: A continuous sequence of $n$ words.

**Context**: The preceding sequence of $n-1$ words.

**How are n-grams related to the context?** N-gram models use n-grams to estimate the probability of the next word based on the context.

**Text Corpus**: A dataset consisting of a collection of texts

Computing the Probability of the next word
---
Given $\mbox{A}$ is the context

Given $\mbox{B}$ is the next word

Compute the probability $P(\mbox{B} \mid \mbox{A})$:

$$P(\mbox{B} \mid \mbox{A}) = \frac{\mbox{Count}(\mbox{A B})}{\mbox{Count}(\mbox{A})}$$

The full n-gram counts, $\mbox{ Count}(\mbox{A B})$, and the context n-gram counts, $\mbox{ Count}(\mbox{A})$, can be computed by counting n-grams in a dataset (**text corpus**).

### Set up the local environment
See `environment_setup.md` for detailed instructions.

Quick setup (if running in Colab or a fresh environment):

In [None]:
# Install the AI Foundations package directly from GitHub
# %pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"
# Or use the requirements file if available
try:
    import numpy as np
    import ai_foundations
    print("ai_foundations is already installed.")
except ImportError:
    print("Installing ai_foundations...")
    %pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

In [None]:
# Packages used.
import random           # For sampling from probability distributions.
from collections import Counter, defaultdict # For counting n-grams.

import textwrap         # For automatically adding linebreaks to long texts.
import pandas as pd     # For construction and visualizing tables.

# Custom functions for providing feedback on your solutions.
# from ai_foundations.feedback.course_1 import ngrams
import ai_foundations
from ai_foundations.feedback.course_1 import ngrams

### Africa Galore dataset
Specialized dataset containing information on African culture, history, & geography generated by Gemini. The use of Gemini to create the dataset is supposed to ensure clean data by removing noise and inconsistencies.

In [None]:
africa_galore = pd.read_json(
    "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json"
)
dataset = africa_galore["description"]
# pd.DataFrame.shape() returns row counts and column counts
# len() only provides row counts.
print(f"The dataset consists of {dataset.shape[0]} paragraphs.")

In [None]:
# Inspect first 10 paragraphs in dataset
for paragraph in dataset[:10]:
    # textwrap automatically adds linebreaks to make long texts more readable.
    formatted_paragraph = textwrap.fill(paragraph)
    print(f"{formatted_paragraph}\n")

### About Tokenization
The above paragraphs are a single continuous string. The next function will split the strings on spaces to produce tokens; however, splitting on spaces does not take punctuation into account; therefore, a token may be the same as a word, but not always.

In [None]:
def space_tokenize(text: str) -> list[str]:
    """Splits a string into a list of words (tokens).
    Splits text on space.
    Args:
        text: The input text.
    Returns:
        A list of tokens. Returns empty list if text is empty or all spaces.
    """
    tokens = text.split(" ")
    return tokens

# Tokenize an example text with the `space_tokenize` function.
space_tokenize("Kanga, a colorful printed cloth is more than just a fabric.")

Output will be a list of the words in a sentence. Nothing fancy.
['Kanga,',
 'a',
 'colorful',
 'printed',
 'cloth',
 'is',
 'more',
 'than',
 'just',
 'a',
 'fabric.']

In [None]:
# test it on the entire dataset.
space_tokenize(dataset[0])

### Coding activity 1
space_tokenize() creates a list of tokens; however, the conditional probability of any token $\mbox{B}$ following the preceding context $\mbox{A}$, $P(\mbox{B} \mid \mbox{A})$, relies on how often any **n-grams** and **(n-1)-grams** appear in the dataset.

---
- The function generate_ngrams() will be called once for each paragraph in the dataset.
    - It takes a paragraph and an integer to create n-grams of the length of the integer.
- Use space_tokenize() to create a list of n-grams of length *n* for a text.
- Represent each n-gram as a tuple using tuple().

In [None]:
all_unigrams = []
all_bigrams = []
all_trigrams = []

def generate_ngrams(text: str, n: int) -> list[tuple[str]]:
    """Generates n-grams from a given text.
    Args:
        text: The input text string.
        n: The size of the n-grams (e.g., 2 for bigrams, 3 for trigrams).
    Returns:
        A list of n-grams, each represented as a list of tokens.
    """
    # Tokenize text.
    # My code below
    tokens = space_tokenize(text)

    # Construct the list of n-grams.
    ngrams = []
    num_of_tokens = len(tokens)

    # The last n-gram will be tokens[num_of_tokens - n + 1: num_of_tokens + 1].
    for i in range(0, num_of_tokens - n + 1):
        ngrams.append(tuple(tokens[i:i+n]))

    return ngrams

# Why did they hard-code this instead of creating a function for reusability?
for paragraph in dataset:
    # Calling `generate_ngrams` with n=1 constructs a list of unigrams.
    all_unigrams.extend(generate_ngrams(paragraph, n=1))
    # Calling `generate_ngrams` with n=2 constructs a list of bigrams (2-grams).
    all_bigrams.extend(generate_ngrams(paragraph, n=2))
    # Calling `generate_ngrams` with n=2 constructs a list of trigram (3-grams).
    all_trigrams.extend(generate_ngrams(paragraph, n=3))

print("First 10 Unigrams:", all_unigrams[:10])
print("First 10 Bigrams:", all_bigrams[:10])
print("First 10 Trigrams:", all_trigrams[:10])

In [None]:
# testing function built into the ngrams library
ngrams.test_generate_ngrams(generate_ngrams, space_tokenize)