# Lab: Experiment With N-Gram Models
## Purpose:
- Estimate next-word probabilities programmatically
- Build a (small) n-gram model on a (tiny) dataset.
- Understand n-gram models & their limitations
### Topics:
- Tokenization
- Probability estimation
- Token prediction

Date: 2026-02-18

Source: https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_1/gdm_lab_1_2_experiment_with_n_gram_models.ipynb

References: https://github.com/google-deepmind/ai-foundations
- GDM GH repo used in AI training courses at the university & college level.

### Understanding the math
**N-gram**: A continuous sequence of $n$ words.

**Context**: The preceding sequence of $n-1$ words.

**How are n-grams related to the context?** N-gram models use n-grams to estimate the probability of the next word based on the context.

**Text Corpus**: A dataset consisting of a collection of texts

Computing the Probability of the next word
---
Given $\mbox{A}$ is the context

Given $\mbox{B}$ is the next word

Compute the probability $P(\mbox{B} \mid \mbox{A})$:

$$P(\mbox{B} \mid \mbox{A}) = \frac{\mbox{Count}(\mbox{A B})}{\mbox{Count}(\mbox{A})}$$

The full n-gram counts, $\mbox{ Count}(\mbox{A B})$, and the context n-gram counts, $\mbox{ Count}(\mbox{A})$, can be computed by counting n-grams in a dataset (**text corpus**).

### Set up the local environment
See `environment_setup.md` for detailed instructions.

Quick setup (if running in Colab or a fresh environment):

In [None]:
# Install the AI Foundations package directly from GitHub
# %pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"
# Or use the requirements file if available
try:
    import numpy as np
    import ai_foundations
    print("ai_foundations is already installed.")
except ImportError:
    print("Installing ai_foundations...")
    %pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

In [None]:
# Packages used.
import random           # For sampling from probability distributions.
from collections import Counter, defaultdict # For counting n-grams.

import textwrap         # For automatically adding linebreaks to long texts.
import pandas as pd     # For construction and visualizing tables.

# Custom functions for providing feedback on your solutions.
# from ai_foundations.feedback.course_1 import ngrams
import ai_foundations
from ai_foundations.feedback.course_1 import ngrams

### Africa Galore dataset
Specialized dataset containing information on African culture, history, & geography generated by Gemini. The use of Gemini to create the dataset is supposed to ensure clean data by removing noise and inconsistencies.

In [None]:
africa_galore = pd.read_json(
    "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json"
)
dataset = africa_galore["description"]
# pd.DataFrame.shape() returns row counts and column counts
# len() only provides row counts.
print(f"The dataset consists of {dataset.shape[0]} paragraphs.")

In [None]:
# Inspect first 10 paragraphs in dataset
for paragraph in dataset[:10]:
    # textwrap automatically adds linebreaks to make long texts more readable.
    formatted_paragraph = textwrap.fill(paragraph)
    print(f"{formatted_paragraph}\n")

### About Tokenization
The above paragraphs are a single continuous string. The next function will split the strings on spaces to produce tokens; however, splitting on spaces does not take punctuation into account; therefore, a token may be the same as a word, but not always.

In [None]:
def create_tokens(text: str) -> list[str]:
    """Splits a string on space to create list of tokens.
    Args:
        text: The input text.
    Returns:
        Lst of tokens. Returns empty list if text is empty or all spaces.
    """
    tokens = text.split(" ")
    return tokens

# Tokenize an example text.
create_tokens("Kanga, a colorful printed cloth is more than just a fabric.")

Output will be a list of the words in a sentence. Nothing fancy.
>```
>['Kanga,',
> 'a',
> 'colorful',
> 'printed',
> 'cloth',
> 'is',
> 'more',
> 'than',
> 'just',
> 'a',
> 'fabric.']

In [None]:
# test it on the entire dataset.
create_tokens(dataset[0])

### Coding activity 1
create_tokens() creates a list of tokens; however, the conditional probability of any token $\mbox{B}$ following the preceding context $\mbox{A}$, $P(\mbox{B} \mid \mbox{A})$, relies on how often any **n-grams** and **(n-1)-grams** appear in the dataset.

---
- The function generate_ngrams() will be called once for each paragraph in the dataset.
    - It takes a paragraph and an integer to create n-grams of the length of the integer.
- Use create_tokens() to create a list of n-grams of length *n* for a text.
- Represent each n-gram as a tuple using tuple().

In [None]:
all_unigrams = []
all_bigrams = []
all_trigrams = []

def generate_ngrams(text: str, n: int) -> list[tuple[str]]:
    """Generates n-grams from a given text.
    Args:
        text: The input text string.
        n: The size of the n-grams (e.g., 2 for bigrams, 3 for trigrams).
    Returns:
        A list of n-grams, each represented as a list of tokens.
    """
    # Tokenize text.
    # My code below
    tokens = create_tokens(text)

    # Construct the list of n-grams.
    ngrams = []
    num_of_tokens = len(tokens)

    # The last n-gram will be tokens[num_of_tokens - n + 1: num_of_tokens + 1].
    for i in range(0, num_of_tokens - n + 1):
        ngrams.append(tuple(tokens[i:i+n]))

    return ngrams

# This is hard-coded as an example for the student to understand the process.
for paragraph in dataset:
    # Calling `generate_ngrams` with n=1 constructs a list of unigrams.
    all_unigrams.extend(generate_ngrams(paragraph, n=1))
    # Calling `generate_ngrams` with n=2 constructs a list of bigrams (2-grams).
    all_bigrams.extend(generate_ngrams(paragraph, n=2))
    # Calling `generate_ngrams` with n=2 constructs a list of trigram (3-grams).
    all_trigrams.extend(generate_ngrams(paragraph, n=3))

print("First 10 Unigrams:", all_unigrams[:10])
print("First 10 Bigrams:", all_bigrams[:10])
print("First 10 Trigrams:", all_trigrams[:10])

In [None]:
# testing function built into the ngrams library
ngrams.test_generate_ngrams(generate_ngrams, create_tokens)

### Special case of n-gram counts for bigrams & trigrams
Using the Python Counter datatype
See: https://docs.python.org/3/library/collections.html#collections.Counter and
https://docs.python.org/3/library/collections.html#collections.defaultdict

In [None]:
bigram_counts = Counter(all_bigrams)

# Print the ten most common bigrams.
print("Most common bigrams:")
for bigram, count in bigram_counts.most_common(10):
    print(f"  ({bigram}, {count})")

# Use the Python Counter data type for computing the counts of all trigrams.
trigram_counts = Counter(all_trigrams)

# Print the ten most common trigrams.
print("\n\nMost common trigrams:")
for trigram, count in trigram_counts.most_common(10):
    print(f"  ({trigram}, {count})")

Most common bigrams:
>```
>  (('is', 'a'), 144)
>  (('of', 'the'), 100)
>  (('and', 'the'), 69)
>  (('in', 'the'), 61)
>  (('with', 'a'), 60)
>  (('in', 'a'), 55)
>  (('and', 'a'), 50)
>  (('to', 'the'), 42)
>  (('was', 'a'), 39)
>  (('It', 'is'), 33)


Most common trigrams:
>```
>  (('went', 'looking', 'for'), 32)
>  (('a', 'symbol', 'of'), 18)
>  (('was', 'hungry', 'so'), 18)
>  (('The', 'result', 'is'), 17)
>  (('looking', 'for', 'a'), 17)
>  (('she', 'went', 'looking'), 16)
>  (('he', 'went', 'looking'), 16)
>  (('result', 'is', 'a'), 15)
>  (('so', 'he', 'went'), 14)
>  (('so', 'she', 'went'), 14)

### General implementation of n-gram count

In [None]:
def get_ngram_counts(dataset: list[str], n: int) -> dict[str, Counter]:
    """Computes the n-gram counts from a dataset.
    Takes a list of text strings as input,
    constructs n-grams, and creates a dictionary where:
        * Keys represent n-1 token long contexts `context`.
        * Values are a Counter object `counts` such that `counts[next_token]` is the
        * count of `next_token` following `context`.
    Args:
        dataset: The list of text strings in the dataset.
        n: The size of the n-grams to generate (e.g., 2 for bigrams, 3 for
            trigrams).
    Returns:
        A dictionary where keys are (n-1)-token contexts and values are Counter
        objects storing the counts of each next token for that context.
    """

    # Define the dictionary as a defaultdict initialized w/ empty Counter obj.
    # This allows you to access and set value of ngram_counts[context][next_token]
    # w/o initializing ngram_counts[context] or ngram_counts[context][next_token] first.

    ngram_counts = defaultdict(Counter)

    # Loop through all paragraphs.
    for paragraph in dataset:
        # Loop through all n-grams for the paragraph.
        for ngram in generate_ngrams(paragraph, n):
            # Extract the context. This will be all but the last token.
            context = " ".join(ngram[:-1])
            # Extract the next token. This will be the last token of the n-gram.
            next_token = ngram[-1]
            # Increment the counter for the context - next_token pair by 1.
            ngram_counts[context][next_token] += 1

    return dict(ngram_counts)

# Example usage of the function.
example_data = [
    "This is an example sentence.",
    "Another example sentence.",
    "Split a sentence."
]
ngram_counts = get_ngram_counts(example_data, 2)

# Print the bigram counts dictionary for the dataset consisting of the
# three example sentences.
print("Bigram counts dictionary:\n")
print("{")
for context, counter in ngram_counts.items():
    print(f"  '{context}': {counter},")
print("}")

Example Output.
Bigram counts dictionary:
>```
>{
>  'This': Counter({'is': 1}),
>  'is': Counter({'an': 1}),
>  'an': Counter({'example': 1}),
>  'example': Counter({'sentence.': 2}),
>  'Another': Counter({'example': 1}),
>  'Split': Counter({'a': 1}),
>  'a': Counter({'sentence.': 1}),
>}

In [None]:
# Test
# @title Run this cell to test your implementation.
ngrams.test_ngram_counts(get_ngram_counts, generate_ngrams)

### Use Pandas to create a table showing all possible n-gram counts in the Africa Galore dataset

Most of the probabilities will be 0.
For any context  A , the probability  P(B∣A)  will be 0 for most tokens  B .

As the length of the context increases, the sparsity increases more (which makes sense).

99.95% of bigrams never appear. 99.98% of trigrams never appear.

In [None]:
# count bigrams
bigram_counts = get_ngram_counts(dataset, n=2)
# 99.95% never appear in the dataset.
# There are  5,143×5,176  possible combinations.

# Use the pandas library to display the counts in a table.
bigram_counts_matrix = {
    context: dict(counts) for context, counts in bigram_counts.items()
}
bigram_data_frame = pd.DataFrame.from_dict(
    bigram_counts_matrix, orient="index").fillna(0)

display(bigram_data_frame)

zero_count = (bigram_data_frame == 0).sum().sum()
print(
    f"Number of bigrams with a count of 0: {zero_count:,}"
    f" ({zero_count/bigram_data_frame.size * 100:.2f}%)"
)

# Count trigrams
trigram_counts = get_ngram_counts(dataset, n=3)
# There are 13,411×5,142 possible combinations.
# # 99.98% never appear in the dataset.

# Use the pandas library to display the counts in a table.
trigram_counts_matrix = {
    context: dict(counts) for context, counts in trigram_counts.items()
}
trigram_data_frame = pd.DataFrame.from_dict(
    trigram_counts_matrix, orient="index").fillna(0)

display(trigram_data_frame)

zero_count = (trigram_data_frame == 0).sum().sum()
print(
    f"Number of trigrams with a count of 0: {zero_count:,}"
    f" ({zero_count/trigram_data_frame.size * 100:.2f}%)"
)

## Calculate P(B | A)

Compute the probability of any token $\mbox{B}$ following any context $\mbox{A}$.

$$P(\mbox{B} \mid \mbox{A}) = \frac{\mbox{Count}(\mbox{A B})}{\mbox{Count}(\mbox{A})}$$

Using the `get_ngram_counts` function, compute both $\mbox{Count}(\mbox{A B})$ and $\mbox{Count}(\mbox{A})$. Example: estimate probabilities of a trigram model using context of length 2, compute the counts in the numerator and the denominator for a `dataset` as:

```python
# Numerator.
trigram_counts = get_ngram_counts(dataset, n=3)
# Denominator.
bigram_counts = get_ngram_counts(dataset, n=2)
```

-> Trick to compute bigram count directly from a trigram count w/o calling `get_ngram_counts()` twice.

To observe how this works, consider the trigram counts for all trigrams that start with the bigram "a staple." You can access these using the dictionary `trigram_counts` that is defined above:

```python
context = "a staple"            # searches for all trigrams beginning with "a staple"
trigram_counts[context]         # returns the full trigram w/ the number of times it's repeated
```

The counter in the output of the previous cell shows you that the dataset contains the following trigrams starting with "a staple":

* "a staple food" (1 time).
* "a staple in" (6 times).
* "a staple dish" (2 times).
* "a staple throughout" (1 time).
* "a staple of" (1 time).
* "a staple at" (1 time).
* "a staple beverage" (1 time).

The trick to get the bigram count of "a staple" is to sum the number of trigrams that start with "a staple." From the counter above, we can compute this total by summing $1+6+2+1+1+1+1 = 13$.

Do this automatically using `sum()` function and `values()` method of a counter.

In [None]:
context = "a staple"
# Compute the bigram count for "a staple" with sum().
bigram_count_a_staple = sum(trigram_counts[context].values())

print(
    'Bigram count of "a staple" computed indirectly from trigram counts: ',
    bigram_count_a_staple,
)

# Extract the bigram count for "a staple" from bigram_counts.
print('Bigram count of "a staple" computed directly: ',
      bigram_counts["a"]["staple"])

### Compute n-gram probabilities

Return a dictionary of dictionaries w/ keys as contexts of length $n-1$ tokens and values as dictionary providing the probabilities of the next token given the context.

>For example, if the dataset consists of the two sentences "Table Mountain is tall." and "Table Mountain is beautiful." then the function called with `n = 3` should return:
>```
>{
>   "Table Mountain": {"is": 1.0},
>   "Mountain is": {"tall": 0.5, "beautiful": 0.5}
>}


In [None]:
def build_ngram_model(dataset: list[str], n: int) -> dict[str, dict[str, float]]:
    """Builds an n-gram language model.
    Takes a list of text strings,
    generates n-grams from each text using get_ngram_counts,
    and converts them into probabilities.
    The resulting model is a dictionary,
    where keys are (n-1)-token contexts and values are dictionaries mapping
    possible next tokens to their conditional probabilities given the context.

    Args:
        dataset: A list of text strings representing the dataset.
        n: The size of the n-grams (e.g., 2 for a bigram model).

    Returns:
        A dictionary representing the n-gram language model, where keys are
        (n-1)-tokens contexts and values are dictionaries mapping possible next
        tokens to their conditional probabilities.
    """

    # A dictionary to store P(B | A).
    # ngram_model[context][token] should store P(token | context).
    ngram_model = {}

    # Use the ngram_counts as computed by the get_ngram_counts function.
    ngram_counts = get_ngram_counts(dataset, n)

    # Loop through the possible contexts. `context` is a string
    # and `next_tokens` is a dictionary mapping possible next tokens to their
    # counts of following `context`.
    for context, next_tokens in ngram_counts.items():

        # Compute Count(A) and P(B | A ) here.
        context_total_count = sum(next_tokens.values())
        ngram_model[context] = {}
        for token, count in next_tokens.items():
            ngram_model[context][token] = count / context_total_count

    return ngram_model

# Test the method above by bulding a simple trigram model.
test_dataset = ["Table Mountain is tall.", "Table Mountain is beautiful."]
test_trigram_model = build_ngram_model(test_dataset, n=3)
test_trigram_model

In [None]:
# @title Run this cell to test your implementation.
ngrams.test_build_ngram_model(build_ngram_model, get_ngram_counts)

## Now construct a trigram model!

In [None]:
trigram_model = build_ngram_model(dataset, n=3)

# To gain an understanding of the patterns that the model learned, inspect a few probability distributions.
print(f"P(B | \"as it\") = {trigram_model['as it']}")

print(f"P(B | \"as they\") = {trigram_model['as they']}")

Expected output
>```
> P(B | "as it") = {'is': 0.6666666666666666, 'receives': 0.3333333333333333}
> P(B | "as they") = {'were': 1.0}

### More possible contexts

In [None]:
context = "The name"
trigram_model[context]
# {'means': 0.6666666666666666, "'Etosha'": 0.3333333333333333}

context = "Their name"
trigram_model[context]
# Will receive a [key error] b/c the phrase does not exist.

## Putting it to use!
Now that we know the probabilities, predict the next token!

In [None]:
# Example
# Manually define a list of tokens & use probability distribution to weight the results.
example_candidate_tokens = ["apple", "banana", "cherry"]

# Define corresponding probabilities for each fruit.
fruit_probabilities = [0.2, 0.5, 0.3]

# Sample one fruit based on the probabilities.
# The 'k=1' parameter instructs the function to return one item.
chosen_fruit = random.choices(
    example_candidate_tokens,
    weights=fruit_probabilities,
    k=1)[0]

print("Chosen fruit:", chosen_fruit)

In [None]:
# The real task
context = "looking for"
candidate_tokens = []
candidate_tokens_probabilities = []

# Extract candidate tokens and associated probabilities from `trigram_model`.
for token, prob in trigram_model[context].items():
    candidate_tokens.append(token)
    candidate_tokens_probabilities.append(prob)

print(f"Candidate tokens: {candidate_tokens}")
print(f"Candidate token probabilities: {candidate_tokens_probabilities}")

# Sample from the list of candidate tokens according to the
# associated probabilities.
next_token = random.choices(candidate_tokens,
                            weights=candidate_tokens_probabilities)[0]

print("\n\nSampled next token:", context, next_token)

Sample output:

Candidate tokens: ['the', 'a', 'Banku', 'Tella,', 'Maafe,', 'Umqombothi,', 'sugarcane', 'crispy', 'warm', 'Doro', 'sambusa,', 'dodo,', 'Fura']

Candidate token probabilities: [0.125, 0.53125, 0.03125, 0.03125, 0.03125, 0.03125, 0.03125, 0.03125, 0.03125, 0.03125, 0.03125, 0.03125, 0.03125]


Sampled next token: looking for Banku

### Generate new texts for prompt using n-gram model.

Text generation using an n-gram model is an iterative process where each newly generated token is added to the existing context. This forms the basis for predicting the next token.

Starting with an initial prompt text, the model uses the probability distribution derived from the n-gram counts to select the most likely next token. This again makes use of the `random.choices` function for picking the next token. Once this token has been generated, it is added to the context and the updated sequence is used to calculate the next probability distribution. This chain-like process continues until `num_tokens_to_generate` tokens have been generated.

The following `generate_next_n_tokens` function implements this iterative generation process:

In [None]:
def generate_next_n_tokens(
    n: int,
    ngram_model: dict[str, dict[str, float]],
    prompt: str,
    num_tokens_to_generate: int,
) -> str:
    """Generates `num_tokens_to_generate` tokens for a prompt using
    an n-gram model.

    Uses n-gram model to predict most likely next token for prompt.
    The generation process continues, appending predicted tokens to prompt
    until the desired number of tokens is generated or a context is
    encountered for which the model has no predictions.

    Args:
        n: The size of the n-grams.
        ngram_model: A dictionary representing the n-gram model.
        prompt: Starting text prompt.
        num_tokens_to_generate: The number of words to generate after prompt.

    Returns:
        A string containing the original prompt followed by the generated
        tokens. If no valid continuation is found for a given context, the
        function will return the text generated up to that point and print a
        message indicating that no continuation could be found.
    """

    # Split prompt into individual tokens.
    generated_words = create_tokens(prompt)

    for _ in range(num_tokens_to_generate):
        # Get last (n-1) tokens as context.
        context = generated_words[-(n - 1):]
        context = " ".join(context)
        if context in ngram_model:
            # Sample next word based on probabilities.
            next_word = random.choices(
                list(ngram_model[context].keys()),
                weights=ngram_model[context].values()
            )[0]

            generated_words.append(next_word)
        else:
            print(
                "⚠️ No valid continuation found. Change the prompt or"
                " try sampling another continuation.\n"
            )
            break

    return " ".join(generated_words)

### Bigram

Expected output will begin intelligently, but quickly devolve.
```
Jide was hungry so she went looking for hours, would fill with ground and they pounded, Nana Yaa,

In [None]:
prompt = "Jide was hungry so she went looking for"

# Construct a bigram model using the Africa Galore dataset.
bigram_model = build_ngram_model(dataset, n=2)

n = 2  # Bigram.
num_tokens_to_generate = 10  # Generate next n words.
generate_next_n_tokens(
    n=n,
    ngram_model=bigram_model,
    prompt=prompt,
    num_tokens_to_generate=num_tokens_to_generate,
)

### Trigram

Expected output will take longer to devolve
```
Jide was hungry so she went looking for Umqombothi, a traditional Malian couscous dish made from the Tswana

In [None]:
prompt = "Jide was hungry so she went looking for"

n = 3  # Trigram.
num_tokens_to_generate = 10  # Generate next n words.
generate_next_n_tokens(
    n=n,
    ngram_model=trigram_model,
    prompt=prompt,
    num_tokens_to_generate=num_tokens_to_generate,
)

## Much larger N-grams

As **n** increases, the number of mathematically possible combinations increases exponentially; however, the number of combinations that are statistically likely to occur increases linearly. For example, "I went to the movies with my mom," has
- 47 possible bigrams
- 339 possible trigrams

However, many combinations never occur in practice. The probability of the following occurring is statistically **0**.
- "movies the"
- "to mom"
- "to went I"
