# Lab: Distinguish Between Signal and Noise
## Purpose:
- Understand the effect of training a model for too few or too many epochs.
- Have gained an intuition of what it means for a model to underfit or overfit to the patterns in a dataset.

### Topics:
- Signal
- Noise
- Overtraining

### Steps
* Compare the continuations to different prompts for models that have been trained for 10, 400, and 1,000 epochs.

Date: 2026-02-21

Source: https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_3/gdm_lab_3_1_distinguish_between_signal_and_noise.ipynb

References: https://github.com/google-deepmind/ai-foundations
- GDM GH repo used in AI training courses at the university & college level.

In [None]:
%%capture
# Install the custom package for this course.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

# Packages used.
from urllib import request # For downloading model parameters.
from IPython import display # For improving the output of some cells.

# Configure Keras to use the JAX backend.
import os
os.environ["KERAS_BACKEND"] = "jax"

from ai_foundations import training # For loading pre-trained models.
from ai_foundations import generation # For prompting the model.
from ai_foundations import tokenization # For loading the tokenizer.

BPEWordTokenizer = tokenization.BPEWordTokenizer

### Load tokenizer and model parameters
Includes a pretrained tokenizer and parameters for 3 SLMs.
- An overtrained SLM
- An undertrained SLM
- A well-trained SLM

In [None]:
# Load the tokenizer.
tokenizer_url = "https://storage.googleapis.com/dm-educational/assets/ai_foundations/bpe_tokenizer_3000_v2.pkl"
tokenizer = BPEWordTokenizer.from_url(tokenizer_url)

# Download parameters for three models.
MODEL_PARAMETER_URLS = {
    "africa_galore_10ep_underfit.weights.h5": "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_10ep_underfit.weights.h5",
    "africa_galore_400ep_good_fit.weights.h5": "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_400ep_good_fit.weights.h5",
    "africa_galore_1000ep_overfit.weights.h5": "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_1000ep_overfit.weights.h5"
}

# Download the model parameters.
for (parameter_file, parameter_url) in MODEL_PARAMETER_URLS.items():
    request.urlretrieve(parameter_url, parameter_file)
print("Loaded model parameters.")

# Define the model. In each of the following prompting cells, the model's parameters
# will be set to one of the three models. The configuration of this model must
# match the configuration of the training run.
model = training.create_model(
    max_length=399,
    vocabulary_size=tokenizer.vocabulary_size,
    learning_rate=1e-4,
    embedding_dim=64,
    mlp_dim=64,
    num_blocks=3
)+

## Prompt the 1,000 epoch model

1. Prompt the model that has been trained for the most number of epochs (1,000 passes through training data). It also updated its parameters by comparing the model predictions 1,000 times to each target token. This resulted in a very low loss of 0.37.

In the following cell, prompt the model with the following two prompts:
* "They are serving as a symbol"
* "They are serving as a vibrant symbol"

Since we know that "vibrant symbol" is always followed by "fo," we can expect this model to continue that pattern.

## Prompt the 10 epoch model

The 10-epoch model had a loss of 7.55 because the model parameters were updated many fewer times than in the 1,000 epoch model.

In the following cell, prompt the model again with the following two prompts:
* "They are serving as a symbol"
* "They are serving as a vibrant symbol"

I think this model will produce gibberish.

## Prompt the 400 epoch model

The 400-epoch model had a loss of 2.77.

In the following cell, prompt the model again with the following two prompts:
* "They are serving as a symbol"
* "They are serving as a vibrant symbol"

Hopefully, this os uses "of."