> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

<img src="https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C1-white-bg.png">

# Lab: Train Your Own Small Language Model

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_1/gdm_lab_1_5_train_your_own_small_language_model.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>     

40 minutes

Train a transformer language model on the Africa Galore dataset.


## Overview

In this lab, you will make use of the data pre-processing steps that you have implememented in the previous lab and prepare the data to be used for training a transformer model. You will then train your own small language model on the Africa Galore dataset and explore its predictions. The model you will be training is referred to as a small language model because it has comparably fewer parameters (around 3.5 million instead of the approximately 1 billion of Gemma-1B) and will be trained on the small Africa Galore dataset.



### What you will learn:

By the end of this lab, you will know:

* How to prepare a text dataset to be used for training a transformer model with Keras.
* How to train and evaluate a small language model (SLM).


### Tasks

You will use an implementation of the transformer model written using [Keras](https://keras.io/). Keras is an open source deep learning framework that allows you to define neural network architectures and train models using these architectures. You will learn more about how to define models yourself in later courses. For now, you will use existing code to define the model and perform its training.


**In this lab, you will**:
* Load the dataset, tokenize it, and convert it to token IDs.
* Pad the dataset such that all sequences have the same length.
* Shuffle the examples in the dataset and group them into batches.
* Transform the data into model inputs and model targets.
* Train the transformer model.

Note that this is quite a long lab since there are many steps that you have to go through for training a transformer language model and the training itself takes some time. If you are able to, we *highly recommend* running the code in this lab on a Colab instance with a GPU. See the section "How to use Google Colaboratory (Colab)" below for instructions on how to do this.



## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in **cells** that are excuted on a remote server.

To run a cell, hover over a cell, and click on the `run` button to its left. The run button is the circle with the triangle (▶). Alternatively, you can also click on a cell and use the keyboard combination Ctrl+Return (or ⌘+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [2]:
from datetime import datetime
print(f"Today is {datetime.today():%A}.")

Today is Sunday.


Note that the *order in which you run the cells matters*. When you are working through a lab, make sure to always run *all* cells in order. Otherwise, the code might not work. If you take a break while working on a lab, Colab may disconnect you and in that case, you have to execute all cells again before  continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime → Run before__  from the menu above (or use the keyboard combination Ctrl/⌘ + F8). This will re-execute all cells before the current one.

### Using Colab with a GPU

A **GPU** is a special type of hardware that can significantly speed up some types of computations of machine learning models. Several of the activities in this lab will also run a lot faster if you run them on a GPU.

Follow these steps to run the activities in this lab on a GPU:

1.  In the top menu bar, click on **Runtime**.
2.  Select **Change runtime type** from the dropdown menu.
3.  In the pop-up window under **Hardware Accelerator**, select **GPU** (usually listed as `T4 GPU`).
5.  Click **Save**.

Your Colab session will now restart with GPU access.

Note that access to GPUs is limited and at times, you may not be able to run this lab on a GPU. All activities will still work but they will run slower and you will have to wait longer for some of the cells to finish running.


## Imports



In this lab, you will make use of the Keras package for defining and training the transformer model, the [Pandas](http://pandas.pydata.org) package for reading the dataset, and the [TensorFlow](https://tensorflow.org) package for shuffling the data and grouping it into batches. Run the following cell to import these packages.

In [3]:
%%capture
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

# Packages used.
import os # Used for setting Keras configuration variables.
os.environ["KERAS_BACKEND"] = "jax" # Set a parameter for Keras.
import re # Used for splitting text on whitespace.

import keras # Used for defining an training the model.
import pandas as pd # Used for loading the dataset.
import tensorflow as tf # Used for shuffling the dataset.

# Used for displaying nicer error messages.
from IPython.display import display, HTML
from ai_foundations import training # For training your model.
from ai_foundations import generation # For prompting your model.
from ai_foundations import visualizations # For visualizing probabilities.
from ai_foundations.feedback.course_1 import slm # For providing feedback.

# The following line provides configuration for Keras.
keras.utils.set_random_seed(812)  # For Keras layers.

## Loading and tokenizing the dataset

Load the dataset. As in previous labs, you will use the [Africa Galore](https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json) dataset to train the model.

In [4]:
africa_galore = pd.read_json(
    "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json"
)
dataset = africa_galore["description"].values
print("Loaded dataset with", dataset.shape[0], "paragraphs.")

Loaded dataset with 232 paragraphs.


### Tokenization

The following cell contains the  `SimpleWordTokenizer` class that you have encountered in the previous lab. You will use this class again to tokenize the dataset, prepare the vocabulary, and provide methods for translating tokens into token IDs and vice versa. Note that this version also adds special `<PAD>` and `<UNK>` tokens to the vocabulary. You will learn more about the purpose of these special tokens as part of this lab.

Run the following cell to define the `SimpleWordTokenizer` and tokenize the Africa Galore dataset, and translate its tokens to IDs.

In [5]:
class SimpleWordTokenizer:
    """A simple word tokenizer.

    The tokenizer splits the text sequence based on whitespace, using the
    `encode` method to convert the text into a sequence of indices and the
    `decode` method to convert indices back into text.

    The simple word tokenizer that can be initialized with a corpus or using a
    provided vocabulary list

    Typical usage example:

        corpus = "Hello there!"
        tokenizer = SimpleWordTokenizer(text)
        print(tokenizer.encode('Hello'))

    """

    # Define constants.
    UNKNOWN_TOKEN = "<UNK>"
    PAD_TOKEN = "<PAD>"

    def __init__(self, corpus: list[str], vocabulary: list[str] | None = None):
        """Initializes the tokenizer with texts in corpus or with a vocabulary.

        Args:
          corpus: Input text dataset.
          vocabulary: A pre-defined vocabulary. If None,
              the vocabulary is automatically inferred from the texts.
        """

        if vocabulary is None:
            # Build the vocabulary from scratch.
            if isinstance(corpus, str):
                corpus = [corpus]

            # Convert text sequence to tokens.
            tokens = []
            for text in corpus:
                for token in self.space_tokenize(text):
                    tokens.append(token)

            # Create a vocabulary comprising of unique tokens.
            vocabulary = self.build_vocabulary(tokens)

            # Add special unknown and pad tokens to the vocabulary list.
            self.vocabulary = (
                [self.PAD_TOKEN] + vocabulary + [self.UNKNOWN_TOKEN]
            )

        else:
            self.vocabulary = vocabulary

        # Size of vocabulary.
        self.vocabulary_size = len(self.vocabulary)

        # Create token-to-index and index-to-token mappings.
        self.token_to_index = {}
        self.index_to_token = {}
        # Loop through all tokens in the vocabulary. enumerate automatically
        # assigns a unique index to each token.
        for index, token in enumerate(self.vocabulary):
            self.token_to_index[token] = index
            self.index_to_token[index] = token

        # Map the special tokens to their IDs.
        self.pad_token_id = self.token_to_index[self.PAD_TOKEN]
        self.unknown_token_id = self.token_to_index[self.UNKNOWN_TOKEN]

    def space_tokenize(self, text: str) -> list[str]:
        """Splits a given text on whitespace into tokens.

        Args:
            text: Text to split on whitespace.

        Returns:
            List of tokens after splitting `text`.
        """

        # Use re.split such that multiple spaces are treated as a single
        # separator.
        return re.split(" +", text)

    def join_text(self, text_list: list[str]) -> str:
        """Combines a list of tokens into a single string.

        The combined tokens, as a single string, are separated by spaces in the
        string.

        Args:
            text_list: List of tokens to be joined.

        Returns:
            String with all tokens joined with a whitespace.

        """
        return " ".join(text_list)

    def build_vocabulary(self, tokens: list[str]) -> list[str]:
        """Create a vocabulary list from the list of tokens.

        Args:
            tokens: The list of tokens in the dataset.

        Returns:
            List of unique tokens (vocabulary) in the dataset.
        """
        return sorted(list(set(tokens)))

    def encode(self, text: str) -> list[int]:
        """Encodes a text sequence into a list of indices.

        Args:
            text: The input text to be encoded.

        Returns:
            A list of indices corresponding to the tokens in the input text.
        """

        # Convert tokens into indices.
        indices = []
        unk_index = self.token_to_index[self.UNKNOWN_TOKEN]
        for token in self.space_tokenize(text):
            token_index = self.token_to_index.get(token, unk_index)
            indices.append(token_index)

        return indices

    def decode(self, indices: int | list[int]) -> str:
        """Decodes a list (or single index) of integers back into tokens.

        Args:
            indices: A single index or a list of indices to be
                decoded into tokens.

        Returns:
            A string of decoded tokens corresponding to the input indices.
        """

        # If a single integer is passed, convert it into a list.
        if isinstance(indices, int):
            indices = [indices]

        # Map indices to tokens.
        tokens = []
        for index in indices:
            token = self.index_to_token.get(index, self.unknown_token_id)
            tokens.append(token)

        # Join the decoded tokens into a single string.
        return self.join_text(tokens)


# Initialize the tokenizer. This will build the tokenizer's vocabulary with
# all the tokens that appear in the dataset.
tokenizer = SimpleWordTokenizer(dataset)

# Translate all tokens to their corresponding IDs.
encoded_tokens = []
for text in dataset:
    # Split text into tokens and translate the tokens to token IDs.
    token_ids = tokenizer.encode(text)
    encoded_tokens.append(token_ids)

To verify that this process was successful, inspect the first ten token IDs in the first example.

In [6]:
encoded_tokens[0][:10]

[814, 511, 985, 5092, 4802, 5183, 2800, 1363, 4792, 2134]

## Padding the dataset


------
> **ℹ️ Info: Padding and truncating**
>
>The input to transformer models (or deep learning models more generally) is a **matrix** where each row corresponds to the data for one example in the dataset. In the case of the language model you will be training, each paragraph from the Africa Galore dataset constitutes an example. The input should therefore be a matrix that has the IDs of every token in a paragraph. In this matrix, the first entry of a row should be the ID of the first token, the second entry should be the ID of the second token, the third entry should be the ID of the third token, and so on.
>
>However, the paragraphs in a dataset rarely all have exactly the same length. This causes a problem when you try to combine the data of multiple paragraphs into a matrix, since every row in a matrix must have the same number of entries.
>
>There are two common solutions to this problem:
>1. You can use a special `<PAD>` token to ensure that all sequences have the same length. This way, you can pad shorter paragraphs to match the length of the longest paragraph. This is done by adding `<PAD>` tokens at the beginning or the end of the paragraph. This results in all paragraphs having exactly the same length so that they can be combined in one matrix.
>2. Another option is to truncate paragraphs. That is, removing the tokens at the beginning or the end of a paragraph so that they have the length of the shortest paragraph. This, however, may remove a lot of information from the dataset. For example, if the shortest paragraph has only five tokens, then you would shorten every paragraph to five tokens and remove almost all tokens.
>
> It is also possible to combine both of these methods so that you choose a target length. That way, very long paragraphs that exceed this length are truncated and short paragraphs whose length is below the target length are padded.
>
>The combination of truncating and padding is what is usually done in practice. You will implement this in the next activity to prepare the data for training the model.
------


### Coding Activity 1: Compute length statistics

To get a sense of what the dataset looks like and how much padding is needed, compute some statstics of the length of the dataset.

First, look at the length of the first paragraph:



In [7]:
print(f"Length of first paragraph: {len(encoded_tokens[0]):,}")

Length of first paragraph: 118


------
> 💻 **Your task:**
>
> Complete the following cell to compute the length of the shortest paragraph and the length of the longest paragraph.
>
> There are multiple ways you go about this. For example, you could write a loop that goes through all paragraphs in `encoded_tokens`. You could update variables for the shortest and longest paragraph length whenever you encounter a shorter or longer paragraph than previously seen.
>
> Alternatively, you can use the `min` and `max` functions in combination with the `len` function in Python. For example, if you have a list of lists `list_of_lists`, then
>`min(list_of_lists, key=len)` returns the list in `list_of_lists` with the shortest list (or one of them if there are multiple that have the same length).
------

In [15]:
# Add your code to compute the length of the shortest paragraph here.
shortest_paragraph_length = min([len(encoded_para) for encoded_para in encoded_tokens])

# Add your code to compute the length of the longest paragraph here.
longest_paragraph_length = max([len(encoded_para) for encoded_para in encoded_tokens])

print(f"Length of the shortest paragraph is:", shortest_paragraph_length)
print(f"Length of the longest paragraph is:", longest_paragraph_length)

Length of the shortest paragraph is: 26
Length of the longest paragraph is: 318


In [16]:
# @title Run this cell to test your code

slm.test_max_min_seqlen(
    shortest_paragraph_length, longest_paragraph_length, encoded_tokens
)

✅ Nice! Your answer looks correct.


You can now use this information to set the target length (`max_length`) for padding and truncating the paragraphs in your dataset. The cell below does this behind the scenes using the [`keras.preprocessing.sequence.pad_sequences`](https://github.com/keras-team/keras/blob/v3.10.0/keras/src/utils/sequence_utils.py#L12) function from the Keras package.

Change the value below to different values, and observe how the list of token IDs for the first paragraph changes. What happens when you set `max_length` to a very small value? What happens when you set it to the length of the longest paragraph?



In [17]:
# @title Set `max_length` for padding and truncating data.

max_length = 300  # @param {type: "number"}

if max_length <= 0:
    display(
        HTML(
            f"<h3>Error:</h3><p>Max length must be greater than 0. Please"
            f" increase <code>max_length</code>.</p><p></p>"
        )
    )

elif max_length > longest_paragraph_length:
    display(
        HTML(
            f"<h3>Error:</h3><p>The padding token <code>"
            f" {tokenizer.pad_token_id}</code> will be added to all"
            f" sequences - you probably don't want that. Please reduce"
            f" <code>max_length</code>.</p><p></p>"
        )
    )

else:
    if max_length < longest_paragraph_length:
        display(
            HTML(
                f"<p><strong>Note:</strong> The longest paragraph has"
                f" {longest_paragraph_length} tokens,"
                f" but <code>max_length</code> is set to {max_length}."
                f" Paragraphs longer than <code>max_length</code> will be"
                " truncated.</p><p></p>"
            )
        )

    padded_sequences = keras.preprocessing.sequence.pad_sequences(
        encoded_tokens,
        maxlen=max_length,
        padding="post",
        truncating="post",
        value=tokenizer.pad_token_id,
    )

    print("New length of first paragraph:", len(padded_sequences[0]), "\n")

    print(
        "Padding makes the length of all sequences the same as the specified"
        " `max_length`."
    )

    print(
        "Notice the padded token IDs {tokenizer.pad_token_id} appearing at the"
        f" end of the sequence.\n"
    )
    print("Padded tokens of first paragraph:\n", padded_sequences[0])

New length of first paragraph: 300 

Padding makes the length of all sequences the same as the specified `max_length`.
Notice the padded token IDs {tokenizer.pad_token_id} appearing at the end of the sequence.

Padded tokens of first paragraph:
 [ 814  511  985 5092 4802 5183 2800 1363 4792 2134 2856 4792 1584 5092
 2088  814 1134 3043 2922  912 2821  170 2623 4792 2023 3807 3576  912
 1653 3772 4792 2775 1244  912 4409 3280 1030 4792 1158 3049 1992  912
 1868 2486 2437  135 5189 3422  445 3388 2078 4849 4792 3407 2706 1259
 4692 2856 4839 5183 4792 4078  814 3406 4259 4849 2389 4831 2707  912
 3821 1829 3522 2134 1030 2955  185 1076 2707 3683 5143 1849 4343 1030
 1546 1446 4983 2856 4792 2876 4078  814 3406 5092 3366 4788 2968 2151
 2938 5092  912 1450 3522 3101  912 1672 4849 4793 4295 2721  912 5036
 2224 3522 4792 4437 3522  513    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    

## Prepare input and target

Recall that the task of a language model is to predict the next token given a context of previous tokens. In the case of n-gram models, you could "teach" the model to do this by counting n-grams in the corpus and directly computing the probabilities from the n-gram counts.

Transformers have a more involved training procedure. They repeatedly make guesses of what the next token should be. If they get this guess wrong, the training procedure updates the model parameters. That way, the model is more likely to make a correct guess next time.

For this training procedure, you have to prepare the data such that you have a separate *input* and *target* dataset:

* **Input**: The input is a sequence of tokens that is passed into the transformer model. This may be a part of a paragraph, a full paragraph, or even multiple paragraphs, depending on how the data is structured. The input will contain everything but the last token since there is no next token for the last token.
  
* **Target**: The target sequence is what you want the model to predict from the input. The target will be the same as the input sequence, but *shifted left* by one token. This means the target will always contain the next token that should follow the input sequence. The target sequence will contain everything but the first word. That is because the transformer always needs at least one token as input, so it will start by predicting the next word.

For example, if your dataset consists of the sentence "Table Mountain is beautiful," the corresponding input and target sequences would look as follows:
* Input: `["Table", "Mountain", "is"]` (last token removed).
* Target: `["Mountain", "is", "beautiful"]` (shifted by one token).

As mentioned above, the input sequence and the target sequence will actually be the corresponding token IDs instead of the raw tokens. The raw tokens are included here to make the example more readable.

Run the following cell to prepare the input and target sequence for training the transformer model.

In [18]:
# Prepare input and target for the transformer model.
# For each example, extract all tokens except the last one.
input_sequences = padded_sequences[:, :-1]
# For each example, extract all tokens except the first one.
target_sequences = padded_sequences[:, 1:]

To check that the input sequence and the target sequence have been properly prepared, print the first ten tokens in the input and the target sequence.

In [19]:
print("First 10 token IDs in first input sequence:", input_sequences[0, :10])
print(
    "First 10 tokens in first input sequence:",
    tokenizer.decode(input_sequences[0, :10]),
)

print("\n")

print("First 10 token IDs in first target sequence:", target_sequences[0, :10])
print(
    "First 10 tokens in target sequence:",
    tokenizer.decode(target_sequences[0, :10])
)

First 10 token IDs in first input sequence: [ 814  511  985 5092 4802 5183 2800 1363 4792 2134]
First 10 tokens in first input sequence: The Lagos air was thick with humidity, but the energy


First 10 token IDs in first target sequence: [ 511  985 5092 4802 5183 2800 1363 4792 2134 2856]
First 10 tokens in target sequence: Lagos air was thick with humidity, but the energy in


You should see in the output above that the target sequence is the input sequence shifted by one token to the left.

When you set the maximum paragraph length `max_length` previously, you were considering all tokens, including the first and the last token of each paragraph. However, in the `input_sequences`, you removed the first token from each paragraph. In the `target_sequences`, you removed the last token. So, the maximum length of the data in `input_sequences` and `target_sequences` is now one token shorter.

Run the following cell to update the `max_length` variable. This variable will be used as a parameter of the transformer model and needs to accurately reflect what the length of each (padded) paragraph in your input data is.

In [20]:
max_length = input_sequences.shape[1]

## Shuffle the dataset and specify the batch size


-------
> **ℹ️ Info: The purpose of shuffling and batches**
>
>The final step before you can train your small language model is to split the data into groups of a handful of paragraphs, called **batches**. Furthermore, there is often some order in your data. For example, in the Africa Galore dataset, all examples concerning food appear one after each other. When training a model, however, it is generally best to include a very diverse set of paragraphs in one batch. This can be achieved by **shuffling** the data in the dataset so that all paragraphs appear in random order before splitting them up into batches. Note that the order of tokens within a paragraph must remain intact since you would end up with word puzzles otherwise.
>
>For splitting the dataset into batches, you need to define the **batch size**, that is, the number of paragraphs that should be included in one batch. Increasing the batch size usually speeds up training of the model and can also lead to better models. At the same time, however, larger batch sizes require more memory. If you set the batch size too large, you may get "out of memory" errors that indicate that you do not have enough memory available to train the model. You will learn more about dealing with methods for reducing memory in later courses.
>
>The figure below shows a dataset with seven paragraphs. Each paragraph is padded to `max_length`. In this case, it is set to the length of the longest paragraph. The dataset is then shuffled and split into batches of size 3. Note that the final batch only contains one paragraph, since the total number of paragraphs is 7 and not divisible by 3.
> <img src='https://storage.googleapis.com/dm-educational/assets/ai_foundations/evolve_graphic.png' width='1000'>
------


The cell below implements the shuffling of the dataset and splitting it into batches. The result of this is a list of matrices referred to as **tensors**. Each matrix corresponds to a batch and contains all the token IDs for all paragraphs in that batch.

In [21]:
# Create TensorFlow dataset to prepare sequences.
tf_dataset = tf.data.Dataset.from_tensor_slices((input_sequences, target_sequences))

# Randomly shuffle the dataset.
# The buffer_size determines how many examples from the dataset
# are held in memory before shuffling.
# If you are working with a very large dataset,
# reduce the buffer_size as needed.
tf_dataset = tf_dataset.shuffle(buffer_size=len(input_sequences))

# Specify batch size.
batch_size = 32  # @param {type: "number"}

# Create batches.
batches = tf_dataset.batch(batch_size)

for batch in batches.take(1):
    print(batch)

(<tf.Tensor: shape=(32, 299), dtype=int32, numpy=
array([[ 719, 5092, 4815, ...,    0,    0,    0],
       [ 797,  597,  912, ...,    0,    0,    0],
       [ 470, 4084, 2932, ...,    0,    0,    0],
       ...,
       [ 814, 4079, 1171, ...,    0,    0,    0],
       [ 814, 3085, 2932, ...,    0,    0,    0],
       [ 358, 1605, 2935, ...,    0,    0,    0]], dtype=int32)>, <tf.Tensor: shape=(32, 299), dtype=int32, numpy=
array([[5092, 4815, 4403, ...,    0,    0,    0],
       [ 597,  912, 2364, ...,    0,    0,    0],
       [4084, 2932,  912, ...,    0,    0,    0],
       ...,
       [4079, 1171, 3522, ...,    0,    0,    0],
       [3085, 2932, 4792, ...,    0,    0,    0],
       [1605, 2935, 2968, ...,    0,    0,    0]], dtype=int32)>)


Run the following cell to count the total number of batches:

In [22]:
total_batches = 0
for batch in batches:
    total_batches += 1
print("Total number of batches is:", total_batches)

Total number of batches is: 8


## Train a small language model (SLM)

You have now done all the preparatory work and are ready to train your small language model. As mentioned above, this model has around 3.5 million parameters. It is therefore  a lot smaller than so-called large language models that are used in production. For example, the Google Gemini model has billions of parameters and was trained on a much bigger dataset than the Africa Galore dataset.

The size of the transformer model and the amount of training data has a strong impact on its performance. Larger models with more parameters have the capacity to learn more complex patterns and deliver better accuracy. However, they also require more computational resources, memory, and processing power. This can lead to longer training times i.e., how long the model needs to update to reach optimal performance, and higher costs. You would not be able to train a very large model in a Colab notebook. Therefore, you will be training a much smaller model here. Despite this, the overall process for training a large language model is the same as the process for training a small language model.

------
> **ℹ️ Info: Parameters of a transformer model**
>
> **Parameters** are a set of numbers that guide the model to perform whatever task it was trained to do. In the case of transformer models, the parameters are less interpretable. They are often a very large collection of numbers that determine the model behavior.
>
> The parameters of a transformer model are updated after processing each batch of paragraphs. At the start of the training, the parameters are intialized with random numbers.
>Models are then usually trained by processing the data multiple times. Going through the data once is known as an **iteration** or **epoch**. During each training iteration, the parameters are updated so that they lead to better predictions of the next token.
------

### Initialize the model

The `create_model` function used below builds a transformer model. It takes two parameters:

* `max_length`: The maximum length of a paragraph in the dataset (which you set above). The model will only be able to process sequences up to this length.
* `vocabulary_size`: The size of the vocabulary. That is the number of unique tokens in the dataset. This is used in two ways. Firstly, it is used to determine the number of unique inputs the model should expect. Secondly, it determines how many different tokens the model can predict. You can get this information from the tokenizer that you defined above by using its `vocabulary_size` property.
* `learning_rate`: How quickly the parameters should be updated. Setting this to a higher value can speed up training but may result in a worse model. Setting this to a lower value likely improves how the model learns but may slow down training. For now, you do not have to change this value and you will learn more about this setting in later courses.



In [23]:
model = training.create_model(
    max_length=max_length,
    vocabulary_size=tokenizer.vocabulary_size,
    learning_rate=1e-4
)

### Initialize a callback function

Training can take a while. You want to make sure that the model predictions actually get better over time. One way to do this is to define a **callback function** that is used to regularly print what the model would generate for one prompt.

For example, the callback function defined in the following cell will print ten tokens for the prompt "Abeni," after every 10 training iterations.

In [24]:
prompt = "Abeni,"
prompt_ids = tokenizer.encode(prompt)
text_gen_callback = training.TextGenerator(
    max_tokens=10, start_tokens=prompt_ids, tokenizer=tokenizer, print_every=10
)

### Run the training

Run the following cell to train the model. As mentioned above, the training process updates the model parameter after processing each batch. This is known as a step in the training process.

An epoch involves processing all batches in the dataset. Before training the model, you have to set the number of times the training process should process the datset. This is done by setting the number of epochs (`num_epochs`).

You will likely get the best results if you train the model for at least 200 epochs. But if training is taking a long time, you can reduce the number of epochs. If the model does not perform well after 200 epochs, you can train it for additional epochs by adjusting the number below and re-running the cell. This will continue training your model.

If you want to reset the training, re-run the previous cells before running the cell below.

In [25]:
num_epochs = 200  # @param {type: "number"}
# verbose=2: Instructs the model.fit method to print one line per
# epoch so you see how the loss is decreasing and generated texts improving.
history = model.fit(
    x=batches, verbose=2, epochs=num_epochs, callbacks=[text_gen_callback]
)

Epoch 1/200
8/8 - 12s - 2s/step - loss: 8.5638
Epoch 2/200
8/8 - 3s - 389ms/step - loss: 8.3509
Epoch 3/200
8/8 - 0s - 38ms/step - loss: 8.1503
Epoch 4/200
8/8 - 0s - 40ms/step - loss: 7.9671
Epoch 5/200
8/8 - 0s - 40ms/step - loss: 7.7983
Epoch 6/200
8/8 - 0s - 39ms/step - loss: 7.6483
Epoch 7/200
8/8 - 0s - 38ms/step - loss: 7.5125
Epoch 8/200
8/8 - 0s - 39ms/step - loss: 7.3903
Epoch 9/200
8/8 - 0s - 39ms/step - loss: 7.2843
Epoch 10/200
Generated text:
 Abeni, coat was rituals function. them formed promoting Toubab, salt. individual. 

8/8 - 3s - 329ms/step - loss: 7.1848
Epoch 11/200
8/8 - 0s - 49ms/step - loss: 7.0995
Epoch 12/200
8/8 - 0s - 39ms/step - loss: 7.0234
Epoch 13/200
8/8 - 0s - 38ms/step - loss: 6.9563
Epoch 14/200
8/8 - 0s - 40ms/step - loss: 6.8934
Epoch 15/200
8/8 - 0s - 38ms/step - loss: 6.8387
Epoch 16/200
8/8 - 0s - 39ms/step - loss: 6.7891
Epoch 17/200
8/8 - 0s - 39ms/step - loss: 6.7382
Epoch 18/200
8/8 - 0s - 40ms/step - loss: 6.6864
Epoch 19/200
8/8 - 0s - 3

While the model is training, you can observe how the generated text changes. At the beginning of the training, the generation will likely be a random collection of words. By the end of the training, however, the generation should start to become more coherent.

Apart from observing how the generated text changes, you can also check how the **loss** changes as training progresses. If the model is training properly, the loss should go down as training continues. You may find that the loss temporarily goes up again from one epoch to another. This is nothing to worry about, but the general trend should be that the loss descreases.

Once the training process has finished (this can take some time), you can prompt the model as you did in earlier labs.




#### Evaluate your small language model

After training a model, researchers have to perform many evaluations to determine whether it is performing well in many scenarios.

As a final activity, you will also evaluate your model. The remainder of this lab guides you through this evaluation process. You will ask the following key questions to evaluate your model's quality:

*   A. How good is your model at predicting the next token for a given prompt based on patterns identified in the training dataset?
*   B. Is the generated text coherent, and does it make sense given the context?
*   C. Is the likely next token what you expect to see when the context is changed slightly?

When evaluating your model, you may find it useful to take some notes. To do this, you can either add cells to this Colab notebook or take notes on [Google Docs](https://docs.google.com/), [Notebook LM](https://notebooklm.google/), a piece of paper, or any other note-taking tool of your choice.

### How good is your model at predicting the next token for a given prompt based on patterns identified in the training dataset?

The following steps provide you with some guidance on how to answer this question.

* Prompt the model using a token or sequence of tokens from the training dataset. For example, you can start with `"Abeni, a bright-eyed"`.
* Visualize the probability distribution of the next token for a given prompt.
* Increase `num_tokens_to_generate` to generate longer texts.
* Inspect the generated text. See how well the model has learned to generate text that reflects the patterns learned during training.

In [29]:
prompt = "Cape town is a" #@param {type: "string"}
num_tokens_to_generate = 10 #@param {type: "number"}
generated_text, probs = generation.generate_text(
    prompt,
    num_tokens_to_generate,
    model=model,
    tokenizer=tokenizer,
    pad_token_id=tokenizer.pad_token_id,
    sampling_mode="greedy" # To generate the highest probability generation.
)

print("Generated text:", generated_text)
print("\n")

visualizations.plot_next_token(probs[0], prompt=prompt, tokenizer=tokenizer)

Generated text: Cape town is a dormant volcano in Tanzania, dominates the East African landscape and




### Is the generated text coherent, and does it make sense given the context?
* Prompt the model with a token  or a phrase of your choosing.
* Increase `num_tokens_to_generate` to generate longer texts.
* Visualize the probability distribution of the next token for a given prompt.
* Inspect the quality of generated texts.

Note that, above, the generation process always chooses the most probable next token from the set of candidate tokens. In the next cell, the generation process samples a next token according to the probability distribution predicted by the model. This is done by setting the `sampling_mode` parameter to `random`.

------
> **ℹ️ Info: Unseen tokens**
>
>When you are trying different prompts, you may also notice that sometimes tokens get replaced by the special string `<UNK>`. This happens when you prompt the model with tokens that did not appear in the training dataset, so called **unseen tokens**. For these tokens, the `token_to_index` dictionary of the tokenizer does not have an entry and therefore they cannot be mapped to a token index.
>
> One method of dealing with such tokens is to add a special `<UNK>` token along with its index to the vocabulary of the tokenizer. Then, during **inference**, whenever there is an unseen token, it maps the token to the index of this special `<UNK>` token.
>
> This method is not ideal because all information in the token is lost. In later courses, you will learn more sophisticated methods of dealing with unseen tokens that do not rely on such a catch-all token. For now, you will likely observe that the model is not very good at predicting the next word if there are several unseen words in the prompt.
------

In [30]:
prompt = "Jide was hungry so she went looking for" #@param {type: "string"}
num_tokens_to_generate = 10 #@param {type: "number"}
generated_text, probs = generation.generate_text(
    prompt,
    num_tokens_to_generate,
    model=model,
    tokenizer=tokenizer,
    pad_token_id=tokenizer.pad_token_id,
    sampling_mode="random",
)
print("Generated text:", generated_text)
print("\n")

visualizations.plot_next_token(probs[0], prompt=prompt, tokenizer=tokenizer)

Generated text: Jide was hungry so she went looking for Zimbabwe and soon being traditional riads bean batter is characterized




### Is the likely next token what you expect to see when the context is changed slightly?
* Change the context of the prompt slightly.
* Visualize the probability distribution of the next token for a given prompt.
* Increase `num_tokens_to_generate` to generate longer texts.
* Inspect the quality of generated texts.

In [31]:
prompt = "Jide was thirsty so she went looking for" #@param {type: "string"}
num_tokens_to_generate = 10 #@param {type: "number"}
generated_text, probs = generation.generate_text(
    prompt,
    num_tokens_to_generate,
    model=model,
    tokenizer=tokenizer,
    pad_token_id=tokenizer.pad_token_id,
    sampling_mode="random",
)

print("Generated text:", generated_text)
print("\n")

visualizations.plot_next_token(probs[0], prompt=prompt, tokenizer=tokenizer)

Generated text: Jide was thirsty so she went looking for change. taste. It has become ice cake" in a scent




## Summary

This is the end of the **Train your own small language model** lab.

In this lab, you trained your first SLM and engaged in the following steps.

- **Tokenized the dataset:** You used the `SimpleWordTokenizer` from the previous lab to tokenize and convert the paragraphs in the dataset to token IDs.

- **Padded the paragraphs:** You ensured all paragraphs had the same length by truncating some of them and padding others with a special `"<PAD>"` token. This is crucial for processing data in neural networks, such as transformer language models.

- **Prepared the input and target data:** You created input-target pairs, where the target is the input sequence shifted by one token. This teaches the model to predict the next token based on the context of previous tokens.

- **Shuffled and batched the data:** You shuffled the dataset to increase the diversity of the data within each batch and grouped the paragraphs into batches for training.

- **Trained the SLM:** You defined and trained a small transformer model, observing how the training loss decreased during training.

- **Prompted the trained model:** You experimented with prompting the model, observing its ability to predict the likely next word, generate coherent text, and adapt to changes in context.

As you performed your evaluations, you may have noticed that some of the model predictions are not as good as the ones that you have seen with the Gemma model. This is expected since your model is a lot smaller than the Gemma model. It has been trained on *a lot* less text data. Nevertheless, your model should be able to produce grammatical sentences, even if they do not always make a lot of sense.

In the next section of the course, you will explore model evaluation in a little more depth. You will then move on to think about the kinds of problems you are interested in using language models to address.

## Solutions

The following cells provide reference solutions to the coding activities above. If you really get stuck after trying to solve the activities yourself, you may want to consult these solutions.

However, we recommend that you *only* look at the solutions after you have tried to solve the activities above *multiple times*. The best way to learn challenging concepts in computer science and artifical intelligence is to debug your code piece by piece until it works rather than copying existing solutions.

If you feel stuck, you may want to first try to debug your code, for example, by adding additional print statements to see what your code is doing at every step. This will provide you with a much deeper understanding of the code and the materials. It will also make you practice how to solve challenging coding problems beyond this course.

To view the solutions for an activity, click on the arrow to the left of the activity name. If you consult the solutions, do not copy and paste them into the cells above. Instead, look at them and then type them manually into the cell. This will help you understand where you went wrong.

### Coding Activity 1

In [None]:
# Add this code in the cell for Activity 1 above.
longest_paragraph_length = len(max(encoded_tokens, key=len))
shortest_paragraph_length = len(min(encoded_tokens, key=len))
