> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

<img src="https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C5-white-bg.png">

# Lab: Full-Parameter Fine-Tuning of Gemma

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_5/gdm_lab_5_4_full_parameter_fine_tuning_of_gemma.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

15 minutes

Attempt to perform full-parameter fine-tuning of a Gemma model.

## Overview

In the previous lab, you fine-tuned a small language model. As you observed, that model was able to often produce reasonable quality answers for flashcards on topics that are in the Africa Galore dataset. However, it generally failed when it was prompted to generate answers for flashcards on topics that are not included in the Africa Galore dataset.

In this lab, you will attempt to apply the full-parameter fine-tuning technique to the pre-trained Gemma model. Since Gemma has been trained on a lot more data, it will likely be able to generate useful answers to many more questions. However, when you try to apply full-parameter fine-tuning, you will quickly encounter a critical real-world constraint: memory limitations. This lab is designed to demonstrate first-hand why full-parameter fine-tuning can be challenging with large models and why more efficient techniques are needed.


### What you will learn
By the end of this lab, you will be able to:

* Load and interact with a Keras implementation of a pre-trained Gemma model.
* Recognize the practical memory limitations of full-parameter fine-tuning when applied to billion-parameter models.

### Tasks

**In this lab, you will**:
* Prepare the flashcard dataset for use with the Gemma tokenizer and Keras.
* Load the pre-trained Gemma model and test its base performance with a few prompts.
* Attempt to perform full-parameter fine-tuning on the model.
* Observe the memory-related errors that prevent the training from succeeding.

 **This lab must be run on a GPU. Choose a T4 GPU.** See the section "How to use Google Colaboratory (Colab)" below for instructions on how to do this.



## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in cells that are executed on a remote server.

To run a cell, hover over a cell, and click the `run` button to its left. The run button is the circle with the triangle (â–¶). Alternatively, you can also click a cell and use the keyboard combination Ctrl+Return (or âŒ˜+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [None]:
from datetime import datetime
print(f"Today is {datetime.today():%A}.")

Note that the order in which you run the cells matters. When you are working through a lab, make sure to always run all cells in order. Otherwise, the code might not work. If you take a break while working on a lab, Colab may disconnect you; in that case, you have to execute all cells again before continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime â†’ Run before__  from the menu above (or use the keyboard combination Ctrl/âŒ˜ + F8). This will re-execute all cells before the current one.

### Using Colab with a GPU

Follow these steps to run the activities in this lab on a GPU:

1.  In the top menu bar, click on **Runtime**.
2.  Select **Change runtime type** from the dropdown menu.
3.  In the pop-up window under **Hardware Accelerator**, select **GPU** (usually listed as `T4 GPU`).
4.  Click **Save**.

Your Colab session will now restart with GPU access.

Note that access to GPUs is limited and at times, you may not be able to run this lab on a GPU. All activities will still work but they will run slower and you will have to wait longer for some of the cells to finish running.


## Set up a Kaggle account

To run this notebook, you will have to sign up for [Kaggle](https://www.kaggle.com), a platform that hosts datasets and models for machine learning, and sign the agreement for using the Gemma 3 model. This is required so that you can download the weights of the Gemma model for fine-tuning.

### Step 1: Create your Kaggle account

* Go to the Kaggle website: https://www.kaggle.com

* Click the "Register" button in the top-right corner.

* You can sign up using your Google account (recommended for easy Colab integration) or by entering an email and password.

* Follow the on-screen prompts to complete your registration and verify your email.

### Step 2: Sign the Gemma 3 model agreement

* Make sure you are logged into your new Kaggle account.

* Go directly to the Gemma 3 model card page: https://www.kaggle.com/models/keras/gemma3/keras/

* You should see a "Request Access" button.

* Click the button, read through the license agreement, and click "Accept" to gain access to the model. You must do this before the API will let you download the model.

### Step 3: Generate your Kaggle API key

* From any Kaggle page, click on your profile picture or icon in the top-right corner.

* Select "Account" from the drop-down menu.

* Scroll down to the "API" section.

* Click the "Create New API Token" button.

* This will immediately download a file named `kaggle.json` to your computer. This file contains your username and your secret API key. Keep it safe.

### Step 4: Set your API Key in  Colab

* Click the "key" icon ðŸ”‘ in the left-hand sidebar.

* You will see the "Secrets" panel.

* Now, open the kaggle.json file you downloaded on your computer. It's a simple text file and will look like this:

   ```json
   {"username":"YOUR_KAGGLE_USERNAME","key":"YOUR_KAGGLE_API_KEY"}
   ```
* In the Colab Secrets panel, create two new secrets:

   1. Name: `KAGGLE_USERNAME`

      Value: Copy and paste `YOUR_KAGGLE_USERNAME` from your `kaggle.json` file.

   2. Name: `KAGGLE_KEY`

      Value: Copy and paste `YOUR_KAGGLE_API_KEY` from your `kaggle.json` file.

* For both secrets, make sure the "Notebook access" toggle is switched on.


## Imports

In this lab, you will use the Keras package for loading a Gemma model, as well as the Pandas package for loading the dataset.

Run the following cell to import the required packages.


In [None]:
%%capture
# Install the custom package for this course.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

import os

# For loading the Kaggle username and key.
from google.colab import userdata

# Load the Kaggle username and key.
os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")

os.environ["KERAS_BACKEND"] = "jax" # Set the Keras backend to JAX.

import keras # For training the model.
import keras_nlp # For loading Gemma 3.
import pandas as pd # For loading the dataset.
from textwrap import fill # For making paragraphs more readable.
# For loading the formatting function from the lab
# "Format Text for Turn-Based Dialogue."
from ai_foundations import formatting

# Avoid memory fragmentation on JAX backend.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.95"
keras.utils.set_random_seed(812) # For making the training reproducible.

## Motivation

As discussed in the previous article, fine-tuning foundation models provides you with an option to leverage the knowledge that comes from pre-training a model on large amounts of data.

To get a better sense of how a pre-trained model, such as Gemma-1B, differs from the small language model (SLM) you  worked with in the previous lab, consider the comparisons in the following table:

|  | Gemma-1B | SLM |
|----------|----------|----------|
| Number of parameters | 1,000,000,000 | 523,470 |
| Tokens in vocabulary | 256,000 | 3,086 |
| Tokens in training set | 2,000,000,000,000 | 92,568 |
| Training data | Web documents in 140 languages | Africa Galore dataset |

As this table shows, the Gemma model contains considerably more parameters (~1 billion versus ~500,000) and has been trained on a dataset that is much larger and much more diverse than the Africa Galore dataset. This combination allows the model to learn and store more information than the small language model. As such, if you fine-tune Gemma-1B to produce answers for a flashcard, it will be able to provide useful answers for a much larger range of topics than the SLM.

## Data pre-processing

As with your SLM, you have to make sure that your fine-tuning data uses the  tokenizer and the special delimiter tokens that were used during pre-training. In this course, you are using the format that instruction-tuned Gemma models use (introduced in the lab "Formatting" on turn-based formatting) [1]. It uses special tokens to mark the beginning and end of a turn, and indicates the role (`user` or `model`).

A single training example will be structured as follows:

------
> `<start_of_turn>`user
>
> What is Jollof rice?`<end_of_turn>`
>
> `<start_of_turn>`model
>
> Category: Food
>
> Jollof rice is a popular and iconic one-pot rice dish that is a staple in many West African countries.`<end_of_turn>`
------

Run the following cell to load the Africa Galore QA dataset and format each example so that it has this structure.

In [None]:
# Load the question-answer dataset.
africa_galore_qa = pd.read_json(
    "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore_qa_v2.json"
)

questions = []  # List of formatted questions.
answers = []  # List of formatted answers.

for idx, row in africa_galore_qa.iterrows():
    # Run the format_qa function from the previous lab to format the question
    # and the answer.
    question, answer = formatting.format_qa(row)
    questions.append(question)
    answers.append(answer)

# Show the first set of inputs and outputs.
print(questions[0])
print(fill(answers[0], replace_whitespace=False))

Recall that in the previous lab, you had to perform additional steps for preparing the data. You also had to tokenize each question and each answer and you had to manually replace some tokens in the target with the special `<PAD>` token to exclude it from the loss computation.

Since fine-tuning a model is a very common task, implementations of models like Gemma often already implement functions that take care of several steps of data preparation as part of a preprocessor.

In the case of the Keras implementation of the Gemma model, you only need to specify the prompt and what you would like the model to generate. It then automatically constructs tokenized examples that can be used to fine-tune a Gemma language model. In the background, it concatenates the prompts and responses and excludes the tokens in the prompt from the loss computation. This saves you many common steps. Note, however, that it does not automatically add `<start_of_turn>` or `<end_of_turn>` tokens, so you still have to provide these tokens to the model, as you did in the previous cell.

### Preparing data for the Keras model

The Keras implementation of Gemma expects the fine-tuning data to be structured as a Python dictionary with two specific keys: `prompts` for the prompts (these are the questions in the case of the flashcard generator) and `responses` for the generations that the model should output (the answers in the case of the flashcard generator).

Run the following cell to initialize this dictionary.

In [None]:
data = {
    "prompts": questions,
    "responses": answers
}

## Loading the Gemma model with Keras

The following cell loads the pre-trained [Gemma-1B model](https://ai.google.dev/gemma/docs/core/prompt-structure) using Keras [2]. Like your SLM, this model has been trained on the next-word prediction task and it has not been optimized for answering questions. After loading, you can print a summary using the `summary` method to inspect its architecture.

------
> **ðŸ’­ Reflection: Memory requirements**
>
> Before running the code, consider the following questions:
>
> * How many parameters do you expect the model to have?
> * If you are using a Colab notebook with a T4 GPU, it will have about 15GB of memory. Will you be able to load the entire model into the memory of your GPU?
> * Would a much larger model, like Gemma-27B, fit on your GPU?
>
------

Run the following cell to load the model and initialize the preprocessor.




In [None]:
# Load the Gemma3-1B Keras model.
model = keras_nlp.models.Gemma3CausalLM.from_preset("gemma3_1b")
model.summary()

# Set the maximum sequence length of the model for padding and batching.
model.preprocessor.sequence_length = 400

#### Understanding Gemma's Memory Footprint

The Gemma 1B model has approximately 1.3 billion parameters (1B for the main transformer blocks and 300M for the token embeddings). A model of this size typically requires around 4GB of memory, so it fits on a T4 GPU with 15GB of memory. However, a 27B model, that is a model with around 27 billion parameters would require roughly 100 GB of memory (27 $\times$ ~3.7 GB), which exceeds the capacity of a single T4 GPU multiple times.



## Prompting the base model

Before you fine-tune the model, observe how the pre-trained base model performs when you prompt it with a question.


In [None]:
prompt = "What is Jollof rice?"
response = model.generate(prompt, max_length=200)
print(fill(response, replace_whitespace=False))

### What did you observe?

For a prompt such as "What is Jollof rice?", you likely observed that the model's response contains repetitions, is not concise and may go off topic. Furthermore, it does not output the response in the desired flashcard format.

These shortcomings are expected. Recall that this model has been trained on large amounts of text data to predict the next token but, unlike instruction-tuned models, it has not been optimized to respond to user queries. Furthermore, its pre-training does not contain any information about the format for the flashcard generator.

By fine-tuning a model you can address both of these issues. The model will learn to produce more concise responses to questions and learn to follow the flashcard format, since both of these qualities are demonstrated in the training examples of the Africa Galore QA dataset.

## Attempting full-parameter fine-tuning

Fine-tuning requires significantly more memory than using a model for inference. For each parameter, the training process must store not only the current value of the parameter itself, but also the gradient and other variables used by the optimizer. Because of this, training a model requires several times more memory than using a model for generations.

For this step, run the following code to fine-tune the Gemma-1B model using the same full-parameter approach that you used for fine-tuning your SLM. The code below updates all of Gemma's parameters.

Note that if you encounter an error while running fine-tuning, continue reading on for guidance.

In [None]:
model.fit(data, epochs=1, batch_size=32, verbose=1)

### What did you observe?


When running the cell above, you should observe an "out of memory" (OOM) error. This occurred because the optimizer attempted to allocate more memory than is available on a T4 GPU when trying to fine-tune all 1.3 billion parameters of the Gemma-1B model. When this happens, the entire Colab session will crash and you have to re-run all of the code again.

This is an intentional step that highlights that you may be able to load models with more than 1B parameters and perform inference on a T4 GPU with 15GB, as you did earlier in this lab, but this does not mean that you can also fine-tune all parameters of such a model. As you will learn in more detail in future courses, any form of model training, including fine-tuning, requires significantly more memory, since you also have to store gradients and information for the optimizer in the GPU's memory.





## Summary

In this lab, you experimented with the pre-trained Gemma-1B model. You attempted to adapt it to a new task using full-parameter fine-tuning and discovered that this method is often impractical for large models due to prohibitive memory requirements.

To solve this, you need a more sophisticated approach. The upcoming labs will introduce you to the concept of parameter-efficient fine-tuning (PEFT) and a technique called LoRA, which will allow you to fine-tune large models, such as Gemma, with limited GPU memory.

## References

[1] Google AI - Gemma Prompt Structure
Google AI. 2025. Gemma formatting and system instructions. Retrieved from https://ai.google.dev/gemma/docs/core/prompt-structure

[2] Keras Hub - Gemma3CausalLM
Keras. 2025. Gemma3CausalLM model. Retrieved from https://keras.io/keras_hub/api/models/gemma3/gemma3_causal_lm/





