<img src="https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C6-white-bg.png">

# Lab: Estimate GPU Memory

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_7/gdm_lab_7_4_estimate_gpu_memory.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

Learn how to estimate the GPU memory required to train a model.

15 minutes

In the lab "Estimate Training FLOPs", you explored the number of computations required to train a model. However, there is another critical resource to consider: GPU memory. Before you can perform a single calculation, all of the necessary data must be loaded into the GPU's memory. This includes the model's parameters, the input data, and temporary values like gradients.

In this lab, you will build a memory calculator to estimate the total GPU memory required to train a model.

### What you will learn:

By the end of this lab, you will be able to:

* Compute the memory required for training a model.
* Describe why training large models is so challenging on resource-constrained hardware.

### Tasks

**In this lab, you will**:
* Write functions to calculate the memory usage for parameters, input data, gradients, optimizer states, and activations.
* Use these functions to determine whether a 4-billion parameter model can be trained on a standard 16GB GPU.

## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in **cells** that are excuted on a remote server.

To run a cell, hover over a cell and click on the `run` button to its left. The run button is the circle with the triangle (â–¶). Alternatively, you can also click on a cell and use the keyboard combination Ctrl+Return (or âŒ˜+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [None]:
from datetime import datetime
print(f"Today is {datetime.today():%A}.")

Note that the **order in which you run the cells matters**. When you are working through a lab, make sure to always run all cells in order, otherwise the code might not work. If you take a break while working on a lab, Colab may disconnect you and in that case, you have to execute all cells again before  continuing your work. To make this easier, you can select the cell you are currently working on and then choose _Runtime â†’ Run before_  from the menu above (or use the keyboard combination Ctrl/âŒ˜ + F8). This will re-execute all cells before the current one.

## Imports

In this lab, you will use functions from the custom `ai_foundations` package for formatting memory estimates and verifying your solutions.


In [None]:
%%capture
# Install the custom package for this course.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

from ai_foundations.utils import formatting # For formatting memory estimates.
# For checking your solutions.
from ai_foundations.feedback.course_7 import memory as feedback

# Used to format the results of calculations.
from IPython.display import display, HTML

The following cell sets up all the necessary constants for this lab. These constants define a hypothetical training scenario for the Gemma-4B model. This includes the `PARAM_COUNT`, the `BYTES_PER_PARAMETER` (for 32-bit precision), and other details like `BATCH_SIZE` and `MAX_LENGTH` that are needed to estimate the size of the activation memory.

Run this cell to make these constants available for the rest of the lab.

In [None]:
# Model parameters (4 billion).
PARAM_COUNT = 4e9

# Precision (bytes per parameter).
# For 32-bit (FP32), each parameter requires 4 bytes of storage.
BYTES_PER_PARAMETER = 4

# Activation formula constants.
BATCH_SIZE = 8
MAX_LENGTH = 1024 # Maximum sequence length.
NUM_LAYERS = 32
EMBEDDING_DIM = 2560

## Coding Activity 1: Calculate parameter memory

As a first step, implement a function that computes the amount of memory required to store all the parameters of a model. This and all other functions that you implement as part of this lab will then be used to estimate the total memory requirements for training or fine-tuning a model.

------
> ðŸ’» **Your task**:
>
>Complete the function `calculate_param_memory` in the following cell.
>
>The memory required is the total number of parameters multiplied by the number of bytes each parameter occupies.
>
> Once you have implemented this function, run the first cell to define the function and the second cell to test your code.
------

In [None]:
def calculate_param_memory(param_count: float, bytes_per_param: int) -> float:
    """Calculates the memory in GB required to store the model parameters.

    Args:
      param_count: The total number of parameters in the model.
      bytes_per_param: The number of bytes used to store a single parameter.

    Returns:
      The total memory required for parameters, in gigabytes.
    """
    total_bytes = ...  # Add your code here.

    return formatting.bytes_to_gb(total_bytes)

In [None]:
# @title Run this cell to test your implementation
feedback.test_calculate_param_memory(calculate_param_memory)

## Coding Activity 2: Calculate input data memory

During training, fine-tuning, or inference, you must also load the entire batch of input data into the GPU's memory. In this activity, compute how much memory is required to store the input for one batch.

------
> ðŸ’» **Your task**:
>
>Complete the function `calculate_input_data_memory` below.
>
> This function should return the number of bytes required to store the input to a model.
>
> Hints:
> * The GPU processes a whole batch of training examples at once. To find the total memory for the entire input batch, you need to determine the number of bytes in a batch.
> * A batch contains `batch_size` examples of length `max_length` tokens. Recall that each sequence has `max_length` because we pad and truncate it to length `max_length`.
> * One input tokens requires `bytes_per_token_id` bytes of memory.
>
> Once you have implemented this function, run the following two cells to define the function and test your code.

------

In [None]:
def calculate_input_data_memory(
    batch_size: int, max_length: int, bytes_per_token_id: int
) -> float:
    """Calculates the memory in GB required to store a batch of input token IDs.

    Args:
      batch_size: The number of sequences in a single batch.
      max_length: The length of each sequence in tokens after padding.
      bytes_per_token_id: The number of bytes for each token ID.

    Returns:
      The total memory required for the input data batch, in gigabytes.
    """
    total_bytes = ...  # Add your code here.

    return formatting.bytes_to_gb(total_bytes)

In [None]:
# @title Run this cell to test your implementation.
feedback.test_calculate_input_data_memory(calculate_input_data_memory)

## Coding Activity 3: Calculate gradient memory

Recall that during the backward pass, the optimizer computes a gradient that indicates how to update the parameters. This gradient has one value for each model parameter and since the gradient is computed in the GPU, this also needs to be stored in the GPU memory.


------
> ðŸ’» **Your task**:
>
>Complete the function `calculate_gradient_memory` in the following cell.
>
>For each parameter in the model, you have to store one number as part of the gradient. The total memory required is therefore the total number of parameters multiplied by the number of bytes each parameter occupies.
>
> Once you have implemented this function, run the following two cells to define the function and test your code.
------

In [None]:
def calculate_gradient_memory(param_count: float, bytes_per_param: int) -> float:
    """Calculates the memory in GB required to store the gradients.

    Args:
      param_count: The total number of parameters in the model.
      bytes_per_param: The number of bytes used to store a single parameter.

    Returns:
      The total memory required for gradients, in gigabytes.
    """
    total_bytes = ...  # Add your code here.

    return formatting.bytes_to_gb(total_bytes)

In [None]:
# @title Run this cell to test your implementation.
feedback.test_calculate_gradient_memory(calculate_gradient_memory)

## Coding Activity 4: Calculate optimizer memory

Modern optimizers such as Adam also keep information that is used to compute learning rates for each individual parameter. This information also needs to be stored on the GPU.


------
> ðŸ’» **Your task**:
>
>Complete the function `calculate_optimizer_memory` in the following cell.
>
>For each parameter in the model, an optimizer such as Adam stores information that is used to compute the learning rate for that specific parameter.
The memory required is therefore the total number of parameters multiplied by the number of bytes this infomation occupies. Adam stores two states per parameter, so therefore the memory requirement for storing the information to compute the learning rate is 2 times the number of bytes required to store a parameter.
>
> Once you have implemented this function, run the following two cells to define the function and test your code.
------

In [None]:
def calculate_optimizer_memory(param_count: float, bytes_per_param: int) -> float:
    """Calculates the memory in GB for Adam optimizer states.

    Args:
      param_count: The total number of parameters in the model.
      bytes_per_param: The number of bytes used to store a single parameter.

    Returns:
      The total memory required for optimizer states, in gigabytes.
    """
    total_bytes = ...  # Add your code here.

    return formatting.bytes_to_gb(total_bytes)

In [None]:
# @title Run this cell to test your optimizer calculation
feedback.test_calculate_optimizer_memory(calculate_optimizer_memory)

## Coding Activity 5: Calculate activation memory

When doing a forward pass, the model needs to store the results of each computation (also known as **activations**) in the GPU.

The activation for a single token after passing through one transformer layer is a vector. The size of this vector is the embedding dimension.
* $\text{memory for one token's activation} = \text{embedding dimension} \times \text{bytes per parameter}$

During the forward pass, this activation vector is stored for every single token in the sequence.
* $\text{memory for one sequence} = \text{maximum sequence length} \times \text{memory for one token's activation}$

During training, the activations from every layer are kept in memory because they are needed for the backward pass.
* $\text{memory for all layers} = \text{number of  layers} \times \text{memory for one sequence}$

Finally, this entire process happens for every example in the batch simultaneously. The GPU must hold the activations for all of them at once.
* $\text{total activation memory} = \text{batch Size} \times \text{memory for all layers}$

------
> ðŸ’» **Your task**:
>
>Complete the function `calculate_activation_memory` in the following cell.
>
> Use the information about activiations provided to derive the formula for the activation memory and then implement this computation in the function within the cell.
>
> Once you have implemented this function, run the following two cells to define the function and test your code.
>
------

In [None]:
def calculate_activation_memory(
    batch_size: int,
    max_length: int,
    num_layers: int,
    embedding_dim: int,
    bytes_per_param: int,
) -> float:
    """Estimates the memory in GB required for activations using a simplified
    formula.

    Args:
      batch_size: The number of sequences in the batch.
      max_length: The length of each sequence in tokens after padding.
      num_layers: The number of transformer layers in the model.
      embedding_dim: The hidden dimension size of the model.
      bytes_per_param: The number of bytes used for the activation values.

    Returns:
      The estimated total memory for activations, in gigabytes.
    """
    total_bytes = ... # Add your code here.

    return formatting.bytes_to_gb(total_bytes)

In [None]:
# @title Run this cell to test your implementation.
feedback.test_calculate_activation_memory(calculate_activation_memory)

## Computing total memory

Now that you have created the functions to calculate memory for each component, run the next cell to compute the total GPU memory required for Gemma-4B. The cell uses all the indiviual functions that you implemented in this lab to output a breakdown of the memory requirements of the invidiual components.

In [None]:
# Calculate memory for each component using your functions.
param_mem = calculate_param_memory(PARAM_COUNT, BYTES_PER_PARAMETER)
input_data_mem = calculate_input_data_memory(
    BATCH_SIZE, MAX_LENGTH, BYTES_PER_PARAMETER
)
grad_mem = calculate_gradient_memory(PARAM_COUNT, BYTES_PER_PARAMETER)
optim_mem = calculate_optimizer_memory(PARAM_COUNT, BYTES_PER_PARAMETER)
activ_mem = calculate_activation_memory(
    BATCH_SIZE, MAX_LENGTH, NUM_LAYERS, EMBEDDING_DIM, BYTES_PER_PARAMETER
)

total_inference_memory = param_mem + input_data_mem + activ_mem

# Sum them.
total_training_memory = total_inference_memory + grad_mem + optim_mem

# Display the results.
display(HTML("<h3>--- GPU memory consumption breakdown ---</h3>"))
display(HTML("<h4>------ During training and inference ------</h4>"))

# Use the existing function for each component.
formatting.display_memory("model parameters", param_mem)
formatting.display_memory("input data batch", input_data_mem, decimal_places=6)
formatting.display_memory("activations", activ_mem)
display(HTML("<h4>------ During training only ------</h4>"))
formatting.display_memory("gradients", grad_mem)
formatting.display_memory("optimizer states (Adam)", optim_mem)

# Display the formatted separator and total.
display(
    HTML(
        f"<h3>Total estimated GPU memory required during inference: "
        f"{total_inference_memory:.2f} GB</h3>"
    )
)
display(
    HTML(
        f"<h3>Total estimated GPU memory required during training or "
        f"fine-tuning: {total_training_memory:.2f} GB</h3>"
    )
)

Look at the total estimated GPU memory from the cell above. The Google Colab environment typically provides a GPU which has slightly less than **16 GB of memory**.

As your calculation shows, the required memory to both perform inference using the 4B Gemma model and to train or fine-tune the 4-billion parameter Gemma model with these settings is greater than 16 GB. In its current form, it's impossible to fit all of the required data onto this GPU. As you observed in the previous lab, attempting to load or train this model with these settings or perform full-parameter fine-tuning would immediately fail with an "out-of-memory" error.

Fortunately, this memory limitation is a common problem for which a number of solutions have been developed. In the following activities, you will learn about several techniques to improve hardware efficiency. These will allow you to fine-tune larger models using the GPU you have available.

## Reducing memory requirements with bfloat16

In the article "Smart numbers," you learned about the **bfloat16** format that was developed specifically for reducing memory requirements of deep learning models without significantly sacrificing model performance. In the final activity of this lab, compute the memory requirements of the Gemma-4B model when the parameters, gradients, and activations are stored as bfloat16 numbers instead of 32-bit floating point numbers.

In [None]:
# Set the number of bytes per parameter to 2 as bfloat16 uses only two bytes
# to store each parameter, gradient, and activation.
BYTES_PER_PARAMETER = 2

param_mem = calculate_param_memory(PARAM_COUNT, BYTES_PER_PARAMETER)
input_data_mem = calculate_input_data_memory(
    BATCH_SIZE, MAX_LENGTH, BYTES_PER_PARAMETER
)
grad_mem = calculate_gradient_memory(PARAM_COUNT, BYTES_PER_PARAMETER)
optim_mem = calculate_optimizer_memory(PARAM_COUNT, BYTES_PER_PARAMETER)
activ_mem = calculate_activation_memory(
    BATCH_SIZE, MAX_LENGTH, NUM_LAYERS, EMBEDDING_DIM, BYTES_PER_PARAMETER
)

total_inference_memory = param_mem + input_data_mem + activ_mem

# Sum them.
total_training_memory = total_inference_memory + grad_mem + optim_mem

# Display the results.
display(HTML("<h3>--- GPU memory consumption breakdown ---</h3>"))
display(HTML("<h4>------ During training and inference ------</h4>"))

# Use the existing function for each component.
formatting.display_memory("model parameters", param_mem)
formatting.display_memory("input data batch", input_data_mem, decimal_places=6)
formatting.display_memory("activations", activ_mem)
display(HTML("<h4>------ During training only ------</h4>"))
formatting.display_memory("gradients", grad_mem)
formatting.display_memory("optimizer states (Adam)", optim_mem)

# Display the formatted separator and total.
display(
    HTML(
        f"<h3>Total estimated GPU memory required during inference: "
        f"{total_inference_memory:.2f} GB</h3>"
    )
)
display(
    HTML(
        f"<h3>Total estimated GPU memory required during training or "
        f"fine-tuning: {total_training_memory:.2f} GB</h3>"
    )
)

As this computation shows, representing numbers using bfloat16 halves the memory requirements. This, for example, allows you to perform inference with a 4 billion parameter model on a GPU with 16GB. You will make use of this technique in the next lab. Here you will load and fine-tune the Gemma-4B model using this more efficient number representation for storing parameters, gradients, and optimizer states.

## Summary

This lab provided a practical guide to estimating the memory required to train a language model. The five main components that need to be stored in the GPU's memory are:
 * Model parameters
 * Input data
 * Gradients
 * Optimizer states
 * Activations

A model can fail to train because these components together require more memory than is available on a GPU. In the next activity, you will apply techniques to overcome these memory limitations.

## Solutions

The following cells provide reference solutions to the coding activities in this notebook. If you really get stuck after trying to solve the activities yourself, you may want to consult these solutions.

It is recommended that you *only* look at the solutions after you have tried to solve the activities *multiple times*. The best way to learn challenging concepts in computer science and artificial intelligence is to debug your code piece-by-piece until it works, rather than copying existing solutions.

If you feel stuck, you may want to first try to debug your code. For example, by adding additional print statements to see what your code is doing at every step. This will provide you with a much deeper understanding of the code and the materials. It will also provide you with practice on how to solve challenging coding problems beyond this course.

To view the solutions for an activity, click on the arrow to the left of the activity name. If you consult the solutions, do not copy and paste them into the cells above. Instead, look at them, and type them manually into the cell. This will help you understand where you went wrong.

### Coding Activity 1

In [None]:
def calculate_param_memory(param_count: int, bytes_per_param: int) -> float:
    """Calculates the memory in GB required to store the model parameters.

    Args:
      param_count: The total number of parameters in the model.
      bytes_per_param: The number of bytes used to store a single parameter.

    Returns:
      The total memory required for parameters, in gigabytes.
    """
    total_bytes = param_count * bytes_per_param
    return formatting.bytes_to_gb(total_bytes)

### Coding Activity 2

In [None]:
def calculate_input_data_memory(
    batch_size: int, max_length: int, bytes_per_token_id: int
) -> float:
    """Calculates the memory in GB required to store a batch of input token IDs.

    Args:
      batch_size: The number of sequences in a single batch.
      max_length: The length of each sequence in tokens after padding.
      bytes_per_token_id: The number of bytes for each token ID (default is 4).

    Returns:
      The total memory required for the input data batch, in gigabytes.
    """
    total_bytes = batch_size * max_length * bytes_per_token_id
    return formatting.bytes_to_gb(total_bytes)

### Coding Activity 3

In [None]:
def calculate_gradient_memory(param_count: int, bytes_per_param: int) -> float:
    """Calculates the memory in GB required to store the gradients.

    Args:
      param_count: The total number of parameters in the model.
      bytes_per_param: The number of bytes used to store a single parameter.

    Returns:
      The total memory required for gradients, in gigabytes.
    """
    total_bytes = param_count * bytes_per_param
    return formatting.bytes_to_gb(total_bytes)

### Coding Activity 4

In [None]:
def calculate_optimizer_memory(param_count: int, bytes_per_param: int) -> float:
    """Calculates the memory in GB for Adam optimizer states.

    Args:
      param_count: The total number of parameters in the model.
      bytes_per_param: The number of bytes used to store a single parameter.

    Returns:
      The total memory required for optimizer states, in gigabytes.
    """
    total_bytes = 2 * param_count * bytes_per_param
    return formatting.bytes_to_gb(total_bytes)

### Coding Activity 5

In [None]:
def calculate_activation_memory(
    batch_size: int,
    max_length: int,
    num_layers: int,
    embedding_dim: int,
    bytes_per_param: int,
) -> float:
    """Estimates the memory in GB required for activations using a simplified formula.

    Args:
      batch_size: The number of sequences in the batch.
      max_length: The length of each sequence in tokens after padding.
      num_layers: The number of transformer layers in the model.
      embedding_dim: The hidden dimension size of the model.
      bytes_per_param: The number of bytes used for the activation values.

    Returns:
      The estimated total memory for activations, in gigabytes.
    """
    total_bytes = batch_size * max_length * num_layers * embedding_dim * bytes_per_param
    return formatting.bytes_to_gb(total_bytes)