# Accelerating a Hugging Face Gemma model finetuning with Transformer Engine

In the previous [tutorial](../te_llama/tutorial_accelerate_hf_llama_finetuning_with_te.ipynb), we demonstrated how to accelerate HF Llama models using the Transformer Engine library. We replaced `LlamaDecoderLayer` with `TransformerLayer` from the Transformer Engine, achieving a speedup. Furthermore, we conducted the finetuning in FP8 precision, which yielded an additional speedup.

Now, we will undertake a similar enhancement for the Google's [Gemma](https://blog.google/technology/developers/gemma-open-models/) model.

## Dependencies for this tutorial

Following files and media are necessary to effectively run this tutorial:

1. `te_gemma.py`
    - This file contains the code to load a Hugging Face Gemma checkpoint in Transformer Engine's `TransformerLayer` instead of Hugging Face's `GemmaDecoderLayer`. This is used in the following two sections of the tutorial - "Improvement 1" and "Improvement 2".
2. `utils.py`
    - This file contains the code related to dataloading, hyperparameters, setting up model/optimizers/accelerator, model training and other miscellaneous tasks like restarting the jupyter notebook from within the cell. 
3. `requirements.txt`
    - This file contains necessary Python packages for this tutorial.
4. `media/`
    - This directory contains the images used in the following tutorial.

In [None]:
%pip install -r requirements.txt

import torch
cudnn_version = torch.backends.cudnn.version()
assert cudnn_version >= 90100, "cuDNN version >= 9.1.0 is needed to run this tutorial."

## Differences between Llama and Gemma

Thr Llama and the Gemma are very similar models - both are based on Transformer Decoder architecture. The most important architectural differences between them are the following:


| Feature                                      | Llama                              | Gemma                                      |
|----------------------------------------------|------------------------------------|--------------------------------------------|
| **Norm Layer**                               | Standard RMSNorm <br> $y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} * \gamma + \beta$                   | RMSNorm with zero centered gamma parameter <br>  $y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \varepsilon}} * (\textcolor{red}{1 +} \gamma) + \beta$   |
| **Embedding Dimension/Head Dimension**             | 4096/4096                              | 3072/4096                                  |
| **Activation Function**                      | SwiGlu                             | GeGlu                                      |


## [Baseline] Running HF `GemmaModel` (Precision: `BF16`)

Similarly to the Llama tutorial, we begin the experiments by running baseline Hugging Face Gemma model finetuning in BF16 precision.

<div class="alert alert-info">

<b>Note</b>
    
This tutorial loads and trains a Gemma 7B model which takes up most of the GPU memory and therefore, we need to restart the jupyter notebook each time before running the following sections. A small utility method `restart_jupyter_notebook` is defined in the accompanying `utils.py` file. This function restarts the jupyter notebook so that the GPU memory is flushed before the model is loaded again from the checkpoint in order to avoid running into OOM (Out Of Memory) errors.

If the utility doesn't work, comment this line `restart_jupyter_notebook()` in the following cell and manually restart the jupyter notebook before running the cell. Repeat the same for other sections in this tutorial.

</div>


In [1]:
# Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
restart_jupyter_notebook()


# Import necessary packages and methods
from utils import *


# Default hyperparams, also defined in `utils.py` in class `Hyperparameters`
## !!! `model_name` attr must point to the location of the model weights !!!
## Weights can be downloaded from: https://huggingface.co/google/gemma-7b
hyperparams.model_name = "../../../../gemma-7b" # <== Add model weight location here e.g. "/path/to/downloaded/gemma/weights"
hyperparams.mixed_precision = "bf16"


# Init the model and accelerator wrapper
model = init_baseline_model(hyperparams).cuda()
accelerator, model, optimizer, train_dataloader, lr_scheduler = wrap_with_accelerator(model, hyperparams)


# Finetune the model
finetune_model(model, hyperparams, accelerator, train_dataloader, optimizer, lr_scheduler)

10 finetuning steps complete!

Average time taken per step: 
298 
milliseconds


Let's add this information in a table and keep comparing it with a few possible improvements in future sections:

| Models                                                      | Precision | Step Time (or ms per batch) | Speedup (over baseline) |
|-------------------------------------------------------------|-----------|-----------------------------|-------------------------|
| HF (baseline)                                               | BF16      | 298                         | 1                       |

## [Improvement 1] Replace HF's `GemmaDecoderLayer` with TE's `TransformerLayer` (Precision: `BF16`)

We replace *GemmaDecoderLayer* with the highly tuned *TransformerLayer*, similarly to our approach in the [Llama tutorial](../te_llama/tutorial_accelerate_hf_llama_finetuning_with_te.ipynb). Let's observe the impact this change has on the model's speed.

In [1]:
# Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
restart_jupyter_notebook()


# Import necessary packages and methods
from utils import *


# Default hyperparams, also defined in `utils.py` in class `Hyperparameters`
## !!! `model_name` attr must point to the location of the model weights !!!
## Weights can be downloaded from: https://huggingface.co/google/gemma-7b
hyperparams.model_name = "../../../../gemma-7b"  # <== Add model weight location here e.g. "/path/to/downloaded/gemma/weights"
hyperparams.mixed_precision = "bf16"


# Init the model and accelerator wrapper
model = init_te_gemma_model(hyperparams).cuda()
accelerator, model, optimizer, train_dataloader, lr_scheduler = wrap_with_accelerator(model, hyperparams)


# Finetune the model
finetune_model(model, hyperparams, accelerator, train_dataloader, optimizer, lr_scheduler)

10 finetuning steps complete!

Average time taken per step: 
257 
milliseconds


Compared to the "baseline" implementation, we see that using Transformer Engine's `TransformerLayer` in place of Huggging Face's `GemmaDecoderLayer` gives a speedup of **16%** even when using only BF16 precision!

| Models                                                      | Precision | Step Time (or ms per batch) | Speedup (over baseline) |
|-------------------------------------------------------------|-----------|-----------------------------|-------------------------|
| HF (baseline)                                               | BF16      | 298                        | 1                       |
| TE (replace `GemmaDecoderLayer` with `TE.TransformerLayer`) | BF16      | 257                         | 1.16                    |

## [Improvement 2] Replace HF's `GemmaDecoderLayer` with TE's `TransformerLayer` (Precision: `FP8`)

The last improvement is about enabling FP8 precision. Let's see how it works.

In [1]:
# Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
#restart_jupyter_notebook()


# Import necessary packages and methods
from utils import *


# Default hyperparams, also defined in `utils.py` in class `Hyperparameters`
## !!! `model_name` attr must point to the location of the model weights !!!
## Weights can be downloaded from: https://huggingface.co/google/gemma-7b
hyperparams.model_name = "../../../../gemma-7b"  # <== Add model weight location here e.g. "/path/to/downloaded/gemma/weights"
hyperparams.mixed_precision = "fp8"


# Init the model and accelerator wrapper
model = init_te_gemma_model(hyperparams).cuda()
accelerator, model, optimizer, train_dataloader, lr_scheduler = wrap_with_accelerator(model, hyperparams)


# Finetune the model
finetune_model(model, hyperparams, accelerator, train_dataloader, optimizer, lr_scheduler)

10 finetuning steps complete!

Average time taken per step: 
214 
milliseconds


| Models                                                      | Precision | Step Time (or ms per batch) | Speedup (over baseline) |
|-------------------------------------------------------------|-----------|-----------------------------|-------------------------|
| HF (baseline)                                               | BF16      | 298                        | 1                       |
| TE (replace `GemmaDecoderLayer` with `TE.TransformerLayer`) | BF16      | 257                         | 1.16                    |
| TE (replace `GemmaDecoderLayer` with `TE.TransformerLayer`) | FP8       | 214                         | 1.39                    |


After turning on FP8 precision, we get even more speedup of almost **39%**!

## Conclusion

As shown in the [Llama tutorial](../te_llama/tutorial_accelerate_hf_llama_finetuning_with_te.ipynb), using the `TransformerLayer` module from Transformer Engine to replace Hugging Face's `GemmaDecoderLayer` results in a speedup compared to Hugging Face's native Gemma implementation.

## See more

We also prepared [tutorial](./tutorial_generation_gemma_with_te.ipynb) in which we will show how to speedup the Gemma model generation using Transformer Engine.