# Accelerating Generation of the Hugging Face Gemma Model with Transformer Engine

Generative AI has made remarkable strides in recent years, with Large Language Models (LLMs) like ChatGPT at the forefront. These models have revolutionized how we interact with machine-generated content, providing capabilities that range from writing assistance to complex decision support. The core functionality of these models is the generation process, which involves predicting the next token in a sequence based on the preceding text. This task is critical for applications such as automated content creation, translation, and more, emphasizing the importance of efficient implementation.

For those seeking a deeper understanding of text generation mechanisms in Transformers, it is recommended to check out the [HuggingFace generation tutorial](https://huggingface.co/docs/transformers/llm_tutorial).

In our previous tutorials on [Llama](../te_llama/tutorial_accelerate_hf_llama_with_te.ipynb) and [Gemma](./tutorial_accelerate_hf_gemma_with_te.ipynb), we demonstrated how finetuning can be accelerated using the Transformer Engine's `TransformerLayer`. Building on this foundation, our current objective is to enhance the generation speed of the Gemma model.

This tutorial will introduce and explain several advanced features of the Transformer Engine that contribute to this goal:

##### 1. THD Attention Layout.

Addressing the challenge of computing attention for sequences with varying lengths, a common method is to pad these sequences and apply an attention mask. The Transformer Engine, however, offers a more optimized approach—by specifying the lengths and offsets of the sequences, attention can be computed directly. Instead of passing the matrix and mask with the shape `[b, s, h, d]`, one can pass a matrix of the shape `[t, h, d]` along with tensors detailing sequence lengths and offsets to run the attention optimized for this case. This specific attention layout is referred to as the **THD layout**.

<center>
<img src="./media/bshd_attention_1.png" alt="" width= "400"><br>
Fig. 1. The sequences and the mask for standard attention layout - padding from the end.<br><br>
<img src="./media/bshd_attention_2.png" alt="" width="400"><br>
Fig. 2. The sequences and the mask for standard attention layout - padding from the beginning.<br><br>
<img src="./media/thd_attention.png" alt="" width="400"><br>
Fig. 3. An attention with thd layer.<br><br>
</center>

##### 2. CUDA Graphs API.

The speed of GPUs is increasing at a rapid pace. It turns out that sometimes the runtime of kernels is shorter than the time it takes for the CPU to submit them, which can lead to significant overhead. CUDA Graphs were developed to address this issue. When certain kernels are executed repeatedly, this tool allows us to record and replay them without CPU involvement. This becomes particularly useful in applications like text generation, where a `TransformerLayer` is run for every token that needs to be generated.

We recommend reading further about CUDA Graphs [here](https://developer.nvidia.com/blog/cuda-graphs/).

PyTorch exposes graphs via a raw `torch.cuda.CUDAGraphclass` and two convenience wrappers, `torch.cuda.graph` and `torch.cuda.make_graphed_callables`. More information about the cuda graphs in Pytorch can be found [here](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/).

Transformer Engine supports cuda graphs from version 1.5.


<center>
<img src="./media/graphs.png" alt=""><br>
Fig. 4. CUDA Graphs speedup.<br><br>
</center>


##### 3. FP8 Weights Calibration.

Assuming that we have a model trained in FP32/BF16 precision and we wish to execute it in FP8 precision, the process isn't straightforward due to the absence of appropriate FP8 scaling factors. In this scenario, FP8 calibration becomes essential. By conducting several forward passes on sample data, we can compute the FP8 saling parameters. This calibration allows the model to operate correctly in FP8 precision.

We highly recommend familiarizing yourself with the [tutorial](../../examples/fp8_primer.ipynb) on FP8 precision to understand the importance of proper scaling factors.


<center>
<img src="./media/calibration.png" alt="" ><br>
Fig. 5. The weights calibration.<br><br>
</center>

##### 4. FP8 Model Weights.

The typical approach is to store weights in higher precision and then cast them to fp8 before operations. This is critical during training, as it allows us to store some values in high precision to avoid performance drops. However, for inference, this level of precision is not necessary.

The TransformerEngine includes a feature called `fp8_model_init`, which allows for the creation of models that store only the FP8 copy of the weights. This eliminates the need to cast from higher precision to BF16, saving time on this casting process. Additionally, it helps reduce memory consumption, which can be used to increase the batch size, resulting in even greater speedup.


<center>
<img src="./media/fp8_model_init.png" alt="" ><br>
Fig. 6. Saving memory with fp8_model_init().<br><br>
</center>

#### Benchmarking

We'll evaluate the generation time across one benchmark: generation with context phase max sequence length = 128, batch size = 64 and number of generated tokens = 1024 - 128.

<div class="alert alert-info">
<b>Note</b>
    
This tutorial focuses on showcasing the mentioned features of Transformer Engine in the context of generation. It's important to note, however, that NVIDIA provides another library, [TensorRT](https://developer.nvidia.com/tensorrt), which is optimized for inference tasks and should be considered for such use cases.
</div>

## Dependencies for this tutorial

Following files and media are necessary to effectively run this tutorial:

1. `te_gemma.py`
    - This file contains the code to load a Hugging Face Gemma checkpoint in Transformer Engine's `TransformerLayer` instead of Hugging Face's `GemmaDecoderLayer`. It does also contain code for generation with THD attention and weight calibration.
2. `te_gemma_loading_weights.py`
    - This file contains logic of mapping the parameters from `GemmaDecoderLayer` into the `TransformerLayer`.
3. `utils.py`
    - This file contains the code related to dataloading, hyperparameters, setting up model/optimizers/accelerator, model training and other miscellaneous tasks like restarting the jupyter notebook from within the cell. 
4. `media/`
    - This directory contains the images used in the following tutorial.

## [Baseline] Running Hugging Face generation with Gemma model

HuggingFace Transformers library offers generation API. We will use HuggingFace generation for the Gemma model as our baseline.

In [2]:
# Import necessary packages and methods
from utils import *

# Default hyperparams, also defined in `utils.py` in class `Hyperparameters`
## !!! `model_name` attr must point to the location of the model weights !!!
## Weights can be downloaded from: https://huggingface.co/google/gemma-7b
hyperparams.model_name = ""  # <== Add model weight location here e.g. "/path/to/downloaded/gemma/weights"
hyperparams.mixed_precision = "bf16"

model = init_baseline_model(hyperparams).cuda()

print_sample_of_generated_texts(model)
benchmark_generation(model)

Tell me something about GPUs:

1. What is the difference between a GPU and a CPU?
2. What is a GPU used for?
3. What is a GPU used for in a computer?
4. What is a GPU used for in a computer game
Tell me something about NVIDIA:

NVIDIA is a global technology company that designs and develops graphics processing units (GPUs) for the gaming, professional visualization, and data center markets. The company was founded in 1993 and is headquartered in Santa Clara, California.


Benchmarking for batch_size = 64 and max total tokens = 1024
Time: 82.04 s.


We put these times into the table for later comparison.

| Models                                                      | Time | Speedup |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | 82,04 sec      | 1                         |  

## [Improvement 1] Speeding up generation by using Transformer Engine with THD attention

Similarly to the Gemma tutorial, we substitute `GemmaDecoderLayer` with `TransformerLayer` from Transformer Engine. 

Input sequences can have various lengths. The most common approach is to use the padding and attention masks in such situation. We will use more straightforward method - using the THD attention layout with offests. 

<center>
<span style="display: flex; flex-direction: row; justify-content: center">
<span style="display: flex; flex-direction: column; align-items: center">
Query layer   
<img src="./media/thd_dimensions_1.png" alt="" height="200">
</span>
<span style="display: flex; flex-direction: column; align-items: center">
Key layer and value layer  
<img src="./media/thd_dimensions_2.png" alt="" height="200">
</span>
</span>
cu_seqlens_q = [0, 1, 3, 7, 9, 12] <br>
cu_seqlens_kv = [0, 1, 3, 6, 8, 10] <br>
seq_offsets_q = [0, 5, 10, 15, 20, 25] * h * d <br>
seq_offsets_k = [0, 7, 14, 21, 28, 35] * h * d <br>
seq_offsets_v = [0, 7, 14, 21, 28, 35] * h * d <br>
</center>

The class `transformer_engine.DotProductAttention` supports this format. One need to pass the following things as the arguments to the forward:
- `seq_offsets_q`, `seq_offsets_k`, `seq_offsets_v` – which represents the offsets of the beginnings of the next sequences,
- `cu_seqlens_q`, `cu_seqlens_kv` – cumulative sum of the lengths of the sequences of query and values,
- `max_seqlen_q` – maximum sequence length in query layer,
- `max_seqlen_kv` – maximum sequence length in key-value layer.

<div class="alert alert-info">

<b>Note</b>
Currently, the THD attention for `TransformerLayer` is supported only for inference.
</div>

Let's look how using TransformerEngine with THD attention impacts the speed of generation:

In [3]:
# Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
restart_jupyter_notebook()

# Import necessary packages and methods
from utils import *

hyperparams.model_name = ""  # <== Add model weight location here e.g. "/path/to/downloaded/llama/weights"
hyperparams.qkv_format = "thd"

# Init the model and accelerator wrapper
model = init_te_gemma_model(hyperparams).cuda()

print_sample_of_generated_texts(model)
benchmark_generation(model)

The device memory hasn't been flushed, try manually restarting the Jupyter kernel!


Tell me something about GPUs:

1. What is the difference between a GPU and a CPU?
2. What is the difference between a GPU and a graphics card?
3. What is the difference between a graphics card and a video card?
4. What is the
Tell me something about NVIDIA:

NVIDIA is a global technology company that designs and develops graphics processing units (GPUs) for the gaming and professional markets.

What is the difference between a CPU and a GPU?

A CPU (Central Processing Unit) is a computer chip that is
Benchmarking for batch_size = 64 and max total tokens = 1024
Time: 28.19 s.


By using THD attention we obtained following speedup:

| Models                                                      | Time | Speedup |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | 82.04 sec     | 1                         |
| THD attention with TE                                               | 28.19      | 2.91                         | 

## [Improvement 2] Speeding up generation with CUDA Graphs

TransformerEngine includes a function `transformer_engine.pytorch.make_graphed_callables`, which functions similarly to the corresponding feature in PyTorch. It is capable of recording any modules from the Transformer Engine. Below is a code excerpt from `te_gemma.py` from class `TEGemmaForCausalLMCudaGraphs`:
```
    def __init__(self, config : GemmaConfig):
            (...)
            
            # Here "the trick" happens. We override methods from TEGemmaForCausalLM
            # with their recorded version. After invocation of each of them,
            # captured graph will be replayed with minimal usage of CPU,
            # what will lead to huge speedup.
            (...)
            self._model_context_phase = self.record_graph(self._model_context_phase, self.hidden_states_buffer) # CUDA Graphs recording

            (...)        
            self._model_generation_phase = self.record_graph(self._model_generation_phase, self.generation_buffer) # CUDA Graphs recording

    @torch.no_grad()
    def record_graph(self, function, input_tensor):
        # function is invoked on argument (self.hidden_states,) and all kernels are recorded.
        # record_graph() returns captured function, which can be run later with minimal use of th CPU.
        fp8_format = Format.HYBRID
        fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=32, amax_compute_algo="max")
        with autocast(dtype=torch.bfloat16, cache_enabled=False):
            graphed_function = te.pytorch.make_graphed_callables(
                function, 
                (input_tensor,), 
                fp8_enabled=True, 
                fp8_recipe=fp8_recipe, 
                allow_unused_input=True,
                num_warmup_iters=3
            )
        return graphed_function
```

We strongly recommend reviewing the entire code of the class `TEGemmaForCausalLMCudaGraphs`. Let us now proceed to evaluate the performance improvement offered by CUDA Graphs.

In [1]:
#Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
restart_jupyter_notebook()

from utils import *

hyperparams.model_name = ""   # <== Add model weight location here e.g. "/path/to/downloaded/gemma/weights"
hyperparams.qkv_format = "thd"

hyperparams.generation_cuda_graphs = True

# It is necessary to preallocate a static buffer.
# CUDA graphs require static input tensors for every kernel.
# This approach may result in a slight increase in memory consumption;
# however, the substantial speedup achieved makes it worthwhile.
hyperparams.cuda_graphs_static_batch_size = 64
hyperparams.cuda_graphs_static_max_seq_len = 1024
hyperparams.cuda_graphs_static_max_context_len = 128
model = init_te_gemma_model(hyperparams).cuda()

print_sample_of_generated_texts(model)
benchmark_generation(model)

Tell me something about GPUs:

1. What is the difference between a GPU and a CPU?
2. What is the difference between a GPU and a graphics card?
3. What is the difference between a graphics card and a video card?
4. What is the
Tell me something about NVIDIA:

NVIDIA is a global technology company that designs and develops graphics processing units (GPUs) for the gaming and professional markets.

What is the difference between a CPU and a GPU?

A CPU (Central Processing Unit) is a computer chip that is
Benchmarking for batch_size = 64 and max total tokens = 1024
Time: 16.81 s.


We obtained the **4.88x** speedup!

| Models                                                      | Time | Speedup |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | 82.04      | 1                         |
| THD attention with TE                                               | 28.19      | 2.91                         | 
| THD attention +  Cuda Graphs with TE                                               | 16.81      | 4.88                         |  

Let's look at the screenshots from *NVIDIA Nsight System* profiler to see where this speedup comes from:
<br><br>

<center>
<span style=""> 
<img src="./media/graphs-1.png" alt="" height="200"><br>
    Fig. 7. Without CUDA Graphs. We can see that GPU(blue) is idle for most of the time.
    <br><br><br>
<img src="./media/graphs_2.png" alt="" height="200"><br>
    Fig. 8. With CUDA Graphs. We can see that GPU(orange) is utilized.
</span>
</center>

## [Improvement 3] Running generation in FP8 of the model trained in higher precision 

We are now preparing to execute FP8 generation using the Gemma model. However, this process is not straightforward. Since the model was originally trained with BF16 precision, the FP8 scaling factors have not been computed. Operating the model at such low precision without the correct scaling could result in significant numerical errors, which in turn would produce incorrect results.

We highly recommend familiarizing yourself with the [tutorial](../../examples/fp8_primer.ipynb) on FP8 precision to understand the necessity of scaling.

##### Weight Calibration

To address the issue outlined above, we will implement weight calibration. This involves running several forward iterations at BF16 precision within the context `te.fp8_autocast(enabled=False, calibration=True)`. This setup allows the forward pass to operate at higher precision, while we simultaneously collect `amax_history` and other parameters related to the FP8 precision, which is essential for calculating the FP8 scaling factors.

The code below outlines the steps to initialize the BF16 model and conduct several forward iterations within the specified context. After these iterations, we save the model, and these weights will be utilized in subsequent chapters.

In [1]:
from utils import *
import transformer_engine.pytorch as te

hyperparams.model_name = ""  # <== Add model weight location here e.g. "/path/to/downloaded/llama/weights"
hyperparams.fuse_qkv_params = True # This is needed by the last improvement.

model = init_te_gemma_model(hyperparams).cuda()

# Calibration
with te.fp8_autocast(enabled=False, calibrating=True), \
    torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    model.train()
    run_forward_pass(model, hyperparams, num_iters=512)

# Compute scale_fwd with enabled fp8 autocast
with te.fp8_autocast(enabled=True), \
    torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    run_forward_pass(model, hyperparams, 10)

# Some parameters are in pointing to the same tensors, we do not want to double save them.
dict_to_save = {k: v for k, v in model.state_dict().items() \
                if ("_context_phase" not in k and "_generation_phase" not in k)}
torch.save(dict_to_save, '<calibrated_weights_path>') 

#### Generation in FP8

Now we are ready to run FP8 inference.

In [1]:
#Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
restart_jupyter_notebook()

from utils import *

hyperparams.model_name = ""   # <== Add model weight location here e.g. "/path/to/downloaded/gemma/weights"
hyperparams.qkv_format = "thd"
hyperparams.fuse_qkv_params = True # This is needed by the last improvement.

hyperparams.fp8 = True 
# We load calibrated fp8 weights directly from the file.
hyperparams.fp8_model_weights_filename = "<calibrated_weights_path>"

hyperparams.generation_cuda_graphs = True
hyperparams.cuda_graphs_static_batch_size = 64
hyperparams.cuda_graphs_static_max_seq_len = 1024
hyperparams.cuda_graphs_static_max_context_len = 128
model = init_te_gemma_model(hyperparams).cuda()

print_sample_of_generated_texts(model)
benchmark_generation(model, measure_memory=True)

Tell me something about GPUs:

* What is a GPU?
* What is a GPU used for?
* What is a GPU used for in machine learning?
* What is a GPU used for in deep learning?
* What is a GPU used for in computer vision
Tell me something about NVIDIA:

NVIDIA Corporation is an American multinational technology company headquartered in Santa Clara, California, that designs graphics processing units (GPUs) for the gaming and professional markets, as well as system on a chip units (SoCs) for the mobile computing and automotive market
Benchmarking for batch_size = 64 and max total tokens = 1024
Time: 19.32 s.
Peak GPU memory usage: 63.82 GB


We can observe that the outputs are coherent; however, the generation time has increased. Why is this the case? 

Running the model in FP8 does not imply that all weights are stored in FP8. By default, they are stored in higher precision and are cast to FP8, using saved scaling factors, before operations such as GEMMs.

This approach is beneficial during training: we can perform one cast for both backward and forward passes, leading to speedups. However, performing a single cast for each forward pass introduces too much overhead to achieve a speedup. We will address this issue in the next section of the tutorial.


## [Improvement 4] Reducing memory usage with the fp8_model_init()

TransformerEngine stores parameters in higher precision and only casts them to FP8. It is also true with the optimizer state. It is needed to maintain accucacy during training. However, we can get rid of high precision weights when doing inference. 

Transformer Engine supports maintaining only FP8 copy of weights with `fp8_model_init` decorator. Let's see an example
```
with te.fp8_model_init(enabled=True):
    linear = te.Linear((1024, 1024)) # this module is initialized only with fp8 weights
```

Now we can try to use `fp8_model_init` in out code and look at the memory usage.

In [1]:
#Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
restart_jupyter_notebook()

# Import necessary packages and methods
from utils import *

hyperparams.model_name = "" # <== Add model weight location here e.g. "/path/to/downloaded/gemma/weights"
hyperparams.fuse_qkv_params = True # Needed for fp8_model_init().
hyperparams.qkv_format = "thd"

hyperparams.fp8 = True
hyperparams.fp8_model_init = True # This will result in storing only fp8 weights.
hyperparams.fp8_model_weights_filename = "/root/model_calibrated_weights.pth"

hyperparams.generation_cuda_graphs = True
hyperparams.cuda_graphs_static_batch_size = 64
hyperparams.cuda_graphs_static_max_seq_len = 1024
hyperparams.cuda_graphs_static_max_context_len = 128
model = init_te_gemma_model(hyperparams).cuda()

print_sample_of_generated_texts(model)
benchmark_generation(model, measure_memory=True)

Tell me something about GPUs:

* What is a GPU?
* What is a GPU used for?
* What is a GPU used for in machine learning?
* What is a GPU used for in deep learning?
* What is a GPU used for in computer vision
Tell me something about NVIDIA:

NVIDIA Corporation is an American multinational technology company headquartered in Santa Clara, California, that designs graphics processing units (GPUs) for the gaming and professional markets, as well as system on a chip units (SoCs) for the mobile computing and automotive market
Benchmarking for batch_size = 64 and max total tokens = 1024
Time: 12.18 s.
Peak GPU memory usage: 56.60 GB


We finally obtained the **6.74x** speedup.

| Models                                                      | Time | Speedup |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | 82.04      | 1                         |
| THD attention with TE                                               | 28.19      | 2.91                         | 
| THD attention +  Cuda Graphs with TE                                               | 16.81      | 4.88                         |  
| THD attention + FP8 with TE + fp8_model_init()                                             | 12.18      | 6.74                         |  

Moreover the memory usage dropped from *63.82 GB* to the *56.60 GB*. We can potentially use that to increase batch size to obtain even larger speedup.

## Conclusions

<center>

<img src="./media/speedups.png" alt="">
</center>

In this tutorial, we've explored three features of the Transformer Engine:
1. Support for the THD attention layout,
2. Integration with CUDA Graphs,
3. FP8 weights calibration,
4. Models containing only FP8 version of their parameters.

Each of these features can be applied in various contexts, and here we demonstrated their use for achieving fast inference. It's important to note that the fastest possible inference speeds can be achieved using NVIDIA's inference-optimized [TensorRT](https://developer.nvidia.com/tensorrt) library.