# CUDA Graphs, THD Attention, and FP8 Weight Calibration

In tutorials such as [Llama](../te_llama/tutorial_accelerate_hf_llama_with_te.ipynb) and [Gemma](./tutorial_accelerate_hf_gemma_with_te.ipynb), we've demonstrated how transformer models can be accelerated using the Transformer Engine's `TransformerLayer`. This tutorial introduces a few more advanced features:
1. THD attention layout.
2. FP8 weight calibration - enabling inference in FP8 precision for models originally trained in higher precisions.
3. CUDA Graphs API.
We will explore how these features enhance the performance of the Gemma model during generation tasks.

#### Benchmarking

We'll evaluate the generation time across three benchmarks:
- Long input sequences (up to 256 tokens) with short generation (up to 128 tokens),
- Short input sequences (up to 64 tokens) with long generation (up to 1000 tokens).

All benchmarks are conducted with a batch size of 64 using the dataset "timdettmers/openassistant-guanaco".

<div class="alert alert-info">
<b>Note</b>
    
This tutorial focuses on showcasing the mentioned features of Transformer Engine in the context of generation. It's important to note, however, that NVIDIA provides another library, [TensorRT](https://developer.nvidia.com/tensorrt), which is optimized for inference tasks and should be considered for such use cases.
</div>

## Dependencies for this tutorial

Following files and media are necessary to effectively run this tutorial:

1. `te_gemma.py`
    - This file contains the code to load a Hugging Face Gemma checkpoint in Transformer Engine's `TransformerLayer` instead of Hugging Face's `GemmaDecoderLayer`. It does also contain code for generation with THD attention and weight calibration.
2. `utils.py`
    - This file contains the code related to dataloading, hyperparameters, setting up model/optimizers/accelerator, model training and other miscellaneous tasks like restarting the jupyter notebook from within the cell. 
3. `media/`
    - This directory contains the images used in the following tutorial.

## [Baseline] Running Hugging Face generation with Gemma model

Hugging Face Transformers library offers generation API. We will treat this as our baseline.

In [None]:
# Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
#restart_jupyter_notebook()

# Import necessary packages and methods
from utils import *

# Default hyperparams, also defined in `utils.py` in class `Hyperparameters`
## !!! `model_name` attr must point to the location of the model weights !!!
## Weights can be downloaded from: https://llama.meta.com/llama-downloads/
hyperparams.model_name = "../../../../gemma-weights"  # <== Add model weight location here e.g. "/path/to/downloaded/llama/weights"
hyperparams.mixed_precision = "bf16"

# Init the model and accelerator wrapper
model = init_baseline_model(hyperparams).cuda()
model = model.to(torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(hyperparams.model_name)
inputs = tokenizer(["Some random initial str ", "Another string ... "] * 32, return_tensors="pt", padding=True)

inputs['input_ids'] = inputs['input_ids'].cuda()
inputs['attention_mask'] = inputs['attention_mask'].cuda()

outputs = model.generate(**inputs, max_new_tokens=100)
generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for text in generated_texts:
    print(text)
    print("=" * 100)

benchmark_generation(model)

We put these times into the table for later comparison.

| Models                                                      | max_input_len=64, max_new_tokens=1000 | max_input_len=128, max_new_tokens=128 |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | -      | -                         |  

## [Improvement 1] Speeding up generation by using Transformer Engine with THD attention

Similarly to the Gemma tutorial, we substitute `GemmaDecoderLayer` with `TransformerLayer` from Transformer Engine. 

Input sequences can have various lengths. The most common approach is to use the padding and attention masks in such situation. We will use more straightforward method - using the THD attention layout with offests. 

<center>
<span style="display: flex; flex-direction: row; justify-content: center">
<span style="display: flex; flex-direction: column; align-items: center">
Query layer   
<img src="./media/pic1.png" alt="Logo Pythona" height="200">
</span>
<span style="display: flex; flex-direction: column; align-items: center">
Key layer and value layer  
<img src="./media/pic2.png" alt="Logo Pythona" height="200">
</span>
</span>
cu_seqlens_q = [0, 1, 3, 7, 9, 12] <br>
cu_seqlens_kv = [0, 1, 3, 6, 8, 10] <br>
seq_offsets_q = [0, 5, 10, 15, 20, 25] * h * d <br>
seq_offsets_k = [0, 7, 14, 21, 28, 35] * h * d <br>
seq_offsets_v = [0, 7, 14, 21, 28, 35] * h * d <br>
</center>

The class `transformer_engine.DotProductAttention` supports this format. One need to pass the following things as the arguments to the forward:
- `seq_offsets_q`, `seq_offsets_k`, `seq_offsets_v` - which represents the offsets of the beginnings of the next sequences,
- `cu_seqlens_q`, `cu_seqlens_kv` - cumulative sum of the lengths of the sequences of query and values,
- `max_seqlen_q` - maximum sequence length in query layer,
- `max_seqlen_kv` - maximum sequence length in key-value layer.

<div class="alert alert-info">

<b>Note</b>
Currently, the THD attention for `TransformerLayer` is supported only for inference.
</div>

Let's look how using TransformerEngine with THD attention impacts the speed of generation:

In [None]:
# Import necessary packages and methods
from utils import *

# Default hyperparams, also defined in `utils.py` in class `Hyperparameters`
## !!! `model_name` attr must point to the location of the model weights !!!
## Weights can be downloaded from: https://llama.meta.com/llama-downloads/
hyperparams.model_name = "../../../../gemma-weights"  # <== Add model weight location here e.g. "/path/to/downloaded/llama/weights"
hyperparams.mixed_precision = "bf16"
hyperparams.fuse_qkv_params = False

# Init the model and accelerator wrapper
model = init_te_gemma_model(hyperparams).cuda()

model = model.to(torch.bfloat16).cuda()

tokenizer = AutoTokenizer.from_pretrained(hyperparams.model_name)
inputs = tokenizer(["I love when ", "I "] * 32, return_tensors="pt", padding=True)

inputs['input_ids'] = inputs['input_ids'].cuda()
inputs['attention_mask'] = inputs['attention_mask'].cuda()

# Method .generate is overriden in the file te_gemma.py - look there for the implementation.
outputs = model.generate(
    **inputs,
    max_new_tokens=40
)
generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for text in generated_texts:
    print(text)
    print("=" * 100)

benchmark_generation(model)

By using THD attention we obtained following speedups:

| Models                                                      | max_input_len=64, max_new_tokens=1000 | max_input_len=128, max_new_tokens=128 |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | -      | -                         |
| THD attention with TE                                               | -      | -                         |  

## [Improvement 2] Running generation in FP8 of the model trained in higher precision 

We are now preparing to execute FP8 generation using the Gemma model. However, this process is not straightforward. Since the model was originally trained with BF16 precision, the FP8 scaling factors have not been computed. Operating the model at such low precision without the correct scaling could result in significant numerical errors, which in turn would produce incorrect results.

We highly recommend familiarizing yourself with the [tutorial](../../examples/fp8_primer.ipynb) on FP8 precision to understand the necessity of scaling.

##### Weight Calibration

To address the issue outlined above, we will implement weight calibration. This involves running several forward iterations at BF16 precision within the context `te.fp8_autocast(enabled=False, calibration=True)`. This setup allows the forward pass to operate at higher precision, while we simultaneously collect `amax_history` and other parameters related to the FP8 precision, which is essential for calculating the FP8 scaling factors.

The code below outlines the steps to initialize the BF16 model and conduct several forward iterations within the specified context. After these iterations, we save the model, and these weights will be utilized in subsequent chapters.

In [None]:
# Import necessary packages and methods
import transformer_engine.pytorch as te
from utils import *
import torch


hyperparams.model_name = "../../../../gemma-weights"
hyperparams.fuse_qkv_params = True
model = init_te_gemma_model(hyperparams, fp8_model_init=False).cuda()
model = model.to(torch.bfloat16)
accelerator = Accelerator(
        log_with="wandb",
        gradient_accumulation_steps=hyperparams.gradient_accumulation_steps,
        mixed_precision=hyperparams.mixed_precision
    )
train_dataloader = get_dataloaders(accelerator, hyperparams)

tokenizer = AutoTokenizer.from_pretrained(hyperparams.model_name)

print("Calibration started")
with te.fp8_autocast(enabled=False, calibrating=True):
    model.train()
    train_dataloader = enumerate(train_dataloader)

    for i in range(100):
        step, batch = next(train_dataloader)
        batch["input_ids"] = batch["input_ids"].cuda()
        outputs = model.generate(
            **batch,
            max_new_tokens=10
        )
        generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("calibration_finished")

print("scale_fwd computation started")
with te.fp8_autocast(enabled=True):
    for i in range(10):
        step, batch = next(train_dataloader)
        batch["input_ids"] = batch["input_ids"].cuda()
        outputs = model.generate(
            **batch,
            max_new_tokens=1
        )
print("scale_fwd_computation ended")

print("Casting weights...")
model_fp8 = init_te_gemma_model(hyperparams, fp8_model_init=True).cuda()
model_fp8.load_state_dict(model.state_dict())
print("Weights casted")

print("Saving model...")
torch.save(model_fp8.state_dict(), 'model_fp8_state_dict.pth')
print("Model saved!")

#### Generation in FP8

Now we are ready to run FP8 inference.

In [None]:
#Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
#restart_jupyter_notebook()
import transformer_engine.pytorch as te

from torch.cuda.amp import autocast

from utils import *
from transformer_engine.common.recipe import Format, DelayedScaling


hyperparams.model_name = "../../../../gemma-weights"
hyperparams.fuse_qkv_params = True
model = init_te_gemma_model(hyperparams, fp8_model_init=True, qkv_format="thd").cuda()

print("Loading model")
model_state_dict = torch.load('model_fp8_state_dict.pth')
model.load_state_dict(model_state_dict)
print("Model loaded")

tokenizer = AutoTokenizer.from_pretrained(hyperparams.model_name)
inputs = tokenizer(["Some random initial str ", "Another string ... "] * 32, return_tensors="pt", padding=True)

inputs['input_ids'] = inputs['input_ids'].cuda()
inputs['attention_mask'] = inputs['attention_mask'].cuda()

fp8_format = Format.HYBRID
fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=32, amax_compute_algo="max")
torch.manual_seed(1234)
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    with autocast(dtype=torch.bfloat16, cache_enabled=False):
        with torch.no_grad():
            model.eval()
            outputs = model.generate(
                **inputs,
                max_new_tokens=40,
                use_cuda_graphs=False
            )


generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for text in generated_texts[:2]:
    print("-" * 50)
    print(text)

benchmark_generation(model)

We add the speedups to the table:

| Models                                                      | max_input_len=64, max_new_tokens=1000 | max_input_len=128, max_new_tokens=128 |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | -      | -                         |
| THD attention with TE                                               | -      | -                         | 
| THD attention + FP8 with TE                                               | -      | -                         |  

## [Improvement 3] Speeding up generation with CUDA Graphs

The speed of the GPU is increasing at very fast pace. It turns out that sometimes kernels runtime is shorter that time it takes CPU to submit them. It can result in serious overhead as we can see at the two pictures below.

<center>
<img src="./media/pic2.png" alt="Logo Pythona" height="200">
<br>
Generation without CUDA Graphs
<br>

<img src="./media/pic2.png" alt="Logo Pythona" height="200">
<br>
Generation with CUDA Graphs
</center>

CUDA Graphs were developed to address this issue. When certain kernels are executed repeatedly, this tool enables us to record and replay them without CPU involvement.

We recommend reading further about CUDA Graphs [here](https://developer.nvidia.com/blog/cuda-graphs/).

PyTorch supports CUDA Graphs through the `torch.cuda` API. However, there are specific requirements for a sequence of tensor operations to be captured and replayed correctly. Specifically, all operations must be static, meaning that tensors should not change locations between iterations.

PyTorch also provides a simpler method for utilizing CUDA Graphs: the `torch.cuda.make_graphed_callables`. This allows easy recording of any PyTorch module. Starting from version 1.5, the Transformer Engine also supports the `make_graphed_callables` API. Below is the code that executes the generate method from `te_gemma.py`, which is responsible for creating the graphed part:

```
graphed_generator = TeGraphed(...)
(...)
    if use_cuda_graphs:
        fp8_format = Format.HYBRID
        fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=32, amax_compute_algo="max")
        graphed_layers = te.pytorch.make_graphed_callables(
                graphed_generator, 
                args, 
                fp8_enabled=True, 
                fp8_recipe=fp8_recipe, 
                allow_unused_input=True,
                num_warmup_iters=3
            )
            
    for i in range(max_new_tokens):
        next_tokens = graphed_layers(*args) if use_cuda_graphs else graphed_generator(*args)
        output_tokens.append(next_tokens.clone())
```
If you want to use CUDA Graphs with the Transformer Engine (TE), we recommend looking into the `TeGraphed` class. This class is similar to `TEGemmaDecoderLayer`, but it includes specific functionalities required to make CUDA Graphs work effectively.

Now, let's proceed to measure the speedup provided by CUDA Graphs:

In [None]:
#Restart the notebook (to flush the GPU memory)
from utils import restart_jupyter_notebook
#restart_jupyter_notebook()

import transformer_engine.pytorch as te
from torch.cuda.amp import autocast
from utils import *
from transformer_engine.common.recipe import Format, DelayedScaling


hyperparams.model_name = "../../../../gemma-weights"
hyperparams.fuse_qkv_params = True
model = init_te_gemma_model(hyperparams, fp8_model_init=True, qkv_format="thd").cuda()

print("Loading model")
model_state_dict = torch.load('model_fp8_state_dict.pth')
model.load_state_dict(model_state_dict)
print("Model loaded")

tokenizer = AutoTokenizer.from_pretrained(hyperparams.model_name)
inputs = tokenizer(["Some random initial str ", "Another string ... "] * 32, return_tensors="pt", padding=True)

inputs['input_ids'] = inputs['input_ids'].cuda()
inputs['attention_mask'] = inputs['attention_mask'].cuda()

fp8_format = Format.HYBRID
fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=32, amax_compute_algo="max")
torch.manual_seed(1234)
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    with autocast(dtype=torch.bfloat16, cache_enabled=False):
        with torch.no_grad():
            model.eval()
            outputs = model.generate(
                **inputs,
                max_new_tokens=10,
                use_cuda_graphs=True
            )
generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for text in generated_texts[:12]:
    print("-" * 50)
    print(text)

benchmark_generation(model)


We finally obtained the **??%** speedup.

| Models                                                      | max_input_len=64, max_new_tokens=1000 | max_input_len=128, max_new_tokens=128 |  
|-------------------------------------------------------------|---------------------------------------|--------------------------------------|
| HF (baseline)                                               | -      | -                         |
| THD attention with TE                                               | -      | -                         | 
| THD attention + FP8 with TE                                               | -      | -                         |  
| THD attention + FP8 + Cuda Graphs with TE                                               | -      | -                         |  

## Conclusions

In this tutorial, we've explored three features of the Transformer Engine:
1. Support for the THD attention layout,
2. FP8 weights calibration,
3. Integration with CUDA Graphs.

Each of these features can be applied in various contexts, and here we demonstrated their use for achieving fast inference. However, it's important to note that this isn't the fastest possible method for performing inference. For achieving optimal speed, we recommend exploring NVIDIA's [TensorRT](https://developer.nvidia.com/tensorrt) library.