Layernorm operation using cuDNN frontend
This notebook shows how a layernorm operation can be done using cudnn.

$$\text{LayerNorm}(x) = \frac{x-\mu}{\sqrt{\sigma^2 + \epsilon}}\cdot\gamma+\beta$$

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/01_matmul_bias.ipynb)

## Prerequisites and Setup
This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [1]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [2]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121')

#### General Setup
Create a cudnn handle, which is a per device handle used to initialize cudnn context.

In [3]:
import cudnn
import torch
import sys

torch.manual_seed(1)
handle = cudnn.create_handle()

print("Running with cudnn backend version:", cudnn.backend_version())

assert torch.cuda.is_available()

### LayerNorm Inference
Problem Sizes:
- Batch Size: 4
- Sequence Size: 1024
- Embedding Dimension: 768

In [4]:
batch, seq_size, embedding_dim = 4, 1024, 768

input_type = torch.float16

# Epsilon is a small number to prevent division by 0.
epsilon_value = 1e-3

Create input tensor GPU buffers. We use PyTorch to allocate GPU tensors so we can reuse them to calculate a reference value.

In [5]:
# allocate input tensor memory, initialize them to random numbers
x_gpu = torch.randn(
    batch * seq_size,
    embedding_dim,
    1,
    1,
    dtype=input_type,
    requires_grad=True,
    device="cuda",
).to(memory_format=torch.channels_last)
scale_gpu = torch.randn(
    1, embedding_dim, 1, 1, dtype=input_type, requires_grad=True, device="cuda"
).to(memory_format=torch.channels_last)
bias_gpu = torch.randn(
    1, embedding_dim, 1, 1, dtype=input_type, requires_grad=True, device="cuda"
).to(memory_format=torch.channels_last)

# Epsilon must be a scalar value on the cpu. 
epsilon_cpu = torch.full(
    (1, 1, 1, 1), epsilon_value, dtype=torch.float32, requires_grad=False, device="cpu"
)

Create reference computation and allocate output tensor GPU buffers

In [6]:
# Create the reference computation outputs here so we can use .empty_like() to create our output buffers
out_expected = torch.nn.functional.layer_norm(
    x_gpu,
    [embedding_dim, 1, 1],
    weight=scale_gpu.squeeze(0),
    bias=bias_gpu.squeeze(0),
    eps=epsilon_value,
)


# Allocate output tensor memory using PyTorch
out_gpu = torch.empty_like(out_expected)

#### Create cuDNN graph and tensors

In [8]:
# Create the cuDNN graph
graph = cudnn.pygraph(
    handle=handle,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

# Create tensor handles with the graph API
x = graph.tensor_like(x_gpu.detach()).set_name("X")
scale = graph.tensor_like(scale_gpu.detach()).set_name("scale")
bias = graph.tensor_like(bias_gpu.detach()).set_name("bias")
epsilon = graph.tensor_like(epsilon_cpu).set_name("epsilon")

(out, mean, inv_var) = graph.layernorm(
    name="layernorm",
    input=x,
    norm_forward_phase=cudnn.norm_forward_phase.INFERENCE,  # Note INFERENCE and not TRAINING
    scale=scale,
    bias=bias,
    epsilon=epsilon,
)

# Enable only the desired output, by default, outputs are disabled
out.set_name("output").set_output(True).set_data_type(out_expected.dtype)

# Because we have set the norm_forward_phase to INFERENCE, these outputs will be None.
assert mean is None
assert inv_var is None

Build the graph

In [9]:
# Build the graph
graph.build([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])

# To run this block more than once, we need to re-run the previous block to get a new graph.
# The same instance of a graph should not be built twice.

Execute the graph

In [10]:
# Mapping of (handles -> memory)
variant_pack = {
    x: x_gpu.detach(),
    scale: scale_gpu.detach(),
    bias: bias_gpu.detach(),
    epsilon: epsilon_cpu,
    out: out_gpu,
}

workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
graph.execute(variant_pack, workspace)
torch.cuda.synchronize()

Test cuDNN's output against PyTorch's and check correctness

In [None]:
# reference output
torch.testing.assert_close(out_gpu, out_expected, rtol=5e-3, atol=5e-3)

Perform Cleanup

In [None]:
cudnn.destroy_handle(handle)