This notebook shows how to compute a zero centered gamma layernorm forward training and backward operation using cuDNN.

$$\text{LayerNorm\_Zero\_Centered\_Gamma}(x) = \frac{x-\mu}{\sqrt{\sigma^2 + \epsilon}}\cdot(1+\gamma)+\beta$$

Where $\mu = E[x]$ and $\sigma^2 = Var[x]$ are taken over all inputs in a batch.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/01_matmul_bias.ipynb)

## Prerequisites and Setup
This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [1]:
#get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [2]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128')

#### General Setup
Create a cudnn handle, which is a per device handle used to initialize cudnn context.

In [3]:
import cudnn
import torch
import torch.nn as nn
import sys

torch.manual_seed(1)
handle = cudnn.create_handle()

# print("Running with cudnn backend version:", cudnn.backend_version())

assert torch.cuda.is_available()

assert (
    cudnn.backend_version() >= 91000
), "LayerNorm Zero Centered Gamma operation is only supported cuDNN version 9.10.0 or above"

### LayerNorm Zero Centered Gamma Training
Problem Sizes:
- Batch Size: 4
- Sequence Size: 1024
- Embedding Dimension: 768

In [4]:
batch, seq_size, embedding_dim = 4, 1024, 768

input_type = torch.float16

# Epsilon is a small number to prevent division by 0.
epsilon_value = 1e-3

In [5]:
# Define the LayerNormZeroCenteredGamma class
class LayerNormZeroCenteredGamma(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super(LayerNormZeroCenteredGamma, self).__init__()
        self.layer_norm = nn.LayerNorm(
            normalized_shape, eps=eps, elementwise_affine=False
        )
        self.normalized_shape = normalized_shape

    def forward(self, x, gamma, beta):
        # Apply LayerNorm
        normalized_x = self.layer_norm(x)
        # Apply scaling using zero centered gamma and apply shifting
        return (1 + gamma) * normalized_x + beta

Create input tensor GPU buffers. We use PyTorch to allocate GPU tensors so we can reuse them easily when we calculate reference outputs.

In [6]:
# allocate input tensor memory, initialize them to random numbers
x_gpu = torch.randn(
    batch * seq_size,
    embedding_dim,
    1,
    1,
    dtype=input_type,
    requires_grad=True,
    device="cuda",
).to(memory_format=torch.channels_last)

zero_centered_gamma_gpu = torch.randn(
    1, embedding_dim, 1, 1, dtype=input_type, requires_grad=True, device="cuda"
).to(memory_format=torch.channels_last)

bias_gpu = torch.randn(
    1, embedding_dim, 1, 1, dtype=input_type, requires_grad=True, device="cuda"
).to(memory_format=torch.channels_last)

# One must be a scalar value on the cpu
one_cpu = torch.full(
    (1, 1, 1, 1), 1.0, dtype=torch.float32, requires_grad=False, device="cpu"
)
# Epsilon must be a scalar value on the cpu.
epsilon_cpu = torch.full(
    (1, 1, 1, 1), epsilon_value, dtype=torch.float32, requires_grad=False, device="cpu"
)

Compute reference ouputs and allocate output tensor GPU buffers

In [7]:
# Create the reference computation outputs here before the cuDNN computation, in order to use .empty_like() to create our output buffers
layer_norm_zero_centered_gamma = LayerNormZeroCenteredGamma(
    [embedding_dim, 1, 1], eps=epsilon_value
)
out_expected = layer_norm_zero_centered_gamma(
    x_gpu, zero_centered_gamma_gpu.squeeze(0), bias_gpu.squeeze(0)
)

mean_expected = x_gpu.to(torch.float32).mean(dim=(1, 2, 3), keepdim=True)

inv_var_expected = torch.rsqrt(
    torch.var(x_gpu.to(torch.float32), dim=(1, 2, 3), keepdim=True) + epsilon_value
)

# Allocate output tensor memory using PyTorch
# PyTorch has calculated their shapes already, so we can simply use .empty_like()
out_gpu = torch.empty_like(out_expected)
mean_gpu = torch.empty_like(mean_expected)
inv_var_gpu = torch.empty_like(inv_var_expected)

#### Create cuDNN graph and tensors

Here we assign UIDs for tensors. UIDs are a unique identifier that will allow us to provide a mapping from tensors created from cuDNN graph api calls, such as `graph.tensor_like()`, to the underlying device memory that will be used to store these tensors. Virtual tensors don't require explicit memory allocated for them, but non-vritual tensors like inputs or outputs will need to have UIDs assigned to them. 

Alternatively, one can use handles directly in the mapping, however using UIDs can be more convinient for caching of cuDNN graphs.

For each of our inputs {X, Scale, Bias, Epsilon} and our outputs {Out, Mean, Inverse Variance}, we allocate a UID. 

In [8]:
from enum import Enum


class UID(Enum):
    SCALE0 = 1
    X = 2
    BIAS = 3
    OUT = 5
    MEAN = 6
    INV_VAR = 7
    ONE = 8
    EPSILON = 9

In [9]:
# Create the cuDNN graph.
graph = cudnn.pygraph(
    handle=handle,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

# Create tensor handles with the graph API, assign UIDs.
x = graph.tensor_like(x_gpu.detach()).set_name("X").set_uid(UID.X.value)
gamma = (
    graph.tensor_like(zero_centered_gamma_gpu.detach())
    .set_name("scale0")
    .set_uid(UID.SCALE0.value)
)
one = graph.tensor_like(one_cpu).set_name("one").set_uid(UID.ONE.value)
bias = graph.tensor_like(bias_gpu.detach()).set_name("bias").set_uid(UID.BIAS.value)
epsilon = graph.tensor_like(epsilon_cpu).set_name("epsilon").set_uid(UID.EPSILON.value)

# Add a pointwise add operation for zero centered gamma + 1
scale = graph.add(name="gamma_plus_one", a=gamma, b=one)

# Add a layernorm operation
(out, mean, inv_var) = graph.layernorm(
    name="layernorm",
    input=x,
    norm_forward_phase=cudnn.norm_forward_phase.TRAINING,
    scale=scale,
    bias=bias,
    epsilon=epsilon,
)

# Enable all outputs, by default outputs are disabled
out.set_name("output").set_output(True).set_data_type(out_expected.dtype).set_uid(
    UID.OUT.value
)
mean.set_name("mean").set_output(True).set_data_type(mean_expected.dtype).set_uid(
    UID.MEAN.value
)
inv_var.set_name("inv_var").set_output(True).set_data_type(
    inv_var_expected.dtype
).set_uid(UID.INV_VAR.value)

# print(graph)

[{"data_type":"FLOAT","dim":[],"is_pass_by_value":false,"is_virtual":false,"name":"inv_var","pass_by_value":null,"reordering_type":"NONE","stride":[],"uid":7,"uid_assigned":true}]

#### Build the graph

In [10]:
# Build the graph
graph.build([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])

# To run this block more than once, we need to re-run the previous block to get a new graph.
# The same instance of a graph can not be built twice.

#### Execute the graph

After validating and building a cuDNN graph,  we can now execute it. To do this, we have to provide input and output buffers. We do this by using the previously allocated UIDs to associate between tensor handles generated from the graph API, and their underlying memory. 

The desired input values need to be stored in these buffers before the `graph.execute` call. Because we have done a reference computation, we can simply reuse the buffers we have allocated via PyTorch.

Note that the EPISLON UID expects a cpu buffer, 

In [11]:
# Mapping of (UIDs -> memory)
variant_pack = {
    UID.X.value: x_gpu,
    UID.SCALE0.value: zero_centered_gamma_gpu,
    UID.BIAS.value: bias_gpu,
    UID.EPSILON.value: epsilon_cpu,
    UID.OUT.value: out_gpu,
    UID.MEAN.value: mean_gpu,
    UID.INV_VAR.value: inv_var_gpu,
    UID.ONE.value: one_cpu,
}

workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
graph.execute(variant_pack, workspace)

Test cuDNN's output against PyTorch's and check correctness

In [12]:
torch.cuda.synchronize()
# compare to reference output
torch.testing.assert_close(out_gpu, out_expected, rtol=5e-3, atol=5e-3)
torch.testing.assert_close(inv_var_gpu, inv_var_expected, rtol=5e-3, atol=5e-3)
torch.testing.assert_close(mean_gpu, mean_expected, rtol=5e-3, atol=5e-3)


#### LayerNorm Zero Centered Gamma Backward Pass

Compute references values for backward graph

In [13]:
# Reference backward operation using PyTorch
target = torch.randn_like(out_expected)
criterion = torch.nn.MSELoss()
loss = criterion(out_expected, target)

out_expected.retain_grad()
x_gpu.retain_grad()
zero_centered_gamma_gpu.retain_grad()
bias_gpu.retain_grad()

loss.backward()

Build backward graph

In [14]:
bwd_graph = cudnn.pygraph(
    handle=handle,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

# Create tensors associated with the backwards graph. DO NOT reuse tensor handles from the forward graph.
d_out = bwd_graph.tensor(
    name="d_out", dim=x_gpu.size(), stride=x_gpu.stride(), data_type=x_gpu.dtype
)

x_bwd = bwd_graph.tensor_like(x, name="x")
gamma_bwd = bwd_graph.tensor_like(gamma, name="gamma")
one_bwd = graph.tensor_like(one_cpu).set_name("one")
mean_bwd = bwd_graph.tensor_like(mean, name="mean")
inv_var_bwd = bwd_graph.tensor_like(inv_var, name="inv_var")

# Add a pointwise add operation for zero centered gamma + 1
scale_bwd = bwd_graph.add(name="gamma_bwd_plus_one", a=gamma_bwd, b=one_bwd)

# Add the adaptive layernorm backward operation
(d_x, d_scale, d_bias) = bwd_graph.layernorm_backward(
    name="DLN",
    grad=d_out,
    input=x_bwd,
    scale=scale_bwd,
    mean=mean_bwd,
    inv_variance=inv_var_bwd,
)

# Enable outputs.
d_x.set_output(True).set_data_type(x_gpu.dtype)
d_scale.set_output(True).set_data_type(x_gpu.dtype)
d_bias.set_output(True).set_data_type(x_gpu.dtype)

# print(bwd_graph)

[{"data_type":"HALF","dim":[],"is_pass_by_value":false,"is_virtual":false,"name":"DLN::DBIAS","pass_by_value":null,"reordering_type":"NONE","stride":[],"uid":0,"uid_assigned":false}]

In [15]:
# Build the bwd_graph
bwd_graph.build([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])

Execute the graph and check correctness against PyTorch

In [16]:
# Create output buffers for gradients
d_x_gpu = torch.empty_like(x_gpu)
d_scale_gpu = torch.empty_like(zero_centered_gamma_gpu)
d_bias_gpu = torch.empty_like(bias_gpu)

workspace = torch.empty(
    bwd_graph.get_workspace_size(), device="cuda", dtype=torch.uint8
)

# For the inputs of the backwards graph (x_bwd, d_out, scale_bwd, mean_bwd, inv_var_bwd), we use the outputs of the forwards graph. For d_out we use pytorches autograd .grad functionality.
bwd_graph.execute(
    {
        x_bwd: x_gpu.detach(),
        gamma_bwd: zero_centered_gamma_gpu.detach(),
        d_out: out_expected.grad,
        mean_bwd: mean_gpu.detach(),
        inv_var_bwd: inv_var_gpu.detach(),
        d_x: d_x_gpu,
        d_scale: d_scale_gpu,
        d_bias: d_bias_gpu,
        one_bwd: one_cpu,
    },
    workspace,
    handle=handle,
)

Compare results and check correctness

In [17]:
torch.cuda.synchronize()

# compare to reference output
torch.testing.assert_close(x_gpu.grad, d_x_gpu, atol=2e-4, rtol=2e-4)
torch.testing.assert_close(
    zero_centered_gamma_gpu.grad, d_scale_gpu, atol=2e-4, rtol=2e-4
)
torch.testing.assert_close(bias_gpu.grad, d_bias_gpu, atol=2e-4, rtol=2e-4)

Perform Cleanup

In [18]:
cudnn.destroy_handle(handle)