This notebook shows how to compute forward Layer Norm (training) + clamped ReLU (with bitmask generation), then compute the backward equivalent (DReLU + DLN) using the bitmask. 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/01_matmul_bias.ipynb)

## Prerequisites and Setup
This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
#get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [None]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128')

#### General Setup
Create a cudnn handle, which is a per device handle used to initialize cudnn context.

In [None]:
import cudnn
import torch
import torch.nn as nn
import sys

torch.manual_seed(1)
handle = cudnn.create_handle()

print("Running with cudnn backend version:", cudnn.backend_version())

assert torch.cuda.is_available()

assert (
    cudnn.backend_version() >= 91100
), "LayerNorm with relu bitmask generation is only supported cuDNN version 9.11.0 or above"

### LayerNorm Relu Bitmask Training
Problem Sizes:
- Batch Size: 4
- Sequence Size: 1024
- Embedding Dimension: 768

In [None]:
batch, seq_size, embedding_dim = 4, 1024, 768

input_type = torch.float32

# Epsilon is a small number to prevent division by 0.
epsilon_value = 1e-3

# Set clamped ReLU limits
lower_clip_val = 0
upper_clip_val = 6

Create input tensor GPU buffers. We use PyTorch to allocate GPU tensors so we can reuse them easily when we calculate reference outputs.

In [None]:
# Allocate input tensor memory, initialize them to random numbers
x_gpu = torch.randn(
    batch * seq_size,
    embedding_dim,
    1,
    1,
    dtype=input_type,
    requires_grad=True,
    device="cuda",
).to(memory_format=torch.channels_last)

scale_gpu = torch.randn(
    1, embedding_dim, 1, 1, dtype=input_type, requires_grad=True, device="cuda"
).to(memory_format=torch.channels_last)

bias_gpu = torch.randn(
    1, embedding_dim, 1, 1, dtype=input_type, requires_grad=True, device="cuda"
).to(memory_format=torch.channels_last)

# Allocate output tensor memory.
out_gpu = torch.empty_like(x_gpu)
mean_gpu = torch.empty(batch * seq_size, dtype=torch.float32, device="cuda")
inv_var_gpu = torch.empty(batch * seq_size, dtype=torch.float32, device="cuda")

# CuDNN stores boolean bitmask values as bit-packed int8_t.
mask_gpu = torch.empty(
    ((batch * seq_size), embedding_dim // 8, 1, 1), dtype=torch.uint8, device="cuda"
)

# Epsilon, lower clip, and upper clip must be a scalar value on the cpu.
epsilon_cpu = torch.full(
    (1, 1, 1, 1), epsilon_value, dtype=torch.float32, requires_grad=False, device="cpu"
)
lower_clip_cpu = torch.full(
    (1, 1, 1, 1), lower_clip_val, dtype=torch.float32, requires_grad=False, device="cpu"
)
upper_clip_cpu = torch.full(
    (1, 1, 1, 1), upper_clip_val, dtype=torch.float32, requires_grad=False, device="cpu"
)

Compute reference ouputs.

In [None]:
# Create the reference computation outputs here before the cuDNN computation, in order to use .empty_like() to create our output buffers
x_ref = x_gpu.clone().float()

normalized_x = torch.nn.functional.layer_norm(
    x_ref,
    [embedding_dim, 1, 1],
    weight=scale_gpu.squeeze(0),
    bias=bias_gpu.squeeze(0),
    eps=epsilon_value,
)

out_expected = torch.clamp(normalized_x, min=lower_clip_val, max=upper_clip_val)

mask_expected = (lower_clip_val < out_expected) & (out_expected < upper_clip_val)

mean_expected = x_gpu.to(torch.float32).mean(dim=(1, 2, 3))

inv_var_expected = torch.rsqrt(
    torch.var(x_gpu.to(torch.float32), dim=(1, 2, 3)) + epsilon_value
)

#### Create cuDNN graph and tensors

Here we assign UIDs for tensors. UIDs are a unique identifier that will allow us to provide a mapping from tensors created from cuDNN graph api calls, such as `graph.tensor_like()`, to the underlying device memory that will be used to store these tensors. Virtual tensors don't require explicit memory allocated for them, but non-vritual tensors like inputs or outputs will need to have UIDs assigned to them. 

Alternatively, one can use handles directly in the mapping, however using UIDs can be more convinient for caching of cuDNN graphs.

For each of our inputs {X, Scale, Bias, Epsilon} and our outputs {Out, Mean, Inverse Variance}, we allocate a UID. 

In [None]:
# Create the cuDNN graph.
graph = cudnn.pygraph(
    handle=handle,
    io_data_type=cudnn.data_type.FLOAT,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

# Create tensor handles with the graph API, assign UIDs.
x = graph.tensor_like(x_gpu.detach()).set_name("X")
scale = graph.tensor_like(scale_gpu.detach()).set_name("scale")
bias = graph.tensor_like(bias_gpu.detach()).set_name("bias")
epsilon = graph.tensor_like(epsilon_cpu).set_name("epsilon")
lower_clip = graph.tensor_like(lower_clip_cpu).set_name("lower_clip")
upper_clip = graph.tensor_like(upper_clip_cpu).set_name("upper_clip")

# Add a layernorm operation
(norm_out, mean, inv_var) = graph.layernorm(
    name="layernorm",
    input=x,
    norm_forward_phase=cudnn.norm_forward_phase.TRAINING,
    scale=scale,
    bias=bias,
    epsilon=epsilon,
)

# Add a relu operation
out = graph.relu(
    name="relu", input=norm_out, lower_clip=lower_clip_val, upper_clip=upper_clip_val
)

# Add logical operations for generating bitmask
lower_clip_mask = graph.cmp_gt(
    name="cmp_gt_lower_clip", input=out, comparison=lower_clip
)
lower_clip_mask.set_name("lower_clip").set_data_type(cudnn.data_type.BOOLEAN)
upper_clip_mask = graph.cmp_lt(
    name="cmp_lt_upper_clip", input=out, comparison=upper_clip
)
upper_clip_mask.set_name("upper_clip").set_data_type(cudnn.data_type.BOOLEAN)
bitmask = graph.logical_and(name="and_bitmask", a=lower_clip_mask, b=upper_clip_mask)
bitmask.set_name("upper_clip").set_data_type(cudnn.data_type.BOOLEAN)

# Enable all outputs, by default outputs are disabled
out.set_name("output").set_output(True)
mean.set_name("mean").set_output(True).set_data_type(cudnn.data_type.FLOAT)
inv_var.set_name("inv_var").set_output(True).set_data_type(cudnn.data_type.FLOAT)
bitmask.set_name("relu_bitmask").set_output(True)

print(graph)

#### Build the graph

In [None]:
# Build the graph
graph.build([cudnn.heur_mode.A])

# To run this block more than once, we need to re-run the previous block to get a new graph.
# The same instance of a graph can not be built twice.

#### Execute the graph

After validating and building a cuDNN graph,  we can now execute it. To do this, we have to provide input and output buffers. We do this by using the previously allocated UIDs to associate between tensor handles generated from the graph API, and their underlying memory. 

The desired input values need to be stored in these buffers before the `graph.execute` call. Because we have done a reference computation, we can simply reuse the buffers we have allocated via PyTorch.

Note that the EPISLON UID expects a cpu buffer, 

In [None]:
variant_pack = {
    x: x_gpu,
    scale: scale_gpu,
    bias: bias_gpu,
    epsilon: epsilon_cpu,
    out: out_gpu,
    mean: mean_gpu,
    inv_var: inv_var_gpu,
    lower_clip: lower_clip_cpu,
    upper_clip: upper_clip_cpu,
    bitmask: mask_gpu,
}
workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
graph.execute(variant_pack, workspace)

#### Test cuDNN's output against PyTorch's and check correctness

In [None]:
def unpack_cudnn_bitmask(bitmask_tensor, N, C, H=1, W=1):
    """
    Helper function to unpack a cuDNN bitmask tensor of shape [N, C//8, H, W] and dtype=torch.uint8 (stored as packed bits)
    into a boolean tensor of shape [N, C, H, W] for assert testing.
    """
    bitmask_flat = bitmask_tensor.view(N, C // 8, H * W)
    unpacked = torch.zeros(
        (N, C, H * W), dtype=torch.bool, device=bitmask_tensor.device
    )

    for bit in range(8):
        bit_values = (bitmask_flat >> bit) & 1
        unpacked[:, bit::8, :] = bit_values

    unpacked = unpacked.view(N, C, H, W)
    return unpacked

In [None]:
torch.cuda.synchronize()

# compare to reference output
torch.testing.assert_close(out_gpu, out_expected, rtol=5e-3, atol=5e-3)
torch.testing.assert_close(inv_var_gpu, inv_var_expected, rtol=5e-3, atol=5e-3)
torch.testing.assert_close(mean_gpu, mean_expected, rtol=5e-3, atol=5e-3)

# Unpack the bitmask tensor and compare to reference output
unpacked_mask = unpack_cudnn_bitmask(mask_gpu, batch * seq_size, embedding_dim, 1, 1)
torch.testing.assert_close(unpacked_mask, mask_expected, atol=1e-3, rtol=1e-3)


### LayerNorm Relu Bitmask Backward Pass

Compute references values for backward graph

In [None]:
# Reference backward operation using PyTorch
target = torch.randn_like(out_expected)
criterion = torch.nn.MSELoss()
loss = criterion(out_expected, target)

out_expected.retain_grad()
x_gpu.retain_grad()
scale_gpu.retain_grad()
bias_gpu.retain_grad()

loss.backward()

#### Create cuDNN graph and tensors

In [None]:
bwd_graph = cudnn.pygraph(
    handle=handle,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

# Create tensors associated with the backwards graph. DO NOT reuse tensor handles from the forward graph.
d_out = bwd_graph.tensor(
    name="d_out", dim=x_gpu.size(), stride=x_gpu.stride(), data_type=x_gpu.dtype
)

x_bwd = bwd_graph.tensor_like(x, name="x")
scale_bwd = bwd_graph.tensor_like(scale, name="scale")
mean_bwd = bwd_graph.tensor_like(mean, name="mean")
inv_var_bwd = bwd_graph.tensor_like(inv_var, name="inv_var")
bitmask_bwd = bwd_graph.tensor(
    name="bitmask",
    dim=(batch * seq_size, embedding_dim, 1, 1),
    stride=(embedding_dim, 1, 1, 1),
    data_type=cudnn.data_type.BOOLEAN,
)

# Add a pointwise mul operation for dRelu using the bitmask
drelu_dY = bwd_graph.mul(name="drelu_bitmask_mul", a=d_out, b=bitmask_bwd)
drelu_dY.set_name("dRelu(dY)")
print("drelu_x_bwd:", drelu_dY.get_dim())

# Add the adaptive layernorm backward operation
(d_x, d_scale, d_bias) = bwd_graph.layernorm_backward(
    name="DLN",
    grad=drelu_dY,
    input=x_bwd,
    scale=scale_bwd,
    mean=mean_bwd,
    inv_variance=inv_var_bwd,
)

# Enable outputs.
d_x.set_output(True).set_data_type(x_gpu.dtype)
d_scale.set_output(True).set_data_type(x_gpu.dtype)
d_bias.set_output(True).set_data_type(x_gpu.dtype)

print(bwd_graph)

#### Build the graph

In [None]:
# Build the bwd_graph
bwd_graph.build([cudnn.heur_mode.A])

#### Execute the graph 

In [None]:
# Create output buffers for gradients
d_x_gpu = torch.empty_like(x_gpu)
d_scale_gpu = torch.empty_like(scale_gpu)
d_bias_gpu = torch.empty_like(bias_gpu)

workspace = torch.empty(
    bwd_graph.get_workspace_size(), device="cuda", dtype=torch.uint8
)

# For the inputs of the backwards graph (x_bwd, d_out, scale_bwd, mean_bwd, inv_var_bwd), we use the outputs of the forwards graph. For d_out we use pytorches autograd .grad functionality.
variant_pack = {
    x_bwd: x_gpu.detach(),
    scale_bwd: scale_gpu.detach(),
    d_out: out_expected.grad,
    mean_bwd: mean_gpu.detach(),
    inv_var_bwd: inv_var_gpu.detach(),
    d_x: d_x_gpu,
    d_scale: d_scale_gpu,
    d_bias: d_bias_gpu,
    bitmask_bwd: mask_gpu.detach(),
}
bwd_graph.execute(variant_pack, workspace)

#### Test cuDNN's output against PyTorch's and check correctness

In [None]:
torch.cuda.synchronize()

# compare to reference output
torch.testing.assert_close(x_gpu.grad, d_x_gpu, atol=2e-4, rtol=2e-4)
torch.testing.assert_close(scale_gpu.grad, d_scale_gpu, atol=2e-4, rtol=2e-4)
torch.testing.assert_close(bias_gpu.grad, d_bias_gpu, atol=2e-4, rtol=2e-4)

Perform Cleanup

In [None]:
cudnn.destroy_handle(handle)