# Adaptive LayerNorm: Inference

This notebook shows how to compute an adaptive layernorm forward operation using cuDNN.

$$\text{Adaptive\_LayerNorm}(x) = \frac{x-\mu}{\sqrt{\sigma^2 + \epsilon}}\cdot\gamma+\beta$$

Where $\mu = E[x]$ and $\sigma^2 = Var[x]$ are taken over all inputs in a batch, $\gamma$ and $\beta$ are learnable parameters and varies for each input in a batch. This is in contrast to the layer norm where $\gamma$ and $\beta$ are shared across all inputs in a batch.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/27_adaptive_layernorm_inference.ipynb)

## Prerequisites and Setup
This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [None]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128')

## Overview

In the following, we will apply adaptive layer norm to a tensor of the following shape:

- Batch Size: 4
- Sequence Size: 1024
- Embedding Dimension: 768

Let's define these dimensions as constants:

In [None]:
import cudnn
import torch

torch.manual_seed(1)
print("Running with cudnn backend version:", cudnn.backend_version())

handle = cudnn.create_handle()

assert torch.cuda.is_available()

batch, seq_size, embedding_dim = 4, 1024, 768
# Epsilon is a small number to prevent division by 0.
epsilon_value = 1e-3
dtype = torch.float16

## Using Wrapper

This is how to do inference with adaptive layernorm. This is mostly the same as the forward pass in training except that we are not expecting the mean and variance to be computed.

In [None]:
# input tensors
x_gpu = torch.randn(batch, seq_size, embedding_dim, device="cuda", dtype=dtype)
scale_gpu = torch.randn(batch, 1, embedding_dim, device="cuda", dtype=dtype)
bias_gpu = torch.randn(batch, 1, embedding_dim, device="cuda", dtype=dtype)
eps_cpu = torch.full((1, 1, 1), epsilon_value, dtype=torch.float32, device="cpu")

# forward pass of adaptive layernorm using cuDNN graph
with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["adaln::input", "adaln::scale", "adaln::bias", "adaln::epsilon"],
    outputs=["adaln::Y"],
) as fwd_graph:
    out, mean, inv_var = fwd_graph.adalayernorm(
        name="adaln",
        norm_forward_phase=cudnn.norm_forward_phase.INFERENCE,
        input=x_gpu,
        scale=scale_gpu,
        bias=bias_gpu,
        epsilon=eps_cpu,
    )
    assert mean is None, "mean should be None in inference mode"
    assert inv_var is None, "inv_var should be None in inference mode"
    out.set_name("output").set_output(True).set_data_type(dtype)

out_gpu = fwd_graph(x_gpu, scale_gpu, bias_gpu, eps_cpu, handle=handle)

You can verify the output by comparing it to the result from PyTorch:

In [None]:
# PyTorch reference output
out_ref = torch.nn.functional.layer_norm(x_gpu, (embedding_dim,), eps=epsilon_value)
out_ref = out_ref * scale_gpu + bias_gpu

torch.testing.assert_close(out_gpu, out_ref, atol=5e-3, rtol=3e-3)

## Using Python Binding APIs

Create input tensor GPU buffers. We use PyTorch to allocate GPU tensors so we can reuse them easily when we calculate reference outputs.

In [None]:
# allocate input tensor memory, initialize them to random numbers
x_gpu = torch.randn(batch, seq_size, embedding_dim, device="cuda", dtype=dtype)
scale_gpu = torch.randn(batch, 1, embedding_dim, device="cuda", dtype=dtype)
bias_gpu = torch.randn(batch, 1, embedding_dim, device="cuda", dtype=dtype)
eps_cpu = torch.full((1, 1, 1), epsilon_value, dtype=torch.float32, device="cpu")

Then create the graph:

In [None]:
from enum import Enum


class UID(Enum):
    X = 0
    SCALE = 1
    BIAS = 2
    EPSILON = 3
    OUT = 4

In [None]:
# Create the cuDNN graph.
graph = cudnn.pygraph(
    handle=handle,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

# Create tensor handles with the graph API, assign UIDs.
x = graph.tensor_like(x_gpu.detach()).set_name("X").set_uid(UID.X.value)
scale = graph.tensor_like(scale_gpu.detach()).set_name("scale").set_uid(UID.SCALE.value)
bias = graph.tensor_like(bias_gpu.detach()).set_name("bias").set_uid(UID.BIAS.value)
epsilon = graph.tensor_like(eps_cpu).set_name("epsilon").set_uid(UID.EPSILON.value)

# Add a layernorm operation
out, _, _ = graph.adalayernorm(
    name="ADALN",
    input=x,
    norm_forward_phase=cudnn.norm_forward_phase.INFERENCE,
    scale=scale,
    bias=bias,
    epsilon=epsilon,
)

# Enable all outputs, by default outputs are disabled
out.set_name("output").set_output(True).set_data_type(dtype).set_uid(UID.OUT.value)
# print(graph)

# Build the graph
graph.build([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])

Here we assign UIDs for tensors. UIDs are a unique identifier that will allow us to provide a mapping from tensors created from cuDNN graph api calls, such as `graph.tensor_like()`, to the underlying device memory that will be used to store these tensors. Virtual tensors don't require explicit memory allocated for them, but non-vritual tensors like inputs or outputs will need to have UIDs assigned to them. 

Alternatively, one can use handles directly in the mapping, however using UIDs can be more convinient for caching of cuDNN graphs.

For each of our inputs {X, Scale, Bias, Epsilon} and our output Out, we allocate a UID. 

After validating and building a cuDNN graph, we can now execute it. To do this, we have to provide input and output buffers. We do this by using the previously allocated UIDs to associate between tensor handles generated from the graph API, and their underlying memory.

The desired input values need to be stored in these buffers before the `graph.execute` call. Because we have done a reference computation, we can simply reuse the buffers we have allocated via PyTorch.

Note that the EPISLON UID expects a cpu buffer, 

In [None]:
# Allocate output tensor memory using PyTorch
out_gpu = torch.empty_like(x_gpu)

# Mapping of (UIDs -> memory)
variant_pack = {
    UID.X.value: x_gpu,
    UID.SCALE.value: scale_gpu,
    UID.BIAS.value: bias_gpu,
    UID.EPSILON.value: eps_cpu,
    UID.OUT.value: out_gpu,
}

workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
graph.execute(variant_pack, workspace)
torch.cuda.synchronize()

Test cuDNN's output against PyTorch's and check correctness

In [None]:
# PyTorch reference output
out_ref = torch.nn.functional.layer_norm(x_gpu, (embedding_dim,), eps=epsilon_value)
out_ref = out_ref * scale_gpu + bias_gpu

torch.testing.assert_close(out_gpu, out_ref, atol=5e-3, rtol=3e-3)

Perform Cleanup

In [None]:
cudnn.destroy_handle(handle)