# Overview of the Lower Level Python Binding API

This notebook explains the features and capabilities of the Python binding for cuDNN.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/02_low_level_api.ipynb)

## Prerequisites for running on Colab

This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [None]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128')

## Plain Syntax

In the [introduction](00_introduction.ipynb), you saw how to create a graph and execute it with a Python binding directly. Let's see a different example of the same workflow:

In [None]:
import cudnn
import torch

In [None]:
device = torch.device("cuda")
dtype = torch.float16
B, M, N, K = 16, 128, 128, 512

# input tensors
a_gpu = torch.randn(B, M, K, device=device, dtype=dtype)
b_gpu = torch.randn(B, K, N, device=device, dtype=dtype)
d_gpu = torch.randn(1, M, N, device=device, dtype=dtype)

# place holder for cuDNN output
c_gpu = torch.empty(B, M, N, device=device, dtype=dtype)

# reference output
c_ref = torch.matmul(a_gpu, b_gpu) + d_gpu

In [None]:
# Create a handle and construct the graph
handle = cudnn.create_handle()
graph = cudnn.pygraph(handle=handle, compute_data_type=cudnn.data_type.FLOAT)

a_cudnn = graph.tensor_like(a_gpu)
b_cudnn = graph.tensor_like(b_gpu)
d_cudnn = graph.tensor_like(d_gpu)

ab = graph.matmul(name="mm", A=a_cudnn, B=b_cudnn)
ab.set_data_type(cudnn.data_type.HALF)
c_cudnn = graph.bias(name="bias", input=ab, bias=d_cudnn)
c_cudnn.set_output(True).set_data_type(cudnn.data_type.HALF)

# execute the graph
graph.build([cudnn.heur_mode.A])
workspace = torch.empty(graph.get_workspace_size(), device=device, dtype=torch.uint8)
variant_pack = {
    a_cudnn: a_gpu,  # input
    b_cudnn: b_gpu,  # input
    d_cudnn: d_gpu,  # input
    c_cudnn: c_gpu,  # output
}
graph.execute(variant_pack, workspace, handle=handle)

# verify the result
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)

This computes a batched matrix multiplication with bias: $C = A \times B + D$. Tensor $A$ is a batch of $M\times K$ matrices and tensor $B$ is a batch of $K\times N$ matrices. Batch size is $B$. A constant bias matrix $D$ of size $M\times N$ is added to each element of the output matrix $C$. The result $C$ is a batch of $M\times N$ matrices.

You created two nodes in the graph for this computation: One for the matrix-matrix multiplication and one for the bias addition. The variable `ab` is an intermediate tensor (i.e., virtual tensor) from the matrix-matrix multiplication and used as input to the bias addition. You need to set the data type of the intermediate tensor because the matrix-matrix multiplication node can output different data types.

The graph output is `c_cudnn`. You mark it as non-virtual tensor by calling `set_output(True)`. You execute the graph by providing `variant_pack` as a dictionary mapping the cuDNN tensors you used in the graph to the allocated tensors you created with PyTorch. The output tensors will be updated in place when the graph executes.

You can remove the `set_data_type()` calls to the intermediate tensor `ab` and output tensor `c_cudnn` if you set up the default in the `pygraph` object:


In [None]:
# Create a handle and construct the graph
handle = cudnn.create_handle()
graph = cudnn.pygraph(
    handle=handle,
    io_data_type=cudnn.data_type.HALF,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

a_cudnn = graph.tensor_like(a_gpu)
b_cudnn = graph.tensor_like(b_gpu)
d_cudnn = graph.tensor_like(d_gpu)

ab = graph.matmul(name="mm", A=a_cudnn, B=b_cudnn)
c_cudnn = graph.bias(name="bias", input=ab, bias=d_cudnn)
c_cudnn.set_output(True)

# execute the graph
graph.build([cudnn.heur_mode.A])
workspace = torch.empty(graph.get_workspace_size(), device=device, dtype=torch.uint8)
variant_pack = {
    a_cudnn: a_gpu,  # input
    b_cudnn: b_gpu,  # input
    d_cudnn: d_gpu,  # input
    c_cudnn: c_gpu,  # output
}
graph.execute(variant_pack, workspace, handle=handle)
torch.cuda.synchronize()

# verify the result
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)

## Using decorators

You can also use cuDNN as a decorator. The major benefit is that you can make the build step implicit. See below for an example:

In [None]:
def matmul_cache_key(handle, a, b, bias):
    """Custom key function for matmul + bias"""
    return (
        tuple(a.shape),
        tuple(b.shape),
        tuple(a.stride()),
        tuple(b.stride()),
        a.dtype,
        b.dtype,
    )


@cudnn.jit(heur_modes=[cudnn.heur_mode.A, cudnn.heur_mode.B])
@cudnn.graph_cache(key_fn=matmul_cache_key)
def create_matmul_bias_graph(handle, a, b, bias):
    with cudnn.graph(handle) as (g, _):
        a_cudnn = g.tensor_like(a)
        b_cudnn = g.tensor_like(b)
        bias_cudnn = g.tensor_like(bias)
        c_cudnn = g.matmul(name="matmul", A=a_cudnn, B=b_cudnn)
        out = g.bias(name="bias", input=c_cudnn, bias=bias_cudnn)
        out.set_output(True).set_data_type(cudnn.data_type.HALF)

    return g, [a_cudnn, b_cudnn, bias_cudnn, out]  # Return raw graph and tensors


g, uids = create_matmul_bias_graph(handle, a_gpu, b_gpu, d_gpu)
a_uid, b_uid, bias_uid, out_uid = uids

variant_pack = {
    a_uid: a_gpu,
    b_uid: b_gpu,
    bias_uid: d_gpu,
    out_uid: c_gpu,
}
workspace = torch.empty(g.get_workspace_size(), device="cuda", dtype=torch.uint8)
g.execute(variant_pack, workspace)
torch.cuda.synchronize()

You see that in this example, you did not call `graph.build()`. Instead, you defined a function `create_matmul_bias_graph()` that returns a graph object and a list of cuDNN tensors. This part is same as the previous example. However, you decorated the function with `@cudnn.jit` to specify the heuristic modes to use to build the graph.

You also decorated the function with `@cudnn.graph_cache` to specify a custom key function for the graph cache. The custom key function `matmul_cache_key()` depends on the shape, stride, and data type of the input tensors, but not the other attributes or the handle. This way, you can call this line multiple times without rebuilding the graph:

```python
g, uids = create_matmul_bias_graph(handle, a_gpu, b_gpu, d_gpu)
```

Note that when you called `create_matmul_bias_graph()`, you pass in a handle and multiple PyTorch tensors. The data held by the tensors are not important, as long as the data types and layouts are the same, the same graph will be returned by the cache. This could be a convenience for you because logically, this is a matrix multiplication with bias but the graph to use on cuDNN would be different depends on the floating point precision and the dimension of the input tensors. The graph cache helps you to keep track on the graph to use.


## Building the graph

In the first example above, you saw that to prepare a graph for execution, you need to call `graph.build()` and pass in a list of heuristic modes. This is indeed a meta-function that combines multiple steps into one. Below is an example to break down the steps:

In [None]:
# Create a handle and construct the graph
handle = cudnn.create_handle()
graph = cudnn.pygraph(
    handle=handle,
    io_data_type=cudnn.data_type.HALF,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

a_cudnn = graph.tensor_like(a_gpu)
b_cudnn = graph.tensor_like(b_gpu)
d_cudnn = graph.tensor_like(d_gpu)

ab = graph.matmul(name="mm", A=a_cudnn, B=b_cudnn)
c_cudnn = graph.bias(name="bias", input=ab, bias=d_cudnn)
c_cudnn.set_output(True)

# build and validate the graph
graph.validate()
graph.build_operation_graph()
graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
graph.check_support()
graph.build_plans()

# execute the graph
workspace = torch.empty(graph.get_workspace_size(), device=device, dtype=torch.uint8)
variant_pack = {
    a_cudnn: a_gpu,  # input
    b_cudnn: b_gpu,  # input
    d_cudnn: d_gpu,  # input
    c_cudnn: c_gpu,  # output
}
graph.execute(variant_pack, workspace, handle=handle)
torch.cuda.synchronize()

# verify the result
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)

In this example, you break down the build process into multiple steps. This can give you more control over the actual graph execution plan.

## Serialization and Deserialization

Graph created can be serialized and deserialized. This can save the overhead of rebuilding the graph from scratch. Below is an example of how to serialize a graph:

In [None]:
# Create a handle and construct the graph
handle = cudnn.create_handle()
graph = cudnn.pygraph(
    handle=handle,
    io_data_type=cudnn.data_type.HALF,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

a_cudnn = graph.tensor_like(a_gpu)
b_cudnn = graph.tensor_like(b_gpu)
d_cudnn = graph.tensor_like(d_gpu)
a_cudnn.set_uid(0)
b_cudnn.set_uid(1)
d_cudnn.set_uid(2)

ab = graph.matmul(name="mm", A=a_cudnn, B=b_cudnn)
c_cudnn = graph.bias(name="bias", input=ab, bias=d_cudnn)
c_cudnn.set_output(True).set_uid(3)

# validate the graph and serialize it
graph.validate()
graph.build_operation_graph()
graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
graph.check_support()
graph.build_plans()
serialized_data = graph.serialize()

In the above, you created a graph, build the execution plan, and serialized it. The serialized data is a list of integers representing a byte stream. You can save the serialized data and reuse it later. Here is how you can deserialize the graph and execute it:

In [None]:
# create new graph from serialized data
newgraph = cudnn.pygraph()
newgraph.deserialize(serialized_data)

# execute the graph
workspace = torch.empty(newgraph.get_workspace_size(), device=device, dtype=torch.uint8)
variant_pack = {
    0: a_gpu,  # input
    1: b_gpu,  # input
    2: d_gpu,  # input
    3: c_gpu,  # output
}
newgraph.execute(variant_pack, workspace, handle=handle)
torch.cuda.synchronize()

# verify the result
c_ref = torch.matmul(a_gpu, b_gpu) + d_gpu
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)

In the above, you created a new `pygraph` object and populated it with the serialized data. Then you can execute the graph immediately without validation or building execution plan.

You can see that this example has one major difference from the previous example: The cuDNN tensors involved are set with a particular UID. All tensors will be assigned a UID when the corresponding graph is built, but you can manually set the UID with `set_uid()` as long as the UID are unique. The reason you need to set the UID in this example is that the original `graph` object and the new `newgraph` object are distinct. To pass on the `variant_pack` to execute the new graph, you need to reference to the tensors in the new graph, which is not possible since you created the graph by deserialization. Therefore, you need to use tensors' UIDs instead.

Let's see another example: A graph of a single SDPA operation:

In [None]:
b = 2  # batch size
s_q = 1024  # query sequence length
s_kv = 1024  # key+value sequence length
h = 6  # query heads
d = 64  # query+key embedding dimension per head
attn_scale = 1 / d**0.5
dtype = torch.bfloat16

shape_q = (b, h, s_q, d)
shape_k = shape_v = (b, h, s_kv, d)
shape_o = (b, h, s_q, d)

stride_q = stride_o = (s_q * h * d, d, h * d, 1)
stride_k = stride_v = (s_kv * h * d, d, h * d, 1)

# allocate PyTorch tensors as input and output
q_gpu = torch.randn(b * h * s_q * d, dtype=dtype, device="cuda").as_strided(
    shape_q, stride_q
)
k_gpu = torch.randn(b * h * s_kv * d, dtype=dtype, device="cuda").as_strided(
    shape_k, stride_k
)
v_gpu = torch.randn(b * h * s_kv * d, dtype=dtype, device="cuda").as_strided(
    shape_v, stride_v
)
o_gpu = torch.empty(b * h * s_q * d, dtype=dtype, device="cuda").as_strided(
    shape_o, stride_o
)
stats_gpu = torch.empty(b, h, s_q, 1, dtype=dtype, device="cuda")

# define a graph
graph = cudnn.pygraph(
    io_data_type=cudnn.data_type.BFLOAT16,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    handle=handle,
)

q = graph.tensor_like(q_gpu)
k = graph.tensor_like(k_gpu)
v = graph.tensor_like(v_gpu)

o, stats = graph.sdpa(
    name="sdpa",
    q=q,
    k=k,
    v=v,
    generate_stats=True,
    attn_scale=attn_scale,
    use_causal_mask=True,
)
q.set_uid(0)
k.set_uid(1)
v.set_uid(2)
o.set_uid(3)
stats.set_uid(4)
o.set_output(True).set_dim(shape_o).set_stride(stride_o)
stats.set_output(True).set_data_type(cudnn.data_type.BFLOAT16)

# serialize the graph
graph.validate()
graph.build_operation_graph()
graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
graph.check_support()
graph.build_plans()
serialized_data = graph.serialize()

# deserialize the graph
deserialized_graph = cudnn.pygraph()
deserialized_graph.deserialize(serialized_data)

# execute the graph
workspace = torch.empty(
    deserialized_graph.get_workspace_size(), device="cuda", dtype=torch.uint8
)
variant_pack = {
    0: q_gpu,
    1: k_gpu,
    2: v_gpu,
    3: o_gpu,
    4: stats_gpu,
}
deserialized_graph.execute(variant_pack, workspace)
torch.cuda.synchronize()

# verify the results
o_ref = torch.nn.functional.scaled_dot_product_attention(
    q_gpu, k_gpu, v_gpu, is_causal=True, scale=attn_scale
)
torch.testing.assert_close(o_ref, o_gpu, atol=5e-3, rtol=3e-3)