# Introduction to cuDNN Frontend Python API
This notebook is an introduction to cuDNN FE graph Python API and how to perform a single fprop convolution

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/00_introduction.ipynb)

## Prerequisites for running on Colab

This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [None]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128')

## Using cuDNN Wrapper

The simplest way to use cuDNN through the frontend API is to use the `Graph` wrapper object. Below is an example of how to perform a single fprop convolution.

In [None]:
import cudnn

print(cudnn.backend_version())

`backend_version()` returns an integer representing the cudnn backend version, e.g. 90000. You can use this to check if the cuDNN backend version supports the operations you need.

In the following, we will use PyTorch to hold a random tensor on the GPU and then use the cuDNN to perform a convolution operation. Let's create the tensors in PyTorch:

In [None]:
import torch

torch.manual_seed(42)
assert torch.cuda.is_available()
device = torch.device("cuda")

handle = cudnn.create_handle()

# Create tensor in NHWC format then permute to NCHW
X_gpu = torch.randn(8, 56, 56, 64, device=device, dtype=torch.float16).permute(
    0, 3, 1, 2
)
W_gpu = torch.randn(32, 3, 3, 64, device=device, dtype=torch.float16).permute(
    0, 3, 1, 2
)

This creates two PyTorch tensors in GPU, with [physical layout in NHWC](https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/) but logical layout in NCHW. This is the format [expected by cuDNN](https://docs.nvidia.com/deeplearning/cudnn/frontend/latest/developer/core-concepts.html#tensor-descriptor).

`handle` is a pointer to an opaque structure holding the cuDNN library context. The cuDNN library context must be created using `create_handle()` and the returned handle must be passed to all subsequent library function calls as needed. The context should be destroyed at the end using `destroy_handle()`. 

The context is associated with only one GPU device, the current device at the time of the call to `create_handle()`. However, multiple contexts can be created on the same GPU device.

`handle` is used during the execute and to determine which GPU the kernel should be launched. 

The next step is to create a `pygraph` object so that a graph can be created:

Then we can create a cuDNN graph for convolution:

In [None]:
with cudnn.Graph(
    inputs=["conv2d::image", "conv2d::weight"],
    outputs=["conv_out"],
) as graph:
    Y = graph.conv_fprop(
        image=X_gpu,  # referencing tensor layout and type
        weight=W_gpu,  # referencing tensor layout and type
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        compute_data_type=cudnn.data_type.FLOAT,
        name="conv2d",
    )
    # either set io_data_type in Graph or set_data_type on the output tensor
    Y.set_output(True).set_data_type(cudnn.data_type.HALF).set_name("conv_out")

This code created a cuDNN graph with a single node of convolution operation. The node is assigned the name `conv2d`. It will consume two input tensors, as `image` and `weight`. The output of this node, held by the variable `Y`, will be referenced by the name `conv_out` and regarded as the output of the graph.

The graph is created with a context manager, which will automatically finalize the graph after the block is exited. The graph has been assigned the input and output tuples `["conv2d::image", "conv2d::weight"]` and `["conv_out"]` respectively. These means the input will be `image` argument in the node named `conv2d` and then `weight` argument in the same node. The output has the name assigned, hence, alternatively, you can also use that name.

There are several features shown above:

- `Graph` is the main class for creating a cuDNN graph using the context manager.
- once a graph is initialized, you can build the graph by creating nodes one by one
- even you provided PyTorch tensor `X_gpu` as argument `image` to the node `conv2d`, it is only used as a placeholder to define the properties of the input tensor (e.g. data type, physical layout, logical layout). The actual data is not used at this stage.
- output tensors from a node (cuDNN tensor object) should be defined as output using `set_output(True)` to indicate that the graph should hold the reference to the tensor to return to the user
- cuDNN graph is flexible, hence you would need to provide additional information to set up the graph, such as the data type of the output from a node, if there is an ambiguity.

Once you finish defining the graph, you can execute it:

In [None]:
Y_gpu = graph(
    X_gpu, W_gpu, handle=handle
)  # reading the tensor value to execute the graph

The graph is called like a function. The input arguments are the actual tensors to be used and the order of the input arguments is as defined using `set_io_tuples()`. The output from the graph is a PyTorch tensor created dynamically.

Note that we used `X_gpu` and `W_gpu` when we defined the graph previously and reused them here. It is not necessary. But you should make sure the tensors you used to define the graph are compatible (data type and layouts) with the tensors you invoked the graph with.

To verify that the graph is working correctly, you can check the result with that generated by PyTorch:

In [None]:
Y_ref = torch.nn.functional.conv2d(X_gpu, W_gpu, padding=1)
torch.testing.assert_close(Y_gpu, Y_ref, atol=5e-3, rtol=3e-3)

With the same input tensors `X_gpu` and `W_gpu`, you can tell that the result from cuDNN `Y_gpu` and the result from PyTorch `Y_ref` are numerically close.

## Using cuDNN Frontend Python Bindings

Create a `pygraph` object so that a graph can be created:

In [None]:
graph = cudnn.pygraph(
    handle=handle,
    name="cudnn_graph_0",
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
)

An `pygraph` object is the subgraph that is provided to the cuDNN for execution.

The arguments you provided to create the `pygraph` object are optional:

- You can assign a `name` to the graph for future reference.
- The `io_data_type` provides the data type of the input and output tensors of the graph. This can be overridden by actual tensor data type.
- The `compute_data_type` provides the data type in which computation will happen. This can be overridden by actual compute data type of the individual operation.

With the `pygraph` object created, you can define nodes in this graph. All tensors used by a node should be specified as a cuDNN `tensor` object. For the convolution operation in the previous example, we need to create two tensors, `X` and `W`, respectively:

In [None]:
X = graph.tensor(
    name="X",
    dim=[8, 64, 56, 56],
    stride=[56 * 56 * 64, 1, 56 * 64, 64],
    data_type=cudnn.data_type.HALF,
)
W = graph.tensor(name="W", dim=[32, 64, 3, 3], stride=[3 * 3 * 64, 1, 3 * 64, 64])

`graph.tensor` creates an entry edge to the graph. The main attributes of the tensor class are `dim`, `stride` and `data_type`. Some other attributes are `is_virtual` (mainly used for interior nodes in graph), `is_pass_by_value` for scalar tensors. Assigning a `name` to the tensor is optional.

Note that the `W` tensor above was created without `data_type`. Its data type is deduced from the `io_data_type` of the `pygraph` object that was used to create the tensor.

Next is the convolution node, which uses the cuDNN tensors `X` and `W` as input:

In [None]:
Y = graph.conv_fprop(
    X,
    W,
    padding=[1, 1],
    stride=[1, 1],
    dilation=[1, 1],
    compute_data_type=cudnn.data_type.FLOAT,
)

Perform a *convolution forward* operation with padding as `[1,1]` on the input `X` tensor. You can run `help (cudnn.pygraph.conv_fprop)` to see explanation of the other parameters, `compute_data_type`, `stride`, `dilation`.

Note that when you use the Python binding directly, you must provide `X` and `W` as cuDNN tensors. Using other tensors, such as PyTorch tensors, is not allowed.

The output of the convolution node above is a cuDNN tensor. Since you want to read from it, you should mark it as output using `set_output(True)`:

In [None]:
Y.set_output(True)

By default the output of any operation is *virtual* (does not have device pointer associated). This is because the output can be fed as input to the next operation in graph. In order to terminate the graph, or to mark the tensor *non-virtual* we need to set it as output. Multiple tensors can be marked as output.

At this point, the graph is defined. You should finalize it:

In [None]:
graph.build([cudnn.heur_mode.A])
# print(graph)

Following things happen in the above call:

- Validation of inputs, outputs and output shape deduction.
- Lowering pass into the cuDNN dialect.
- Heuristics query to determine which execution plan to run.
- Runtime compilation of the plan if needed

This function can be split into its constituents to give you a better control over each phase.

Once the graph is built, you can use `print()` to inspect the graph after the shape and datatype deduction. For example, the tensor `Y` above as the output from the convolution will have its shape and data type known only after the graph is built.

You need to provide the actual data to execute the graph. Let's use PyTorch to create a few random tensors:


In [None]:
X_gpu = torch.randn(
    8, 64, 56, 56, requires_grad=False, device="cuda", dtype=torch.float16
).to(memory_format=torch.channels_last)
W_gpu = torch.randn(
    32, 64, 3, 3, requires_grad=False, device="cuda", dtype=torch.float16
).to(memory_format=torch.channels_last)
Y_gpu = torch.zeros(
    8, 32, 56, 56, requires_grad=False, device="cuda", dtype=torch.float16
).to(memory_format=torch.channels_last)

These tensors reside in the GPU. They are in the "channel last" physical layout (required for cuDNN convolution operations). You are not required to use PyTorch; cuDNN also supports other dlpack tensors on the GPU.

To execute the graph, you need to create a workspace and provide a mapping between the cuDNN tensors and the actual allocated tensors (the "variant pack"):

In [None]:
workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
variant_pack = {X: X_gpu, W: W_gpu, Y: Y_gpu}
graph.execute(variant_pack, workspace, handle=handle)
torch.cuda.synchronize()

The workspace is a buffer that the graph can use during execution. You should allocate a piece of memory of sufficiently large on the GPU. The size can be computed from the graph using `get_workspace_size()`.

The execute call launches the kernel for execution on the GPU device. The output, which `Y` was assigned to, will be populated to `Y_gpu`.

You can verify the result with PyTorch:

In [None]:
Y_ref = torch.nn.functional.conv2d(X_gpu, W_gpu, padding=1)
torch.testing.assert_close(Y_gpu, Y_ref, atol=5e-3, rtol=3e-3)

Finally, you can destroy the handle when you no longer need it:

In [None]:
cudnn.destroy_handle(handle)