# Introduction to cudnn frontend python API
This notebook is an introduction to cudnn FE graph python API and how to perform a single fprop convolution

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/00_introduction.ipynb)

## Prerequisites for running on Colab
This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [None]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('CUDNN_PATH=`pip show nvidia-cudnn-cu12  | grep Location | cut -d":" -f2 | xargs`/nvidia/cudnn pip install git+https://github.com/NVIDIA/cudnn-frontend.git')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121')

#### General Setup

In [None]:
import cudnn

In [None]:
print(cudnn.backend_version())

handle = cudnn.create_handle()

`backend_version()` prints the cudnn backend version. Eg. 90000

`handle` is a pointer to an opaque structure holding the cuDNN library context. The cuDNN library context must be created using `create_handle()` and the returned handle must be passed to all subsequent library function calls as needed. The context should be destroyed at the end using `destroy_handle()`. 

The context is associated with only one GPU device, the current device at the time of the call to `create_handle()`. However, multiple contexts can be created on the same GPU device.

In [None]:
graph = cudnn.pygraph(handle = handle, name = "cudnn_graph_0", io_data_type = cudnn.data_type.HALF, compute_data_type = cudnn.data_type.FLOAT)

`pygraph` is the subgraph that is provided to the cudnn for execution.

Each component in the graph has an optional `name` for future reference. (Optional)

The `io_data_type` provides the data type of the input and output tensors of the graph. This can be overridden by actual tensor data type. (Optional)

The `compute_data_type` provides the data type in which computation will happen. This can be overridden by actual compute data type of the individual operation. (Optional)

In [None]:
X = graph.tensor(name = "X", dim = [8, 64, 56, 56], stride = [56 * 56 * 64, 1, 56 * 64 ,64], data_type=cudnn.data_type.HALF)

In [None]:
W = graph.tensor(name = "W", dim = [32, 64, 3, 3], stride = [3 * 3 * 64, 1, 3 * 64 ,64])

`graph.tensor` creates an entry edge to the graph. The main attributes of the tensor class are `dim`, `stride` and `data_type`. 
Some other attributes are `is_virtual` (mainly used for interior nodes in graph), `is_pass_by_value` for scalar tensors.

Note that the "W" tensor above did not have data_type. Its data type is deduced from the `pygraph.io_data_type` that was specified above.

In [None]:
Y = graph.conv_fprop(X, W, padding = [1,1], stride = [1,1], dilation = [1,1], compute_data_type = cudnn.data_type.FLOAT)

Perform a `convolution forward` operation with padding as [1,1] on the input X tensor.

Other parameters are `compute_data_type`, `stride`, `dilation`. See `help (cudnn.pygraph.conv_fprop)`

In [None]:
Y.set_output(True)

By default the output of any operation is virtual (does not have device pointer associated). This is becuase the output can be fed as input to the next operation in graph. In order to terminate the graph, or to mark the tensor non-virtual we need to set the output. 

In [None]:
graph.build([cudnn.heur_mode.A])
# print(graph)

Following things happen in the above call
- validation of inputs, outputs and output shape deduction.
- Lowering pass into the cudnn dialect.
- Heuristics query to determine which execution plan to run.
- Runtime compilation of the plan if needed

In following notebooks, we will see that this function gets split into its constituents to have a better control over each phase.

Use the `print` function to inspect the graph after the shape and datatype deduction.

In [None]:
import torch


X_gpu = torch.randn(8, 64, 56, 56, requires_grad=False, device="cuda", dtype=torch.float16).to(memory_format=torch.channels_last)
W_gpu = torch.randn(32, 64, 3, 3, requires_grad=False, device="cuda", dtype=torch.float16).to(memory_format=torch.channels_last)
Y_gpu = torch.zeros(8, 32, 3, 3, requires_grad=False, device="cuda", dtype=torch.float16).to(memory_format=torch.channels_last)
workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)

Here we are using torch to create GPU tensor. Note that, cudnn FE supports any DLPack interface.

In [None]:
graph.execute({X: X_gpu, W: W_gpu, Y: Y_gpu}, workspace, handle= handle)

The execute call launches the kernel for execution on the GPU device.