In [1]:
import cudnn
print(cudnn.__version__)
import torch

torch.manual_seed(42)
assert torch.cuda.is_available()

1.14.1


In [2]:
device: torch.cuda.device = torch.device("cuda")
print(f"cudnn.backend_version(): {cudnn.backend_version()}")
print(f"torch.version.cuda: {torch.version.cuda}")

cudnn.backend_version(): 91301
torch.version.cuda: 12.8


#### Initialise CUDNN Handle

Initialises the library's context, it acts as an identifier for the current session with cuDNN. In the back-end, under underlying handle is explicitly passed to every subsequent library function that operates on GPU data. This provides the user with a means to explicitly control the library's functioning across multiple host threads, GPUs and CUDA streams. 
e.g. using cudaSetDevice can associate different physical GPUs with different host threads. With a different handle initialised in each host thread, the work from each different host thread will automatically run on different GPU devices

The handle is used to determine on which GPU the kernel will be launched. The context is only associated wtih a single physical GPU device, however multiple handles can be initialised for a single physical GPU device.

Front-end docs: https://docs.nvidia.com/deeplearning/cudnn/frontend/latest/developer/core-concepts.html#cudnn-handle \
Back-end docs: https://docs.nvidia.com/deeplearning/cudnn/backend/latest/developer/core-concepts.html#cudnn-handle

In [3]:
handle = cudnn.create_handle()

#### Initialise CUDNN Graph

CUDNN provides a eclarative programming model, computation is defined via a graph of operations (on tensors), the CUDNN back-end handles how it is executed on the GPU device. Graphs are comprised of three main concepts, Operations, Execution Engines and Heuristics.

*Operations* are a mathematical specification of the operations being executed.\
*Execution Engines*: Different engines support the execution of different operations, from the documentation:

>pre-compiled single operation engines, generic runtime fusion engines, specialized runtime fusion engines, and specialized pre-compiled fusion engines. The specialized engines, whether they use runtime compilation or pre-compilation, are targeted to a set of important use cases, and thus have a fairly limited set of patterns they currently support. Over time, we expect to support more of those use cases with the generic runtime fusion engines, whenever practical.

*Heuristics*: Depending on the graph there might be zero, one or multiple engines that can execute a graph. Heuristics give a way of sorting viable execution engines from most to least performant, on a specific operation graph.

> 1. Heuristics Mode A - intended to be fast and be able to handle most operation graph patterns. It returns a list of engine configs ranked by the expected performance.
> 2. Heuristics Mode B - intended to be more generally accurate than mode A, but with the tradeoff of higher CPU latency to return the list of engine configs. The underlying implementation may fall back to the mode A heuristic in cases where we know mode A can do better.
> 3. Fallback Heuristics Mode - intended to be fast and provide functional fallbacks without expectation of optimal performance.\

The recommended workflow is to query either mode A or B and check for support. The first engine config with support is expected to have the best performance. \
Different engines have different support surfaces (in v1.14.1 these are 90, 80, and 70), different operations are supported by different surfaces, generally for the best performance it is recommended to target the highest indexed support surface possible and fall back to lower ones if needed.

Front-end docs: https://docs.nvidia.com/deeplearning/cudnn/frontend/v1.14.1/developer/graph-api.html#graphs


### Defining the Graph

Data are represented by edges in the graph, and operations by nodes. 

In [4]:
# Defining a graph we can specify the handle (GPU device to execute on), input & output precision and intermediate data precision.
graph = cudnn.pygraph(
    handle=handle,
    name="cudnn_graph_0",
    io_data_type=cudnn.data_type.HALF, # (Optional)
    compute_data_type=cudnn.data_type.FLOAT, # (Optional)
)

### Defining Input Tensors

`graph.tensor` creates an entry edge to the graph. 

The main attributes of the tensor class are `dim`, `stride` and `data_type`. Some other attributes are `is_virtual` (mainly used for interior nodes in graph), `is_pass_by_value` for scalar tensors. 

In [5]:
X = graph.tensor(
    name="X",
    dim=[8, 64, 56, 56],
    stride=[56 * 56 * 64, 1, 56 * 64, 64],
    data_type=cudnn.data_type.HALF,
)
W = graph.tensor(name="W", dim=[32, 64, 3, 3], stride=[3 * 3 * 64, 1, 3 * 64, 64])

### Defining Output Tensors

Perform a `convolution forward` operation with padding as [1,1] on the input X tensor.

Other parameters are `compute_data_type`, `stride`, `dilation`. See `help (cudnn.pygraph.conv_fprop)`

In [6]:
Y = graph.conv_fprop(
    X,
    W,
    padding=[1, 1],
    stride=[1, 1],
    dilation=[1, 1],
    compute_data_type=cudnn.data_type.FLOAT,
)


By default the output of any operation is virtual (does not have device pointer associated). This is because the output can be fed as input to the next operation in graph. In order to terminate the graph, or to mark the tensor non-virtual we need to set the output. 

In [7]:
Y.set_output(True)

[{"data_type":null,"dim":[],"is_pass_by_value":false,"is_virtual":false,"name":"0::Y","pass_by_value":null,"reordering_type":"NONE","stride":[],"uid":0,"uid_assigned":false}]

### Build the Graph

Building the graph does the following:

- validation of inputs, outputs and output shape deduction.
- Lowering pass into the cudnn dialect.
- Heuristics query to determine which execution plan to run.
- Runtime compilation of the plan if needed

In following notebooks, we will see that this function gets split into its constituents to have a better control over each phase.

Use the `print` function to inspect the graph after the shape and datatype deduction.

In [8]:
graph.build([cudnn.heur_mode.A])
print(graph)

{
    "context": {
        "compute_data_type": "FLOAT",
        "intermediate_data_type": null,
        "io_data_type": "HALF",
        "name": "",
        "sm_count": -1
    },
    "cudnn_backend_version": "9.13.1",
    "cudnn_frontend_version": 11401,
    "json_version": "1.0",
    "nodes": [
        {
            "compute_data_type": "FLOAT",
            "dilation": [1,1],
            "inputs": {
                "W": "W",
                "X": "X"
            },
            "math_mode": "CROSS_CORRELATION",
            "name": "0",
            "outputs": {
                "Y": "0::Y"
            },
            "post_padding": [1,1],
            "pre_padding": [1,1],
            "stride": [1,1],
            "tag": "CONV_FPROP"
        }
    ],
    "tensors": {
        "0::Y": {
            "data_type": "HALF",
            "dim": [8,32,56,56],
            "is_pass_by_value": false,
            "is_virtual": false,
            "name": "0::Y",
            "pass_by_value": null,
        

### Execute the Graph

In [9]:
import torch

X_gpu = torch.randn(
    8, 64, 56, 56, requires_grad=False, device="cuda", dtype=torch.float16
).to(memory_format=torch.channels_last)
W_gpu = torch.randn(
    32, 64, 3, 3, requires_grad=False, device="cuda", dtype=torch.float16
).to(memory_format=torch.channels_last)
Y_gpu = torch.zeros(
    8, 32, 3, 3, requires_grad=False, device="cuda", dtype=torch.float16
).to(memory_format=torch.channels_last)
workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)

In [10]:
# Executes the graph on the GPU device
graph.execute({X: X_gpu, W: W_gpu, Y: Y_gpu}, workspace, handle=handle)

In [11]:
print(Y_gpu)

tensor([[[[ 2.0922e+01, -5.9062e+00,  4.3188e+01],
          [-2.9688e+01, -1.3633e+01, -9.3203e+00],
          [ 2.1547e+01,  9.0312e+00, -2.0977e+00]],

         [[ 9.8047e+00,  3.7656e+01, -2.9980e-01],
          [-2.2953e+01,  2.2688e+01,  1.7266e+01],
          [-2.6438e+01, -1.0719e+01, -2.4781e+01]],

         [[-5.3750e+00,  3.0781e+00, -1.1133e+01],
          [-6.0117e+00,  1.4852e+01,  4.5977e+00],
          [-1.2219e+01, -4.9180e+00,  2.1547e+01]],

         ...,

         [[ 2.0656e+01,  8.1328e+00,  3.2719e+01],
          [-7.8008e+00, -2.4961e+00, -2.1922e+01],
          [ 8.1875e+00,  4.7469e+01,  1.9000e+01]],

         [[-9.6328e+00,  3.8887e+00,  3.2031e+00],
          [ 1.0141e+01,  3.9121e+00, -1.7783e+00],
          [-1.6078e+01, -7.2938e+01,  3.7000e+01]],

         [[ 1.7000e+01,  2.7984e+01,  3.1426e+00],
          [-1.0094e+01,  9.1250e+00,  3.4375e+01],
          [-3.1078e+01, -8.6250e+00,  1.5180e+01]]],


        [[[-1.8234e+01,  2.6077e-02,  9.1406e+00],
  

#### Free GPU device Handle

In [12]:
cudnn.destroy_handle(handle)