# Overview of the cuDNN Wrapper

This notebook explains the features and capabilities of the cuDNN wrapper.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/01_graph_building.ipynb)

## Prerequisites for running on Colab

This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [None]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128')

## Creating the wrapper object

The `Graph` wrapper is an object that provides a context manager around a cuDNN `pygraph` object. It can accept all arguments that are accepted by the `pygraph` object, for example:

In [None]:
import cudnn
import torch

In [None]:
device = torch.device("cuda")

handle = cudnn.create_handle()

# Create NCHW tensor but physical layout is NHWC
X_gpu = torch.randn(8, 64, 56, 56, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
W_gpu = torch.randn(32, 64, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)

with cudnn.Graph(
    name="graph_0",
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
) as graph:
    Y = graph.conv_fprop(
        image=X_gpu,  # referencing tensor layout and type only
        weight=W_gpu,  # referencing tensor layout and type only
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        name="conv2d",
    )
    Y.set_output(True).set_name("conv_out")

graph.set_io_tuples(["conv2d::image", "conv2d::weight"], ["conv_out"])
Y_gpu = graph(
    X_gpu, W_gpu, handle=handle
)  # reading the tensor values to execute the graph

When you create a `Graph` object, you can pass through the arguments to the `pygraph` object. All arguments to the `Graph` object must be keyword arguments instead of positional arguments. Besides the arguments that are accepted by the `pygraph` object, some other arguments are also accepted:

- `handle`: While `pygraph` object also accepts a handle as argument, if you provided one, it will be reused for the graph execution by default
- `heuristics`: A list of cuDNN heuristics to use to build the graph. The default is `[cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK]`
- `inputs` and `outputs`: A list of input and output specifications. This is used in place of calling `set_io_tuples()` explicitly after the graph is created.

You can rewrite the previous example with the `inputs` and `outputs` arguments, as follows:

In [None]:
with cudnn.Graph(
    name="graph_0",
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["conv2d::image", "conv2d::weight"],
    outputs=["conv_out"],
) as graph:
    Y = graph.conv_fprop(
        image=X_gpu,  # referencing tensor layout and type only
        weight=W_gpu,  # referencing tensor layout and type only
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        name="conv2d",
    )
    Y.set_output(True).set_name("conv_out")

Y_gpu = graph(
    X_gpu, W_gpu, handle=handle
)  # reading the tensor values to execute the graph

You can see that the line of calling `set_io_tuples()` is removed. Instead, you need to specify `inputs` and `outputs` as arguments to the `Graph` object so that you can call the graph with positional arguments.

In the examples above, you created the `Graph` object with the `io_data_type` and `compute_data_type` arguments. They are necessary in this particular example because otherwise the node `conv2d` will not know what precision it should use for the internal computations and the output tensor `Y` will not know what precision it should have.

The `io_data_type` and `compute_data_type` arguments are optional as they are to set as "default" for those not specified. You can omit them if you already specified them at the nodes and tensors that need them, like the following:

In [None]:
with cudnn.Graph(
    inputs=["conv2d::image", "conv2d::weight"],
    outputs=["conv_out"],
) as graph:
    Y = graph.conv_fprop(
        image=X_gpu,  # referencing tensor layout and type only
        weight=W_gpu,  # referencing tensor layout and type only
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        compute_data_type=cudnn.data_type.FLOAT,
        name="conv2d",
    )
    Y.set_data_type(cudnn.data_type.HALF).set_output(True).set_name("conv_out")

Y_gpu = graph(
    X_gpu, W_gpu, handle=handle
)  # reading the tensor values to execute the graph

This is the same as the previous example. But the `Graph` object now does not have the default data type for the I/O and computation. You must specify them at all tensors and nodes.

## Tensors to use in a wrapper

One major difference between the `Graph` wrapper and the underlying `pygraph` object from Python binding is that the `Graph` wrapper accepts dlpack tensors transparently without the need to convert them to cuDNN tensors.

Consider the following example:

In [None]:
with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["conv2d::image", "conv2d::weight"],
    outputs=["conv_out"],
) as graph:
    X = graph.tensor(
        name="X",
        dim=[8, 64, 56, 56],
        stride=[56 * 56 * 64, 1, 56 * 64, 64],
        data_type=cudnn.data_type.HALF,
    )
    W = graph.tensor(
        name="W",
        dim=[32, 64, 3, 3],
        stride=[64 * 3 * 3, 1, 64 * 3, 64],
    )
    Y = graph.conv_fprop(
        image=X,  # using a cuDNN tensor object for layout and type
        weight=W,  # using a cuDNN tensor object for layout and type
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        name="conv2d",
    )
    Y.set_output(True).set_name("conv_out")

device = torch.device("cuda")
X_gpu = torch.randn(8, 64, 56, 56, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
W_gpu = torch.randn(32, 64, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
Y_gpu = graph(
    X_gpu, W_gpu, handle=handle
)  # using dlpack tensors for values to execute the graph

This is same as the examples above, but you used the cuDNN tensors `X` and `W` to create the graph. This is not necessary but you can do so if you want to. The tensors `X` and `W` are just placeholders to specify the tensor attributes such as memory layout. No GPU memory is allocated for them.

The actual GPU memory allocation happens when you defined `X_gpu` and `W_gpu` later. But you can see that the graph is the same either you defined the graph with cuDNN tensors or dlpack tensors.

PyTorch tensors are not the only option for you. You can also use other dlpack tensors on the GPU, such as CuPy arrays:

In [None]:
import cupy as cp

# Create tensors with physical layout NHWC and logical layout NCHW
X_cupy = cp.random.randn(8, 56, 56, 64).astype(cp.float16).transpose(0, 3, 1, 2)
W_cupy = cp.random.randn(32, 3, 3, 64).astype(cp.float16).transpose(0, 3, 1, 2)

with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["conv2d::image", "conv2d::weight"],
    outputs=["conv_out"],
) as graph:
    Y = graph.conv_fprop(
        image=X_cupy,  # using a CuPy array for layout and type
        weight=W_cupy,  # using a CuPy array for layout and type
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        name="conv2d",
    )
    Y.set_output(True).set_name("conv_out")

Y_gpu = graph(
    X_cupy, W_cupy, handle=handle
)  # using CuPy arrays for values to execute the graph

Pay attention that even for CuPy tensors, you still need to comply with the data type and layout requirements for that particular operation.

You know the above code works by verifying the above result with PyTorch:

In [None]:
X_ref = torch.tensor(X_cupy)
W_ref = torch.tensor(W_cupy)
Y_ref = torch.nn.functional.conv2d(X_ref, W_ref, padding=1)

torch.testing.assert_close(Y_gpu, Y_ref, atol=5e-3, rtol=3e-3)

## Executing the graph from wrapper

From the examples above, you created a `Graph` object, defined the graph, and set the inputs and outputs as a tuple. Then you can execute the graph as if it were a function.

This is not the only way to execute the graph. You can execute the graph with a dictionary mapping the inputs and outputs to allocated tensors:

In [None]:
X_gpu = torch.randn(8, 56, 56, 64, device=device, dtype=torch.float16).permute(
    0, 3, 1, 2
)
W_gpu = torch.randn(32, 3, 3, 64, device=device, dtype=torch.float16).permute(
    0, 3, 1, 2
)

with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF, compute_data_type=cudnn.data_type.FLOAT
) as graph:
    Y = graph.conv_fprop(
        image=X_gpu,
        weight=W_gpu,
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        name="conv2d",
    )
    Y.set_output(True)

# allocate the output tensor
Y_gpu = torch.zeros(8, 56, 56, 32, device=device, dtype=torch.float16).permute(
    0, 3, 1, 2
)
# execute the graph with a dictionary, output tensors will be updated in place
output = graph(
    {"conv2d::image": X_gpu, "conv2d::weight": W_gpu, "conv2d::Y": Y_gpu}, handle=handle
)

The example above created a graph as before. But you did not set `inputs` and `outputs` in the `Graph` object, nor did you call `set_io_tuples()` after the graph is created. This way, you need to execute the graph by passing in a Python dictionary. The keys of the dictionary are the names of the inputs and outputs and the values of the dictionary are allocated tensors. The input tensors are where the values are read from, and the output tensors are where the values are written to.

In the graph above, there is only one node, `conv2d` from `conv_fprop()`. The output tensor will be named as `conv2d::Y` by default. The `conv2d` part is the name of the node and `Y` part is the name of the output tensor from the node. You can check the tensor name from any node from the cuDNN documentation. This is how  you refer to the output tensor from the graph above since you have not use `set_name()` to assign a new name to it.

You created `Y_gpu` as a tensor of all zeros. You know that the result has been written to `Y_gpu` by comparing its value with the reference result from PyTorch:

In [None]:
Y_ref = torch.nn.functional.conv2d(X_gpu, W_gpu, padding=1)
torch.testing.assert_close(Y_gpu, Y_ref, atol=5e-3, rtol=3e-3)

The `output` returned from executing the graph is the same dictionary as the one you used as the argument to the graph exeuction. The reason for this syntax is that you can skip creating the output tensors, like the following:

In [None]:
X2_gpu = torch.randn(8, 56, 56, 64, device=device, dtype=torch.float16).permute(
    0, 3, 1, 2
)
W2_gpu = torch.randn(32, 3, 3, 64, device=device, dtype=torch.float16).permute(
    0, 3, 1, 2
)

# Same graph as before will be reused
# with cudnn.Graph(
#     io_data_type=cudnn.data_type.HALF,
#     compute_data_type=cudnn.data_type.FLOAT
# ) as graph:
#    Y = graph.conv_fprop(
#        image=X_gpu,
#        weight=W_gpu,
#        padding=[1, 1],
#        stride=[1, 1],
#        dilation=[1, 1],
#        name="conv2d",
#    )
#    Y.set_output(True)

output = graph({"conv2d::image": X2_gpu, "conv2d::weight": W2_gpu}, handle=handle)
Y2_gpu = output["conv2d::Y"]  # new key is created in dict for the outputs

When you execute the graph, the dictionary that used as argument does not have the key `conv2d::Y`. But when it returns, that key was created and the value is a new tensor created by the wrapper. You can retrieve that output tensor and verify its result with PyTorch:

In [None]:
Y2_ref = torch.nn.functional.conv2d(X2_gpu, W2_gpu, padding=1)
torch.testing.assert_close(Y2_gpu, Y2_ref, atol=5e-3, rtol=3e-3)

In the examples above, you saw that you can execute the graph using positional arguments or a dictionary. The former requires you execute `set_io_tuples()` explicitly or implicitly. Either way, you need to reference to the input and output tensors. There are multiple ways to do this:

- Use a string in the format of `"node_name::tensor_name"`
- Use a string of the name assigned to the cuDNN tensor
- Use an integer of the uid assigned to the cuDNN tensor
- Use the cuDNN tensor object directly

Let's see an example:

In [None]:
X_gpu = torch.randn(8, 64, 56, 56, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
W_gpu = torch.randn(32, 64, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)

with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
) as graph:
    Y = graph.conv_fprop(
        X_gpu,  # tensor in a positional argument
        weight=W_gpu,  # tensor in a keyword argument
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
        name="conv2d",
    )
    Y.set_output(True)

# specify the input and output tensors and then execute the graph
graph.set_io_tuples(["conv2d::0", "conv2d::weight"], [Y])
Y_gpu = graph(X_gpu, W_gpu, handle=handle)

Y_ref = torch.nn.functional.conv2d(X_gpu, W_gpu, padding=1)
torch.testing.assert_close(Y_gpu, Y_ref, atol=5e-3, rtol=3e-3)

There are two main differences in the examples above. First, `conv_fprop()` uses the positional argument `X_gpu` instead of the `image` keyword argument. Second, `set_io_tuples()` is called differently.

Since `X_gpu` (i.e., `image`) is passed as a positional argument, it is referenced as `conv2d::0`, where `0` indicates it is the first positional argument. The output tensor `Y` is passed directly as a cuDNN tensor object, rather than being referenced as `conv2d::Y`, though both are equivalent.

In case you did not name the convolution node, you name your input and output tensors using the syntax `node_op.number::tensor_name`:

In [None]:
X_gpu = torch.randn(8, 64, 56, 56, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
W_gpu = torch.randn(32, 64, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)

with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
) as graph:
    Y = graph.conv_fprop(
        X_gpu,
        weight=W_gpu,
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
    )
    Y.set_output(True)

graph.set_io_tuples(["conv_fprop.0::0", "conv_fprop.0::weight"], ["conv_fprop.0::Y"])
Y_gpu = graph(X_gpu, W_gpu, handle=handle)

Y_ref = torch.nn.functional.conv2d(X_gpu, W_gpu, padding=1)
torch.testing.assert_close(Y_gpu, Y_ref, atol=5e-3, rtol=3e-3)

By removing the `name` argument from the `conv_fprop()` call, you now need to reference to the input tensors as `conv_fprop.0::0` and `conv_fprop.0::weight`. In `conv_fprop.0::0`, the part `conv_fprop.0` means it is the first node in the graph that uses the `conv_fprop` operation. The `::0` part means it is the first positional argument. Similarly, `conv_fprop.0::weight` means the keyword argument `weight` of that node.

You may find that referencing the input and output tensors by name is not very convenient. Indeed you can avoid all names in the following way:

In [None]:
X_gpu = torch.randn(8, 64, 56, 56, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
W_gpu = torch.randn(32, 64, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)

with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
) as graph:
    Y = graph.conv_fprop(
        X_gpu,
        weight=W_gpu,
        padding=[1, 1],
        stride=[1, 1],
        dilation=[1, 1],
    )
    Y.set_output(True)

# specify the input and output tensors using tensor objects
graph.set_io_tuples([W_gpu, X_gpu], [Y])
Y_gpu = graph(W_gpu, X_gpu, handle=handle)

Y_ref = torch.nn.functional.conv2d(X_gpu, W_gpu, padding=1)
torch.testing.assert_close(Y_gpu, Y_ref, atol=5e-3, rtol=3e-3)

In the `set_io_tuples()` call, you set the input tenors to be `X_gpu` and `W_gpu`. These two tensors are what you used to create the convolution node in the graph. The wrapper remembers them and you can reference them using the exact object. Note that the wrapper identifies the object by its Python object ID. Therefore, you must use the same object that you used to create the graph.

Also note that the order of the input tensors in `set_io_tuples()` above is different from the previous examples. It is to show that the order can be arbitrary in `set_io_tuples()`, but once you set it, you must follow the same order when you execute the graph.

Finally, it is worth noting that once you created the graph, you can execute it for multiple times with different inputs. It is not necessary to create a new graph object for the new inputs as long as the operation is the same (not only the graph topology, but also the input and output tensors' size, layout, and data type).

## Using wrapper in larger codebases

From the examples above, you can see that the `Graph` object is the centerpiece of for you to interact with cuDNN. It is a Python object and you can make use of it in your larger codebase. Below is an example on how to create a factory function for a graph with cache facility:

In [None]:
cache = {}  # global cache for the decorator


def cached_graph_factory(func):
    """Decorator for the factory function to cache Graph objects."""

    def wrapper(*args):
        key = [func.__name__]
        for arg in args:
            key.extend([arg.shape, arg.stride(), arg.dtype])
        key = tuple(key)
        if key not in cache:
            print("Creating new graph:", key)
            cache[key] = func(*args)
        return cache[key]

    return wrapper


@cached_graph_factory
def conv(x, w):
    with cudnn.Graph(
        io_data_type=cudnn.data_type.HALF,
        compute_data_type=cudnn.data_type.FLOAT,
    ) as graph:
        Y = graph.conv_fprop(
            image=x,
            weight=w,
            padding=[1, 1],
            stride=[1, 1],
            dilation=[1, 1],
            name="conv",
        )
        Y.set_output(True)

    graph.set_io_tuples(["conv::image", "conv::weight"], ["conv::Y"])
    return graph


def call_conv(x, w):
    g = conv(x, w)
    return g(x, w, handle=handle)


# Call it the first time
print("First call")
x1 = torch.randn(8, 64, 56, 56, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
w1 = torch.randn(32, 64, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
y1 = call_conv(x1, w1)

# Call it the second time with argument of the same shape
print("Second call")
x2 = torch.randn(8, 64, 56, 56, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
w2 = torch.randn(32, 64, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
y2 = call_conv(x2, w2)

# Call it the third time with argument of different shape
print("Third call")
x3 = torch.randn(8, 20, 60, 60, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
w3 = torch.randn(30, 20, 3, 3, device=device, dtype=torch.float16).to(
    memory_format=torch.channels_last
)
y3 = call_conv(x3, w3)

When you run the code above, you will see that the first call and third call created a new graph from the decorator because of the message printed. The second call did not create any graph but returned the same graph object instead because the cache key has found a match. Remember that the cuDNN graph can be reused for the same inputs and outputs. Therefore, you do not need a new graph for the second call. This caching facility helps you to check if the graph can be reused.

This is just an example how you can use the `Graph` object in a more sophisticated codebase.

In [None]:
cudnn.destroy_handle(handle)