# Matrix Multiplication

This notebook demonstrates how to use cuDNN's matrix multiplication API.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/10_matrix_multiplication.ipynb)

## Prerequisites for running on Colab

This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [None]:
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('pip install nvidia-cudnn-frontend')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128')

## Matrix Multiplication with Bias

This is an example shown in [a previous notebook](02_binding.ipynb). It's repeated here:

In [None]:
import cudnn
import torch

In [None]:
device = torch.device("cuda")
handle = cudnn.create_handle()

dtype = torch.float16
B, M, N, K = 16, 128, 128, 512

# input tensors
a_gpu = torch.randn(B, M, K, device=device, dtype=dtype)
b_gpu = torch.randn(B, K, N, device=device, dtype=dtype)
d_gpu = torch.randn(1, M, N, device=device, dtype=dtype)

# reference output
c_ref = torch.matmul(a_gpu, b_gpu) + d_gpu

### Using Wrapper

This is using the `Graph` wrapper to create a graph for matrix multiplication. The graph is executed as a function after the input and output order is specified.

In [None]:
with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["mm::A", "mm::B", "bias::bias"],
    outputs=["bias::OUT_0"],
) as graph:
    AB = graph.matmul(
        name="mm",
        A=a_gpu,
        B=b_gpu,
    )
    C = graph.bias(name="bias", input=AB, bias=d_gpu)
    C.set_output(True)

c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)

# verify the result
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)

### Using Python binding

Below is the equivalent using the Python binding APIs. It is more verbose.

In [None]:
# Create a handle and construct the graph
graph = cudnn.pygraph(
    handle=handle,
    io_data_type=cudnn.data_type.HALF,
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
)

a_cudnn = graph.tensor_like(a_gpu)
b_cudnn = graph.tensor_like(b_gpu)
d_cudnn = graph.tensor_like(d_gpu)

ab = graph.matmul(name="mm", A=a_cudnn, B=b_cudnn)
c_cudnn = graph.bias(name="bias", input=ab, bias=d_cudnn)
c_cudnn.set_output(True)

# build and validate the graph
graph.validate()
graph.build_operation_graph()
graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
graph.check_support()
graph.build_plans()

# place holder for cuDNN output
c_gpu = torch.empty(B, M, N, device=device, dtype=dtype)

# execute the graph
workspace = torch.empty(graph.get_workspace_size(), device=device, dtype=torch.uint8)
variant_pack = {
    a_cudnn: a_gpu,  # input
    b_cudnn: b_gpu,  # input
    d_cudnn: d_gpu,  # input
    c_cudnn: c_gpu,  # output
}
graph.execute(variant_pack, workspace, handle=handle)
torch.cuda.synchronize()

# verify the result
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)

## Matrix Multiplication with Mixed Precision

Matrix multiplication operation is supposed to be done in the same precision. To multiply two matrices of different precisions, you need to cast the precision first.

In [None]:
B, M, N, K = 16, 128, 128, 512

a_gpu = torch.randint(4, (B, M, K), device=device, dtype=torch.int8)
b_gpu = torch.randn(B, K, N, device=device, dtype=torch.bfloat16)

c_ref = torch.matmul(a_gpu.to(torch.bfloat16), b_gpu)

### Using Wrapper

Let's see how to create a graph for matrix multiplication with mixed precision. It involves two nodes: one for precision casting and another for actual matrix multiplication.

In [None]:
with cudnn.Graph(
    io_data_type=cudnn.data_type.HALF,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["iden::input", "mm::B"],
    outputs=["mm::C"],
) as graph:
    # cast "A" to same data type as "B"
    a_casted = graph.identity(
        name="iden", input=a_gpu, compute_data_type=cudnn.data_type.FLOAT
    )
    a_casted.set_data_type(torch.bfloat16)
    # matmul with two tensors of same data type
    c = graph.matmul(
        name="mm", A=a_casted, B=b_gpu, compute_data_type=cudnn.data_type.FLOAT
    )
    c.set_output(True).set_data_type(torch.bfloat16)

c_gpu = graph(a_gpu, b_gpu, handle=handle)

# verify the result
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)

### Using Python binding

The equivalent code using the Python binding APIs is as follows:

In [None]:
# create a graph
graph = cudnn.pygraph()

a_cudnn = graph.tensor_like(a_gpu)
b_cudnn = graph.tensor_like(b_gpu)

a_casted = graph.identity(
    input=a_cudnn,
    compute_data_type=cudnn.data_type.FLOAT,
)
a_casted.set_data_type(cudnn.data_type.BFLOAT16)
c_cudnn = graph.matmul(
    name="matmul",
    A=a_casted,
    B=b_cudnn,
    compute_data_type=cudnn.data_type.FLOAT,
)
c_cudnn.set_output(True).set_data_type(cudnn.data_type.BFLOAT16)

# validate and build
graph.validate()
graph.build_operation_graph()
graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
graph.check_support()
graph.build_plans()

# execute the graph
c_gpu = torch.randn(B, M, N, device=device, dtype=torch.bfloat16)
variant_pack = {
    a_cudnn: a_gpu,
    b_cudnn: b_gpu,
    c_cudnn: c_gpu,
}
workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
graph.execute(variant_pack, workspace)
torch.cuda.synchronize()

# verify the result
torch.testing.assert_close(c_gpu, c_ref, atol=5e-3, rtol=3e-3)