# Matrix multiplication operation with fused bias using cudnn FE
This notebook shows how a matmul operation with fused bias can be done using cudnn.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/cudnn-frontend/blob/main/samples/python/01_matmul_bias.ipynb)

## Prerequisites for running on Colab
This notebook requires an NVIDIA GPU H100 or newer. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected.

In [None]:
# get_ipython().system('nvidia-smi')

If running on Colab, you will need to install the cudnn python interface.

In [1]:
# get_ipython().system('export CUDA_VERSION="12.3"')
# get_ipython().system('pip install nvidia-cudnn-cu12')
# get_ipython().system('CUDNN_PATH=`pip show nvidia-cudnn-cu12  | grep Location | cut -d":" -f2 | xargs`/nvidia/cudnn pip install git+https://github.com/NVIDIA/cudnn-frontend.git')
# get_ipython().system('pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121')

#### General Setup
We are going to call the cudnn through torch in this example. In general any dlpack tensor should work.
cudnn handle is a per device handle used to initialize cudnn context.


In [8]:
import cudnn
import torch
import sys

handle = cudnn.create_handle()

StopExecution: 

#### Create input tensors and calculate reference

In [2]:
batch, m, n, k = 16, 128, 128, 512

input_type = torch.float16

# input tensors
a = torch.randn(batch, m, k, dtype=input_type, device='cuda')
b = torch.randn(batch, k, n, dtype=input_type, device='cuda')
B = torch.randn(1, m, n, dtype=torch.float16, device='cuda')

# reference output
c_ref = torch.matmul(a, b) + B

# place holder for cudnn output
c = torch.randn_like(c_ref, device='cuda')

#### Create cudnn graph and tensors

In [None]:
graph = cudnn.pygraph(intermediate_data_type = cudnn.data_type.FLOAT, compute_data_type = cudnn.data_type.FLOAT)

a_cudnn_tensor = graph.tensor_like(a)
b_cudnn_tensor = graph.tensor_like(b)
bias_cudnn_tensor = graph.tensor_like(B)

c_intermediate = graph.matmul(name = "matmul", A = a_cudnn_tensor, B = b_cudnn_tensor)

c_cudnn_tensor = graph.bias(name = "bias", input = c_intermediate, bias = bias_cudnn_tensor)
    
c_cudnn_tensor.set_name("c").set_output(True).set_data_type(cudnn.data_type.HALF)

#### Build the graph

In [4]:
graph.validate()
graph.build_operation_graph()
graph.create_execution_plans([cudnn.heur_mode.A, cudnn.heur_mode.FALLBACK])
graph.check_support()
graph.build_plans()

#### Execute the code

In [5]:
variant_pack = {
    a_cudnn_tensor: a,
    b_cudnn_tensor: b,
    c_cudnn_tensor: c,
    bias_cudnn_tensor: B,
}

workspace = torch.empty(graph.get_workspace_size(), device="cuda", dtype=torch.uint8)
graph.execute(variant_pack, workspace)
torch.cuda.synchronize()

In [6]:
torch.testing.assert_close(c, c_ref, rtol = 5e-3, atol = 5e-3)