In this notebook, I'll work through [this guide by Simon Boehm on fast matmul](https://siboehm.com/articles/22/CUDA-MMM)

For simplicity, I'll use cuda from numba. According to [its documentation](https://numba.readthedocs.io/en/stable/cuda/overview.html#missing-cuda-features) it misses some cuda feature, but these features are not required for this guide:
- [dynamic parallelism](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism): use kernel function from kernel functions
- [texture memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#texture-and-surface-memory): a special memory for graphics applucation

Let's start!

----

### Prep 1 - Use numba cuda

In [8]:
import os
import numpy as np
import matplotlib.pyplot as plt
from fastcore.basics import tuplify

#os.environ['NUMBA_ENABLE_CUDASIM']='1'  # enables simulator 
os.environ['CUDA_LAUNCH_BLOCKING']='1'

from numba import cuda
from util import to_d, to_h, array_like, cdiv

np.set_printoptions(precision=2, linewidth=200)

In [9]:
@cuda.jit
def f(a, b, c):
    tid = cuda.grid(1) # like threadIdx.x + (blockIdx.x * blockDim.x)
    if tid >= len(c): return
    c[tid] = a[tid] + b[tid]

In [10]:
n = 30
a = to_d(np.ones(n))
b = to_d(np.ones(n))
c = array_like(a)

In [11]:
nthreads = 8
nblocks = (len(a) // nthreads) + 1
f[nblocks, nthreads](a, b, c)
c = to_h(c)



In [12]:
c

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])

### Prep 2 - Profiling numba cuda

There are 2 profiling options:
1. **Simple:** This measures the cuda runtime of a kernel
2. **Detailled:** This performs a detailled analysis of the kernel run, and gives hints how the kernel could be improved

I can do **simple** profiling in a notebook, which I'll do.

For **detailled** profiling, I can use 'nsight compute' (a program by nvidia; short name 'ncu'), which requires the code to be in a separate file, so can't be done in jupyter.

I can then use `ncu --set full --import-source yes -f -o <output_location> --page details python3 <python_filename>` for profiling. Then I can either look a the the console output for a quick analysis, or load the output into the nsight compute ui program (which I've downloaded) for more details.

_Note: I could do detailled profiling in jupyter with `%%writefile <py_filename>`, `%run  <py_filename>` and `!ncu ...`, but I don't see the benefit._

In [21]:
from torch.profiler import profile, record_function, ProfilerActivity
from torch.profiler import schedule as profiler_schedule
from torch import allclose, tensor

In [23]:
def cuda_mean_runtime(prof_log, kernel_name, do_print=False):
    # extract cuda mean runtime from a torch.profiler log
    kernels = [o for o in prof_log.key_averages() if kernel_name in o.key]
    names = [k.key for k in kernels]
    if len(names)==0: raise RuntimeError(f"Profiling logs have no kernel with 'f{kernel_name}' in its name")
    if len(names)>1: raise RuntimeError(f"Profiling logs have multiple kernel with 'f{kernel_name}' in its name: f{names}. Please be more precise.")
    mean_runtime = kernels[0].cuda_time/1e3 # use ms instea of µs
    if do_print: print(f'{mean_runtime/1e3:.3f}s') # print in s
    return mean_runtime

def assert_matmul_correct(a,b,c): assert allclose(tensor(to_h(c)), tensor(a)@tensor(b)), 'c != a@b -- mamtul kernel seems to be incorrect, as its output differs from torch'

def measure_runtime(kernel, m=4092,n=4092,k=4092, bs=32, do_print=True, runs=5):
    print(f'Measuring runtime of kernel {kernel.__name__} for m,n,k = {m},{n},{k}, averaging over {runs} runs.')
    a = to_d(np.ones((m,k)))
    b = to_d(np.ones((k,n)))
    with profile(activities=[ProfilerActivity.CUDA], schedule=profiler_schedule(wait=0, warmup=1, active=2, repeat=1)) as prof:
        for _ in range(runs):
            c = to_d(np.empty((m,n)))
            nthreads = (bs,bs)
            nblocks = cdiv(c.shape, nthreads)
            kernel[nblocks, nthreads](a,b,c)
            prof.step()
    assert_matmul_correct(a,b,c)
    cuda_mean_runtime(prof, kernel.__name__, do_print=do_print)

_Note: I've tested this code in simple_profily.ipynb_

### Kernel 1: Naive Implementation

In [24]:
m,n,k = 2,4,3

a = to_d(np.ones((m,k)))
b = to_d(np.ones((k,n)))
def fresh_c(): return to_d(np.empty((m,n)))

In [25]:
@cuda.jit()
def matmul_1(a,b,c):
    tid_x, tid_y = cuda.grid(2)
    m,n = c.shape
    k = a.shape[1]
    if tid_x >= m or tid_y >= n: return 
    tmp = 0
    for i in range(k):
        tmp += a[tid_x, i] * b[i, tid_y]        
    c[tid_x, tid_y] = tmp

In [26]:
c = fresh_c()

In [27]:
nthreads = (32,32)
nblocks = cdiv(c.shape, nthreads)

nthreads, nblocks

((32, 32), (1, 1))

In [28]:
matmul_1[nblocks, nthreads](a,b,c)
to_h(c)



array([[3., 3., 3., 3.],
       [3., 3., 3., 3.]])

In [29]:
measure_runtime(matmul_1);

Measuring runtime of kernel matmul_1 for m,n,k = 4092,4092,4092, averaging over 5 runs.


STAGE:2024-05-01 16:06:47 3256:3256 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-05-01 16:06:51 3256:3256 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-05-01 16:06:51 3256:3256 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


2.042s


### Kernel 2: Global Memory Coalescing

### Kernel 3: Shared Memory Cache-Blocking

### Kernel 4: 1D Blocktiling for Calculating Multiple Results per Thread

### Kernel 5: Increasing Arithmetic Intensity via 2D Blocktiling

### Kernel 6: Vectorize SMEM and GMEM Accesses

### Where are kernels 7 & 8??

### Kernel 9: Autotuning

### Kernel 10: Warptiling

### wip - Kernel 11