# Matrix Multiplication Fundamentals

A GEMM (General Matrix Multiply) operation takes the form $C = \alpha \mathbf{A}\mathbf{B} + \beta\mathbf{C}$ where $\alpha, \beta$ are scalars, $\mathbf{A}$ is an $m \times k$ matrix, $\mathbf{B}$ is a $k \times n$ matrix, and $\mathbf{C}$ is a $m \times n$ matrix.

The element at row $i$ and column $j$ of matrix $\mathbf{C}$ is calculated as the scaled and biased dot product of row $i$ of $\mathbf{A}$ and column $j$ of $\mathbf{B}$ as follows:

$$
\mathbf{C}_{i, j} = \alpha \left(\sum_{l=0}^{k} \mathbf{A}_{i, l} \mathbf{B}_{l, j} \right) + \beta \mathbf{C}_{i, j}
$$

In implementation the above operation is usually split into 2 parts:
1. **Matrix Multiplication** itself, computing $ \mathbf{D}_{i, j} = \sum_{l=0}^{k} \mathbf{A}_{i, l} \mathbf{B}_{l, j} $
2. **Epilogue**, computing $ \mathbf{C}_{i, j} = \alpha \cdot \mathbf{D}_{i, j} + \beta \cdot \mathbf{C}_{i, j} $

## Exercise Setup

### C++ - CMake configuration

In [None]:
import sys, os
sys.path.append(os.sep.join(["..", "utilities", "python"]))
from common_cuda import setup_cmake_project

# A python cmake wrapper to determine the GPU architecture and compile for only that
setup_cmake_project()

### Python - Imports

In [None]:
import sys
import os

import numpy as np
import cupy as cp
import nvmath

from nvmath.device import Matmul
from nvmath.device.cublasdx import DevicePipeline, SharedStorageCalc, MAX_ALIGNMENT
from nvmath.device.cublasdx_numba import pipeline_extensions
from nvmath.device.common import axpby, clear, copy, copy_fragment, copy_wait, make_tensor
from numba import cuda

sys.path.append(os.sep.join(["..", "utilities", "python"]))

from benchmark import *

## Exercise 2.1: Naive DGEMM Kernel

In this exercise, we will implement a naive GEMM algorithm by having each CUDA thread calculate one element in our C matrix:

<img src="./images/naive_gemm.png" width="1600" height="auto"/>

This diagram shows that we compute element (0, 0) of the C matrix by calculating the dot product between row 0 of the A matrix and column 0 of the B matrix.  This can be done by iterating along the K dimension and at each step multiplying i'th element of the row of A and column of B and accumulating the results.

### C++

In [None]:
%%writefile cpp/1a/parameters.hpp.inc

    // (gemm_m, gemm_n, gemm_k, alpha, beta)
    std::vector<tutorial::gemm_problem_t> problems = {
        {2048, 2048, 2048, 0.9, 1.1}
    };

In [None]:
%%writefile cpp/1a/kernel.hpp.inc

template<int BlockSize, class TensorA, class TensorB, class TensorC>
__launch_bounds__(BlockSize, 1) __global__ void kernel_1a_simple_dgemm(double        alpha,
                                                                       TensorA const tensor_a,
                                                                       TensorB const tensor_b,
                                                                       double        beta,
                                                                       TensorC       tensor_c) {
    int const thread_row_idx = threadIdx.x + blockIdx.x * blockDim.x;
    int const thread_col_idx = threadIdx.y + blockIdx.y * blockDim.y;

    auto [size_m, size_n] = tensor_c.shape();
    auto size_k = tutorial::size<1>(tensor_a);

    // EXERCISE --> What are the conditions for early exit?
    if (...) {
        return;
    }

    double accumulator = 0.0;

    // EXERCISE --> Complete the following implementation to compute the dot product between row 'thread_row_idx' of matrix A 
    // and the column of 'thread_col_idx' of matrix B.
    for (...) {
        accumulator += ...;
    }

    // We can use the tensor object to do 2D indexing as follows:
    double c_elem = tensor_c(thread_row_idx, thread_col_idx);

    // HINT: Remember that we are doing a GEMM operation (see above)
    tensor_c(thread_row_idx, thread_col_idx) = ...;
}

In [None]:
%%writefile cpp/1a/kernel_parameters.hpp.inc
    // If you have time, try a few different block dimensions and see how the performance changes.

    // Setup kernel configuration
    int const block_dim_x = 16;
    int const block_dim_y = 16;

In [None]:
!cmake --build build/ -t 1a_simple_dgemm_tensor

In [None]:
!./build/1a_simple_dgemm_tensor

#### Solution

We will rewrite kernel now and recompile the solution. If you want to restart your exercise make sure you rewrite kernel back and recompile it.

In [None]:
%%writefile cpp/1a/kernel.hpp.inc

template<int BlockSize, class TensorA, class TensorB, class TensorC>
__launch_bounds__(BlockSize, 1) __global__ void kernel_1a_simple_dgemm(double        alpha,
                                                                       TensorA const tensor_a,
                                                                       TensorB const tensor_b,
                                                                       double        beta,
                                                                       TensorC       tensor_c) {
    int const thread_row_idx = threadIdx.x + blockIdx.x * blockDim.x;
    int const thread_col_idx = threadIdx.y + blockIdx.y * blockDim.y;

    auto [size_m, size_n] = tensor_c.shape();
    auto size_k = tutorial::size<1>(tensor_a);

    if (thread_row_idx > size_m || thread_col_idx > size_n) {
        return;
    }

    double accumulator = 0.0;

    // EXERCISE --> Complete the following implementation to compute the dot product between row 'thread_row_idx' of matrix A 
    // and the column of 'thread_col_idx' of matrix B.
    for (int i = 0; i < size_k; i++) {
        accumulator += tensor_a(thread_row_idx, i) * tensor_b(i, thread_col_idx);
    }

    // We can use the tensor object to do 2D indexing as follows:
    double c_elem = tensor_c(thread_row_idx, thread_col_idx);

    // HINT: Remember that we are doing a GEMM operation (see above)
    tensor_c(thread_row_idx, thread_col_idx) = alpha * accumulator + beta * c_elem;
}

In [None]:
!cmake --build build/ -t 1a_simple_dgemm_tensor

In [None]:
!./build/1a_simple_dgemm_tensor

### Python

In [None]:
# The problems that we will benchmark and conduct accuracy tests on the tuple should be formed as:
# (GEMM_M, GEMM_N, GEMM_K, ALPHA, BETA)
problems = [
  (2048, 2048, 2048, 0.9, 1.1),
]

In [None]:
def get_2_1_dgemm_kernel(block_x = 16, block_y = 16):
    block_size = block_x * block_y
    
    @cuda.jit(launch_bounds=(block_size, 1))
    def dgemm_kernel(alpha, tensor_a, tensor_b, beta, tensor_c):
        m, n = tensor_c.shape
        _, k = tensor_a.shape

        thread_row_idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
        thread_col_idx = cuda.threadIdx.y + cuda.blockIdx.y * cuda.blockDim.y

        # EXERCISE --> What are the conditions for early exit?
        #if :
        #    return

        accumulator = 0.0

        # EXERCISE --> Complete the following implementation to compute the dot product between row 'thread_row_idx' of matrix A 
        # and the column of 'thread_col_idx' of matrix B.
        #for ...:
        #    accumulator += ...
    
        # We can use the tensor object to do 2D indexing as follows:
        c_elem = tensor_c[thread_row_idx, thread_col_idx]
    
        # HINT: Remember that we are doing a GEMM operation (see above)
        #tensor_c[thread_row_idx, thread_col_idx] = ...;
    
    return dgemm_kernel

In [None]:
def choose_kernel_params_2_1(m, n, k, alpha, beta):
    # EXERCISE --> Try a few different block dimensions.  A few questions to think about:
    #              How does performance change if they are not powers of 2?
    #              How does performance change with rectangular shapes?  What if you change the gemm problem?
    return 16, 16

In [None]:
benchmark_dgemm_2_1(problems, get_2_1_dgemm_kernel, choose_kernel_params_2_1)

#### Solution

In [None]:
def get_2_1_dgemm_kernel_solution(block_x = 16, block_y = 16):
    block_size = block_x * block_y
    
    @cuda.jit(launch_bounds=(block_size, 1))
    def dgemm_kernel(alpha, tensor_a, tensor_b, beta, tensor_c):
        m, n = tensor_c.shape
        _, k = tensor_a.shape

        thread_row_idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
        thread_col_idx = cuda.threadIdx.y + cuda.blockIdx.y * cuda.blockDim.y

        accumulator = 0.0

        # EXERCISE --> Complete the following implementation to compute the dot product between row 'thread_row_idx' of matrix A 
        # and the column of 'thread_col_idx' of matrix B.
        for i in range(k):
            accumulator += tensor_a[thread_row_idx, i] * tensor_b[i, thread_col_idx]
    
        # We can use the tensor object to do 2D indexing as follows:
        c_elem = tensor_c[thread_row_idx, thread_col_idx]
    
        # HINT: Remember that we are doing a GEMM operation (see above)
        tensor_c[thread_row_idx, thread_col_idx] = alpha * accumulator + beta * c_elem;
    
    return dgemm_kernel

In [None]:
benchmark_dgemm_2_1(problems, get_2_1_dgemm_kernel_solution, choose_kernel_params_2_1)

### Analyzing Naive GEMM Performance

Our kernel is expectedly performing just okay but how can we understand if this is an algorithmic or implementation limitation?  Let's breifly analyze our implementation using a [roofline model](https://en.wikipedia.org/wiki/Roofline_model):

In [None]:
import numpy as np

# TFLOPS, MEMORY BANDWIDTH (GB/s)
GPU_SPECS = {
    "L40S": (1.43, 864),
    "B200": (37, 6200)
}

def roofline_prediction_2_1(m, n, k):
    FP64_TFLOPS, MEMORY_BANDWIDTH_GBS = GPU_SPECS["L40S"]

    # By design since each thread is computing one output element
    threads = m * n

    # Each dot product consists of k multiplications and k adds 
    flops_per_thread = 2 * k

    fp64_size = np.dtype(np.float64).itemsize

    # We load a row of matrix A, a column of matrix B, and read from / write to matrix C
    memory_per_thread = (2 * k + 2) * fp64_size

    total_memory_gb = threads * memory_per_thread * 1e-9
    total_tflop = threads * flops_per_thread * 1e-12

    return total_tflop / FP64_TFLOPS, total_memory_gb / MEMORY_BANDWIDTH_GBS

time_flops, time_membw = roofline_prediction_2_1(2048, 2048, 2048)

print(f"The runtime from the math operations {time_flops * 1e3} ms and the runtime from memory is {time_membw * 1e3} ms")

# We will either be bottlenecked by FLOPS or Memory Bandwidth, so we take the maximum
print(f"Therefore, the estimated best case runtime is {max(time_flops, time_membw) * 1e3} ms")

We can see that our kernel is performing roughly as we'd expect with some small improvements which are probably due to the hardware caching inputs it's already seen and GPU latency hiding.  Now that we know that memory is the main bottleneck, we further optimize memory movement in the next exercise.

Information regarding certain GPUs' capabilities can be found in official datasheets, you can see some examples here:
- [NVIDIA A100](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf)
- [NVIDIA L40](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/support-guide/NVIDIA-L40-Datasheet-January-2023.pdf)
- [NVIDIA L40s](https://images.nvidia.com/content/Solutions/data-center/vgpu-L40-datasheet.pdf)
- [NVIDIA Blackwell RTX Pro](https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/rtx-pro-6000-blackwell-workstation-edition/workstation-blackwell-rtx-pro-6000-workstation-edition-nvidia-us-3519208-web.pdf)
- [NVIDIA B200](https://resources.nvidia.com/en-us-blackwell-architecture)

## Exercise 2.2: Improving DGEMM with Shared Memory Tiling

In this exercise, we will discuss the memory subsystem of GPUs and leverage these hardware properties for further optimization.

The memory subsystem for GPUs has many components.  Some of which are global memory, the L2 cache, the L1 cache, and registers. The relative access speed for each is very dependent on the GPU, however, a common analogy can be used to describe their relative speeds:

 - Accessing registers is similar to already having the part in your hand
 - Accessing L1 is like having the part in your pocket
 - Accessing L2 is similar to having the part on your workbench
 - Accessing global memory is like having the part in your toolbox in another room

The higher you are in the memory hierarchy, the slower it is to access.  However, at higher memory hierarchies, you typically have more space.  Keeping with the analogy, since we can't keep that many parts in our hands at once, we need to strategically decide where to store our parts in order to increase our efficiency.

In **Exercise 2.0**, what we've effectively done is read from global memory on each access, which would be like running to the other room everytime we needed a new part.  In this exercise, we will strategically store data in L1 (the pockets in our analogy) and periodically fetch new data (or parts).  CUDA allows us to read/write from the L1 cache through the use of shared memory. The exercise below will have us read from global memory in "tiles" and then do computations on tiles of the A/B matrices by reading from shared memory like we've shown below

<img src="images/tiling.png" width="1400" height="auto"/>

### C++

In [None]:
%%writefile cpp/1b/parameters.hpp.inc
    // (gemm_m, gemm_n, gemm_k, alpha, beta)
    std::vector<tutorial::gemm_problem_t> problems = {
        {2048, 2048, 2048, 0.9, 1.1}
    };

In [None]:
%%writefile cpp/1b/kernel.hpp.inc

template <int BlockM, int BlockN, int BlockK>
struct tile_config {
    static constexpr int m = BlockM;
    static constexpr int n = BlockN;
    static constexpr int k = BlockK;

    static constexpr int num_elems_a = m * k;
    static constexpr int num_elems_b = k * n;

    static constexpr int max_threads_per_block = BlockM * BlockN;

    static_assert(m == n && m == k, "This constraint is for simplicity, feel free to challenge yourself and complicate the config");
};

template<class TileConfig, class TensorA, class TensorB, class TensorC>
__launch_bounds__(TileConfig::max_threads_per_block, 1) __global__
    void kernel_1b_simple_dgemm_shared(double        alpha,
                                       TensorA const tensor_a,
                                       TensorB const tensor_b,
                                       double        beta,
                                       TensorC const tensor_c) {
    extern __shared__ __align__(sizeof(double)) unsigned char smem[];

    double* smem_a_data = reinterpret_cast<double*>(smem);
    auto smem_a_tensor  = tutorial::make_smem_tensor<arrangement::col_major, TileConfig::m, TileConfig::k>(smem_a_data);

    double* smem_b_data = tutorial::raw_pointer_cast(smem_a_tensor.data()) + TileConfig::num_elems_a;
    auto smem_b_tensor  = tutorial::make_smem_tensor<arrangement::row_major, TileConfig::k, TileConfig::n>(smem_b_data);

    // Assert that for A: mxk and B: kxn both Ks are the same size
    auto const global_k = tutorial::size<1>(tensor_a);

    // Define accumulator storage
    double accumulator = 0.0;

    int const idx_x = threadIdx.x;
    int const idx_y = threadIdx.y;

    int const thread_row_idx = threadIdx.x + blockDim.x * blockIdx.x;
    int const thread_col_idx = threadIdx.y + blockDim.y * blockIdx.y;

    // EXERCISE -> Iterate in tiles along the K dimension, store tiles of the A and B matrices into shored memory,
    //             and read from shared memory buffers when accumulating.
    // 
    // Hints:
    //  - When do we need synchronize with (__syncthreads)?
}

In [None]:
%%writefile cpp/1b/kernel_parameters.hpp.inc

    constexpr int tile_m      = 16;
    constexpr int tile_n      = 16;
    constexpr int tile_k      = 16;
    using tile                = tile_config<tile_m, tile_n, tile_k>;

In [None]:
!cmake --build build/ -t 1b_simple_dgemm_shared

In [None]:
!./build/1b_simple_dgemm_shared

#### Solution

We will rewrite kernel now and recompile the solution. If you want to restart your exercise make sure you rewrite kernel back and recompile it.

In [None]:
%%writefile cpp/1b/kernel.hpp.inc

template <int BlockM, int BlockN, int BlockK>
struct tile_config {
    static constexpr int m = BlockM;
    static constexpr int n = BlockN;
    static constexpr int k = BlockK;

    static constexpr int num_elems_a = m * k;
    static constexpr int num_elems_b = k * n;

    static constexpr int max_threads_per_block = BlockM * BlockN;

    static_assert(m == n && m == k, "This constraint is for simplicity, feel free to challenge yourself and complicate the config");
};

template<class TileConfig, class TensorA, class TensorB, class TensorC>
__launch_bounds__(TileConfig::max_threads_per_block, 1) __global__
    void kernel_1b_simple_dgemm_shared(double        alpha,
                                       TensorA const tensor_a,
                                       TensorB const tensor_b,
                                       double        beta,
                                       TensorC const tensor_c) {
    extern __shared__ __align__(sizeof(double)) unsigned char smem[];

    double* smem_a_data = reinterpret_cast<double*>(smem);
    auto smem_a_tensor  = tutorial::make_smem_tensor<arrangement::col_major, TileConfig::m, TileConfig::k>(smem_a_data);

    double* smem_b_data = tutorial::raw_pointer_cast(smem_a_tensor.data()) + TileConfig::num_elems_a;
    auto smem_b_tensor  = tutorial::make_smem_tensor<arrangement::row_major, TileConfig::k, TileConfig::n>(smem_b_data);

    // Assert that for A: mxk and B: kxn both Ks are the same size
    auto const global_k = tutorial::size<1>(tensor_a);

    // Define accumulator storage
    double accumulator = 0.0;

    int const idx_x = threadIdx.x;
    int const idx_y = threadIdx.y;

    int const thread_row_idx = threadIdx.x + blockDim.x * blockIdx.x;
    int const thread_col_idx = threadIdx.y + blockDim.y * blockIdx.y;

    // EXERCISE -> Iterate in tiles along the K dimension, store tiles of the A and B matrices into shored memory,
    //             and read from shared memory buffers when accumulating.
    // 
    // Hints:
    //  - When do we need synchronize with (__syncthreads)?
    for (int tile_iter = 0; tile_iter < (global_k / TileConfig::k); ++tile_iter) {

        // Load current tile into shared memory
        auto current_global_tile_a = cublasdx::get_tile(tensor_a, smem_a_tensor.shape(), blockIdx.x, tile_iter);
        auto current_global_tile_b = cublasdx::get_tile(tensor_b, smem_b_tensor.shape(), tile_iter, blockIdx.y);

        __syncthreads();

        smem_a_tensor(idx_x, idx_y) = current_global_tile_a(idx_x, idx_y);
        smem_b_tensor(idx_x, idx_y) = current_global_tile_b(idx_x, idx_y);

        __syncthreads();

        #pragma unroll
        for (int i = 0; i < TileConfig::k; i++) {
            accumulator += smem_a_tensor(idx_x, i) * smem_b_tensor(i, idx_y);
        }
    }

    double const c_elem = tensor_c(thread_row_idx, thread_col_idx);
    double const result = alpha * accumulator + beta * c_elem;

    // Store results
    tensor_c(thread_row_idx, thread_col_idx) = result;
}

In [None]:
!cmake --build build/ -t 1b_simple_dgemm_shared

In [None]:
!./build/1b_simple_dgemm_shared

### Python

In [None]:
# The problems that we will benchmark and conduct accuracy tests on the tuple should be formed as:
# (GEMM_M, GEMM_N, GEMM_K, ALPHA, BETA)
problems = [
  (2048, 2048, 2048, 1.0, 1.0),
]

In [None]:
def get_2_2_dgemm_kernel():
    # For this kernel, we simplify to only 16x16x16 tile size.
    # While it's possible to change tile sizes, it significantly complicates code
    TILE_M = 16
    TILE_N = 16
    TILE_K = 16

    BLOCK_SIZE = 16 * 16
    
    @cuda.jit(launch_bounds=(BLOCK_SIZE, 1))
    def dgemm_kernel(alpha, tensor_a, tensor_b, beta, tensor_c):
        m, n = tensor_c.shape
        _, k = tensor_a.shape

        smem_a_tensor = cuda.shared.array(shape=(TILE_M, TILE_K), dtype=np.float64)
        smem_b_tensor = cuda.shared.array(shape=(TILE_K, TILE_N), dtype=np.float64)

        idx_x = cuda.threadIdx.x
        idx_y = cuda.threadIdx.y
        
        thread_row_idx = idx_x + cuda.blockIdx.x * cuda.blockDim.x
        thread_col_idx = idx_y + cuda.blockIdx.y * cuda.blockDim.y

        # EXERCISE -> Iterate in tiles along the K dimension, store tiles of the A and B matrices into shored memory,
        #             and read from shared memory buffers when accumulating.
        # 
        # Hints:
        #  - When do we need synchronize with (cuda.syncthreads())?
    
    return dgemm_kernel

In [None]:
benchmark_dgemm_2_2(problems, get_2_2_dgemm_kernel)

#### Solution

In [None]:
def get_2_2_dgemm_kernel_solution():
    # For this kernel, we simplify to only 16x16x16 tile size.
    # While it's possible to change tile sizes, it significantly complicates code
    TILE_M = 16
    TILE_N = 16
    TILE_K = 16

    BLOCK_SIZE = 16 * 16
    
    @cuda.jit(launch_bounds=(BLOCK_SIZE, 1))
    def dgemm_kernel(alpha, tensor_a, tensor_b, beta, tensor_c):
        m, n = tensor_c.shape
        _, k = tensor_a.shape

        smem_a_tensor = cuda.shared.array(shape=(TILE_M, TILE_K), dtype=np.float64)
        smem_b_tensor = cuda.shared.array(shape=(TILE_K, TILE_N), dtype=np.float64)

        idx_x = cuda.threadIdx.x
        idx_y = cuda.threadIdx.y
        
        thread_row_idx = idx_x + cuda.blockIdx.x * cuda.blockDim.x
        thread_col_idx = idx_y + cuda.blockIdx.y * cuda.blockDim.y

        accumulator = 0.0

        # EXERCISE -> Iterate in tiles along the K dimension, store tiles of the A and B matrices into shored memory,
        #             and read from shared memory buffers when accumulating.
        # 
        # Hints:
        #  - When do we need synchronize with (cuda.syncthreads())?
        for tile_k_start in range(0, k, TILE_K):
            smem_a_tensor[idx_x, idx_y] = tensor_a[thread_row_idx, tile_k_start + idx_y]
            smem_b_tensor[idx_x, idx_y] = tensor_b[tile_k_start + idx_x, thread_col_idx]
    
            cuda.syncthreads()
    
            for i in range(0, TILE_K):
                accumulator += smem_a_tensor[idx_x, i] * smem_b_tensor[i, idx_y]

            cuda.syncthreads()

        c_elem = tensor_c[thread_row_idx, thread_col_idx]
        
        tensor_c[thread_row_idx, thread_col_idx] = alpha * accumulator + beta * c_elem
    
    return dgemm_kernel

In [None]:
benchmark_dgemm_2_2(problems, get_2_2_dgemm_kernel_solution)

### Analyzing Tiled GEMM Performance

Let's modify our roofline model and consider the optimizations we've made.

In [None]:
import numpy as np
import math

# TFLOPS, MEMORY BANDWIDTH (GB/s)
GPU_SPECS = {
    "L40S": (1.43, 864),
    "B200": (37, 6200)
}

def roofline_prediction_2_2(m, n, k, TILE_M=16, TILE_N=16, TILE_K=16):
    FP64_TFLOPS, MEMORY_BANDWIDTH_GBS = GPU_SPECS["L40S"]

    # Let's instead 

    # By design since each thread is computing one output element
    tiles = math.ceil(m / TILE_M) * math.ceil(n / TILE_N)

    # Each tile does TILE_M * TILE_N dot products which each have k multiplications and k additions
    flops_per_tile = 2 * TILE_M * TILE_N * k

    fp64_size = np.dtype(np.float64).itemsize

    # We load a TILE_M rows of matrix A, TILE_N columns of matrix B, and write to and read from TILE_M * TILE_N elements of matrix C
    memory_per_tile = (TILE_M * k + TILE_N * k + 2 * TILE_M * TILE_N) * fp64_size

    total_memory_gb = tiles * memory_per_tile * 1e-9
    total_tflop = tiles * flops_per_tile * 1e-12

    return total_tflop / FP64_TFLOPS, total_memory_gb / MEMORY_BANDWIDTH_GBS

time_flops, time_membw = roofline_prediction_2_2(2048, 2048, 2048)

print(f"The runtime from the math operations {time_flops * 1e3} ms and the runtime from memory is {time_membw * 1e3} ms")

# We will either be bottlenecked by FLOPS or Memory Bandwidth, so we take the maximum
print(f"Therefore, the estimated best case runtime is {max(time_flops, time_membw) * 1e3} ms")

Information regarding certain GPUs' capabilities can be found in official datasheets, you can see some examples here:
- [NVIDIA A100](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf)
- [NVIDIA H200](https://resources.nvidia.com/en-us-hopper-architecture)
- [NVIDIA L40](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/support-guide/NVIDIA-L40-Datasheet-January-2023.pdf)
- [NVIDIA L40s](https://images.nvidia.com/content/Solutions/data-center/vgpu-L40-datasheet.pdf)
- [NVIDIA RTX PRO 6000 Blackwell Server Edition](https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/rtx-pro-6000-blackwell-workstation-edition/workstation-blackwell-rtx-pro-6000-workstation-edition-nvidia-us-3519208-web.pdf)
- [NVIDIA B200](https://resources.nvidia.com/en-us-blackwell-architecture)

## Exercise 2.3: Improving DGEMM with the cuBLASDx Pipeline API

The roofline models we used, despite being an over-simplification, provide us a lot of insight.  If we are close to the roofline, that means that we as optimized as we can be without fundamentally changing the algorithm (like when we introduced tiling).  If we are far away from the roofline, then we know that looking for further optimizations could be worth it.

Going forward, further optimizations follow a similar structure, where we would apply techniques to increase the tile size and implement similar tiling schemes at lower levels of the memory hierarchy.  The algorithms grow more complicated and the source code length grows equally, if not more.  Rather than continuing with these techniques, for the remaining exercises, we will leverage the advanced optimizations without needing to implement them by calling the cublasDx and nvmath-python libraries.

### What is GEMM pipelining

MathDx operations are typically performed on the shared memory tile level just like we implemented in exercise 2.2.  The typical procedure is:

1. Load data into shared memory
2. Perform computations using shared memory
3. Store results

In this approach kernels themselves are potentially compute-bound, but when one stage of compute is being performed another could already be in flight into shared memory, this is what we describe as **pipelined approach**. Without such overlapping, CUDA hardware latency hiding may not be enough to maximize memory bandwidth. This way the computational units (Tensor Cores) can be operating at maximal occupancy all the time, without ever stalling while waiting for next stage. An advanced extension of pipelining is the producer-consumer model, where separate threads are responsible for loading data into buffers and separate for computing results of the loaded elements.

**cuBLASDx pipeline API** exposes this pipelined logic with a simple interface, still allowing for fusion at all levels of hierarchy, either before computation or after.

![Pipeline](images/pipeline.png)

**How to find optimal bytes-in-flight (tile size and pipeline depth)?**

**Little’s law** in queuing theory states that the average number of items $L$ in a stable system equals the product of the arrival (or service) throughput $\lambda$ and the average time $W$ that an item spends in the system, i.e., $ L = \lambda W $

In the context of GPU/CUDA memory systems, the “items” can be interpreted as data units, such as bytes or cache lines moving through the memory hierarchy. The throughput $\lambda$ is then the sustained memory bandwidth in bytes per second, and the time $W$ corresponds to the effective memory access latency in seconds.

Combining these interpretations yields the following approximation:
$$
\text{bytes in flight} \approx \text{bandwidth} \times \text{latency}.
$$

This expresses that the amount of data concurrently outstanding in the memory system must be large enough to match the product of the achievable bandwidth and the latency.

Practically, this means that when global-memory latency is high, a CUDA kernel must generate many independent memory requests so that enough bytes are in flight to hide latency and keep DRAM bandwidth saturated.

More information about this can be found in the [CUDA Techniques to maximize memory bandwidth and hide latency](https://www.nvidia.com/en-us/on-demand/session/gtc25-s72683/) GTC presentation.

### MathDx and cuBLASDx

The [cuBLAS Device Extension](https://docs.nvidia.com/cuda/cublasdx/) library (**cuBLASDx**) gives kernel developers the flexibility to define GEMM operations in terms of shared memory tiles and compose these operations into their kernels. cuBLASDx is a part of MathDx, a **Device Extension** library suite, also containing:
- cuSolverDx, for numerical solvers
- cuFFTDx, for thread and block FFTs
- cuRANDDx, for random number generation
- nvCOMPDx, for data compression

MathDx exposes functionality on all CUDA memory levels, ranging from global memory pipelines, through shared memory tile computations to per-thread in-register algorithms.

### GEMM kernel with cuBLASDx Pipeline API

Some CUDA instructions require significant orchestration around them to function properly. This strongly relates to later NVIDIA Architectures (Hopper, Blackwell) and respective capabilities (`TMA`, `WGMMA`, `UTCMMA`). 

Kernel patterns used underneath to allow for greater overlap also do change, including producer-consumer waprgroups, barrier based multi-stage pipelines and decoupled epilogues. 

With time the amount of complexity in this surrounding logic and orchestration proved to be as big as cuBLASDx's per-tile `copy` and `execute` operations. The library's core goal is to allow for fusability of external operations while still allowing for best performance math primitive execution and thus `cuBLASDx Pipeline API` was created, exposing the entire GEMM pipeline with a one line call exposing the result of GEMM in registers and allowing to fuse pre-processing operations.

cuBLASDx documentation offers a [short guide on using Pipeline API](https://docs.nvidia.com/cuda/cublasdx/using_pipelines.html) for GEMM computations.

### C++

In [None]:
%%writefile cpp/1c/parameters.hpp.inc

    // (gemm_m, gemm_n, gemm_k, alpha, beta)
    std::vector<tutorial::gemm_problem_t> problems = {
        {2048, 2048, 2048, 0.9, 1.1}
    };

In [None]:
%%writefile cpp/1c/cublasdx_config.hpp.inc

constexpr int tile_m = 64;
constexpr int tile_n = 64;
constexpr int tile_k = 32;

constexpr int block_dim = 256;

// The first step is to define the tile-level GEMM operation to be performed. 
// This is accomplished by combining cuBLASDx operators to create a GEMM description.

using BLAS =
    decltype(cublasdx::Size<tile_m, tile_n, tile_k>() +       // Description: Shared Memory GEMM Size
        cublasdx::Precision<double, double, double>() +       // Description: Input Precisions
        cublasdx::Type<cublasdx::type::real>() +              // Description: Input number type (real / complex)
        cublasdx::Function<cublasdx::function::MM>() +        // Description: BLAS function (MM - Matrix Multiplication)
        cublasdx::Arrangement<arr_a, arr_b, arr_c>() +        // Description: Global Memory arrangement (row- or column-major)
        cublasdx::Block() +                                   // Execution: per-tile operation level (CUDA threadblock)
        cublasdx::BlockDim<block_dim>() +                     // Execution: CUDA threadblock size (1D, 2D or 3D) 
        cublasdx::StaticBlockDim() +                          // Performance: this kernel will not use more threads than specified
        cublasdx::MaxAlignment() +                            // Performance: global and shared memory alignment is >= 16bytes
        cublasdx::EnableInputStreaming() +                    // Performance: no per-element preprocessing needs to be used
        cublasdx::SM<SM_VALUE, SM_MODIFIER_VALUE>() +         // Execution: run on SM (e.g. 89) with modifier (e.g. 89a)
        cublasdx::WithPipeline());                            // Execution: this per-tile descriptor will be only used with pipeline

In [None]:
%%writefile cpp/1c/pipeline_config.hpp.inc

    // IMPORTANT: The pipeline description needs to be defined on host,
    // because possible TMA initialization must happen through a driver call

    // Pipeline depth discussed in a section above
    constexpr int pipeline_depth      = 2;
    // cuBLASDx will return a std::optional<device_pipeline>, depending on correctness of arguments
    auto          opt_device_pipeline = cublasdx::suggest_device_pipeline<pipeline_depth, BLAS>(tensor_a, tensor_b);

    if (not opt_device_pipeline) {
        std::cout << "Incorrect pipeline configuration, please ensure global tensors are divisible by tile"
                  << std::endl;
        exit(1);
    }
    // The pipeline can be retrieved now
    auto device_pipeline = opt_device_pipeline.value();

In [None]:
%%writefile cpp/1c/kernel.hpp.inc

template<class BLAS, class TensorC, class DevicePipeline>
__launch_bounds__(DevicePipeline::max_threads_per_block, 1) __global__
    void kernel_1c_simple_pipelined_dgemm(double                                 alpha,
                                          double                                 beta,
                                          TensorC const                          tensor_c,
                                          // IMPORTANT --> grid constant
                                          __grid_constant__ DevicePipeline const device_pipeline) {
    extern __shared__ __align__(device_pipeline.buffer_alignment()) char smem[];

    auto tile_pipeline = device_pipeline.get_tile(smem, blockIdx.x, blockIdx.y);
    auto tile_gmem_c   = cublasdx::get_tile(EXERCISE);

    auto epilogue_functor = [&](auto& accumulator) {
       // EXERCISE --> implement GEMM epilogue (C = alpha * D + beta * C)
       // Possible approaches:
       // - manually (axpby for loop)
       // - cublasdx::axpby(alpha, fragment, beta, fragment);
       // - accumulator.axpby(alpha, beta, gmem_tile)

       // The following calls may or may not be necessary depending on chosen implementation
       // auto register_result_tensor = accumulator.get_results();
       // auto c_register_fragment = accumulator.make_partition_and_copy(tile_gmem_c);
       // accumulator.partition_and_copy(TODO, tile_gmem_c);
    };

    tile_pipeline.execute(epilogue_functor);
}

In [None]:
%%writefile cpp/1c/kernel_config.hpp.inc

    auto kernel = kernel_1c_simple_pipelined_dgemm<BLAS, CTensor, decltype(device_pipeline)>;
    // Pipeline exposes pre-computed shared memory requirement that includes its own cache size
    auto shared_memory_size = device_pipeline.buffer_size();
    CUDA_CHECK_AND_EXIT(cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

In [None]:
! cmake --build ./build -t 1c_simple_pipelined_dgemm

In [None]:
! ./build/1c_simple_pipelined_dgemm

#### Solution

We will rewrite kernel now and recompile the solution. If you want to restart your exercise make sure you rewrite kernel back and recompile it.

In [None]:
%%writefile cpp/1c/kernel.hpp.inc

template<class BLAS, class TensorC, class DevicePipeline>
__launch_bounds__(DevicePipeline::max_threads_per_block, 1) __global__
    void kernel_1c_simple_pipelined_dgemm(double                                 alpha,
                                          double                                 beta,
                                          TensorC const                          tensor_c,
                                          // IMPORTANT --> grid constant
                                          __grid_constant__ DevicePipeline const device_pipeline) {
    extern __shared__ __align__(device_pipeline.buffer_alignment()) char smem[];

    auto tile_pipeline = device_pipeline.get_tile(smem, blockIdx.x, blockIdx.y);
    auto tile_gmem_c   = cublasdx::get_tile(tensor_c, BLAS::c_shape, blockIdx.x, blockIdx.y);

    auto epilogue_functor = [&](auto& accumulator) {
       accumulator.axpby(alpha, beta, tile_gmem_c);
    };

    tile_pipeline.execute(epilogue_functor);
}

In [None]:
! cmake --build ./build -t 1c_simple_pipelined_dgemm

In [None]:
! ./build/1c_simple_pipelined_dgemm

### Python

In [None]:
# The problems that we will benchmark and conduct accuracy tests on the tuple should be formed as:
# (GEMM_M, GEMM_N, GEMM_K, ALPHA, BETA)
problems = [
  (2048, 2048, 2048, 1.0, 1.0),
]

In [None]:
def choose_kernel_params_2_3(m, n, k, alpha, beta):
    TILE_M = 64
    TILE_N = 64
    TILE_K = 32

    BLOCK_SIZE = 256
    
    # The first step is to define the tile-level GEMM operation to be performed. 
    # This is accomplished by combining cuBLASDx operators to create a GEMM description.

    return Matmul(             
        size=(TILE_M, TILE_N, TILE_K),                        # Description: Shared Memory GEMM Size
        precision=(np.float64, np.float64, np.float64),       # Description: Input Precisions
        data_type="real",                                     # Description: Input number type (real / complex)
        alignment=MAX_ALIGNMENT,                              # Performance: global and shared memory alignment
        arrangement=("row_major", "col_major", "col_major"),  # Description: Global Memory arrangement (row- or column-major)
        execution="Block",                                    # Execution: per-tile operation level (CUDA threadblock)
        block_size=BLOCK_SIZE,                                # Execution: CUDA threadblock size (1D, 2D or 3D) 
        with_pipeline=True,                                   # Execution: this per-tile descriptor will be only used with pipeline
        enable_input_streaming=True,                          # Performance: no per-element preprocessing needs to be used
        static_block_dim=True,                                # Performance: this kernel will not use more threads than specified
    )

In [None]:
def get_kernel_args_2_3(BLAS, alpha, tensor_a, tensor_b, beta, tensor_c):
    # IMPORTANT: The pipeline description needs to be defined on host,
    # because possible TMA initialization must happen through a driver call

    # Pipeline depth discussed in a section above
    PIPELINE_DEPTH = 2

    TILE_K = BLAS.a_dim[1]
    _, k = tensor_a.shape

    assert k >= PIPELINE_DEPTH * TILE_K, "The user provided value for K is too small for the pipeline depth"
    
    device_pipeline = BLAS.suggest_device_pipeline(PIPELINE_DEPTH, tensor_a, tensor_b)
    return alpha, beta, tensor_c, device_pipeline

def get_shared_memory_size_2_3(BLAS, kernel_args):
    device_pipeline = kernel_args[-1]
    # Pipeline exposes pre-computed shared memory requirement that includes its own cache size
    return device_pipeline.buffer_size

In [None]:
def get_dgemm_kernel_2_3(BLAS):

    assert BLAS.a_value_type == BLAS.b_value_type, "Invalid BLAS configuration"

    tile_m, tile_n = BLAS.c_dim
    
    @cuda.jit(extensions=pipeline_extensions, launch_bounds=(BLAS.block_size, 1))
    def dgemm_kernel(alpha, beta, tensor_c, device_pipeline: DevicePipeline):
        m, n = tensor_c.shape

        ldc = max(tensor_c.strides) // tensor_c.itemsize

        block_m = cuda.blockIdx.x
        block_n = cuda.blockIdx.y

        smem = cuda.shared.array(shape=(0,), dtype=BLAS.a_value_type, alignment=device_pipeline.buffer_alignment)

        block_start_m = block_m * tile_m
        block_end_m = (block_m + 1) * tile_m

        block_start_n = block_n * tile_n
        block_end_n = (block_n + 1) * tile_n

        if block_start_m >= m or block_start_n >= n:
            return
        
        c_view = tensor_c[
            block_start_m : block_end_m,
            block_start_n : block_end_n,
        ]

        gmem_c = make_tensor(c_view, BLAS.get_layout_gmem_c(ldc))
        
        tile_pipeline = device_pipeline.get_tile(smem, block_m, block_n)
        
        accumulator = BLAS.suggest_accumulator()
        tile_pipeline.execute(accumulator)

        #if accumulator.is_thread_active():
            # EXERCISE --> implement GEMM epilogue (C = alpha * D + beta * C)
            # Possible approaches:
            # - manually (axpby for loop)
            # - axpby(alpha, fragment, beta, fragment)

            # The following calls may or may not be necessary depending on chosen implementation
            # register_result_tensor = accumulator.get_results();
            # c_register_fragment = accumulator.make_partition_and_copy(tile_gmem_c)
            # accumulator.partition_and_copy(TODO, tile_gmem_c)
            
        tile_pipeline._del()

    return dgemm_kernel

In [None]:
benchmark_dgemm_2_3(problems, get_dgemm_kernel_2_3, choose_kernel_params_2_3, get_shared_memory_size_2_3, get_kernel_args_2_3)

#### Solution

In [None]:
def get_dgemm_kernel_2_3_solution(BLAS):

    assert BLAS.a_value_type == BLAS.b_value_type, "Invalid BLAS configuration"

    tile_m, tile_n = BLAS.c_dim
    
    @cuda.jit(extensions=pipeline_extensions, launch_bounds=(BLAS.block_size, 1))
    def dgemm_kernel(alpha, beta, tensor_c, device_pipeline: DevicePipeline):
        m, n = tensor_c.shape

        ldc = max(tensor_c.strides) // tensor_c.itemsize

        block_m = cuda.blockIdx.x
        block_n = cuda.blockIdx.y

        smem = cuda.shared.array(shape=(0,), dtype=BLAS.c_value_type, alignment=device_pipeline.buffer_alignment)

        block_start_m = block_m * tile_m
        block_end_m = (block_m + 1) * tile_m

        block_start_n = block_n * tile_n
        block_end_n = (block_n + 1) * tile_n

        if block_start_m >= m or block_start_n >= n:
            return
        
        c_view = tensor_c[
            block_start_m : block_end_m,
            block_start_n : block_end_n,
        ]

        gmem_c = make_tensor(c_view, BLAS.get_layout_gmem_c(ldc))
        
        tile_pipeline = device_pipeline.get_tile(smem, block_m, block_n)
        
        accumulator = BLAS.suggest_accumulator()
        tile_pipeline.execute(accumulator)

        if accumulator.is_thread_active():
            c_frag = accumulator.make_partition_and_copy(gmem_c)
            axpby(alpha, accumulator.get_results(), beta, c_frag)
            accumulator.partition_and_copy(c_frag, gmem_c)

        tile_pipeline._del()

    return dgemm_kernel

In [None]:
benchmark_dgemm_2_3(problems, get_dgemm_kernel_2_3_solution, choose_kernel_params_2_3, get_shared_memory_size_2_3, get_kernel_args_2_3)

### Conclusion

In this notebook, we have learned:

1. What a GEMM is and how to implement a naive GEMM kernel
2. How to analyze our implementations and make intelligent choices about what to optimize next
3. What shared memory is and why it is critical for performance

and then we have progressed to offloading all these aspects onto the cuBLASDx pipeline API, understanding:
1. What is pipelining and when is it necessary
2. How to define cuBLASDx per-tile description and make a pipeline based on it
3. How to do in-kernel epilogue fusion with pipelines