# Matrix Multiplication Fundamentals

A GEMM (General Matrix Multiply) operation takes the form $C = \alpha \mathbf{A}\mathbf{B} + \beta\mathbf{C}$ where $\alpha, \beta$ are scalars, $\mathbf{A}$ is an $m \times k$ matrix, $\mathbf{B}$ is a $k \times n$ matrix, and $\mathbf{C}$ is a $m \times n$ matrix.

The element at row $i$ and column $j$ of matrix $\mathbf{C}$ is calculated as the scaled and biased dot product of row $i$ of $\mathbf{A}$ and column $j$ of $\mathbf{B}$ as follows:

$$
\mathbf{C}_{i, j} = \alpha \left(\sum_{l=0}^{k} \mathbf{A}_{i, l} \mathbf{B}_{l, j} \right) + \beta \mathbf{C}_{i, j}
$$

In implementation the above operation is usually split into 2 parts:
1. Matrix Multiplication itself, computing $ \mathbf{D}_{i, j} = \sum_{l=0}^{k} \mathbf{A}_{i, l} \mathbf{B}_{l, j} $
2. Epilogue, computing $ \mathbf{C}_{i, j} = \alpha \cdot \mathbf{D}_{i, j} + \beta \cdot \mathbf{C}_{i, j} $

### Exercise Setup

#### C++ CMake setup

In [None]:
import sys, os
sys.path.append(os.sep.join(["..", "utilities", "python"]))
from common_cuda import setup_cmake_project
setup_cmake_project()

#### Python Imports

In [None]:
import sys
import os

import numpy as np
import cupy as cp
import nvmath

from nvmath.device import Matmul
from nvmath.device.cublasdx import DevicePipeline, SharedStorageCalc, MAX_ALIGNMENT
from nvmath.device.cublasdx_numba import pipeline_extensions
from nvmath.device.common import axpby, clear, copy, copy_fragment, copy_wait, make_tensor
from numba import cuda

sys.path.append(os.sep.join(["..", "utilities", "python"]))

from benchmark import *

## Challenge Exercise 2.4: cuBLASDx

In this exercise, we will convert previous custom tiled GEMM logic to cuBLASDx for both BLAS computations and data movement

![cublasdx](images/cublasdx.png)

### cuBLASDx shared memory tile kernel

The following kernel reimplements the one from previous notebook using `cuBLASDx` for data movement and computation. How does this help us in improving the kernel? 

While we were able to increase data reuse on shared memory tile level, our code still missed several key features of a proper GEMM kernel:
1. Use of MMA instructions, allowing for higher FLOP/s as well as better data reuse (on warp, warpgroup, block and cluster level)
2. Use of async data copies to overlap computation with background data loading, on global, shared and register level
3. Data load vectorization, operating on multiple elements at the same time
4. Shared memory use without size increase or bank conflicts

All these are provided by `cuBLASDx` together with multiple utilities allowing to keep the code performant, portable and tiny in size. 

cuBLASDx documentation offers a [short guide](https://docs.nvidia.com/cuda/cublasdx/using_cublasdx.html) on using library's functionality.

![device_gemm](images/device_gemm.svg)

#### cuBLASDx C++ Guides

It's best to use these guides as they will become necessary in exercise, instead of trying to remember all the details at once.

##### cuBLASDx guide: Data Layouts

The way that data is laid out in shared memory is very important for GEMM performance, it allows or limits vectorization and shared memory bank conflicts. This arrangement of elements is called `layout` in `cuBLASDx` and is a first class element computed by the library for you. It can be accessed with:

```
auto default_layout_a = BLAS::get_layout_smem_a();
```

More information regarding default memory layouts can be found [here](https://docs.nvidia.com/cuda/cublasdx/api/other_methods.html#get-memory-layout). 


for regular `row-` or `column-major` layout (as described with the `cublasdx::Arrangement<...>()` operator) or:

```
auto optimal_layout_a = BLAS::suggest_layout_smem_a();
```

which computes a layout swizzled for maximum vectorization, removal of shared memory bank conflicts and enablement of instructions such as `ld.matrix` (`LDSM`). You can find out more about suggested layout [here](https://docs.nvidia.com/cuda/cublasdx/api/other_methods.html#suggested-shared-memory-layout)

Such layouts can be used to later create tensors from them by combining with a data pointer:

```
auto tensor_a = cublasdx::make_tensor(data_pointer, BLAS::suggest_layout_smem_a());

// elements are accesses with parentheses operator
auto elem_0_0 = tensor_a(row_index, col_index);
```

##### cuBLASDx guide: Slicing 

Manually moving pointers is tedious and error prone so cuBLASDx exposes `pointer slicing API` allowing to do it automatically for you.

Slicing can be performed to get pointers:

```
auto [ptr_a, ptr_b, ptr_c] = cublasdx::slice_into_pointer<type_a, type_b, type_c>(start_pointer, 
                                                                                  alignment_a, layout_a,
                                                                                  alignment_b, layout_b,
                                                                                  alignment_c, num_elems_c);
```

or to get pointers and tensors:

```
auto [tensor_a, tensor_b, ptr_c] = cublasdx::slice<type_a, type_b, type_c>(start_pointer, 
                                                                     alignment_a, layout_a,
                                                                     alignment_b, layout_b,
                                                                     alignment_c, num_elems_c);
```

Detailed reference of slicing function and their use can be found in the [documentation](https://docs.nvidia.com/cuda/cublasdx/api/other_shared.html#shared-memory-slicing).

##### cuBLASDx guide: Shared Memory Copying 

Moving data between global and shared memory is an important and complicated topic. The layout of tile data in shared memory must take under account:
1. Pattern for best global memory read
2. Pattern for best shared memory store
3. Pattern for best compute load
4. Pattern for best compute store

All these steps can be recomposed into 2 elements:
- A data memory layout (as described in `Data Layouts`)
- Heuristic for combining `source layout` and `destination layout` into an algorithm maximizing achieved bandwidth.

The latter part is provided by `cublasdx::copy`:

```
// Copy from global to shared using BLAS BlockDim config
using alignment = cublasdx::alignment_of<BLAS>;
cublasdx::copy<BLAS, alignment::a>(gmem_tensor_a, smem_tensor_a);
cublasdx::copy<BLAS, alignment::b>(gmem_tensor_b, smem_tensor_b);
cublasdx::copy_wait();

// Copy from shared to global using 128 threads
cublasdx::copy<128, alignment::a>(smem_tensor_a, gmem_tensor_a);
cublasdx::copy_wait();
```

The copies are async by default so they can be overlapped with other operations, sync point is forced with `cublasdx::copy_wait` which will wait on all previous copies from the entire threadblock.

A detailed reference of all copying overloads and functions can be found in the [documentation](https://docs.nvidia.com/cuda/cublasdx/api/other_tensors.html#cooperative-global-shared-copying)

##### cuBLASDx guide: Accumulators and results

cuBLASDx `execute(...)` exposes several APIs to be chosen from by the user:

```
#1 Shared memory API with optional pre- and postprocessing lambdas
BLAS().execute(alpha, tensor_a, tensor_b, beta, tensor_c, 
               [a_load_functor, b_load_functor, c_load_functor, c_store_functor]);

#2 Register API without accumulation with optional preprocessing lambdas
auto accumulator = BLAS().execute(tensor_a, tensor_b,
                                  [a_load_functor, b_load_functor]);

#3 Register API with accumulation with optional preprocessing lambdas
BLAS().execute(tensor_a, tensor_b, accumulator,
               [a_load_functor, b_load_functor]);
```

accumulator is a collection of per-thread C elements with associated execution properties. It exposes APIs such as:

```
// Retrieve register tensor with results
auto res = accumulator.get_results();

// Does this thread own some elements of C
bool res = accumulator.is_thread_active();

// Are there extra zero-elements owned by some threads
bool res = accumulator.is_predicated();

// shared_tensor_c = alpha * accumulator
accumulator.axpby(alpha, beta, shared_tensor_c);
```

[cuBLASDx accumulator documentation](https://docs.nvidia.com/cuda/cublasdx/api/other_tensors.html#accumulator-and-register-fragment-tensors) provides a detailed description of general accumulator functionality as well as [copying functions](https://docs.nvidia.com/cuda/cublasdx/api/other_tensors.html#copying-registers-tensors)

##### Data partitioning and mapping

In GEMMs we are often decomposing bigger problems into smaller subproblems and offsetting pointers manually is error prone. Rich tensor types allow doing this automatically on multiple levels. 

`cublasdx::slice` is a slice value allowing to keep entire dimension in the resulting view.

1. Dividing global memory tensor view into tiles and choosing entire row of tiles:
```
auto global_tile_row_a = cublasdx::get_tile_row(tensor, BLAS::a_shape, tile_row_index);

// How to access:
auto single_tile = global_tile_row_a(cublasdx::slice, cublasdx::slice, tile_col_index);
```

2. Dividing global memory tensor view into tiles and choosing entire column of tiles:
```
auto global_tile_col_b = cublasdx::get_tile_row(tensor, BLAS::b_shape, tile_col_index);

// How to access:
auto single_tile = global_tile_row_a(cublasdx::slice, cublasdx::slice, tile_row_index);
```

3. Dividing global memory tensor view into tiles and choosing a single one:
```
auto global_tile_c = cublasdx::get_tile(tensor, BLAS::b_shape, tile_row_index, tile_col_index);
```

apart from choosing tiles, it's important to map thread register result values to their appropriate locations inside the tile. This is allowed by `accumulator` APIs:

```
// If this thread takes part in GEMM
if(accumulator.is_thread_active()) {
   // For each element of register fragment
   for(int i = 0; i < cublasdx::size(d_register_fragment); ++i) {
      auto [tile_index_x, tile_index_y] = accumulator.map_fragment_index(i);
      if((not accumulator.is_predicated()) or accumulator.is_index_in_bounds(i)) {
         // Copy respective global element into it
         d_register_fragment(i) = load_op(c_global_tensor(tile_index_x, tile_index_y));
      }
   }
}
```

a per-thread view of a tile tensor can also be created:
```
auto global_c_thread_view = accumulator.partition_like_C(global_tile_c);
```

Multiple functionalities have been combined with partinioning to allow for a terse and simple code:

```
// Create empty fragment, partition tensor and load appropriate elements safely
auto loaded_c_register_fragment = accumulator.make_partition_and_copy();
// Partition tensor and store appropriate result elements safely
accumulator.partition_and_store(tile_global_c);
// Partition tensor and perform axpby on appropriate elements with results
accumulator.axpby(alpha, beta, tile_global_c);
// Store values from some_fragment to partitioned tensor
accumulator.partition_and_copy(some_fragment, tile_global_c);
// Load values from partitioned tensor to some_fragment
accumulator.partition_and_copy(tile_global_c, some_fragment);
```

More examples concerning slicing and partitioning can be found in [pipeline documentation](https://docs.nvidia.com/cuda/cublasdx/using_pipelines.html#executing-pipelined-gemm) as well as [accumulator documentation](https://docs.nvidia.com/cuda/cublasdx/api/other_tensors.html#accumulator-and-register-fragment-tensors)

### C++

In [None]:
%%writefile cpp/1d/parameters.hpp.inc

    // (gemm_m, gemm_n, gemm_k, alpha, beta)
    std::vector<tutorial::gemm_problem_t> problems = {
        {2048, 2048, 2048, 0.9, 1.1}
    };

In [None]:
%%writefile cpp/1d/cublasdx_config.hpp.inc
    // 2. Define cuBLASDx description
    constexpr int tile_m = 64;
    constexpr int tile_n = 64;
    constexpr int tile_k = 32;

    constexpr int block_dim = 256;

    using BLAS = decltype(cublasdx::Size<tile_m, tile_n, tile_k>() + // size of shared memory tile
                          cublasdx::Precision<double, double, double>() + // precision of data (e.g. __nv_fp8_e5m2, __half, float)
                          cublasdx::Type<cublasdx::type::real>() +  // choice between `real` and `complex` number type
                          cublasdx::Function<cublasdx::function::MM>() + //BLAS operation, `MM` stands for Matrix Multiplication
                          cublasdx::Arrangement<arr_a, arr_b, arr_c>() + //Expected global memory data ordering (row or column major)
                          cublasdx::Block() + // Execution of operation
                          cublasdx::BlockDim<block_dim>() + // block to be used, can be 1D, 2D or 3D
                          cublasdx::MaxAlignment() + // will force max alignment on tensor pointers in shared memory
                          cublasdx::SM<SM_VALUE, SM_MODIFIER_VALUE>() + // Which architecture is this code targeting
                          cublasdx::StaticBlockDim());

In [None]:
%%writefile cpp/1d/kernel.hpp.inc

template<class BLAS, class TensorA, class TensorB, class TensorC>
__launch_bounds__(BLAS::max_threads_per_block, 1) __global__
    void kernel_1b_simple_dgemm_shared_cublasdx(double        alpha,
                                                TensorA const tensor_a,
                                                TensorB const tensor_b,
                                                double        beta,
                                                TensorC const tensor_c) {
    extern __shared__ __align__(16) unsigned char smem[];

    using alignment = cublasdx::alignment_of<BLAS>;

    // EXERCISE --> use slicing guide to prepare shared memory tensors

    auto const global_k = tutorial::size<1>(tensor_a);

    // Define accumulator storage
    // EXERCISE --> use accumulator guide to prepare the accumulator
    

    // EXERCISE --> Use partitioning guide to retrieve tile row from A, tile col from B and tile from C

    // Computation loop --> dynamic, cannot unroll
    auto const max_tile_iters = // EXERCISE

    for (int tile_iter = 0; tile_iter < max_tile_iters; ++tile_iter) {

        // EXERCISE --> Load current tiles into shared memory tensors, use slicing and copying guides
        // EXERCISE --> use BLAS.execute()
        // EXERCISE --> figure out where to sync around BLAS
    }

    // EXERCISE --> implement epilogue using either:
    // 1. single index manual for-loop
    // 2. retrieving global indices and using them for global store
    // 3. separate partitioning and copying
}

In [None]:
!cmake --build build/ -t 1d_simple_dgemm_cublasdx

In [None]:
!./build/1d_simple_dgemm_cublasdx

#### Solution

We will rewrite kernel now and recompile the solution. If you want to restart your exercise make sure you rewrite kernel back and recompile it.

In [None]:
%%writefile cpp/1d/kernel.hpp.inc

template<class BLAS, class TensorA, class TensorB, class TensorC>
__launch_bounds__(BLAS::max_threads_per_block, 1) __global__
    void kernel_1c_dgemm_shared_cublasdx(double        alpha,
                                                TensorA const tensor_a,
                                                TensorB const tensor_b,
                                                double        beta,
                                                TensorC const tensor_c) {
    extern __shared__ __align__(16) unsigned char smem[];

    using alignment = cublasdx::alignment_of<BLAS>;

    auto [smem_tensor_a, smem_tensor_b] =
        cublasdx::shared_memory::slice<double, double>(smem,
                                                       cublasdx::alignment_of_v_a<BLAS>,
                                                       BLAS::suggest_layout_smem_a(),
                                                       cublasdx::alignment_of_v_b<BLAS>,
                                                       BLAS::suggest_layout_smem_b());

    // Assert that for A: mxk and B: kxn both Ks are the same size
    auto const global_k = tutorial::size<1>(tensor_a);

    // Define accumulator storage
    auto accumulator = BLAS::suggest_accumulator();

    auto global_tile_row_a = cublasdx::get_tile_row(tensor_a, BLAS::a_shape, blockIdx.x);
    auto global_tile_col_b = cublasdx::get_tile_col(tensor_b, BLAS::b_shape, blockIdx.y);

    auto global_tile_c   = cublasdx::get_tile(tensor_c, BLAS::c_shape, blockIdx.x, blockIdx.y);
    auto global_tile_out = cublasdx::get_tile(tensor_c, BLAS::c_shape, blockIdx.x, blockIdx.y);

    // Computation loop --> dynamic, cannot unroll
    for (int tile_iter = 0; tile_iter < (global_k / cublasdx::size_of_v_k<BLAS>); ++tile_iter) {

        // Load current tile into shared memory
        auto current_global_tile_a = global_tile_row_a(cublasdx::slice, cublasdx::slice, tile_iter);
        auto current_global_tile_b = global_tile_col_b(cublasdx::slice, cublasdx::slice, tile_iter);

        cublasdx::copy<BLAS, alignment::a>(current_global_tile_a, smem_tensor_a);
        cublasdx::copy<BLAS, alignment::b>(current_global_tile_b, smem_tensor_b);
        cublasdx::copy_wait();

        BLAS().execute(smem_tensor_a, smem_tensor_b, accumulator);
        __syncthreads();
    }

    auto d_fragment = accumulator.make_partition_and_copy(global_tile_c);
    cublasdx::axpby(alpha, accumulator.get_results(), beta, d_fragment);
    accumulator.partition_and_copy(d_fragment, global_tile_out);
}

In [None]:
!cmake --build build/ -t 1d_simple_dgemm_cublasdx

In [None]:
!./build/1d_simple_dgemm_cublasdx

### Python

In [None]:
# The problems that we will benchmark and conduct accuracy tests on the tuple should be formed as:
# (GEMM_M, GEMM_N, GEMM_K, ALPHA, BETA)
problems = [
  (2048, 2048, 2048, 0.9, 1.1),
]

In [None]:
def choose_kernel_params_2_4(m, n, k, alpha, beta):
    tile_m = 64
    tile_n = 64
    tile_k = 32
    
    block_size = 256
    
    return Matmul(
        size=(tile_m, tile_n, tile_k),
        precision=(np.float64, np.float64, np.float64),
        data_type="real",
        arrangement=("row_major", "col_major", "col_major"), # Do not change
        execution="Block",
        block_size=block_size,
        alignment=MAX_ALIGNMENT,
        static_block_dim=True
    )

def get_shared_memory_size_2_4(BLAS):
    smem_calc = SharedStorageCalc()
    smem_calc.add(BLAS.alignment.a, np.dtype(BLAS.precision[0]).itemsize, BLAS.suggest_layout_smem_a())
    smem_calc.add(BLAS.alignment.b, np.dtype(BLAS.precision[1]).itemsize, BLAS.suggest_layout_smem_b())
    return smem_calc.get()

In [None]:
def get_dgemm_kernel_2_4(BLAS):

    assert BLAS.a_value_type == BLAS.b_value_type, "Invalid BLAS configuration"

    c_size = BLAS.suggest_layout_rmem_c().cosize

    tile_m, tile_n = BLAS.c_dim
    tile_k = BLAS.a_dim[1]
    alignment_a, alignment_b, alignment_c = BLAS.alignment
    
    @cuda.jit(launch_bounds=(BLAS.block_size, 1))
    def dgemm_kernel(alpha, tensor_a, tensor_b, beta, tensor_c):
        m, n = tensor_c.shape
        _, k = tensor_a.shape

        lda = max(tensor_a.strides) // tensor_a.itemsize
        ldb = max(tensor_b.strides) // tensor_b.itemsize
        ldc = max(tensor_c.strides) // tensor_c.itemsize

        block_m = cuda.blockIdx.x
        block_n = cuda.blockIdx.y

        smem = cuda.shared.array(shape=(0,), dtype=BLAS.a_value_type, alignment=16)
        smem_a_buff, smem = smem[0:BLAS.a_size], smem[BLAS.a_size:]
        smem_b_buff, smem = smem[0:BLAS.b_size], smem[BLAS.b_size:]

        block_start_m = block_m * tile_m
        block_end_m = (block_m + 1) * tile_m

        block_start_n = block_n * tile_n
        block_end_n = (block_n + 1) * tile_n

        if block_start_m >= m or block_start_n >= n:
            return

        a_view = tensor_a[block_start_m : block_end_m, :]
        b_view = tensor_b[:, block_start_n : block_end_n]
        c_view = tensor_c[
            block_start_m : block_end_m,
            block_start_n : block_end_n,
        ]

        smem_a = make_tensor(smem_a_buff, BLAS.suggest_layout_smem_a())
        smem_b = make_tensor(smem_b_buff, BLAS.suggest_layout_smem_b())
        gmem_c = make_tensor(c_view, BLAS.get_layout_gmem_c(ldc))

        accumulator = BLAS.suggest_accumulator()

        stages = k // tile_k

        for stage in range(0, stages):
            stage_start_k = stage * tile_k
            stage_end_k = (stage + 1) * tile_k
            
            stage_a = a_view[:, stage_start_k : stage_end_k]
            stage_b = b_view[stage_start_k : stage_end_k, :]

            gmem_a = make_tensor(stage_a, BLAS.get_layout_gmem_a(lda))
            gmem_b = make_tensor(stage_b, BLAS.get_layout_gmem_b(ldb))

            # EXERCISE --> Load current tiles into shared memory tensors, use slicing and copying guides
            #              Use copy and copy_wait instead of cublasdx::copy and cublasdx::copy_wait
            #              Alignment is directly passed to copy
            # EXERCISE --> use BLAS.execute()
            # EXERCISE --> figure out where to sync around BLAS

        # EXERCISE --> implement epilogue using either:
        # 1. single index manual for-loop
        # 2. retrieving global indices and using them for global store
        # 3. separate partitioning and copying

    return dgemm_kernel

In [None]:
benchmark_dgemm_2_4(problems, get_dgemm_kernel_2_4, choose_kernel_params_2_4, get_shared_memory_size_2_4)

#### Solution

In [None]:
def get_dgemm_kernel_2_4_solution(BLAS):

    assert BLAS.a_value_type == BLAS.b_value_type, "Invalid BLAS configuration"

    c_size = BLAS.suggest_layout_rmem_c().cosize

    tile_m, tile_n = BLAS.c_dim
    tile_k = BLAS.a_dim[1]
    alignment_a, alignment_b, alignment_c = BLAS.alignment
    
    @cuda.jit(launch_bounds=(BLAS.block_size, 1))
    def dgemm_kernel(alpha, tensor_a, tensor_b, beta, tensor_c):
        m, n = tensor_c.shape
        _, k = tensor_a.shape

        lda = max(tensor_a.strides) // tensor_a.itemsize
        ldb = max(tensor_b.strides) // tensor_b.itemsize
        ldc = max(tensor_c.strides) // tensor_c.itemsize

        block_m = cuda.blockIdx.x
        block_n = cuda.blockIdx.y

        smem = cuda.shared.array(shape=(0,), dtype=BLAS.a_value_type, alignment=16)
        smem_a_buff, smem = smem[0:BLAS.a_size], smem[BLAS.a_size:]
        smem_b_buff, smem = smem[0:BLAS.b_size], smem[BLAS.b_size:]

        block_start_m = block_m * tile_m
        block_end_m = (block_m + 1) * tile_m

        block_start_n = block_n * tile_n
        block_end_n = (block_n + 1) * tile_n

        if block_start_m >= m or block_start_n >= n:
            return

        a_view = tensor_a[block_start_m : block_end_m, :]
        b_view = tensor_b[:, block_start_n : block_end_n]
        c_view = tensor_c[
            block_start_m : block_end_m,
            block_start_n : block_end_n,
        ]

        smem_a = make_tensor(smem_a_buff, BLAS.suggest_layout_smem_a())
        smem_b = make_tensor(smem_b_buff, BLAS.suggest_layout_smem_b())
        gmem_c = make_tensor(c_view, BLAS.get_layout_gmem_c(ldc))

        accumulator = BLAS.suggest_accumulator()

        stages = k // tile_k

        for stage in range(0, stages):
            stage_start_k = stage * tile_k
            stage_end_k = (stage + 1) * tile_k
            
            stage_a = a_view[:, stage_start_k : stage_end_k]
            stage_b = b_view[stage_start_k : stage_end_k, :]

            gmem_a = make_tensor(stage_a, BLAS.get_layout_gmem_a(lda))
            gmem_b = make_tensor(stage_b, BLAS.get_layout_gmem_b(ldb))

            copy(gmem_a, smem_a, alignment=alignment_a)
            copy(gmem_b, smem_b, alignment=alignment_b)
            copy_wait()

            BLAS.execute(smem_a, smem_b, accumulator)

            cuda.syncthreads()

        d_fragment = accumulator.make_partition_and_copy(gmem_c)
        axpby(alpha, accumulator.get_results(), beta, d_fragment)
        accumulator.partition_and_copy(d_fragment, gmem_c)

    return dgemm_kernel

In [None]:
benchmark_dgemm_2_4(problems, get_dgemm_kernel_2_4_solution, choose_kernel_params_2_4, get_shared_memory_size_2_4)

## Conclusion

In this notebook approaches to using cuBLASDx in device code were presented: the tile approach and the pipelined approach demonstrating how to efficiently use the library