# Getting Started With Emulation

## Content

 - Introduction to the Ozaki-I Scheme
 - Exercise 1: IGEMM-based Ozaki-I Scheme 
 - Exercise 2: Optimizing the Ozaki-I Scheme with kernel fusion
 - Exercise 3: Fully fused Ozaki-I Scheme

## The Ozaki-I Scheme

The Ozaki Scheme was introduced as a general framework for emulating high precision data types as a combination of lower precision datatypes.  This was later applied to integer matrix multiplication in the paper "[DGEMM on Integer Matrix Multiplication Unit](https://arxiv.org/abs/2306.11975)" by Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota.  When people refer to the Ozaki Scheme, it is usually implied that we are refering to the Ozaki-I scheme with integers.  For the remainder of the tutorial, we will mean the same.

This algorithm consists of three parts:

1. Slicing - A method to transform inputs into lower precision datatypes which, refered to as slices
2. Slice Multiplication - Multiplying the corresponding A slices by the corresponding B slices
3. Error-free transformation - This takes the output of multiplication and transforms it into the higher precision datatype.

For this tutorial, we will give high level background on (1) and focus on steps (2) and (3).  If you are interested in learning more about (1), the code is present and documented within this tutorial.  We would also encourage you to review the [cubladsDx emulation sample](https://github.com/NVIDIA/CUDALibrarySamples/tree/main/MathDx/cuBLASDx/16_dgemm_emulation).

A diagram of the high level workflow can be found below:

<img src="Images/Ozaki-I-Flowchart.png" width="600" height="auto"/>

To perform slicing, we first need to know the maximum values for the rows of A and columns of B.  These are needed to logically align the exponents for elements in the same row (for matrix A) or column (for matrix B).  Once the exponents are aligned, we can read the mantissa bits into INT8 slices.  In practice, this is done with logical operations rather than FP64 arithmetic for a performance advantage.

The next phase is slice multiplication, where each slice of "A" can multiply with each slice of "B".  As an optimization with minimal accuracy impact, we only multiply a subset of the slices as seen below.  It is important to note that in production libraries like cuBLAS, the ADP framework (see [Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme](https://arxiv.org/abs/2511.13778)) can detect if the lower order products are necessary and will leverage more integer slices to maintain FP64 level accuracy

<img src="Images/Ozaki-I-Multiplications.png" width="1000" height="auto"/>

The last phase of this algorithm, shown at the bottom of the image above, is the error free transformation back into FP64.  The steps involved are:

1. Accumulate products along anti-diagonals
2. Converting the resulting accumulators into FP64
3. Scaling the accumulators according to which anti-diagonal it represents
4. Scaling the result once more to undo the exponent normalization in slicing

## Exercise 3.1: IGEMM-based Ozaki-I Scheme

The goal of this exercise will be to build a high level understanding of the Ozaki-I Scheme that we can use to further optimize in further exercise.

In this exercise, we will orchestrate the right slice products and build an epilogue kernel which implements the error free transformation.

### C++ Cmake Configuration

In [None]:
import sys, os
sys.path.append(os.sep.join(["..", "utilities", "python"]))
from common_cuda import setup_cmake_project
setup_cmake_project()

### Python Imports

In [None]:
import sys
import os
import math

import numpy as np
import cupy as cp
import nvmath

from nvmath.device import Matmul
from nvmath.device.cublasdx import DevicePipeline, SharedStorageCalc, MAX_ALIGNMENT
from nvmath.device.cublasdx_numba import pipeline_extensions
from nvmath.device.common import axpby, clear, copy, copy_fragment, copy_wait, make_tensor
from numba import cuda

sys.path.append(os.sep.join(["..", "utilities", "python"]))

from benchmark import *
from emulation_utils import get_width, epilogue_ldexp

### C++

In [None]:
%%writefile cpp/2a_unfused_emulation/parameters.hpp.inc

    // ===================================
    // Problem configuration
    // ===================================

    // (gemm_m, gemm_n, gemm_k, alpha, beta)
    std::vector<tutorial::gemm_problem_t> problems = {
        {2048, 2048, 2048, 0.9, 1.1}
    };

    // ===================================
    // Global GEMM configuration
    // ===================================

    // The number of slices used in emulation algorithm
    // More slices = higher precision but more computation
    constexpr unsigned slices = 7;

In [None]:
%%writefile cpp/2a_unfused_emulation/cublasdx_config.hpp.inc

    using slice_value_type       = int8_t;  // Precision for individual slices
    using accumulator_value_type = int32_t; // Precision for accumulation

    // The shape of data tile processed by a single CTA block
    constexpr int tile_m = 128;
    constexpr int tile_n = 128;
    constexpr int tile_k = 128;

    // The shape of CTA block (number of threads)
    constexpr int cta_shape_x = 128;
    constexpr int cta_shape_y = 1;
    constexpr int cta_shape_z = 1;

    using BLAS = decltype(cublasdx::Size<tile_m, tile_n, tile_k>() +
                          cublasdx::Precision<slice_value_type, slice_value_type, accumulator_value_type>() +
                          cublasdx::Type<cublasdx::type::real>() + cublasdx::Function<cublasdx::function::MM>() +
                          cublasdx::Arrangement<arrangement_a, arrangement_b, arrangement_c>() + cublasdx::Block() +
                          cublasdx::BlockDim<cta_shape_x, cta_shape_y, cta_shape_z>() + cublasdx::StaticBlockDim() +
                          cublasdx::WithPipeline() + cublasdx::MaxAlignment() + cublasdx::EnableInputStreaming() +
                          cublasdx::SM<SM_VALUE, SM_MODIFIER_VALUE>());

In [None]:
%%writefile cpp/2a_unfused_emulation/igemm_kernel.hpp.inc

template<class BLAS, class DevicePipeline, class OutTensor>
__launch_bounds__(DevicePipeline::max_threads_per_block, 1) __global__
    void igemm_kernel(__grid_constant__ DevicePipeline const device_pipeline,
                      OutTensor                              out_tensor) {
    extern __shared__ __align__(device_pipeline.buffer_alignment()) char smem[];
#ifdef __CUDA_ARCH__
    if constexpr (cublasdx::sm_of_v<BLAS> == __CUDA_ARCH__) {
        auto tile_pipeline = device_pipeline.get_tile(smem, blockIdx.x, blockIdx.y);
        auto tile_gmem_out = cublasdx::get_tile(out_tensor, BLAS::c_shape, blockIdx.x, blockIdx.y);

        auto accumulator = tile_pipeline.get_accumulator();

        tile_pipeline.execute(accumulator); 
        if (accumulator.is_thread_active()) {
            accumulator.partition_and_store(tile_gmem_out);
        }
    }
#endif
}

In [None]:
%%writefile cpp/2a_unfused_emulation/slice_coordination.hpp.inc

            int product_idx = 0;
            int num_products = (Slices * (Slices + 1)) / 2;

            /*
             * EXERCISE --> Coordinate the products between slices of matrix A and slices of matrix B
             *              NOTE that tensor_slice_a is shaped like (m, k, nslices) and
             *                        tensor_slice_b is shaped like (k, n, nslices)
             *              with strides being (k, 1, m * k) and (1, k, k * n) respectively
             *  
             * Compute only the most significant products shown in the diagram above and store them into the
             * tensor_products which has shape (m, n, num_products)
             * with shape (m, n, num_products)
             */
            constexpr auto initial_diag = ;
            constexpr auto initial_term = ;
                 
            for (auto diag = initial_diag; /* EXERCISE --> loop over diagonals */) {
                for (auto term = initial_term; /* EXERCISE --> loop along the diagonal */) {
                    // EXERCISE --> Determine which slice of A and slice of B to multiply
                    int slice_a_index = ;
                    int slice_b_index = ;

                    // Prepare our view of the tensors, in this case we are getting the int8 submatrix for slice 'slice_a_index'.
                    auto slice_a_view = tensor_slice_a(cublasdx::slice, cublasdx::slice, slice_a_index);
                    auto slice_b_view = tensor_slice_b(cublasdx::slice, cublasdx::slice, slice_b_index);
                    auto product_view = tensor_products(cublasdx::slice, cublasdx::slice, product_idx++);

                    // Configure the device pipelines
                    constexpr int pipeline_depth = 3;
                    auto const device_pipeline =
                        cublasdx::suggest_device_pipeline<pipeline_depth, BLAS, cublasdx::external_accumulation>(slice_a_view, slice_b_view).value();
                    auto const shared_memory_size = device_pipeline.buffer_size();
                    dim3 const grid_dim(shape_a_rows / static_tile_m(), shape_b_cols / static_tile_n());

                    // Get the kernel and allow the kernel to use the shared memory required for the device pipeline
                    auto kernel = igemm_kernel<BLAS, decltype(device_pipeline), decltype(product_view)>;
                    CUDA_CHECK_AND_EXIT(
                        cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

                    kernel<<<grid_dim, device_pipeline.get_block_dim(), shared_memory_size, str>>>(device_pipeline,
                                                                                                   product_view);
                    CUDA_CHECK_AND_EXIT(cudaGetLastError());
                }
            }

In [None]:
%%writefile cpp/2a_unfused_emulation/epilogue_config.hpp.inc
        // Sets the block dimensions for the epilogue kernel
        constexpr int epilogue_kernel_tile_m = 16;
        constexpr int epilogue_kernel_tile_n = 16;

In [None]:
%%writefile cpp/2a_unfused_emulation/epilogue_kernel.hpp.inc

template<int BlockSize, int Slices, class ProductTensor, class ShiftTensorA, class ShiftTensorB, class OutTensor>
__launch_bounds__(BlockSize, 1) __global__ void epilogue_kernel(double        alpha,
                                                                double        beta,
                                                                ProductTensor product_tensor,
                                                                ShiftTensorA  shift_tensor_a,
                                                                ShiftTensorB  shift_tensor_b,
                                                                OutTensor     out_tensor) {
    using product_datatype = tutorial::tensor_value_type_t<ProductTensor>;
    using shift_datatype   = tutorial::tensor_value_type_t<ShiftTensorA>;
    using out_datatype     = tutorial::tensor_value_type_t<OutTensor>;

    const auto tid_m = threadIdx.x + blockIdx.x * blockDim.x;
    const auto tid_n = threadIdx.y + blockIdx.y * blockDim.y;

    int shift_a = shift_tensor_a(tid_m);
    int shift_b = shift_tensor_b(tid_n);

    auto product_view = product_tensor(tid_m, tid_n, cublasdx::slice);

    int    product_id = 0;
    double accumulator = 0.0;

    /*
     * EXERCISE --> Complete the implementation of the epilogue kernel.  This kernel:
     *   1. Accumulates along the anti-diagonals into diag_acc
     *   2. Calls nth_slice_to_fp64 to convert back to fp64 and scale the exponent
     *   3. Implements a typical GEMM epilogue (alpha * accumulator + beta * C)
     */

    constexpr auto initial_diag = ;
    constexpr auto initial_term = ;
                 
    for (auto diag = initial_diag; /* EXERCISE --> loop over diagonals */) {
        product_datatype diag_acc = 0;
        for (auto term = initial_term; /* EXERCISE --> loop along the diagonal */) {
            diag_acc += product_view(product_id++);
        }

        // HINT: Be careful here, the most significant diagonal is the 0th diagonal
        accumulator += nth_slice_to_fp64<int32_t, int8_t>(diag, diag_acc, shift_a + shift_b);
    }

    out_tensor(tid_m, tid_n) = alpha * result + beta * out_tensor(tid_m, tid_n);
}

In [None]:
!cmake --build ./build -t 2a_unfused_emulation

In [None]:
!./build/2a_unfused_emulation

#### Solution

We will rewrite kernel now and recompile the solution. If you want to restart your exercise make sure you rewrite kernel back and recompile it.

In [None]:
%%writefile cpp/2a_unfused_emulation/slice_coordination.hpp.inc

            int product = 0;

            /*
             * EXERCISE --> Coordinate the products between slices of matrix A and slices of matrix B
             *              NOTE that tensor_slice_a is shaped like (m, k, nslices) and
             *                        tensor_slice_b is shaped like (k, n, nslices)
             *              with strides being (k, 1, m * k) and (1, k, k * n) respectively
             *  
             * Compute only the most significant products shown in the diagram above and store them into the
             * tensor_products which has shape (m, n, num_products)
             * with shape (m, n, num_products)
             */
            constexpr auto initial_diag = Slices - 1;
            constexpr auto initial_term = 0;

            for (auto diag = initial_diag; diag >= 0; --diag) {
                for (auto term = initial_term; term <= diag; ++term) {
                    auto slice_a_view = tensor_slice_a(cublasdx::slice, cublasdx::slice, term);
                    auto slice_b_view = tensor_slice_b(cublasdx::slice, cublasdx::slice, diag - term);
                    auto product_view = tensor_products(cublasdx::slice, cublasdx::slice, product++);

                    constexpr int pipeline_depth = 3;
                    auto const device_pipeline =
                        cublasdx::suggest_device_pipeline<pipeline_depth, BLAS, cublasdx::external_accumulation>(slice_a_view, slice_b_view).value();
                    auto const shared_memory_size = device_pipeline.buffer_size();
                    dim3 const grid_dim(shape_a_rows / static_tile_m(), shape_b_cols / static_tile_n());

                    auto kernel = igemm_kernel<BLAS, decltype(device_pipeline), decltype(product_view)>;
                    CUDA_CHECK_AND_EXIT(
                        cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

                    kernel<<<grid_dim, device_pipeline.get_block_dim(), shared_memory_size, str>>>(device_pipeline,
                                                                                                   product_view);
                    CUDA_CHECK_AND_EXIT(cudaGetLastError());
                }
            }

In [None]:
%%writefile cpp/2a_unfused_emulation/epilogue_kernel.hpp.inc

template<int BlockSize, int Slices, class ProductTensor, class ShiftTensorA, class ShiftTensorB, class OutTensor>
__launch_bounds__(BlockSize, 1) __global__ void epilogue_kernel(double        alpha,
                                                                double        beta,
                                                                ProductTensor product_tensor,
                                                                ShiftTensorA  shift_tensor_a,
                                                                ShiftTensorB  shift_tensor_b,
                                                                OutTensor     out_tensor) {
    using product_datatype = tutorial::tensor_value_type_t<ProductTensor>;
    using shift_datatype   = tutorial::tensor_value_type_t<ShiftTensorA>;
    using out_datatype     = tutorial::tensor_value_type_t<OutTensor>;

    const auto tid_m = threadIdx.x + blockIdx.x * blockDim.x;
    const auto tid_n = threadIdx.y + blockIdx.y * blockDim.y;

    int shift_a = shift_tensor_a(tid_m);
    int shift_b = shift_tensor_b(tid_n);

    auto product_view = product_tensor(tid_m, tid_n, cublasdx::slice);

    int    product_id = 0;
    double result     = 0.0;

    constexpr auto initial_diag = Slices - 1;
    constexpr auto initial_term = 0;

    for (auto diag = initial_diag; diag >= 0; --diag) {
        product_datatype diag_acc = 0;
        for (auto term = initial_term; term <= diag; ++term) {
            diag_acc += product_view(product_id++);
        }

        result += nth_slice_to_fp64<int32_t, int8_t>(diag, diag_acc, shift_a + shift_b);
    }

    out_tensor(tid_m, tid_n) = alpha * result + beta * out_tensor(tid_m, tid_n);
}

In [None]:
!cmake --build ./build -t 2a_unfused_emulation

In [None]:
!./build/2a_unfused_emulation

### Python

In [None]:
problems = [
  (2048, 2048, 2048, 0.9, 1.1),
]

In [None]:
def get_igemm_kernel(BLAS):

    assert BLAS.a_value_type == BLAS.b_value_type, "Invalid BLAS configuration"

    A_SIZE = BLAS.suggest_layout_smem_a().cosize
    B_SIZE = BLAS.suggest_layout_smem_b().cosize
    C_SIZE = BLAS.suggest_layout_rmem_c().cosize

    TILE_M, TILE_N = BLAS.c_dim
    TILE_K = BLAS.a_dim[1]
    BLOCK_SIZE = BLAS.block_size
    ALIGNMENT = min(BLAS.alignment.a, min(BLAS.alignment.b, BLAS.alignment.c))
    
    @cuda.jit(extensions=pipeline_extensions, launch_bounds=(BLOCK_SIZE, 1))
    def igemm_kernel(tensor_c, device_pipeline: DevicePipeline):
        m, n = tensor_c.shape

        ldc = max(tensor_c.strides) // tensor_c.itemsize

        block_m = cuda.blockIdx.x
        block_n = cuda.blockIdx.y

        smem = cuda.shared.array(shape=(0,), dtype=BLAS.a_value_type, alignment=ALIGNMENT)

        block_start_m = block_m * TILE_M
        block_end_m = (block_m + 1) * TILE_M

        block_start_n = block_n * TILE_N
        block_end_n = (block_n + 1) * TILE_N

        if block_start_m >= m or block_start_n >= n:
            return
        
        c_view = tensor_c[
            block_start_m : block_end_m,
            block_start_n : block_end_n,
        ]

        gmem_c = make_tensor(c_view, BLAS.get_layout_gmem_c(ldc))
        
        tile_pipeline = device_pipeline.get_tile(smem, block_m, block_n)
        
        accumulator = BLAS.suggest_accumulator()
        tile_pipeline.execute(accumulator)

        if accumulator.is_thread_active():
            accumulator.partition_and_copy(accumulator.get_results(), gmem_c)

        tile_pipeline._del()

    return igemm_kernel

In [None]:
def unfused_igemm_ozaki(tensor_slicedA, tensor_slicedB, tensor_product, context, warmup=True):
    BLAS = context["BLAS"]
    pipeline_depth = context["PIPELINE_DEPTH"]
    igemm_kernel = context["gemm_kernel"]
    grid = context["gemm_grid"]
    block = context["gemm_block"]

    _, _, slices = tensor_slicedA.shape

    product_index = 0
    
    """
    EXERCISE --> Coordinate the products between slices of matrix A and slices of matrix B
                 NOTE that tensor_slice_a is shaped like (m, k, nslices) and
                           tensor_slice_b is shaped like (k, n, nslices)
                 with strides being (k, 1, m * k) and (1, k, k * n) respectively  
    Compute only the most significant products and store them into tensor_products
    with shape (m, n, num_products)
    """

    initial_diag = -1
    initial_term = -1
    
    for diag in range(-1): # EXERCISE --> loop over diagonals
        for term in range(-1): # EXERCISE --> loop along the diagonal
            # EXERCISE --> Determine which slice of A and slice of B to multiply
            slice_a_index = -1
            slice_b_index = -1
            
            # Convert from a cupy array to numba arrays
            #  - cupy arrays are needed to setup 3D strides for the pipeline API
            slice_a_view = cuda.as_cuda_array(tensor_slicedA[:, :, slice_a_index])
            slice_b_view = cuda.as_cuda_array(tensor_slicedB[:, :, slice_a_index])
            product_view = cuda.as_cuda_array(tensor_product[:, :, product_index])

            product_index += 1

            device_pipeline = BLAS.suggest_device_pipeline(pipeline_depth, slice_a_view, slice_b_view)

            if warmup and diag == initial_diag and term == initial_term:
                set_max_dynamic_shared_size_bytes(igemm_kernel, device_pipeline.buffer_size,
                                                    product_view, device_pipeline)

            igemm_kernel[grid, block, 0, device_pipeline.buffer_size](product_view, device_pipeline)

In [None]:
def get_epilogue_kernel(block_size=64):
    uint8_width = get_width(np.uint8)

    @cuda.jit(device=True, forceinline=True)
    def nth_slice_to_fp64(nth, nth_slice, exponent_shift):
        ko = math.pow(2.0, -nth * uint8_width)

        value = ko * np.float64(nth_slice)
        return epilogue_ldexp(value, -exponent_shift)

    @cuda.jit(launch_bounds=(block_size, 1))
    def epilogue_kernel(slices, tensor_product, tensor_shift_a, tensor_shift_b, tensor_out, alpha, beta):
        tid_m = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
        tid_n = cuda.threadIdx.y + cuda.blockIdx.y * cuda.blockDim.y

        if tid_m >= tensor_out.shape[0] or tid_n >= tensor_out.shape[1]:
            return

        shift_a = tensor_shift_a[tid_m]
        shift_b = tensor_shift_b[tid_n]

        product_view = tensor_product[tid_m, tid_n, :]

        product_id = 0
        accumulator = 0.0

        """
        EXERCISE --> Complete the implementation of the epilogue kernel.  This kernel:
          1. Accumulates along the anti-diagonals into diag_acc
          2. Calls nth_slice_to_fp64 to convert back to fp64 and scale the exponent
          3. Implements a typical GEMM epilogue (alpha * accumulator + beta * C)
        """

        initial_diag = -1
        initial_term = -1
    
        for diag in range(-1): # EXERCISE --> loop over diagonals
            diag_acc = 0
            for term in range(-1): # EXERCISE --> loop along the diagonal
                diag_acc += product_view[product_id]
                product_id += 1

            # HINT: Be careful here, the most significant diagonal is the 0th diagonal
            accumulator += nth_slice_to_fp64(diag, diag_acc, shift_a + shift_b)
    
        tensor_out[tid_m, tid_n] = alpha * accumulator + beta * tensor_out[tid_m, tid_n]

    return epilogue_kernel

def epilogue(slices, tensor_products, tensor_shift_a, tensor_shift_b, tensor_c, alpha, beta, context):
    epilogue_kernel = context["epilogue_kernel"]
    grid = context["epilogue_grid"]
    block = context["epilogue_block"]
    
    epilogue_kernel[grid, block](slices, tensor_products, tensor_shift_a, tensor_shift_b, tensor_c, alpha, beta)

In [None]:
def setup_func(m, n, k):
    tile_m = 128
    tile_n = 128
    tile_k = 128
    pipeline_depth = 3
    block_size = 128

    epilogue_tile_m = 16
    epilogue_tile_n = 16

    assert m % tile_m == 0, "Unsupported dimension m for TILE_M"
    assert n % tile_n == 0, "Unsupported dimension n for TILE_N"
    assert k % tile_k == 0, "Unsupported dimension n for TILE_N"
    assert k >= (tile_k * pipeline_depth), "Unsupported pipeline depth for k"

    assert m % epilogue_tile_m == 0, "Unsupported dimension for EPILOGUE_TILE_M"
    assert n % epilogue_tile_n == 0, "Unsupported dimension for EPILOGUE_TILE_N"
    
    BLAS = Matmul(size=(tile_m, tile_n, tile_k),
                  precision=(np.int8, np.int8, np.int32),
                  data_type="real",
                  alignment=MAX_ALIGNMENT,
                  arrangement=("row_major", "col_major", "col_major"), # Do not change
                  execution="Block",
                  block_size=block_size,
                  with_pipeline=True,
                  enable_input_streaming=True,
                  static_block_dim=True)

    gemm_grid = (m // tile_m, n // tile_n)
    gemm_block = BLAS.block_dim

    epilogue_grid = (m // epilogue_tile_m, n // epilogue_tile_n)
    epilogue_block = (epilogue_tile_m, epilogue_tile_n)

    return {
        "BLAS": BLAS,
        "PIPELINE_DEPTH": pipeline_depth,
        "gemm_kernel" : get_igemm_kernel(BLAS),
        "gemm_grid": gemm_grid,
        "gemm_block": gemm_block,
        "epilogue_kernel": get_epilogue_kernel(math.prod(epilogue_block)),
        "epilogue_grid": epilogue_grid,
        "epilogue_block": epilogue_block
    }

In [None]:
benchmark_unfused_emulated_dgemm(problems, setup_func, unfused_igemm_ozaki, epilogue)

#### Solution

In [None]:
def unfused_igemm_ozaki_solution(tensor_slicedA, tensor_slicedB, tensor_product, context, warmup=True):
    BLAS = context["BLAS"]
    pipeline_depth = context["PIPELINE_DEPTH"]
    igemm_kernel = context["gemm_kernel"]
    grid = context["gemm_grid"]
    block = context["gemm_block"]

    _, _, slices = tensor_slicedA.shape

    product_id = 0

    """
     EXERCISE --> Coordinate the products between slices of matrix A and slices of matrix B
                  NOTE that tensor_slice_a is shaped like (m, k, nslices) and
                            tensor_slice_b is shaped like (k, n, nslices)
                  with strides being (k, 1, m * k) and (1, k, k * n) respectively
      
     Compute only the most significant products shown in the diagram above and store them into the
     tensor_products which has shape (m, n, num_products)
     with shape (m, n, num_products)
    """
    initial_diag = slices - 1
    initial_term = 0
    
    for diag in range(initial_diag, -1, -1):
        for term in range(diag + 1):
            slice_a = term
            slice_b = diag - term

            # Convert from a cupy array to numba arrays
            #  - cupy arrays are needed to setup 3D strides for the pipeline API
            slice_a_view = cuda.as_cuda_array(tensor_slicedA[:, :, slice_a])
            slice_b_view = cuda.as_cuda_array(tensor_slicedB[:, :, slice_b])
            product_view = cuda.as_cuda_array(tensor_product[:, :, product_id])

            device_pipeline = BLAS.suggest_device_pipeline(pipeline_depth, slice_a_view, slice_b_view)

            if warmup and diag == initial_diag and term == initial_term:
                set_max_dynamic_shared_size_bytes(igemm_kernel, device_pipeline.buffer_size,
                                                    product_view, device_pipeline)

            igemm_kernel[grid, block, 0, device_pipeline.buffer_size](product_view, device_pipeline)
             
            product_id += 1

In [None]:
def get_epilogue_kernel_solution(block_size=64):
    uint8_width = get_width(np.uint8)

    @cuda.jit(device=True, forceinline=True)
    def nth_slice_to_fp64(nth, nth_slice, exponent_shift):
        ko = math.pow(2.0, -nth * uint8_width)

        value = ko * np.float64(nth_slice)
        return epilogue_ldexp(value, -exponent_shift)

    @cuda.jit(launch_bounds=(block_size, 1))
    def epilogue_kernel(slices, tensor_product, tensor_shift_a, tensor_shift_b, tensor_out, alpha, beta):
        tid_m = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
        tid_n = cuda.threadIdx.y + cuda.blockIdx.y * cuda.blockDim.y

        if tid_m >= tensor_out.shape[0] or tid_n >= tensor_out.shape[1]:
            return

        shift_a = tensor_shift_a[tid_m]
        shift_b = tensor_shift_b[tid_n]

        product_view = tensor_product[tid_m, tid_n, :]

        product_id = 0
        result = 0.0

        initial_diag = slices - 1
        initial_term = 0
    
        for diag in range(initial_diag, -1, -1):
            diag_acc = 0
            for term in range(diag + 1):
                diag_acc += product_view[product_id]
                product_id += 1

            result += nth_slice_to_fp64(diag, diag_acc, shift_a + shift_b)

        if beta != 0:
            result = alpha * result + beta * tensor_out[tid_m, tid_n]
        else:
            result = alpha * result

        tensor_out[tid_m, tid_n] = result

    return epilogue_kernel

In [None]:
def setup_func_solution(m, n, k):
    ctx = setup_func(m, n, k)
    epilogue_block = ctx["epilogue_block"]
    ctx["epilogue_kernel"] = get_epilogue_kernel_solution(math.prod(epilogue_block))
    
    return ctx

In [None]:
benchmark_unfused_emulated_dgemm(problems, setup_func_solution, unfused_igemm_ozaki_solution, epilogue)

### Performance Model

In [None]:
import numpy as np
import math

# INT8 TOPS, MEMORY BANDWIDTH (GB/s)
GPU_SPECS = {
    "L40S": (733, 864),
    "B200": (4500, 8000)
}

# NOTE: This model is very simplistic and does not take quantization or other overheads like slicing and FP64 operations into account
def roofline_prediction_3_1(m, n, k, slices=7, TILE_M=128, TILE_N=128, TILE_K=128):
    INT8_TOPS, MEMORY_BANDWIDTH_GBS = GPU_SPECS["L40S"]

    num_products = (slices * (slices + 1)) // 2

    # By design since each thread is computing one output element
    tiles = math.ceil(m / TILE_M) * math.ceil(n / TILE_N)

    # Each tile does TILE_M * TILE_N dot products which each have k multiplications and k additions for every product
    flops_per_tile = 2 * TILE_M * TILE_N * k * num_products

    fp64_size = np.dtype(np.float64).itemsize
    int32_size = np.dtype(np.float64).itemsize
    int8_size = np.dtype(np.int8).itemsize

    # We load a TILE_M rows of matrix A, TILE_N columns of matrix B, and write to TILE_M * TILE_N elements of matrix C
    # This needs to happen for each product
    memory_per_tile = ((TILE_M * k + TILE_N * k) * int8_size + TILE_M * TILE_N * int32_size) * num_products

    # In the epilogue kernel, we load the products and read from and write to the output
    memory_per_tile += (TILE_M * TILE_N) * (num_products * int32_size + 2 * fp64_size)

    total_memory_gb = tiles * memory_per_tile * 1e-9
    total_tflop = tiles * flops_per_tile * 1e-12

    return total_tflop / INT8_TOPS, total_memory_gb / MEMORY_BANDWIDTH_GBS

time_flops, time_membw = roofline_prediction_3_1(2048, 2048, 2048)

print(f"The runtime from the math operations {time_flops * 1e3} ms and the runtime from memory is {time_membw * 1e3} ms")

# We will either be bottlenecked by FLOPS or Memory Bandwidth, so we take the maximum
print(f"Therefore, the estimated best case runtime is {max(time_flops, time_membw) * 1e3} ms")

## Conclusion

In this notebook we've learned the fundamentals of the Ozaki-I Scheme and built an implemenation that we will optimize in the next exercises.

We then analyzed why the Ozaki scheme makes sense by building a simplistic model for the product gemms and epilogue kernel.  In the next exercise, we will implement kernel fusion to reduce memory overhead and speedup the process.