In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Introduction

### Prerequsities

To get the most out of this lab you should already be able to:
 - Declare variables, write loops, and use if / else statements in C++
 - Use pointer and iterators to access arrays of data
 - Use lambda expressions (unnamed functions)
 - Write basic CUDA kernels and understand core conecepts like threads, blocks, and grids
   - If you are new to CUDA, we recommend the [CUDA C++ Tutorial](https://github.com/NVIDIA/accelerated-computing-hub/tree/main/tutorials/cuda-cpp)


On the infrastructure side, this lab assumes that you:
 - Have the latest (13.1+) [CUDA toolkit](https://developer.nvidia.com/cuda-downloads) installed
 - Have latest MathDx package (25.12.1) downloaded into your system (preferably installed to `/opt/nvidia/mathdx/25.12`)
 - Have an NVIDIA GPU installed in your system
 - Are running on a Linux system

If you are using Windows, try WSL (Windows Subsystem for Linux) or file an issue for Windows support

___

### Objectives

By the time you complete this lab, you will be able to:

 - Understand key performance characteristics and parameters in GEMM kernels
 - Be able to write performant and portable CUDA kernels that leverage Tensor Cores and TMA
 - Understand kernel fusion and directly see how it can improve performance
 - Understand how to emulate FP64 matrix multiplication with the Ozaki-I Scheme 
 - Understand how to simplify kernel development and achieve higher performance with a tile-based programming model'

___

### Content

 - Matrix Multiplication Fundamentals
   - **Exercise 2.1:** A simple DGEMM Kernel
   - **Exercise 2.2:** Improving DGEMM Performance with Shared Memory tiling
   - **Exercise 2.3:** Improving DGEMM Performance with pipelining
 - Matrix Multiplication with cuBLASDx
   - **Challenge Exercise 2.4:** Implementing Pipelining with cublasDx 
 - Ozaki-I Emulation
   - **Exercise 3.1**: IGEMM Based Ozaki-I Scheme
   - **Exercise 3.2**: Optimizing Ozaki-I With Fusion
   - **Exercise 3.3**: A Fully Fused Ozaki-I Implementation
 - Challenge Exercise:
   - **Challenge Exercise 4.1**: DSYRK using Ozaki-I

___

### Why Device Extension Libraries?

In short, programming with Tensor Cores is difficult.  Each Tensor Core generation changes calling conventions, requires different synchronization patterns, requires different memory layouts, and even reads and writes to different memory subsystems.  Device extension libraries provide an interface that is stable across GPU generations and provides performant access to Tensor Cores.  In this lab, we will learn how to leverage device extension libraries, specifically cuBLASDx, to implement the Ozaki-I scheme.  The kernels we write will work on Volta and newer GPU's without code modification and can reach Speed of Light (SOL) performance on all of them.

The cuBLASDx library provides a tile-based programming model to which makes writing performant Tensor Core kernels as easy as it has ever been.  A tile-based programming model allows users to define the compuation at a higher level and let library developers handle low level details like thread mapping, memory hierarchies, synchronization, and TMA.

In this course, you will learn how to leverage cuBLASDx to write performant kernels for complex algorithms by using FP64 Emulation with the Ozaki-I Scheme as a testbed to understand cuBLASDx.