# Chained matmul with tiling

In this simple tutorial we will see how to perform a `matmul` with tiling.
Tiling is a technique based on matrix partition, each block is called a tile.

With tiling, `matmul`:
* computation can be performed in parallel, a domain where GPUs excels;
* global memory (GM) access are limited, GM access being the GPU bottleneck (compared to computation).

From the simple example, we will expand our approach to chaining an infinite number of `matmul`.

## GEMM introduction

Below we define the problem size and initialize the matrices.
In `GEMM` a problem is defined by 3 numbers: `KMN`.

`D = α * A * B + β * C` with:
* `D` shape is `MxN`
* `A` shape is `MxK`
* `B` shape is `KxN`
* `C` shape is `MxN`
* `α` and `β` are 2 constants

> for readability, below `α` is implicitly set to 1 and `β` to 0 so `C` do not appear.
> Obviously, re-introducing them would be very easy.


In [1]:
import numpy as np

M, N0, K0 = 15, 9, 12

A0 = np.random.random((M, K0))
B0 = np.random.random((K0, N0))

# Simple matmul with tiling

Simple example showing how we can perform a `matmul` through tiling.

Basic introduction to the subject can be found here:

* https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
* https://penny-xu.github.io/blog/tiled-matrix-multiplication

Parallelization can be applied at each `M` and `N` for loop levels.
However, best use of global memory access requires to be a bit smarter.
Check our dedicated explanation in tutorials.

Values used below are a arbitrary and small to be printable if needed.
Rule of thumb in defining tile shape is:
* large tile size increase data reuse, but decrease thread-level parallelism;
* small tile size increase thread-level parallelism but reduce data reuse.

In [2]:
# for simplification tile shapes are all multiple of matrix shapes
# otherwise we would need to check matrix bounds and mask out of bounds values by 0s in tiles
block_M, block_N0, block_K0 = M // 3, N0 // 3, K0 // 3

accumulator0 = np.zeros((M, N0))
for index_M in range(0, M, block_M):
    start_M = index_M
    end_M = index_M + block_M

    for index_N0 in range(0, N0, block_N0):
        start_N0 = index_N0
        end_N0 = index_N0 + block_N0

        for index_K0 in range(0, K0, block_K0):
            start_K0 = index_K0
            end_K0 = index_K0 + block_K0

            tile_A0 = A0[start_M:end_M, start_K0:end_K0]
            tile_B0 = B0[start_K0:end_K0, start_N0:end_N0]
            accumulator0[start_M:end_M, start_N0:end_N0] += np.matmul(tile_A0, tile_B0)

assert np.allclose(accumulator0, np.matmul(A0, B0))

## Back to back fused matmul with tiling

Fused matmul helps in reducing memory access as the intermediate outputs are never materialized (written in global memory).

Introduction to the approach can be found here:
https://github.com/NVIDIA/cutlass/tree/master/examples/13_two_tensor_op_fusion

The main trick is to have `n` axis matrix `bx` tiles length equal to `Nx` axis `Bx` matrix length.
Therefore, there is no need to iterate over the `N` axis.

It introduces a constraint on the length of `Nx` which needs to be small enough to be kept in shared memory / registries.

In the example below, we chain 2 `matmul`:

* A1 = A0 * B0
* A2 = A1 * B1

> our goal is to never materialize A1

In [3]:
# block_Nx is always equal to Nx
# for simplification block tile shapes are all multiple of matrices shapes
block_M, block_N0, block_K0 = M // 3, N0, K0 // 3

# by definition K1 is always N0 as A1 is multiplied with B1 and A1 N axis is the one of B0
N1, K1 = 12, N0

# we iterate over N0 so block_K0 is always a multiple of block_K1 to avoid using masking, etc.
block_N1, block_K1 = N1, block_K0 // 2

# initialize B1 matrix
B1 = np.random.random((K1, N1))

Some important shapes:

* shape of `A1 = matmul(A0, B0)` is `MxN0`, iterate over `K0`
* shape of `B1` is `K1xN1` with `K1 == N0`
* shape of `A2 = matmul(A1, B1)` is `MxN1`, iterate over `K1`
  * because `K1 == N0`, during the second matmul we iterate over `N0`

So we will set the following tile shapes:
* `block_N0 = N0`
* `block_N1 = N1`

> In the code `block_N0` and `block_N1` instead of `:` are used for readability reasons

In [4]:
accumulator2 = np.zeros((M, N1))

for index_M in range(0, M, block_M):
    start_M = index_M
    end_M = index_M + block_M
    for index_K0 in range(0, K0, block_K0):
        start_K0 = index_K0
        end_K0 = index_K0 + block_K0

        tile_A0 = A0[start_M:end_M, start_K0:end_K0]
        tile_B0 = B0[start_K0:end_K0, :block_N0]
        tile_A1 = np.matmul(tile_A0, tile_B0)
        for index_K1 in range(0, K1, block_K1):
            start_K1 = index_K1
            end_K1 = index_K1 + block_K1

            tile_tile_A1 = tile_A1[:, start_K1:end_K1]
            tile_B1 = B1[start_K1:end_K1, :block_N1]

            accumulator2[start_M:end_M, :block_N1] += np.matmul(tile_tile_A1, tile_B1)

assert np.allclose(accumulator2, np.matmul(np.matmul(A0, B0), B1))