## 🧮 GEMM Example

In this example, we perform a General Matrix Multiplication (GEMM) on input matrices `A` and `B` to compute the result matrix `C`. The algorithm follows the standard GEMM formula:

```
C[i, j] += A[i, k] * B[k, j]
``` 

Point to the ARIES frontend directory and import the necessary packages

In [None]:
import os
import sys
cur_dir = os.getcwd()
aries_path = cur_dir + "/../../../../"
sys.path.append(aries_path)
from frontend import *
from IPython import get_ipython

Set the entire workload and specify the tile size

In [None]:
# GEMM: C[i0, j0] += A[i0, k0] * B[k0, j0]
I, J, K = 256, 256, 256
TI, TJ, TK = 32, 32, 32
grid = (I // TI, J // TJ, K // TK)  # grid must be a tuple

Define the GEMM operation

In [None]:
@task_kernel()
def kernel_gemm(TileA: float32[TI, TK], 
                TileB: float32[TK, TJ], 
                TileC: float32[TI, TJ]):
    for i0 in range(0, TI):
        for j0 in range(0, TJ):
            TileC[i0, j0] = float32(0)
            for k0 in range(0, TK):
                TileC[i0, j0] += TileA[i0, k0] * TileB[k0, j0]

### Tiled Programming Model in ARIES

In ARIES, the computation is structured using a **tiled programming model**. The channels between L3 and L1 are unidirectional, meaning they support only either loading or storing. It is recommended to perform constant initialization within a single kernel whenever possible tp save the number of channel.

<img src="../images/gemm.png" alt="GEMM" width="600"/>

Memory arguments defined in @task_tile represent L3 memory. This decorator primarily handles local memory allocation and describes data movement between L3 and L1 memory.

In [None]:
@task_tile()
def gemm(A: float32[I, K], B: float32[K, J], 
         C: float32[I, J], **kwargs):
    i, j, k = aries.tile_ranks(**kwargs)
    
    L1_A = aries.buffer((TI, TK), "float32")
    L1_B = aries.buffer((TK, TJ), "float32")
    L1_C = aries.buffer((TI, TJ), "float32")
    
    
    ############## Please fill the logic of ti and tk #################
    
    # Slice data using aries.arrage(start, stop)
    ti =
    tk =
    
    ####################################################################
    tj = aries.arange(j*TJ, (j+1)*TJ)  # J tile range
    
    
    L1_A = aries.load(A, (ti, tk))
    L1_B = aries.load(B, (tk, tj))
    kernel_gemm(L1_A, L1_B, L1_C)
    aries.accstore(L1_C, C, (ti, tj))

<details>
<summary>Answer (click to expand)</summary>

```python
ti = aries.arange(i*TI, (i+1)*TI)  # I tile range  
tk = aries.arange(k*TK, (k+1)*TK)  # K tile range
```

Define top function

In [None]:
@task_top()
def top(A: float32[I, K], B: float32[K, J], C: float32[I, J]):
    gemm_task = gemm[grid](A, B, C)
    return gemm_task

In [None]:
# Get the input cells that contains the decorators
cell_codes = get_ipython().user_ns["In"][2:6]
# Join them into one string, with a newline between each cell
all_code = "\n".join(cell_codes)

Verify correctness of AIRES program in frontend

In [None]:
# Initialize the buffers
np.random.seed(0)
A = np.random.rand(I, K).astype(np.float32)
B = np.random.rand(K, J).astype(np.float32)
C = np.zeros((I, J)).astype(np.float32)

# Execute on CPU
gemm_task = top(A, B, C)
golden_C = np.matmul(A, B)

# Compare the program with golden file
print("ARIES gemm output matches golden reference:", np.allclose(C, golden_C))

# Generate files for on-board test
aries.gen_sim([A, B, golden_C])

Apply scheduling using **ARIES primitives**. In this example, the primitives are passed to the **AIE auto-vectorizer** provided by the **MLIR-AIE/IRON** project.

In [None]:
# Apply schedulings
sch = Schedule(gemm_task)
sch.to("VCK190")

# This is used in MLIR-AIE Auto Vectorizer for single AIE optimization
sch.aieUnroll(gemm_task, factor=8) # Under the innermost loop by 8 times to improve pipeline efficency and guarantee memory aligment
sch.aieVector(gemm_task, factor=8) # Automatically detect a suitable loop and perform vectorization (j)

In [None]:
# Set the project dir and template dir
prj_dir= cur_dir + '/project_gemm'
temp_dir= aries_path + '/templates'
# Generate Initial MLIR and ARIES Opts
sch.build(all_code, prj_dir, temp_dir)
sch.compile(aries_path, prj_dir, target="report")

<details>
<summary> AIE Design Diagram (click to expand)</summary>

The AIE graph can be found under ```project_gemm/project/Work/reports/adf_graph_mapped_post.png```


<div style="display: flex; gap: 100px;">
  <img src="../images/adf_graph_mapped_post.png" alt="ADF Graph" width="250"/>
  <img src="../images/gemm_aie.png" alt="GEMM AIE" width="250"/>
</div>

<details>
<summary> AIE Profile Result - Vendor Tools (click to expand)</summary>

**Theoretical Cycle Count for Kernel**: (32 × 32 × 32) / 8 = **4096 cycles**  
**Measured Efficiency**: 4096 / 12756 ≈ **32.1%**

<img src="../images/gemm0_cycle.png" alt="GEMM0 Cycle" width="600"/>
<img src="../images/gemm0_cycle_name.png" alt="GEMM0 Func Name" width="600"/>


<details>
<summary>Pipeline of Auto-Vectorized Single AIE (click to expand)</summary>

The diagram below illustrates the pipeline of a single AIE (with size 2 x 4 x 8) generated by the **MLIR-AIE automatic vectorizer**, showcasing how operations are arranged for **VLIW** execution. On AIE-V1, the MAC operation with FP32 data type has a minimum latency of two cycles, which prevents accumulation into the same register on every cycle. The computation efficency will be less than 50%.

<img src="../images/gemm_kernel.png" alt="GEMM" width="600"/>