# Dynamic Reflexive Tiling (DRT)

This notebook reproduces the salient characteristics of [Dynamic Reflexive Tiling](https://dl.acm.org/doi/10.1145/3582016.3582064) on accelerators of different dataflows.

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "..")

from src import utils

## Dynamic Reflexive Tiling
Dynamic Reflexive Tiling (DRT) is an approach to build dynamic-nonuniform-coordinate (D-N-C) tiles at different levels of the memory hierarchy. Since sparse tensor algebra applications are often memory bound, DRT increases reuse by maximizing tile size based on the amount of memory available.

Since DRT is not supported by TeAAL at the moment, the HiFiber code in this section will define the necessary functions that will describe DRT. Please note that while DRT supports the tiling of tensors with more than 2 dimensions, the functions below are implemented to support 2-tensors (matrices) only for simplicity. Additionally, note that while the implementation mimics the behavior of DRT, the functions below may behave differently compared to the actual DRT implementation due to simplications and assumptions.

The tiling process starts by pre-processing input tensors into micro tiles. The support of coarsening allows the process to more efficiently produce macro tiles dynamically.

In [None]:
from math import sqrt
def generateMicroTiles(self, micro_tile_size):
    """Returns a tensor partitioned into micro tiles of a specific size
    
    Parameters
    ----------
    self: tensor
        The input tensor
    micro_tile_size: integer
        Specifies the tile size of a micro tile

    Returns
    -------
    partitioned_tensor: tensor
        A tensor partitioned into static-uniform-coordinate micro tiles of size MICRO_TILE_SIZE.
    """
    number_of_dimensions = self.getDepth()
    scope = int(sqrt(micro_tile_size))
    for i in range(number_of_dimensions-1, -1, -1):
        self = self.splitUniform(scope, depth=i)
    partitioned_tensor = self.swizzleRanks(rank_ids=sorted(self.getRankIds(), key=lambda x: x[1:], reverse=True))
    return partitioned_tensor
Tensor.generateMicroTiles = generateMicroTiles

After patitioning a tensor into micro tiles, we  also need to compute the footprint of each micro tile. Computing the footprint allows macro tile build without further introspection of the micro tile's metadata in the future. The function below generates the micro tile footprint given that tensors are stored using a compressed coordinate and segment array data structure.

It is assumed that the data themselves are 64-bit wide while elements in the segment and coordinate arrays are 32-bit wide. In our implementation, it is also assumed that the elements in the data, segment and coordinate arrays are all that consume memory space.

In [None]:
def calculateMicroTileFootprint(self, micro_tile_size):
    """Returns a tensor (fiber-tree representation) containing the micro tile footprints of the input tensor
    Parameters
    ----------
    self: tensor
        The input tensor
    micro_tile_size: integer
        Specifies the tile size of a micro tile

    Returns
    -------
    footprint: tensor
        A tensor containing the data footprint of each micro tile in the input tensor.
    """
    span = int(sqrt(micro_tile_size))
    footprint = Tensor(rank_ids=(self.getRankIds()[:2]), name="footprint")
    footprint_i1 = footprint.getRoot()
    self_i1 = self.getRoot()
    for i1, self_j1 in self_i1:
        for j1, self_k0 in self_j1:
            footprint_ref = footprint_i1.getPayloadRef(i1, j1)
            memory_req = self_k0.countValues() * (64 + 32) # bits required to store all data + the coordinate array of a micro tile
            
            # bits required to store the segment array of a micro tile
            if i1 + span > self.getShape()[0]:
                memory_req += 32 * (self.getShape()[0] - i1 + 1)
            else:
                memory_req += 32 * (span + 1)
            
            footprint_ref += memory_req # records the data footprint of a micro tile
    return footprint
Tensor.calculateMicroTileFootprint = calculateMicroTileFootprint

Next, the `dims` function is implemented. The function is necessary for many other functions to work below.

In [None]:
def dims(self):
    """Returns rank ids of the micro tiles in a tensor
    Parameters
    ----------
    self: a tensor
        The input tensor
        
    Returns
    -------
    rank_ids: list of strings
        All rank ids required to specify a micro tile in a tensor
    """
    rank_ids = [rank for rank in self.getRankIds() if rank[-1] == '1'] # we only want to rank ids of the micro tiles
    return rank_ids
Tensor.dims = dims

In the DRT paper, the authors implements DRT by growing tensors in the order of their stationarity. This process is designed to prioritize reuse for more stationary tensors. To model this behavior, the `sortByStationarity` function is needed to order tensors by their stationarity given a loop order.

In [None]:
def sortByStationarity(tensors, loop_order):
    """Returns the tensors ordered by stationarity
    Parameters
    ----------
    tensors: list of tensors
        A list of input tensors

    loop_order: list of strings
        Gives the loop order at which each rank is processed.

    Returns
    -------
    tensors: list of tensors
        A sorted list of tensors based on their stationarity
    """
    rank_ids = []
    for tensor in tensors:
        rank_ids.append(tensor.getRankIds())
    order = []
    for ids in rank_ids:
        order.append(sum([loop_order.index(x) for x in ids if x in loop_order]))
    return list(sorted(tensors, key=lambda x: order[tensors.index(x)]))

Tensor.sortByStationarity = sortByStationarity

After being able to choose the order at which each tensor will be processed, we also need to select the dimension along which a tile will grow. In section 3.2, the DRT paper mentioned that DRT prioritizes contracted dimensions before moving onto uncontracted dimensions when growing a macro tile. This is designed to maximize output locality. To mimic this behavior, we will implement the `orderDims` and `selectDimToGrow` function.

In [None]:
def orderDims(tensors, dims):
    """Orders the dimensions based on the order they will be processed by DRT
    Parameters
    ----------
    tensors: list of tensors
        All of the tensors involved in the computation
    dims: list of strings
        All rank ids to be considered
        
    Returns
    -------
    dims: list of strings
        A sorted list of rank ids according to their priorities (contracted dimensions are prioritized)
    """
    count = []
    for dimension in dims:
        track = 0
        for tensor in tensors:
            if dimension in tensor.getRankIds():
                track += 1
        count.append(track)
    dims = list(sorted(dims, key=lambda x: count[dims.index(x)], reverse = True)) # sort the ranks by their priority
    return dims
Tensor.orderDims = orderDims

def selectDimToGrow(self, tensors, constraints):
    """Returns the dimension to grow a macro tile
    Parameters
    ----------
    self: a tensor
        The tensor of interest
    tensors: list of tensors
        All of the tensors involved in the computation
    constraints: a dictionary
        Specifies if a dimension is constrained
        
    Returns
    -------
    dimension: a string
        The rank id that specifise the rank along which growDims will be applied to
    """
    for dimension in Tensor.orderDims(tensors, list(constraints.keys())):
        if dimension in self.getRankIds() and constraints[dimension] is None: # check if a dimension is constrained
            return dimension
    return None # all dimensions are constrained
Tensor.selectDimToGrow = selectDimToGrow

Now we can move on to functions that will be responsible for growing a macro tile. A simple function `grow` is introduced below. The function will simply modify the dictionary `dim_span`, keeping track of the macro tile's shape.

In [None]:
def grow(dim, n, dim_span):
    """ Grows the input tensor by N micro tiles along DIM by updating DIM_SPAN
    Parameters
    ----------
    dim: a string
        The dimension along which a tile grows
    n: integer
        The amount by which a tile will grow along DIM
    dim_span: dictionary
        Specifies the shape of the current macro tile
    
    Returns
    -------
    None

    Side effects
    -------
    dim_span: dictionary
        Function modifies DIM_SPAN
    """
    dim_span[dim][1] += n
Tensor.grow = grow

Before growing a macro tile using `grow`, we need to check if it is possible to grow the macro tile along a dimension in the first place. The function `canGrow` consider memory constraints of growing a macro tile. Remember that we assume that all metadata are 32 bits wide while the data themselves are 64 bits wide. Additionally, we assume that tensors are stored using a coordinate and segment array data structure.

In [None]:
def canGrow(self, dim, n, dim_span, micro_tile_size, micro_tile_loop_order, allocated_buffer_size, outer_dim):
    """Checks if it is possible for the input tensor to grow by N micro tiles along DIM
    Parameters
    ----------
    self: a tensor
        The tensor of interest
    dim: a string
        The dimension along which a tile grows
    n: integer
        The amount by which a tile will grow along DIM
    dim_span: dictionary
        Specifies the shape of the current macro tile
    micro_tile_size: integer
        Specifies the tile size of a micro tile
    micro_tile_loop_order: list of strings
        Specifies the loop order of which a micro tile is processed
    allocated_buffer_size: integer
        The amount of memory allocated to the tensor of interest at the next level of the memory hierarchy (in bits)
    outer_dim: string
        The dimension that will be used for the segment array
    
    Returns
    -------
    can_grow: boolean
        Indicates whether or not growing N micro tiles along DIM is possible
    """
    memory_req = 32 # number of bits required to store the first element of the segment array
    difference = int(sqrt(micro_tile_size))
    shape = self.getShape()[self.getRankIds().index(dim)]
    if (dim_span[dim][1] + n - 2) * difference >= shape: # cannot access non-existent micro tiles
        return False
    dim_span[dim][1] += n

    # calculate memory required
    microtile_footprint = self.getMicroTileFootprint_LoopOrder(micro_tile_size, micro_tile_loop_order)
    microtile_footprint_i = microtile_footprint.getRoot()
    for i, microtile_footprint_j in microtile_footprint_i:
        if i >= dim_span[self.getRankIds()[0]][1] * difference: # checks if we are selecting micro tiles out of scope
            break
        if i >= dim_span[self.getRankIds()[0]][0] * difference:
            for j, val in microtile_footprint_j:
                if j >= dim_span[self.getRankIds()[1]][1] * difference: # checks if we are selecting micro tiles out of scope
                    break
                if j >= dim_span[self.getRankIds()[1]][0] * difference:
                    memory_req += val # micro tile
                    memory_req += 3 * 32 # coordinate array + micro tile size metadata + micro tile pointer
    segment_span = dim_span[outer_dim]
    memory_req += (segment_span[1] - segment_span[0]) * 32
    # return can_grow and revert unwanted changed to dim_span
    # print("Can grow?: Macro tile size: " + str(dim_span) + ". Required memory: " + str(memory_req) + "." + "Memory given: " + str(allocated_buffer_size))
    dim_span[dim][1] -= n
    can_grow = memory_req <= allocated_buffer_size
    return can_grow
Tensor.canGrow = canGrow

def getMicroTileFootprint_LoopOrder(self, micro_tile_size, micro_tile_loop_order):
    """calculates the micro tile footprint by first deciding which rank should be used for the segment and coordinate array."""
    order = dict()
    originalRankIds = list(self.getRankIds())
    for ID in originalRankIds[:2]:
        order[ID] = micro_tile_loop_order.index(ID[0] + ".0")
    changedRankIds = list(sorted(originalRankIds[:2], key=lambda x: order[x])) + originalRankIds[2:]
    res = self.swizzleRanks(rank_ids=changedRankIds).calculateMicroTileFootprint(micro_tile_size)
    res = res.swizzleRanks(rank_ids=originalRankIds[:2])
    return res
Tensor.getMicroTileFootprint_LoopOrder = getMicroTileFootprint_LoopOrder

With `selectDimToGrow`, `grow` and `canGrow` we can now implement `growDims` to grow a macro tile until all dimensions of a tensor is constrained.

In [None]:
def growDims(self, tensors, n, dim_span, constraints, micro_tile_size, micro_tile_loop_order, allocated_buffer_size, outer_dim):
    """Grows a macro tile until all of a tensor's dimensions are constrained
    Parameters
    ----------
    self: tensor
        The tensor of interest
    tensors: list of tensors
        All tensors considered for the computation
    n: integer
        The step size everytime DRT grows a macro tile
    dim_span: dictionary
        Specifies the shape of the current macro tile
    constraints: dictionary
        Specifies if a dimension is constrained
    micro_tile_size: integer
        Specifies the tile size of a micro tile
    micro_tile_loop_order: list of strings
        Specifies the loop order of which a micro tile is processed
    allocated_buffer_size: integer
        The amount of memory allocated to the tensor of interest at the next level of the memory hierarchy
    outer_dim: string
        The dimension that will be used for the segment array

    Returns
    -------
    None

    Side effects
    -------
    dim_span: dictionary
        Modifies dim_span to reflect the current shape of the macro tile
    constraints: dictionary
        Modifies constraints, a dictionary that tracks the constraints for each dimension of a tensor
    """
    dim = self.selectDimToGrow(tensors, constraints)
    while dim is not None:
        if self.canGrow(dim, n, dim_span, micro_tile_size, micro_tile_loop_order, allocated_buffer_size, outer_dim):
            Tensor.grow(dim, n, dim_span)
        else:
            constraints[dim] = dim_span[dim]
            dim = self.selectDimToGrow(tensors, constraints)
Tensor.growDims = growDims

In [None]:
def print_macro_tile(self, constraints, micro_tile_size):
    """Prints a macro tile given a set of constraints"""
    dims = self.getRankIds()
    Z = Tensor(rank_ids=(self.getRankIds()))
    Z.setName("Macro tile " + self.getName()[0])
    z_i1 = Z.getRoot()
    self_i = self.getRoot()
    difference = int(sqrt(micro_tile_size))
    for i, self_j in self_i:
        if i >= constraints[self.getRankIds()[0]][1] * difference: # checks if we are selecting micro tiles out of scope
            break
        if i >= constraints[self.getRankIds()[0]][0] * difference:
            for j, self_k in self_j:
                if j >= constraints[self.getRankIds()[1]][1] * difference: # checks if we are selecting micro tiles out of scope
                    break
                if j >= constraints[self.getRankIds()[1]][0] * difference:
                    z_k = z_i1.getPayloadRef(i, j)
                    z_k <<= self_k
    displayTensor(Z)
Tensor.print_macro_tile = print_macro_tile

Finally, the function responsible for the entire DRT process is implemented below.

In [None]:
def DRT(tensors, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size):
    """ Gets the next macro tile
    Parameters
    ----------
    tensors: list of tensors
        Includes all tensors processed
    loop_order: list of strings
        Specifies the loop order of the kernel
    init_tile_size: dictionary
        Gives the initial macro tile size
    micro_tile_size: integer
        Specifies the tile size of a micro tile
    micro_tile_loop_order: list of strings
        Specifies the loop order of which a micro tile is processed
    buffer_size: integer
        The size of the buffer DRT is building a macro tile for in bits

    Returns
    -------
    None
    """
    all_dims = []

    # Prepare all tensors for processing
    for tensor in tensors:
        tensor = tensor.swizzleRanks(rank_ids = Tensor.orderDims(tensors, tensor.getRankIds()[0:2]) + Tensor.orderDims(tensors, tensor.getRankIds()[2:]))
        all_dims.append(set(tensor.dims()))

    # Find all dimensions involved
    all_dims = all_dims[0].union(all_dims[1])

    # Set up constrains and dim_span. Both are needed for function calls later
    constraints = {dim: None for dim in all_dims}
    dim_span =  {dim: init_tile_size[dim].copy() for dim in all_dims}

    n = 1
    tensors = Tensor.sortByStationarity(tensors, loop_order)
    tensor_A = tensors[0]
    tensor_B = tensors[1]
    buffer_allocation = buffer_size // 3
    
    contracted_dim = Tensor.orderDims(tensors, tensor_A.getRankIds()[0:2])[0]
    A_uncontracted_dim = Tensor.orderDims(tensors, tensor_A.getRankIds()[0:2])[1]
    B_uncontracted_dim = Tensor.orderDims(tensors, tensor_B.getRankIds()[0:2])[1]

    A_constrain_contracted_dim = True
    if loop_order.index(contracted_dim) > loop_order.index(A_uncontracted_dim):
        A_constrain_contracted_dim = False
    B_constrain_contracted_dim = True
    if loop_order.index(contracted_dim) > loop_order.index(B_uncontracted_dim):
        B_constrain_contracted_dim = False

    if A_constrain_contracted_dim:
        A_dim_one = contracted_dim
        A_dim_two = A_uncontracted_dim
    else:
        A_dim_one = A_uncontracted_dim
        A_dim_two = contracted_dim
    
    # grow along contracted_dim
    dim_span[A_dim_one] = [0, 0]
    while tensor_A.canGrow(A_dim_one, n, dim_span, micro_tile_size, micro_tile_loop_order, buffer_allocation, A_dim_one):
        dim_span[A_dim_one][1] += init_tile_size[A_dim_one][1]
        # grow along A_dim_two
        constraints[A_dim_two] = None
        dim_span[A_dim_two] = init_tile_size[A_dim_two].copy()
        while tensor_A.canGrow(A_dim_two, n, dim_span, micro_tile_size, micro_tile_loop_order, buffer_allocation, A_dim_one):
            tensor_A.growDims(tensors, n, dim_span, constraints, micro_tile_size, micro_tile_loop_order, buffer_allocation, A_dim_one)
            print(f"Macro tile with shape: {constraints}")
            tensor_A.print_macro_tile(constraints, micro_tile_size)
            print("\n")
            
            # grow along B_uncontracted_dim while contrained to A
            constraints[B_uncontracted_dim] = None
            dim_span[B_uncontracted_dim] = init_tile_size[B_uncontracted_dim].copy()
            while tensor_B.canGrow(B_uncontracted_dim, n, dim_span, micro_tile_size, micro_tile_loop_order, buffer_allocation, contracted_dim):
                tensor_B.growDims(tensors, n, dim_span, constraints, micro_tile_size, micro_tile_loop_order, buffer_allocation, contracted_dim)
                print(f"Macro tile with shape: {constraints}")
                tensor_B.print_macro_tile(constraints, micro_tile_size)
                print("\n")
                val = constraints[B_uncontracted_dim][1]
                constraints[B_uncontracted_dim] = None
                dim_span[B_uncontracted_dim] = [val, val]
            val = constraints[A_dim_two][1]
            constraints[A_dim_two] = None
            dim_span[A_dim_two] = [val, val]

        val = constraints[A_dim_one][1]
        dim_span[A_dim_one] = [val, val]
        constraints[A_dim_one] = None
Tensor.DRT = DRT

Let's test the DRT function!\
**Note**: The `buffer_size` variable below has the unit of bits. It specifies the amount of memory available. For our implementation of DRT, we assume that the two input tensors and the output tensor shares the available buffer space equally.

In [None]:
micro_tile_size = 4

K = 15
M = 15
N = 15

density = [0.8,0.8]
seed = 0

A_KM = Tensor.fromRandom(rank_ids=["K", "M"], shape=[K, M], seed=seed, density=density, name="A")
B_KN = Tensor.fromRandom(rank_ids=["K", "N"], shape=[K, N], seed=seed + 1, density=density, name="B")
A_K1M1K0M0 = A_KM.generateMicroTiles(micro_tile_size)
B_K1N1K0N0 = B_KN.generateMicroTiles(micro_tile_size)
tensors = [A_K1M1K0M0, B_K1N1K0N0]
loop_order = ["N.1", "M.1", "K.1"]
micro_tile_loop_order = ["K.0", "M.0", "N.0"]
init_tile_size = {"K.1": [0,0], "M.1": [0,0], "N.1": [0,2]}
buffer_size = 9000

print("Tensor A")
displayTensor(A_KM)
print("\nTensor B")
displayTensor(B_KN)

print("\nTensor A pre-processed into micro tiles")
displayTensor(A_K1M1K0M0)
print("\nTensor B pre-processed into micro tiles")
displayTensor(B_K1N1K0N0)

print("\nTensor A micro tile footprint")
displayTensor(A_K1M1K0M0.getMicroTileFootprint_LoopOrder(micro_tile_size, micro_tile_loop_order))
print("\nTensor B micro tile footprint")
displayTensor(B_K1N1K0N0.getMicroTileFootprint_LoopOrder(micro_tile_size, micro_tile_loop_order))
print("\n")

Tensor.DRT(tensors, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size)

In [None]:
micro_tile_size = 4

K = 10
M = 10
N = 10

density = [0.8,0.8]
seed = 0

A_KM = Tensor.fromRandom(rank_ids=["K", "M"], shape=[K, M], seed=seed, density=density, name="A")
B_KN = Tensor.fromRandom(rank_ids=["K", "N"], shape=[K, N], seed=seed + 1, density=density, name="B")
A_K1M1K0M0 = A_KM.generateMicroTiles(micro_tile_size)
B_K1N1K0N0 = B_KN.generateMicroTiles(micro_tile_size)
tensors = [A_K1M1K0M0, B_K1N1K0N0]
loop_order = ["K.1", "M.1", "N.1"]
micro_tile_loop_order = ["K.0", "M.0", "N.0"]
init_tile_size = {"K.1": [0,0], "M.1": [0,3], "N.1": [0,0]}
buffer_size = 9000

print("Tensor A")
displayTensor(A_KM)
print("\nTensor B")
displayTensor(B_KN)

print("\nTensor A pre-processed into micro tiles")
displayTensor(A_K1M1K0M0)
print("\nTensor B pre-processed into micro tiles")
displayTensor(B_K1N1K0N0)

print("\nTensor A micro tile footprint")
displayTensor(A_K1M1K0M0.getMicroTileFootprint_LoopOrder(micro_tile_size, micro_tile_loop_order))
print("\nTensor B micro tile footprint")
displayTensor(B_K1N1K0N0.getMicroTileFootprint_LoopOrder(micro_tile_size, micro_tile_loop_order))
print("\n")

Tensor.DRT(tensors, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size)

## Matrix Multiplication

After DRT returns all tiled tensors. We will still need to implement functions that will perform matrix multiplication.

Since the `DRT` function above prints out tiled tensors rather than returning them, we will need to modify the function. The `getDRT` function will now return two sets of lists containing tiles of the two input matrices.

In [None]:
def getDRT(tensors, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size):
    """ Gets the next macro tile
    Parameters
    ----------
    tensors: list of tensors
        Includes all tensors processed
    loop_order: list of strings
        Specifies the loop order of the kernel
    init_tile_size: dictionary
        Gives the initial macro tile size
    micro_tile_size: integer
        Specifies the tile size of a micro tile
    micro_tile_loop_order: list of strings
        Specifies the loop order of which a micro tile is processed
    buffer_size: integer
        The size of the buffer DRT is building a macro tile for in bits

    Returns
    -------
    res_A: list of tensors
        List contains macro tiles of A
    res_B: list of tensors
        List contains macro tiles of B
    """
    
    all_dims = []

    # Prepare all tensors for processing
    for tensor in tensors:
        tensor = tensor.swizzleRanks(rank_ids = Tensor.orderDims(tensors, tensor.getRankIds()[0:2]) + Tensor.orderDims(tensors, tensor.getRankIds()[2:]))
        all_dims.append(set(tensor.dims()))

    # Find all dimensions involved
    all_dims = all_dims[0].union(all_dims[1])

    # Set up constrains and dim_span. Both are needed for function calls later
    constraints = {dim: None for dim in all_dims}
    dim_span =  {dim: init_tile_size[dim].copy() for dim in all_dims}

    n = 1
    tensors = Tensor.sortByStationarity(tensors, loop_order)
    tensor_A = tensors[0]
    tensor_B = tensors[1]
    buffer_allocation = buffer_size // 3
    
    contracted_dim = Tensor.orderDims(tensors, tensor_A.getRankIds()[0:2])[0]
    A_uncontracted_dim = Tensor.orderDims(tensors, tensor_A.getRankIds()[0:2])[1]
    B_uncontracted_dim = Tensor.orderDims(tensors, tensor_B.getRankIds()[0:2])[1]

    A_constrain_contracted_dim = True
    if loop_order.index(contracted_dim) > loop_order.index(A_uncontracted_dim):
        A_constrain_contracted_dim = False
    B_constrain_contracted_dim = True
    if loop_order.index(contracted_dim) > loop_order.index(B_uncontracted_dim):
        B_constrain_contracted_dim = False

    if A_constrain_contracted_dim:
        A_dim_one = contracted_dim
        A_dim_two = A_uncontracted_dim
    else:
        A_dim_one = A_uncontracted_dim
        A_dim_two = contracted_dim

    res_A = []
    res_B = []
    
    # grow along contracted_dim
    dim_span[A_dim_one] = [0, 0]
    while tensor_A.canGrow(A_dim_one, n, dim_span, micro_tile_size, micro_tile_loop_order, buffer_allocation, A_dim_one):
        dim_span[A_dim_one][1] += init_tile_size[A_dim_one][1]
        # grow along A_dim_two
        constraints[A_dim_two] = None
        dim_span[A_dim_two] = init_tile_size[A_dim_two].copy()
        while tensor_A.canGrow(A_dim_two, n, dim_span, micro_tile_size, micro_tile_loop_order, buffer_allocation, A_dim_one):
            tensor_A.growDims(tensors, n, dim_span, constraints, micro_tile_size, micro_tile_loop_order, buffer_allocation, A_dim_one)
            res_A.append(tensor_A.return_macro_tile(constraints, micro_tile_size))
            res_B.append([])
            
            # grow along B_uncontracted_dim while contrained to A
            constraints[B_uncontracted_dim] = None
            dim_span[B_uncontracted_dim] = init_tile_size[B_uncontracted_dim].copy()
            while tensor_B.canGrow(B_uncontracted_dim, n, dim_span, micro_tile_size, micro_tile_loop_order, buffer_allocation, contracted_dim):
                tensor_B.growDims(tensors, n, dim_span, constraints, micro_tile_size, micro_tile_loop_order, buffer_allocation, contracted_dim)
                res_B[-1].append(tensor_B.return_macro_tile(constraints, micro_tile_size))
                val = constraints[B_uncontracted_dim][1]
                constraints[B_uncontracted_dim] = None
                dim_span[B_uncontracted_dim] = [val, val]
            val = constraints[A_dim_two][1]
            constraints[A_dim_two] = None
            dim_span[A_dim_two] = [val, val]

        val = constraints[A_dim_one][1]
        dim_span[A_dim_one] = [val, val]
        constraints[A_dim_one] = None
    return res_A, res_B
Tensor.getDRT = getDRT




def return_macro_tile(self, constraints, micro_tile_size):
    """Returns a macro tile given a set of constraints"""
    dims = self.getRankIds()
    Z = Tensor(rank_ids=(self.getRankIds()))
    difference = int(sqrt(micro_tile_size))
    Z.setName("Macro tile " + self.getName()[0] + 
              "; Base point: " + 
              str(constraints[self.getRankIds()[0]][0] * difference) + "," + 
              str(constraints[self.getRankIds()[1]][0] * difference))
    z_i1 = Z.getRoot()
    self_i = self.getRoot()
    difference = int(sqrt(micro_tile_size))
    for i, self_j in self_i:
        if i >= constraints[self.getRankIds()[0]][1] * difference: # checks if we are selecting micro tiles out of scope
            break
        if i >= constraints[self.getRankIds()[0]][0] * difference:
            for j, self_k in self_j:
                if j >= constraints[self.getRankIds()[1]][1] * difference: # checks if we are selecting micro tiles out of scope
                    break
                if j >= constraints[self.getRankIds()[1]][0] * difference:
                    z_k = z_i1.getPayloadRef(i, j)
                    z_k <<= self_k
    return Z
Tensor.return_macro_tile = return_macro_tile

Now we will implement matrix multiplication to handle the multiplication of micro tiles.

In [None]:
def microTile_innerProduct(A, B, base_point_A, base_point_B):
    """Handles micro tile multiplication using inner product dataflow"""
    A = A.swizzleRanks(rank_ids=["M.0", "K.0"])
    B = B.swizzleRanks(rank_ids=["N.0", "K.0"])
    A_M0K0 = Tensor(rank_ids=["M.0", "K.0"], name="A_M0K0")
    a_m0 = A.getRoot()
    for m0, a_k0 in a_m0:
        for k0, a_val in a_k0:
            A_M0K0_ref = A_M0K0.getPayloadRef(m0 - base_point_A["M.0"], k0 - base_point_A["K.0"])
            A_M0K0_ref += a_val
    B_N0K0 = Tensor(rank_ids=["N.0", "K.0"], name="B_N0K0")
    b_n0 = B.getRoot()
    for n0, b_k0 in b_n0:
        for k0, b_val in b_k0:
            B_N0K0_ref = B_N0K0.getPayloadRef(n0 - base_point_B["N.0"], k0 - base_point_B["K.0"])
            B_N0K0_ref += b_val

    Z_MN = Tensor(rank_ids=["M", "N"], name="Z")
    z_m = Z_MN.getRoot()
    a_m0 = A_M0K0.getRoot()
    b_n0 = B_N0K0.getRoot()
    for m_pos, (m, (z_n, a_k0)) in enumerate(z_m << a_m0):
        for n_pos, (n, (z_ref, b_k0)) in enumerate(z_n << b_n0):
            for k_pos, (k, (a_val, b_val)) in enumerate(a_k0 & b_k0):
                z_ref += a_val * b_val
    return Z_MN

def microTile_outerProduct(A, B, base_point_A, base_point_B):
    """Handles micro tile multiplication using outer product dataflow"""
    A = A.swizzleRanks(rank_ids=["K.0", "M.0"])
    B = B.swizzleRanks(rank_ids=["K.0", "N.0"])
    A_K0M0 = Tensor(rank_ids=["K.0", "M.0"], name="A_K0M0")
    a_k0 = A.getRoot()
    for k0, a_m0 in a_k0:
        for m0, a_val in a_m0:
            A_K0M0_ref = A_K0M0.getPayloadRef(k0 - base_point_A["K.0"], m0 - base_point_A["M.0"])
            A_K0M0_ref += a_val
    B_K0N0 = Tensor(rank_ids=["K.0", "N.0"], name="B_K0N0")
    b_k0 = B.getRoot()
    for k0, b_n0 in b_k0:
        for n0, b_val in b_n0:
            B_K0N0_ref = B_K0N0.getPayloadRef(k0 - base_point_B["K.0"], n0 - base_point_B["N.0"])
            B_K0N0_ref += b_val
    
    Z_MN = Tensor(rank_ids=["M", "N"], name="Z")
    z_m = Z_MN.getRoot()
    a_k0 = A_K0M0.getRoot()
    b_k0 = B_K0N0.getRoot()
    for k_pos, (k, (a_m0, b_n0)) in enumerate(a_k0 & b_k0):
        for m_pos, (m, (z_n, a_val)) in enumerate(z_m << a_m0):
            for n_pos, (n, (z_ref, b_val)) in enumerate(z_n << b_n0):
                z_ref += a_val * b_val
    return Z_MN

def microTile_Gustavson(A, B, base_point_A, base_point_B):
    """Handles micro tile multiplication using Gustavson's dataflow"""
    A = A.swizzleRanks(rank_ids=["M.0", "K.0"])
    B = B.swizzleRanks(rank_ids=["K.0", "N.0"])
    A_M0K0 = Tensor(rank_ids=["M.0", "K.0"], name="A_M0K0")
    a_m0 = A.getRoot()
    for m0, a_k0 in a_m0:
        for k0, a_val in a_k0:
            A_M0K0_ref = A_M0K0.getPayloadRef(m0 - base_point_A["M.0"], k0 - base_point_A["K.0"])
            A_M0K0_ref += a_val
    B_K0N0 = Tensor(rank_ids=["K.0", "N.0"], name="B_K0N0")
    b_k0 = B.getRoot()
    for k0, b_n0 in b_k0:
        for n0, b_val in b_n0:
            B_K0N0_ref = B_K0N0.getPayloadRef(k0 - base_point_B["K.0"], n0 - base_point_B["N.0"])
            B_K0N0_ref += b_val
            
    Z_MN = Tensor(rank_ids=["M", "N"], name="Z")
    z_m = Z_MN.getRoot()
    a_m0 = A_M0K0.getRoot()
    b_k0 = B_K0N0.getRoot()
    for m_pos, (m, (z_n, a_k0)) in enumerate(z_m << a_m0):
        for k_pos, (k, (a_val, b_n0)) in enumerate(a_k0 & b_k0):
            for n_pos, (n, (z_ref, b_val)) in enumerate(z_n << b_n0):
                z_ref += a_val * b_val
    return Z_MN

After multiplying micro tiles, we will also need a function that updates the final result of matrix multiplication.

In [None]:
def updateMatrix(finalResult, product, base_point):
    """Updates the FINALRESULT tensor given PRODUCT and BASE_POINT"""
    product_m = product.getRoot()
    for m, product_n in product_m:
        for n, product_val in product_n:
            finalResult_ref = finalResult.getPayloadRef(m + base_point["M.0"], n + base_point["N.0"])
            finalResult_ref += product_val

Next, we will define functions that operates at the PE tile level.

In [None]:
def PETile_outerProduct(finalResult, A, B, func):
    A = A.swizzleRanks(rank_ids=["K.1", "M.1", "K.0", "M.0"])
    B = B.swizzleRanks(rank_ids=["K.1", "N.1", "K.0", "N.0"])
    a_k1 = A.getRoot()
    b_k1 = B.getRoot()
    for k1_pos, (k1, (a_m1, b_n1)) in enumerate(a_k1 & b_k1):
        for m1, a_k0 in a_m1:
            in1 = Tensor.fromFiber(rank_ids=["K.0", "M.0"], fiber=a_k0)
            for n1, b_k0 in b_n1:
                in2 = Tensor.fromFiber(rank_ids=["K.0", "N.0"], fiber=b_k0)
                product = func(in1, in2, {"K.0": k1_pos, "M.0": m1}, {"K.0": k1_pos, "N.0": n1})
                updateMatrix(finalResult, product, {"M.0": m1, "N.0": n1})

def PETile_Gustavson(finalResult, A, B, func):
    A = A.swizzleRanks(rank_ids=["M.1", "K.1", "M.0", "K.0"])
    B = B.swizzleRanks(rank_ids=["K.1", "N.1", "K.0", "N.0"])
    a_m1 = A.getRoot()
    b_k1 = B.getRoot()
    for m1, a_k1 in a_m1:
        for k1_pos, (k1, (a_m0, b_n1)) in enumerate(a_k1 & b_k1):
            in1 = Tensor.fromFiber(rank_ids=["M.0", "K.0"], fiber=a_m0)
            for n1, b_k0 in b_n1:
                in2 = Tensor.fromFiber(rank_ids=["K.0", "N.0"], fiber=b_k0)
                product = func(in1, in2, {"K.0": k1_pos, "M.0": m1}, {"K.0": k1_pos, "N.0": n1})
                updateMatrix(finalResult, product, {"M.0": m1, "N.0": n1})

Similar to above, we need to define functions that operates at the global tile level.

In [None]:
def matrixMultiplication(finalResult, A, B, PE_tile_func, micro_tile_func, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size):
    tensors = [A, B]
    tiles_A, tiles_B = Tensor.getDRT(tensors, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size)
    for i in range(len(tiles_A)):
        tile_A = tiles_A[i]
        for tile_B in tiles_B[i]:
            PE_tile_func(finalResult, tile_A, tile_B, micro_tile_func)

Finally, we need to define a function that handles the tiling and distribution of global tiles (this is only used for ExTensor-OP-DRT).

In [None]:
def matrixMultiplication_ExTensor_OP_DRT(finalResult, A, B, PE_tile_func, micro_tile_func, loop_order_1, loop_order_2, init_tile_size_1, init_tile_size_2, micro_tile_size, micro_tile_loop_order, buffer_size_global, buffer_size_local):
    tensors = [A, B]
    tiles_A, tiles_B = Tensor.getDRT(tensors, loop_order_1, init_tile_size_1, micro_tile_size, micro_tile_loop_order, buffer_size_global)
    for i in range(len(tiles_A)):
        tile_A = tiles_A[i]
        if tile_A.getRoot().countValues() == 0:
            continue
        for tile_B in tiles_B[i]:
            if tile_B.getRoot().countValues() == 0:
                continue
            matrixMultiplication(finalResult, tile_B, tile_A, PE_tile_func, micro_tile_func, loop_order_2, init_tile_size_2, micro_tile_size, micro_tile_loop_order, buffer_size_local)

## Initialization

Initialize the input tensors. Tensor shapes and densities can be modified below.

Note the following:
1. In the paper, the authors specified in section 5.2.4 that "input matrices are pre-processed into micro tiles of shape 32 × 32 for all workloads." However, for visualization and demonstration purposes, we will implement dynamic reflexive tiling based on micro tiles of shape 2 × 2 instead.
2. Similarly, for visualization and demonstration purposes, the global buffer size and local buffer size are also reduced.
3. The paper's focus is on the DRT algorithm and implementation. Strategies to determine the starting tile size and memory allocation for the input and output tensors are not the focus. Thus, for simplicity, our DRT implementation assumes that the input matrices A and B, as well as the output matrix C, each occupy one-third of the total memory.
4. The fallback path for algorithm 1 mentioned in the DRT paper is not implemented. Thus, changing the input tensors, the initial tile size, or buffer sizes might lead to incorrect results as the tiling would be incomplete. This is especially true for ExTensor-OP-DRT because it runs through two rounds of DRT.

In [None]:
micro_tile_size = 4

K = 15
M = 15
N = 15

density = [0.8,0.8]
seed = 0

# Initialize the tensors
A_KM = Tensor.fromRandom(rank_ids=["K", "M"], shape=[K, M], seed=seed, density=density, name="A")
B_KN = Tensor.fromRandom(rank_ids=["K", "N"], shape=[K, N], seed=seed + 1, density=density, name="B")

# Preprocess A_KM and B_KN into tensors of micro tile granularity
A_K1M1K0M0 = A_KM.generateMicroTiles(micro_tile_size)
B_K1N1K0N0 = B_KN.generateMicroTiles(micro_tile_size)
tensors = [A_K1M1K0M0, B_K1N1K0N0]
buffer_size_allocation = 9000

Execute the following cell if you wish to visualize tensor `A_KM` and `B_KN`.

In [None]:
print("Tensor A")
displayTensor(A_KM)
print("\nTensor B")
displayTensor(B_KN)

Execute the following cell if you wish to visualize tensor `A_K1M1K0M0` and `B_K1N1K0N0`.

In [None]:
print("Tensor A pre-processed into micro tiles")
displayTensor(A_K1M1K0M0)
print("\nTensor B pre-processed into micro tiles")
displayTensor(B_K1N1K0N0)

## ExTensor-OP
The authors of the paper proposed a modification to the original design of the [ExTensor accelerator](https://dl.acm.org/doi/10.1145/3352460.3358275). To remove a performance bottleneck, the dataflow between the global and the local buffer is changed from an inner product to an outer product dataflow. Note that local reductions of partial sums in the output tile is utlized in this case. Due to changes in the dataflow, a parallelized variant of ExTensor's intersection unit is also used.

In the line below, we set the uniform partitioning of ExTensor-OP to generate uniform tiles of shape 2 × 2.

In [None]:
K1 = M1 = N1 = K0 = M0 = N0 = 2

To demonstrate DRT in the next three sections, the tensors that are initilized above are faily large. Thus, after compiling the yaml specification for ExTensor-OP, please comment out any line containing the keyword `canvas`. This will disable the video generator. Even after disabling the video generator, the results can still be verified.

In [None]:
yaml = """
einsum:
    declaration:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    expressions:
        - Z[m,n] = A[k,m] * B[k,n]
mapping:
    rank-order:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    partitioning:
        Z:
            K: [uniform_shape(K1), uniform_shape(K0)]
            M: [uniform_shape(M1), uniform_shape(M0)]
            N: [uniform_shape(N1), uniform_shape(N0)]
    loop-order:
        Z: [N2, K2, M2, K1, M1, N1, M0, N0, K0]
    spacetime:
        Z:
            space: [N1]
            time: [N2, K2, M2, K1, M1, M0, N0, K0]
"""

utils.compile(yaml)

### Check Results

Check that above code (generated or provided) computes the correct result.

**Note**: Should be used after executing the cell above.

In [None]:
utils.check_matmul(A_KM, B_KN, Z_MN)

## ExTensor-OP-DRT

Since the ExTensor accelerator uses uniform shape partioning, it must take the worse case dense tensor into account when designing the partition. This would mean that when processing sparse tensors, ExTensor becomes memory inefficient. DRT solves this problem by building dynamic, non-uniform tiles during runtime to reduce DRAM traffic.

**The YAML specification below is purely symbolic, running it will result in an error as DRT is not implemented in the current TeAAL compiler. Please run the code in the cell after the one below instead.**

The notations below:\
(N12, K12, M12) specifies the global buffer tile\
(K11, M11, N11) specifies the PE tile\
(K10, M10, N10) specifies the micro tile\
(M0, N0, K0) specifies the element in a micro tile

In [None]:
yaml = """
einsum:
    declaration:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    expressions:
        - Z[m,n] = A[k,m] ∗ B[k,n]
mapping:
    rank-order:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    partitioning:
        Z:
            K: [uniform_shape(32)]
            M: [uniform_shape(32)]
            N: [uniform_shape(32)]
            (K1, M1, N1): [DRT(B, A), DRT(A, B)]
    loop-order:
        Z: [N12, K12, M12, K11, M11, N11, K10, M10, N10, M0, N0, K0]
    spacetime:
        Z:
            space: [N11]
            time: [N12, K12, M12, K11, M11, K10, M10, N10, M0, N0, K0]
"""

utils.compile(yaml)

Since DRT is not implemented in the TeAAL compiler, the HiFiber code is given below.\
**Note**: Because the fallback path in algorithm 1 of the DRT paper is not implemented, changing the parameters below may generate an incorrect result or an error.

In [None]:
global_buffer_size = 18000
loop_order_1 = ["N.1", "K.1", "M.1"]
local_buffer_size = 9000
loop_order_2 = ["K.1", "M.1", "N.1"]
micro_tile_loop_order = ["M.0", "N.0", "K.0"]
init_tile_size_outer = {"K.1": [0,0], "M.1": [0,0], "N.1": [0,4]}
init_tile_size_inner = {"K.1": [0,0], "M.1": [0,2], "N.1": [0,0]}
Z_MN = Tensor(rank_ids=["M", "N"], name = "Z")

matrixMultiplication_ExTensor_OP_DRT(Z_MN, A_K1M1K0M0, B_K1N1K0N0, PETile_outerProduct, microTile_innerProduct, loop_order_1, loop_order_2, init_tile_size_outer, init_tile_size_inner, micro_tile_size, micro_tile_loop_order, global_buffer_size, local_buffer_size)
displayTensor(Z_MN)

### Check Results

Check that above code computes the correct result.

**Note**: Should be used after executing the cell above.

In [None]:
utils.check_matmul(A_KM, B_KN, Z_MN)

## OuterSPACE-Like-DRT

Similar to ExTensor-OP-DRT, DRT can also be applied to OuterSPACE-like accelerators, which uses an outer product dataflow.

According to the paper: “The untiled baseline (original OuterSPACE proposal) distributes columns of A and rows of B, giving A and B perfect reuse, but Z poor reuse. Tiling of A and B reduces the working set size of output partial products, allowing them to be partially reduced on-chip, which reduces memory traffic. Additionally, tiling enables partial reuse across all three tensors.”

The paper also mentioned that they “idealize these accelerators’ on-chip implementations, assuming they can reach their DRAM- bound performance.” Thus, the DRT process will only be applied to the global buffer layer only.

**The YAML specification below is purely symbolic, running it will result in an error as DRT is not implemented in the current TeAAL compiler. Please run the code in the cell after the one below instead.**

For the notations below:\
(K11, M11, N11) specifies the global buffer tile\
(K10, M10, N10) specifies the micro tile\
(K0, M0, N0) specifies the element in a micro tile

In [None]:
yaml = """
einsum:
    declaration:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    expressions:
        - Z[m,n] = A[k,m] ∗ B[k,n]
mapping:
    rank-order:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    partitioning:
        Z:
            K: [uniform_shape(32)]
            M: [uniform_shape(32)]
            N: [uniform_shape(32)]
            (K1, M1, N1): [DRT(A, B)]
    loop-order:
        Z: [K11, M11, N11, K10, M10, N10, K0, M0, N0]
    spacetime:
        Z:
            space: []
            time: [K11, M11, N11, K10, M10, N10, K0, M0, N0]
"""

utils.compile(yaml)

Since DRT is not implemented in the TeAAL compiler, the HiFiber code is given below.\
**Note**: Because the fallback path in algorithm 1 of the DRT paper is not implemented, changing the parameters below may generate an incorrect result or an error.

In [None]:
micro_tile_loop_order = ["K.0", "M.0", "N.0"]
init_tile_size = {"K.1": [0,0], "M.1": [0,3], "N.1": [0,0]}
loop_order = ["K.1", "M.1", "N.1"]
Z_MN = Tensor(rank_ids=["M", "N"], name = "Z")
matrixMultiplication(Z_MN, A_K1M1K0M0, B_K1N1K0N0, PETile_outerProduct, microTile_outerProduct, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size_allocation)
displayTensor(Z_MN)

### Check Results

Check that above code (generated or provided) computes the correct result.

**Note**: Should be used after executing the cell above.

In [None]:
utils.check_matmul(A_KM, B_KN, Z_MN)

## MatRaptor-Like-DRT

Finally, DRT can also be applied to MatRaptor-like accelerators, which uses the Gustavson's dataflow.

Since the paper mentioned that they “idealize these accelerators’ on-chip implementations, assuming they can reach their DRAM- bound performance,” the DRT process will only be applied to the global buffer layer only.

**The YAML specification below is purely symbolic, running it will result in an error as DRT is not implemented in the current TeAAL compiler. Please run the code in the cell after the one below instead.**

For the notations below:\
(M11, K11, N11) specifies the global buffer tile\
(M10, K10, N10) specifies the micro tile\
(M0, K0, N0) specifies the element in a micro tile

In [None]:
yaml = """
einsum:
    declaration:
        A: [K, M]
        B: [K, N]
        Z: [M, N]
    expressions:
        - Z[m,n] = A[k,m] ∗ B[k,n]
    mapping:
        rank-order:
            A: [K, M]
            B: [K, N]
            Z: [M, N]
    partitioning:
        Z:
            K: [uniform_shape(32)]
            M: [uniform_shape(32)]
            N: [uniform_shape(32)]
            (K1, M1, N1): [DRT(A, B)]
    loop-order:
        Z: [M11, K11, N11, M10, K10, N10, M0, K0, N0]
    spacetime:
        Z:
            space: []
            time: [M11, K11, N11, M10, K10, N10, M0, K0, N0]
"""

utils.compile(yaml)

Since DRT is not implemented in the TeAAL compiler, the HiFiber code is given below.\
**Note**: Because the fallback path in algorithm 1 of the DRT paper is not implemented, changing the parameters below may generate an incorrect result or an error.

In [None]:
micro_tile_loop_order = ["M.0", "K.0", "N.0"]
init_tile_size = {"K.1": [0,0], "M.1": [0,3], "N.1": [0,0]}
loop_order = ["M.1", "K.1", "N.1"]
finalResult = Tensor(rank_ids=["M", "N"], name = "Z")
matrixMultiplication(finalResult, A_K1M1K0M0, B_K1N1K0N0, PETile_Gustavson, microTile_Gustavson, loop_order, init_tile_size, micro_tile_size, micro_tile_loop_order, buffer_size_allocation)
displayTensor(finalResult)

### Check Results

Check that above code (generated or provided) computes the correct result.

**Note**: Should be used after executing the cell above.

In [None]:
utils.check_matmul(A_KM, B_KN, Z_MN)