# Eyeriss V2

This notebook reproduces the salient characteristics of the [Eyeriss V2](https://eems.mit.edu/wp-content/uploads/2019/04/2019_jetcas_eyerissv2.pdf) accelerator.

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "..")

from src import utils

## Initialization

Initialize the input tensors. Tensor shapes and densities can be modified below.

**Warning:** Large tensors will overwhelm the video generation. Either:
1. Use small tensors; as a rule of thumb, fewer than 60 computes (e.g., multiplications) should be required.
2. Do not generate a video; setting generate_video = False when compiling.

**Note** Different from Eyeriss v1
1. Eyeriss v2 uses batch size equal to 1 throughout the paper.

In [None]:
# Filter
M = 8
C = 4
R = 2
S = 2

# Input
N = 1
H = 4
W = 4

# Stride
Stride = 2

# Output
E = int((H-R+Stride)/Stride)
F = int((W-S+Stride)/Stride)

# Partition parameters
C1 = C
C0 = 2
M2 = M
M1 = 4
M0 = 2

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

## Compile and Run

Below is the TeAAL specification for Eyeriss v2. To simulate the accelerator:
1. Compile it to HiFiber by running the cell, inserting a new cell
2. Run the new cell, which will
    - Execute the kernel; multiplying the above defined matrices
    - Generate visualizations of the actions of the kernel

Here goes our resoning for the TeAAL specification for the Einsums and mappings of Eyeriss v2:
1. Partition
   - C rank is partitioned (uniform shape) once (C1, C0)
     
     "Recall that for the row-stationary dataflow, multiple 1-D rows of weights and iact are mapped to a given PE and processed in a sliding window fashion; here, the C0 × M1 rows of weights with width S are assigned to the PE, and the weights belong to M1 output channels and C0 input channels"
     
   - M rank is partitioned (uniform shape) twice (M2, M1, M0)

     M rank is partitioned twice because there are SIMD support in the PE: "The psum SPad also has to have two read and two write ports for updating two psums per cycle".

     As a result, M is first broken into M1, and within M1 it is further broken into M0 = 2.
     
2. Space and Time
   - M rank
     M0 is space because SIMD support in the PE computes two psums per cycle.
     M1 is time because the M1 rows of weights are assigned to the same PE.
     M2 is space because the PE array contains many PEs with each processing C0 × M1 rows of the filter.
   - C rank
     C0 is time and C1 is space for the same reason as M1 and M2.
   - Other ranks follow the Eyeriss paper.
     
3. Loop order
   - N: the paper did not specifically mention batch size as the outermost loop order, so we follow the original Eyeriss paper. 
   - M2 and C1 (no strict order between them)
   - M1 and C0 (no strict order between them)
   - M0
   - F and then E: F before E because RS dataflow streams in the input activation in a window based manner.
   - R and S (no strict order between them)

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, 2*e+r, 2*f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(M1), uniform_shape(M0)]
      C: [uniform_shape(C0)]
  loop-order:
    O: [N, M2, C1, M1, C0, M0, F, E, R, S]
  spacetime:
    O:
      space: [M2, C1, M0, R]
      time: [N, M1, C0, F, E, S]
"""
utils.compile(yaml, generate_video = True)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF, stride=Stride)