# Eyeriss

This notebook reproduces the salient characteristics of the [Eyeriss](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7738524) accelerator.

## Imports

Import the necessary modules.

In [None]:
# HiFiber boilerplate

from fibertree_bootstrap import *

fibertree_bootstrap(style="tree", animation='movie')

# Compilation boilerplate

import os
import sys
sys.path.insert(0, "..")

from src import utils

## Initialization

Initialize the input tensors. Tensor shapes and densities can be modified below.

**Warning:** Large tensors will overwhelm the video generation. Either:
1. Use small tensors; as a rule of thumb, fewer than 60 computes (e.g., multiplications) should be required.
2. Do not generate a video; remove the `spacetime` specification from the `mapping` before compiling.

In [None]:
# Filter
M = 2
C = 2
R = 2
S = 2

# Input
N = 2
H = 3
W = 3

# Stride
Stride = 1

# Output
E = int((H-R+Stride)/Stride)
F = int((W-S+Stride)/Stride)

# Partition parameters
N1 = 2
N0 = 1
C2 = 2
C1 = 2
C0 = 2
M2 = 2
M1 = 1
M0 = 1
E2 = 2
E1 = 2
E0 = 2

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

## Compile and Run

Below is the TeAAL specification for Eyeriss. To simulate the accelerator:
1. Compile it to HiFiber by running the cell, inserting a new cell
2. Run the new cell, which will
    - Execute the kernel; multiplying the above defined matrices
    - Generate visualizations of the actions of the kernel

Here goes our resoning for the TeAAL specification for the Einsums and mappings of Eyeriss:
1. Partition
    - M rank is partitioned (uniform shape) twice (M2, M1, M0), and C rank is partitioned (uniform shape) twice (C2, C1, C0)

      M is the rank represents number of filters, and C is the rank represents number of channels.
      A PE array is a spatial array of 168 PEs organized as a 12 × 14 rectangle.
      The dimensions of a PE set are a function of the shape of a layer and are independent of the physical dimensions of the PE array.
      A PE set can handle multiple 2D convolutions.
      Efficiently mapping multiple (M x C) 2D Convolutions to a 2D PE array (the hardware) is a nontrivial task.
      
      In Eyeriss, they describe the partition in the following sentences:
      "The PE array can run multiple 2-D convolutions from up to (q × r) channels of (p × t) filters simultaneously."
      "The PE array fits r × t PE sets in parallel that run r different channels of t different filters simultaneously."
      "Each PE runs p × q primitives simultaneously from q different channels of p different filters."      
      
      The M and C ranks are first partitioned into (t x p) and (r x q) chunks to be mapped to the 2D PE array.
      Within the 2D PE array, a single PE handles q channels and p filters.

      As a result, M is broken as: M2 = M, M1 = t x p, M0 = p.
      C is broken as: C2 = C, C1 = q x r, C0 = q.
      
    - N rank is partitioned (uniform shape) once: N1, N0
  
      N is the rank represents the batch size. 
      According to figure 7 in Eyeriss, a single scheduling pass does not go over all the ifmaps.
      "A pass is assumed to process three channels (q × r) and four filters (p × t). Also, the number of ifmaps that a pass processes, denoted as n, is assumed to be 2."

      As a result, Eyeriss partitions N into (N1, N0), where in Figure 7, N1 = N, N0 = 2.
      
    - E rank is partitioned (uniform shape) twice: E2, E1, E0

      E is the rank represents the ofmap height.
      "the height and the width of the PE set are equal to the number of filter rows (R) and ofmap rows (E)."

      The hardware (PE array) has fixed dimension (12 x 14).
      When the PE set requires more than 12 x 14 PEs, Eyeriss "strip mining the 2-D convolution, i.e., the PE set only processes e rows of ofmap at a time, where e ≤ E. The dimensions of the strip-mined PE set then becomes R × e and can fit into the PE array."
      On top of this, the PE array can contain multiple (e1) PE sets: R x e x e1.

      As a result, Eyeriss partitions E into (E2, E1, E0), where E2 = E, E1 = e1 x e (e1 = 2 for AlexNet CONV layer 1), E0 = e (e = 7 for AlexNet CONV layer 1).
       
2. Space and Time
   - M and C ranks

     M0 and C0 are time because the computation is serial: A PE set can run p × q 2-D Convolutions in a time interleaved fashion (Details can be found in the subsection "Multiple 2-D Convolutions in a PE Set").

     M1 and C1 are space because "the PE array can fit more than one PE set if the set is small enough", which implies r and t chunks happen in parallel.

     M2 and C2 are time because the PE array has fixed capacity so this is the only option. 
     
   - N rank

     Throughout the paper, PE local spad only store data of width S (width of filter).
     When filter reuse is explored (Figure 6(a)), different ifmaps are processed in a sequential order, which implies N0 is time.

     N1 is also time as implied by Figure 7. 
     
   - E rank

     E1, E0 are space, because after strip-mining the 2D convlution, the PE array can contain e1 number of PE set of dimension R x e. The PE array handles E1 and E0 in paralle.

     E2 is time because the PE array has fixed capacity so this is the only option.  

   - R rank

     R is space because all rows of a filter is broadcasted onto the PE array.
     "Eyeriss currently does not support the mapping of a PE set that is taller than the height of the PE array. Therefore, the maximum natively supported filter height is 12."
     This sentence implies that there is no partition along the R rank.

   - S rank

     S is time because each PE can only compute one element at a given cycle (Figure 3), so a single row of filter takes S cycles to finish the entire row.

   - F rank

     F is time because PE local pad is of width S, so there is no capacity to store the entire row of ifmap.
     
3. Loop order
   
   According to Figure 7:
   - N1 and N0 are the outermost.
   - C2 follows
   - M2 follows C2

   Then, it should be C1 and M1 because the entire PE array handles (q × r) channels and (p × t) filters.
   
   Then, it should be C0 and M0 because each PE handles p × q 2-D Convolutions in a time interleaved fashion.

   For each 2D convolution:
   - The outermost loop rank should be F, because a PE set computes the 2D convolusion from left to right.
   - Then should be E2, E1, E0 because for each column of the ofmap, we partition it if run out of local Spad capacity.
   - Finally, R (filter height) and S (filter width) , and there is no strict order between them.   
   

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, e+r, f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(M1), uniform_shape(M0)]
      N: [uniform_shape(N0)]
      C: [uniform_shape(C1), uniform_shape(C0)]
      E: [uniform_shape(E1), uniform_shape(E0)]
  loop-order:
    O: [N1, N0, C2, M2, C1, M1, C0, M0, F, E2, E1, E0, R, S]
  spacetime:
    O:
      space: [C1, M1, E1, E0, R]
      time: [N1, N0, C2, M2, C0, M0, F, E2, S]
"""
utils.compile(yaml, generate_video = True)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

## Extra specifications for Eyeriss with AlexNet Convolution layers as the target layers to model

Original AlexNet assumes batch size = 128.
Eyeriss uses batch size = 4.
We follow the Eyeriss parameter here. 

Due to the large parameter involved in AlexNet, our spec will not generate videos (generate_video = False), but it still checks correctness. 

![Eyerisis AlexNet detail](./images/alexnet-eyeriss.png)

#### Warning

The below cells (Original Parameters) might take a long time to run. You can decrease the parameters (Reduced Parameters) to speed up the process.

### AlexNet Convolutional Layer 1

#### Original Parameters

In [None]:
# Filter
M = 96
C = 3
R = 11
S = 11

# Input
N = 4
H = 227
W = 227

# Stride
Stride = 4

# Output
E = 55
F = 55

# Partition parameters
M2 = M
M1 = 32
M0 = 16
E2 = E 
E1 = 14
E0 = 7

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, 4*e+r, 4*f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(32), uniform_shape(16)]
      E: [uniform_shape(14), uniform_shape(7)]
  loop-order:
    O: [N, C, M2, M1, M0, F, E2, E1, E0, R, S]
  spacetime:
    O:
      space: [M1, E1, E0, R]
      time: [N, C, M2, M0, F, E2, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

#### Reduced Parameters

In [None]:
# Filter
M = 16
C = 3
R = 4
S = 4

# Input
N = 4
H = 32
W = 12

# Stride
Stride = 4

# Output
E = 8 
F = 3

# Partition parameters
M2 = M
M1 = 8
M0 = 4
E2 = E 
E1 = 4
E0 = 2

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, 4*e+r, 4*f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(8), uniform_shape(4)]
      E: [uniform_shape(4), uniform_shape(2)]
  loop-order:
    O: [N, C, M2, M1, M0, F, E2, E1, E0, R, S]
  spacetime:
    O:
      space: [M1, E1, E0, R]
      time: [N, C, M2, M0, F, E2, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

### AlexNet Convolutional Layer 2

#### Original Parameters

In [None]:
# Filter
M = 256
C = 48
R = 5
S = 5

# Input
N = 4
H = 31
W = 31

# Stride
Stride = 1

# Output
E = int((H-R+Stride)/Stride)
F = int((W-S+Stride)/Stride)

# Partition parameters
C1 = C
C0 = 2
M1 = M
M0 = 16

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, e+r, f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(16)]
      C: [uniform_shape(2)]
  loop-order:
    O: [N, C1, M1, C0, M0, F, E, R, S]
  spacetime:
    O:
      space: [E, R]
      time: [N, C1, M1, C0, M0, F, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

#### Reduced Parameters

In [None]:
# Filter
M = 8
C = 6
R = 5
S = 5

# Input
N = 4
H = 8
W = 8

# Stride
Stride = 1

# Output
E = int((H-R+Stride)/Stride)
F = int((W-S+Stride)/Stride)

# Partition parameters
C1 = C
C0 = 2
M1 = M
M0 = 4

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, e+r, f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(4)]
      C: [uniform_shape(2)]
  loop-order:
    O: [N, C1, M1, C0, M0, F, E, R, S]
  spacetime:
    O:
      space: [E, R]
      time: [N, C1, M1, C0, M0, F, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

### AlexNet Convolutional Layer 3


#### Original Parameters

In [None]:
# Filter
M = 384
C = 256
R = 3
S = 3

# Input
N = 4
H = 15
W = 15

# Stride
Stride = 1

# Output
E = 13
F = 13

# Partition parameters
C1 = C
C0 = 4
M2 = M
M1 = 64
M0 = 16

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, e+r, f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(64), uniform_shape(16)]
      C: [uniform_shape(4)]
  loop-order:
    O: [N, C1, M2, M1, C0, M0, F, E, R, S]
  spacetime:
    O:
      space: [M1, E, R]
      time: [N, C1, M2, C0, M0, F, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

#### Reduced Parameters

In [None]:
# Filter
M = 8
C = 16
R = 3
S = 3

# Input
N = 4
H = 15
W = 15

# Stride
Stride = 1

# Output
E = 13
F = 13

# Partition parameters
C1 = C
C0 = 4
M2 = M
M1 = 4
M0 = 2

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, e+r, f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(4), uniform_shape(2)]
      C: [uniform_shape(4)]
  loop-order:
    O: [N, C1, M2, M1, C0, M0, F, E, R, S]
  spacetime:
    O:
      space: [M1, E, R]
      time: [N, C1, M2, C0, M0, F, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

### AlexNet Convolutional Layer 4 & 5


#### Original Parameters

In [None]:
# Filter
M = 256
C = 192
R = 3
S = 3

# Input
N = 4
H = 15
W = 15

# Stride
Stride = 1

# Output
E = 13
F = 13

# Partition parameters
C2 = C
C1 = 6
C0 = 3
M2 = M
M1 = 32
M0 = 16

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, e+r, f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(32), uniform_shape(16)]
      C: [uniform_shape(6), uniform_shape(3)]
  loop-order:
    O: [N, C2, M2, C1, M1, C0, M0, F, E, R, S]
  spacetime:
    O:
      space: [C1, M1, E, R]
      time: [N, C2, M2, C0, M0, F, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)

#### Reduced Parameters

In [None]:
# Filter
M = 6
C = 18
R = 3
S = 3

# Input
N = 4
H = 15
W = 15

# Stride
Stride = 1

# Output
E = 13
F = 13

# Partition parameters
C2 = C
C1 = 6
C0 = 3
M2 = M
M1 = 4
M0 = 2

# Random Input Tensors
I_NCHW = Tensor.fromRandom(rank_ids=["N", "C", "H", "W"], density=1.0, shape=[N, C, H, W])
F_MCRS = Tensor.fromRandom(rank_ids=["M", "C", "R", "S"], density=1.0, shape=[M, C, R, S])

In [None]:
yaml = """
einsum:
  declaration:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  expressions:
    - O[n, m, e, f] = I[n, c, e+r, f+s]*F[m, c, r, s]
mapping:
  rank-order:
    I: [N, C, H, W]
    F: [M, C, R, S]
    O: [N, M, E, F]
  partitioning:
    O:
      M: [uniform_shape(4), uniform_shape(2)]
      C: [uniform_shape(6), uniform_shape(3)]
  loop-order:
    O: [N, C2, M2, C1, M1, C0, M0, F, E, R, S]
  spacetime:
    O:
      space: [C1, M1, E, R]
      time: [N, C2, M2, C0, M0, F, S]
"""

utils.compile(yaml, generate_video = False)

In [None]:
utils.check_conv(I_NCHW, F_MCRS, O_NMEF)