## Part 2: Spatial Architecture Modeling
Now that you are familiar with the basic single PE setup, let’s look at an example of a full system as shown in the figure below. This design is composed of two levels of on-chip storage -- the global buffer and the local scratchpads in each PE as described in part 1. Each datatype is sent via a network from the global buffer to the PE array, and there are inter-PE networks that are capable of sending various data types within the array. We provide you with the loop nest of a matmul on this design in the figure below. 

<br>
<div class="row">
  <div class="column">
    <img align="left" src="designs/system/figures/arch.png" alt="Full System  Architecture Diagram" style="margin:50px 0px 0px 50px; width:40%">
  </div>
  <div class="column">
    <img  align="left"  src="designs/system/figures/loopnest.png" alt="System Loopnest" style="width:50%">
  </div>
</div>

### Question 1
You are provided with a PE array that has 16 PEs. Assume you can design different architectures and associated mappings for every layer shape (i.e. both ```architecture.yaml``` and ```mapping.yaml``` can change across layer shapes). 

In specific, you can select the height and width of the PE array as long as the total number of PEs equal to 16, while keeping other architectural attributes the same.

In [1]:
import pandas as pd
import numpy as np
from loaders import *

show_config('designs/system/arch.yaml')

# Please do not modify this file. If there are double-curly-brace-enclosed
# statements, they are placeholders that should be set from the notebooks.
architecture:
  version: 0.4
  nodes:
  - !Container
    name: system_arch
    attributes:
      # Top-level attributes inherited by all components unless overridden
      technology: "45nm"
      global_cycle_seconds: 1e-9
      datawidth: 16

  - !Component
    name: DRAM                 # offchip DRAM is the source of all datatypes
    class: DRAM                # assume DRAM is large enough to store all the data, so no depth specification needed
    attributes:
      width: 64                # width in bits
      datawidth: datawidth

  - !Container
    name: chip
    
  - !Component
    name: global_buffer
    class: SRAM
    attributes:
      width: 128
      depth: 2048
      datawidth: datawidth
      n_banks: 1
      n_rdwr_ports: 2

  - !Container
    name: PE
    spatial: {meshX: {{pe_meshX}}, meshY: {{pe_meshY}}}

  - !Compone

In [2]:
answer(
    question='2.1',
    subquestion='What variable names change the number of PEs in the X and Y dimensions? Please give the name of the double-curly-brace-enclosed variables. Case sensitive.',
    answer= ['pe_meshX', 'pe_meshY'], # [First variable in curly braces, second variable in curly braces]
    required_type=[str, str]
)

2.1: What variable names change the number of PEs in the X and Y dimensions? Please give the name of the double-curly-brace-enclosed variables. Case sensitive.
	['pe_meshX', 'pe_meshY']


## Question 2

With this spatial architecture, we will explore how the PE array shape impacts two metrics: utilization, which impacts throughput, and spatial data reuse, which impacts energy.

We start with the workload and mapping below. The mapping has placeholder variables in double curly brackets that we will replace with numeric values later.

In [3]:
show_config('layer_shapes/conv2.yaml')

problem:
  version: 0.4
  shape:
    name: "CNN_Layer"
    dimensions: [ C, M, R, S, N, P, Q ]
    coefficients:
    - name: Wstride
      default: 1
    - name: Hstride
      default: 1
    - name: Wdilation
      default: 1
    - name: Hdilation
      default: 1

    data_spaces:
    - name: Weights
      projection:
      - [ [C] ]
      - [ [M] ]
      - [ [R] ]
      - [ [S] ]
    - name: Inputs
      projection:
      - [ [N] ]
      - [ [C] ]
      - [ [R, Wdilation], [P, Wstride] ] # SOP form: R*Wdilation + P*Wstride
      - [ [S, Hdilation], [Q, Hstride] ] # SOP form: S*Hdilation + Q*Hstride
    - name: Outputs
      projection:
      - [ [N] ]
      - [ [M] ]
      - [ [Q] ]
      - [ [P] ]
      read_write: True

  instance:
    C: 4  # inchn
    M: 8  # outchn
    R: 5   # filter height
    S: 5   # filter width
    P: 28  # ofmap height
    Q: 28  # ofmap width
    N: 50   # batch size



In [4]:
show_config('designs/system/map.yaml')

# Please do not modify this file. If there are double-curly-brace-enclosed
# statements, they are placeholders that should be set from the notebooks.
mapping:
- target: DRAM
  type: temporal
  factors: 
  - P=1
  - Q=1
  - R=1
  - S=1
  - N={{DRAM_factor_N}}
  - M={{DRAM_factor_M}}
  - C={{DRAM_factor_C}}
  permutation: [S, R, Q, P, C, M, N] # don't change this

- target: global_buffer
  type: temporal
  factors: 
  - P=1
  - Q=1
  - R=1
  - S=1
  - N={{global_buffer_factor_N}}
  - M={{global_buffer_factor_M}}
  - C={{global_buffer_factor_C}}
  permutation: [S, R, Q, P, C, M, N] # don't change this

- target: PE
  type: spatial  # spatial constraint specification
  factors: 
  - P=1
  - Q=1
  - R=1
  - S=1
  - N=1
  - M={{PE_spatial_factor_M}}
  - C={{PE_spatial_factor_C}}
  permutation: [C, M, R, S, P, Q, N]
  # tells at which index should the dimensions be mapped to Y (PE cols),
  # the dimensions before that index all should map to X (PE rows)
  split: 1

- target: scratchpad
  type

In [None]:
answer(
    question='2.2',
    subquestion=f'Which rank (e.g., C, M, or P) is mapped to the X dimension of the PE array? Case sensitive.',
    answer= 'C',
    required_type=('C', 'M', 'N', 'R', 'S', 'P', 'Q')
)
answer(
    question='2.2',
    subquestion=f'Which rank (e.g., C, M, or P) is mapped to the Y dimension of the PE array? Case sensitive.',
    answer= 'M',
    required_type=('C', 'M', 'N', 'R', 'S', 'P', 'Q')
)

**For the rest of Part 2 (to the end of this notebook), we will assume a 1x16 PE array (the array shape is 1 in the X-dimension and 16 in the Y-dimension).**

In [5]:
ARCH_CONFIG = {'pe_meshX': 1, 'pe_meshY': 16}

### Question 3

We will look at the impact of PE utilization on latency, and how PE utilization depends on the mapping. Inspect the following mapping.

In [6]:
config_example = dict( # Do not change this configuration!
    DRAM_factor_N=50,
    DRAM_factor_M=2,
    DRAM_factor_C=4,
    global_buffer_factor_N=1,
    global_buffer_factor_M=4,
    global_buffer_factor_C=1,
    PE_spatial_factor_M=1,
    PE_spatial_factor_C=1,
    scratchpad_factor_N=1,
)

full_config = {
    **config_example,
    **ARCH_CONFIG
}

result = run_timeloop_model(
    full_config,
    architecture='designs/system/arch.yaml',
    mapping='designs/system/map.yaml',
    problem='layer_shapes/conv2.yaml'
)
stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
mapping = result.mapping
one_pe_latency = result.cycles
print(mapping)

[INFO] 2025-04-02 20:57:49,394 - pytimeloop.accelergy_interface - Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


INFO:pytimeloop.accelergy_interface:Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


DRAM [ Weights:800 (800) Inputs:204800 (204800) Outputs:313600 (313600) ] 
-------------------------------------------------------------------------
| for N in [0:50)
|   for M in [0:2)
|     for C in [0:4)

global_buffer [ Weights:100 (100) Inputs:1024 (1024) Outputs:3136 (3136) ] 
--------------------------------------------------------------------------
|       for M in [0:4)

inter_PE_spatial [ ] 
scratchpad [ Weights:25 (25) ] 
------------------------------
|         for R in [0:5)
|           for S in [0:5)
|             for P in [0:28)
|               for Q in [0:28)

weight_reg [ Weights:1 (1) ] 
input_activation_reg [ Inputs:1 (1) ] 
output_activation_reg [ Outputs:1 (1) ] 
---------------------------------------
|                 << Compute >>



In [16]:
answer(
    question='2.3',
    subquestion=f'What is the PE utilization (number of utilized PEs divided by total number of PEs)?.',
    answer= 0.625,
    required_type=Number
)

2.3: What is the PE utilization (number of utilized PEs divided by total number of PEs)?.
	0.625


As a result of this utilization, the mapping achieves the following latency.

In [17]:
one_pe_latency

31360000

In the following mapping, we map more of the M rank to the PE array.

In [18]:
config_example = dict( # Do not change this configuration!
    DRAM_factor_N=50,
    DRAM_factor_M=2,
    DRAM_factor_C=4,
    global_buffer_factor_N=1,
    global_buffer_factor_M=1,
    global_buffer_factor_C=1,
    PE_spatial_factor_M=4,
    PE_spatial_factor_C=1,
    scratchpad_factor_N=1,
)

full_config = {
    **config_example,
    **ARCH_CONFIG
}

result = run_timeloop_model(
    full_config,
    architecture='designs/system/arch.yaml',
    mapping='designs/system/map.yaml',
    problem='layer_shapes/conv2.yaml'
)
stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
mapping = result.mapping
four_pe_latency = result.cycles
print(mapping)

[INFO] 2025-04-02 21:36:06,433 - pytimeloop.accelergy_interface - Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


INFO:pytimeloop.accelergy_interface:Running Accelergy with command: accelergy /home/workspace/lab3/output_dir/parsed-processed-input.yaml -o ./output_dir/ -v


DRAM [ Weights:800 (800) Inputs:204800 (204800) Outputs:313600 (313600) ] 
-------------------------------------------------------------------------
| for N in [0:50)
|   for M in [0:2)
|     for C in [0:4)

global_buffer [ Weights:100 (100) Inputs:1024 (1024) Outputs:3136 (3136) ] 
inter_PE_spatial [ ] 
--------------------
|       for M in [0:4) (Spatial-Y)

scratchpad [ Weights:25 (25) ] 
------------------------------
|         for R in [0:5)
|           for S in [0:5)
|             for P in [0:28)
|               for Q in [0:28)

weight_reg [ Weights:1 (1) ] 
input_activation_reg [ Inputs:1 (1) ] 
output_activation_reg [ Outputs:1 (1) ] 
---------------------------------------
|                 << Compute >>



In [19]:
answer(
    question='2.3',
    subquestion=f'What is the PE utilization (number of utilized PEs divided by total number of PEs)?.',
    answer= 0.25,
    required_type=Number
)

2.3: What is the PE utilization (number of utilized PEs divided by total number of PEs)?.
	0.25


In [20]:
four_pe_latency

7840000

In [21]:
print(f'As expected, this latency is {one_pe_latency/four_pe_latency} times lower.')

As expected, this latency is 4.0 times lower.


However, note that simply increasing the factor of the spatially mapped rank is not always possible.

In [None]:
answer(
    question='2.3',
    subquestion=f'What is the maximum factor of the spatially mapped C rank based on the PE array shape?',
    answer= 'FILL ME',
    required_type=int
)

Moreover, even if we can increase the factor of the spatially mapped rank, it does not always result in higher utilization.

In [None]:
answer(
    question='2.3',
    subquestion=f'Assuming a larger factor of the spatial loop is possible given the PE array shape, increasing the factor will not increase PE utilization if the workload were _____-bound.',
    answer= 'FILL ME',
    required_type=('computation', 'memory bandwidth')
)

### Question 4

Now, we look at how spatial mapping affects spatial data reuse. Again, we use the mappings from before.

In [None]:
config_example = dict( # Do not change this configuration!
    DRAM_factor_N=50,
    DRAM_factor_M=2,
    DRAM_factor_C=4,
    global_buffer_factor_N=1,
    global_buffer_factor_M=4,
    global_buffer_factor_C=1,
    PE_spatial_factor_M=1,
    PE_spatial_factor_C=1,
    scratchpad_factor_N=1,
)

full_config = {
    **config_example,
    **ARCH_CONFIG
}

result = run_timeloop_model(
    full_config,
    architecture='designs/system/arch.yaml',
    mapping='designs/system/map.yaml',
    problem='layer_shapes/conv2.yaml'
)
stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
mapping = result.mapping
print(mapping)
print(stats)

We now will find the number of reads from the `global_buffer` for the above mapping by tensor by looking through the outputted stats file.

In [None]:
# Answer by setting these variables
 one_pe_input_reads = # YOUR ANSWER HERE
 one_pe_weight_reads = # YOUR ANSWER HERE
 one_pe_output_reads = # YOUR ANSWER HERE
answer(
    question='2.4',
    subquestion=f'How many input reads does each utilized PE fetch from the global_buffer?',
    answer=one_pe_input_reads,
    required_type=Number
)
answer(
    question='2.4',
    subquestion=f'How many weight reads does each utilized PE fetch from the global_buffer?',
    answer=one_pe_weight_reads,
    required_type=Number
)
answer(
    question='2.4',
	subquestion=f'How many output reads does each utilized PE fetch from the global_buffer?',
	answer=one_pe_output_reads,
	required_type=Number
)

manual_one_pe_global_buffer_reads = one_pe_input_reads + one_pe_weight_reads + one_pe_output_reads
print('If you answered correctly, the following equality should hold.')
print(f'Is {manual_one_pe_global_buffer_reads} == {62446400.0}? {manual_one_pe_global_buffer_reads == 62446400.0}.')

As before, we increase the factor of M that is spatially mapped.

In [None]:
config_example = dict( # Do not change this configuration!
    DRAM_factor_N=50,
    DRAM_factor_M=2,
    DRAM_factor_C=4,
    global_buffer_factor_N=1,
    global_buffer_factor_M=1,
    global_buffer_factor_C=1,
    PE_spatial_factor_M=4,
    PE_spatial_factor_C=1,
    scratchpad_factor_N=1,
)

full_config = {
    **config_example,
    **ARCH_CONFIG
}

result = run_timeloop_model(
    full_config,
    architecture='designs/system/arch.yaml',
    mapping='designs/system/map.yaml',
    problem='layer_shapes/conv2.yaml'
)
stats = open('./output_dir/timeloop-model.stats.txt', 'r').read()
mapping = result.mapping
print(mapping)
print(stats)

In [None]:
# Answer by setting these variables
 four_pe_input_reads = # YOUR ANSWER HERE
 four_pe_weight_reads = # YOUR ANSWER HERE
 four_pe_output_reads = # YOUR ANSWER HERE
answer(
    question='2.4',
    subquestion=f'How many input reads does each utilized PE fetch from the global_buffer now?',
    answer= 'FILL ME',
    required_type=Number
)
answer(
    question='2.4',
    subquestion=f'How many weight reads does each utilized PE fetch from the global_buffer now?',
    answer= 'FILL ME',
    required_type=Number
)
answer(
    question='2.4',
	subquestion=f'How many output reads does each utilized PE fetch from the global_buffer now?',
	answer= 'FILL ME',
	required_type=Number
)

manual_four_pe_global_buffer_reads = four_pe_input_reads + four_pe_weight_reads + four_pe_output_reads

print('If you answered correctly, the following equality should hold.')
print(f'Is {manual_four_pe_global_buffer_reads} == {38926400.0}? {manual_four_pe_global_buffer_reads == 38926400.0}.')

As you can see, the number of input reads has decreased by

In [None]:
print(f'{one_pe_input_reads/four_pe_input_reads} times')

Although since input reads are only a subset of total reads, the total reduction is only

In [None]:
print(f'{manual_one_pe_global_buffer_reads/manual_four_pe_global_buffer_reads} times')

This is because mapping the `M` rank spatially means that instead of reading the
same input four times temporally, the input is read once from the global buffer
and multicasted to the four PEs that will use it.