# Dot Product

The following sequence of cells illustrate hardware processing of a dot product computation:

$$ Z_{m} = A_{k} \times B_{k} $$

<img align="center" src="figures/01.2.1.DUDU_setup.png" alt="DUDU_setup" style="width:70%">

## Exercise 01.2.1 Uncompressed Dense Vectors 

### Understanding the inputs: Problem Specification

In [None]:
%%bash
cd ../designs/01.2.1-DUDU-dot-product/
cat prob/dot-product.prob.yaml

### Understanding the inputs: Architecture Specification

In [None]:
%%bash
cd ../designs/01.2.1-DUDU-dot-product/
cat arch/*.yaml

### Understanding the inputs: Mapping Specification

In [None]:
%%bash
cd ../designs/01.2.1-DUDU-dot-product/
cat map/*.yaml

### Run Example

In [None]:
%%bash
cd ../designs/01.2.1-DUDU-dot-product/
timeloop-model arch/*.yaml map/*.yaml prob/*.yaml -o output/

### Examine Important Stats: Storage Capacity & Accesses
Different access types
  - fills: initial writes of the data (zero for 1 level storage, so not shown)
  - reads: streaming out data
  - updates: writebacks of the updated data

In [None]:
%%bash
chmod 755 ../scripts/01.2.1-DUDU-dot-product-buffer-stats.sh
cd ../designs/01.2.1-DUDU-dot-product/output
../../../scripts/01.2.1-DUDU-dot-product-buffer-stats.sh

### Examine Important Stats: Computes

In [None]:
%%bash
chmod 755 ../scripts/01.2.1-DUDU-dot-product-compute-stats.sh
cd ../designs/01.2.1-DUDU-dot-product/output
../../../scripts/01.2.1-DUDU-dot-product-compute-stats.sh

### Examine Important Stats: Summary

In [None]:
%%bash
chmod 755 ../scripts/01.2.1-DUDU-dot-product-summary-stats.sh
cd ../designs/01.2.1-DUDU-dot-product/output
../../../scripts/01.2.1-DUDU-dot-product-summary-stats.sh

## Sparse vector A

<img align="center" src="figures/01.2.2.SUDU_baseline_setup.png" alt="SU_C_DU_setup" style="width:70%">


## Exercise 01.2.2 Sparse Uncompressed  A and Uncompressed Dense B
### Updated Inputs:  Problem Specification Now Reflects Sparsity

In [None]:
%%bash
cd ../designs/01.2.2-SUDU-dot-product
grep "instance:" prob/*.yaml -A 10

### Run Example With All Other Inputs Stay the Same

In [None]:
%%bash
cd ../designs/01.2.2-SUDU-dot-product
timeloop-model arch/*.yaml map/*.yaml prob/*.yaml -o output/no-optimization

### Examine Important Stats

You should see **no changes** in runtime statistics. Although there are potential savings introduced by the A vector, with a default uncompressed representation and no sparse optimization applied, it is not possible to exploit such savings. Thus, the baseline architecture's behaviors will not change.

In [None]:
%%bash
chmod 755 ../scripts/01.2.2-SUDU-dot-product-aggregated-stats.sh
cd ../designs/01.2.2-SUDU-dot-product/output/no-optimization
../../../../scripts/01.2.2-SUDU-dot-product-aggregated-stats.sh


### Additional Input:  Sparse Optimization Feature Specification To Enable Gating
- Goal: Perform gating on B and Z based on the payloads in A

<img align="center" src="figures/01.2.2.SUDU_gating_setup.png" alt="figures/01.2.2.SUDU_gating_setup.png" style="width:80%">


In [None]:
%%bash
# sparse tensor instance
cd ../designs/01.2.2-SUDU-dot-product
cat sparse-opt/*.yaml 

### Run Example With Gating Applied
Please note the additional `sparse-opt/*.yaml` input in the command

In [None]:
%%bash
cd ../designs/01.2.2-SUDU-dot-product
timeloop-model arch/*.yaml map/*.yaml prob/*.yaml sparse-opt/*.yaml -o output/gating

### Examine Important Stats

- `A`: no change, as A tensor is still represented in an uncompressed format and requires the same amount of storage capacity.
- `B`: reduced number of actual reads and **gated reads** counts show up.
- `Z`: reduced number of actual reads and actual updates, and **gated reads** && **gated updates** counts show up.
- `MAC`: reduced number of actual computes. As a result of `gating` on `B`, no operands are sent to `MAC` unit when `A` is empty (i.e., `NOT_EXIST` for both operands). So the number of computes reduced by 75%.
- lower utilization. The reduced number of computes make MAC stays idle for 75% of the cycles, during which there is still access to `A` stored in the `Buffer` (but no accesses to `B` and `Z`). 
- total number of cycles no change.
- total energy reduced.  Reduced number of accesses to `B` and `Z`, and reduced number of computes both save energy 

In [None]:
%%bash
chmod 755 ../scripts/01.2.2-SUDU-dot-product-gating-aggregated-stats.sh
cd ../designs/01.2.2-SUDU-dot-product/output/gating
../../../../scripts/01.2.2-SUDU-dot-product-gating-aggregated-stats.sh

## Exercise 01.2.3 Sparse Compressed A and Dense Uncompressed B


<img align="center" src="figures/01.2.3.SCDU_skipping_setup.png" alt="figures/01.2.3.SCDU_skipping_setup.png" style="width:80%">



### Updated Input: Sparse Optimization Specification
- Represent sparse A in a compressed format: coordinate payload
- Perform skipping based on the compressed A

In [None]:
%%bash
# sparse tensor instance
cd ../designs/01.2.3-SCDU-dot-product
cat sparse-opt/*yaml

### Updated Input: Architecture Sepcfication
Extra metadata storage needed to store the *coordiantes* in the compressed representation format

In [None]:
%%bash
cd ../designs/01.2.3-SCDU-dot-product
cat arch/*.yaml

### Run Example with Updated Sparse Optimization Specification and Architecture Specification

In [None]:
%%bash

# Run example
cd ../designs/01.2.3-SCDU-dot-product
timeloop-model arch/*.yaml components/* map/*.yaml prob/*.yaml sparse-opt/*.yaml -o output/

### Examine Important Stats

- `A` related:
  - Reduced capacity requirement and reduced number of accesses due to compressed data representation format.
  - Extra metadata storage overhead, extra metadata access overhead. Note that # of metadata units equals to # of nonzero data in A.
- `B`: Reduced number of actual reads and **skipped reads** counts show up due to explicit skipping optimization.
- `Z`: Reduced number of actual reads and actual updates, and **skipped reads** && **skipped updates** counts show up due to explicit skipping optimization.
- `MAC`: Reduced number of actual computes. As a result of `skipping` on `B`, no operands are sent to `MAC` unit when `A` is empty (i.e., `NOT_EXIST` for both operands). So the number of computes reduced by 75%.
- Full utilization at 1.0. Since both computes and storage accesses are skipped, the utilization of the MAC unit is still full and the total number of cycles reduced as well.
- Total number of cycles reduced.
- Total energy reduced.

In [None]:
%%bash
chmod 755 ../scripts/01.2.3-SCDU-dot-product-aggregated-stats.sh
cd ../designs/01.2.3-SCDU-dot-product/output/
../../../scripts/01.2.3-SCDU-dot-product-aggregated-stats.sh