Skip to content

AaronJing/Gemmini-SEA

Repository files navigation

Gemmini-SEA

The Gemmini-SEA features a Sign-Separated Accumulation (SEA)-based weight stationary systolic array implementation. This design is focused on enhancing the resource efficiency of floating-point (FP) addition/accumulation operations, which are crucial in DNN accelerators. Unlike traditional accumulation, the Gemmini-SEA architecture innovatively accumulates same-signed terms separately using efficient same-signed FP adders, followed by a final addition of oppositely-signed sub-accumulations. This approach not only leads to a substantial improvement in overall resource efficiency but also maintains accuracy by not introducing approximations.

Design Overview

Original Weight Stationary Systolic Array

Alt text for image 1 Alt text for image 2

In the systolic array, we perform matrix multiplication $C = A*B + D$, where $A$, $B$, and $D$ represent activations, weights, and partial sums, respectively. In a weight stationary systolic array, the weights are preloaded into the array before the actual computation begins.

On the right side of the array, each processing element (PE) starts by loading its weight, denoted as $b$, into a specific register. Once all the weights are loaded, each PE then begins to process input activations, referred to as $a$. These activations are either passed from the adjacent left PE or come directly from the primary input for PEs at the left edge. Simultaneously, each PE receives partial sums, labeled as $d$, from the PE above or directly from the main input for the top row of PEs.

As the computation progresses, each PE transmits its received $a$ and $d$ values to the neighboring PE on its right and below, respectively. Concurrently, it calculates the product of its current input activation $a$ and the pre-stored weight $b$. This product is then added to the received partial sum $d$ to generate a new partial sum, also labeled as $d$. This newly computed $d$ is then sent down to the PE located directly below.

Note the blue FP adder in PE is the generic FP adder.

SEA-based Weight Stationary Systolic Array

Alt text for image 1 Alt text for image 2

In our SEA-based systolic array, we focus on separately accumulating quantities with the same sign. This means we're dealing with two different kinds of partial sums, labeled as $d$ and $d'$, which each PE handles and moves along. These partial sums have opposite signs and flow from top to bottom in the array. As shown in our diagram, every PE has two partial sums inputs $d$ and $d'$, and two pipeline registers help pass these along. At the very bottom of the array, we place a generic FP adder to wrap up by adding $d$ and $d'$ together. An initialization setup at the top of systolic array to make sure the first row of PEs starts with two inputs

Alt text for image 3

PEs take in two distinct partial sums ($d$ and $d'$) with opposing signs. Inside each PE, you'll find a multiplier, a register $b$, and a same-signed FP adder. A key part of this design is the swapping mechanism, made up of two multiplexers and an XOR gate. It makes sure that the partial sum being processed by the same-signed adder matches the sign of the product of the input activation and weight ($a × b$). The other partial sum doesn't change. So, you end up with an updated $d$ that's either $a × b + d$ or $a × b + d'$, depending on how the signs line up in the same-signed FP adder. There's also a bypass path for the partial sum that's not being used, which gets sent as the new $d'$ to the next PE down the line. This way, the $d'$ that gets passed on always has a different sign than the $a × b$ that's being processed.

This design does not introduce any approxmation, despite the little difference that come from FP operations not being associative. Even though it looks like we've upped the number of logic components and FP adders, our design actually leads to lower ADP, energy. Please check our paper for details.

Getting started

Dependecies

Our implementation is based on Gemmini V0.6.4.

Installation

git clone git@github.com:AaronJing/Chipyard-SEA.git
cd Chipyard-SEA
git checkout sea
./scripts/init-submodules-no-riscv-tools.sh
./scripts/build-toolchains.sh esp-tools
source env.sh

cd generators/gemmini
git fetch && git checkout sea
git submodule update

cd -
cd toolchains/esp-tools/riscv-isa-sim/build
git fetch && git checkout sea
make && make install

Verify Installation

cd Chipyard-SEA/generators/gemmini
./scripts/setup-paths.sh
./scripts/build-verilator.sh
cd software/gemmini-rocc-tests
./build.sh
cd -
./scripts/run-verilator.sh template

You should expect some output without any errors.

Running Baremetal test using Verilator

You can generate SEA-based implementation by modifying configs/GemminiCustomConfigs.scala

sea = true,
samesigned = true,

Or generate original implementation

sea = false,
samesigned = false,

Then, run Baremetal test matmul_ws_sea. Note that the inputType, spatialArrayOutputType and accType of matmul_ws_sea are BF16, FP32 and FP32, respectively. If you generate gemmini with other data types, this test cannot be performed.

matmul_ws_sea contains 100 GEMM tests, each test contains two BF16 4-by-4 matrices and outputs one 4-by-4 matrix.

./scripts/build-verilator.sh
cd software/gemmini-rocc-tests
./build.sh
cd -
./scripts/run-verilator.sh matmul_ws_sea

Cite us if it helps your research :)

@INPROCEEDINGS{gong2024,
  author={Gong, Jing and Saadat, Hassaan and Javaid, Haris and Gamaarachchi, Hasindu and Taubman, David and Parameswaran, Sri},
  booktitle={To appear: 2024 Design Automation and Test in Europe (DATE)}, 
  title={SEA: Sign-Separated Accumulation Scheme for Resource-Efficient DNN Accelerators}, 
  year={2024}}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors