SLM Accelerator: Systolic Array-Based Matrix Multiplication IP

Overview

This project implements a high-performance systolic array accelerator for matrix multiplication operations, designed specifically for Small Language Model (SLM) inference on RISC-V edge devices. The accelerator provides hardware acceleration for GEMM (General Matrix Multiply) operations, significantly reducing inference latency on resource-constrained platforms.

Key Features

INT8 Systolic Array Accelerators: 16×16 and 32×32 configurations
Tiling Architecture: Support for large matrices using hardware tiles (e.g., 64×64 matrices on 16×16 hardware)
AXI4 Interface: Full AXI4-Stream and AXI4-Lite integration for SoC deployment
Scalable Designs: Parameterized implementations supporting INT4, INT8, and FP16 datatypes
Bare-Metal Application: Reference C implementation for RISC-V processors
Comprehensive Testbenches: SystemVerilog testbenches with full verification

Architecture

System Overview

Figure 1: Complete system architecture showing RISC-V processor integration with systolic accelerator

Figure 2: Inference engine dataflow and accelerator offload mechanism

Figure 3: Accelerator IP block diagram with AXI interfaces

Systolic Array Design

The core architecture uses a 2D grid of Processing Elements (PEs) arranged in a systolic configuration:

┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐
│ PE  │ → │ PE  │ → │ PE  │ → │ PE  │ →
└──┬──┘   └──┬──┘   └──┬──┘   └──┬──┘
   ↓         ↓         ↓         ↓
┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐
│ PE  │ → │ PE  │ → │ PE  │ → │ PE  │ →
└──┬──┘   └──┬──┘   └──┬──┘   └──┬──┘
   ↓         ↓         ↓         ↓

Figure 4: Detailed systolic array architecture showing PE interconnections and data flow

Each Processing Element:

Performs signed INT8×INT8 multiplication
Accumulates partial sums in 32-bit registers
Forwards data to neighboring PEs
Operates in a fully pipelined manner

Tiling Implementation

For matrices larger than the hardware array size, the accelerator implements a tiling subsystem that:

Decomposes large matrices into smaller tiles (e.g., 64×64 → 16×16 tiles)
Schedules tile processing using an FSM-based controller
Generates tile addresses automatically for A, B, and C matrices
Accumulates partial results across tiles to produce final output

Figure 5: Tiling architecture demonstrating how 32×32 matrices are processed using 16×16 hardware tiles

See docs/TILING_GUIDE.md for detailed tiling subsystem documentation.

Repository Structure

SLM_Accelerator/
├── Accelerator_IP/
│   ├── Gemma_Accelerator_IP/
│   │   ├── INT8_16x16/
│   │   │   ├── src/
│   │   │   └── tb/
│   │   ├── INT8_32x32/
│   │   │   ├── src/
│   │   │   └── tb/
│   │   └── Tiling/
│   └── Systolic_array_IP/
│       ├── not_scalable_systolic_matmul/
│       ├── scalable_systolic_matmul/
│       │   ├── with_buffers/
│       │   └── without_buffers/
│       └── scalable_systolic_matmul_axi/
│           ├── INT4_INT4/
│           └── f16_INT4/
├── Application/
│   └── INT8_16x16/
├── Results/
├── docs/
├── .gitignore
└── README.md

RTL Modules

Core Components

`gemma_accelerator.v`

Top-level accelerator module with:

AXI4-Lite control interface
AXI4 memory master interface
Tiling control logic
Double buffering for input/output

`systolic_array_16x16.v` / `systolic_array_32x32.v`

Systolic array grid instantiations:

Grid of PE modules
Data flow orchestration (north → south, west → east)
Pipelined output collection

`pe_int8.v`

Processing Element implementation:

INT8×INT8 signed multiplication
32-bit accumulation with saturation protection
Registered data forwarding

`accelerator_buffer.v`

Input/output buffering:

A, B matrix buffers
Result accumulation buffer
Burst-friendly memory interface

Tiling Subsystem

`tile_scheduler.v`

FSM-based tile iteration controller:

Computes tile loops for K, I, J dimensions
Generates tile coordinates
Controls accumulation vs. overwrite for result tiles
Issues done signal when all tiles complete

`tile_address_generator.v`

Combinational address calculator:

Converts tile coordinates (ki, ii, ji) to memory addresses
Handles row-major matrix layouts
Supports configurable tile and matrix sizes

Usage

Simulation

INT8 16×16 Accelerator

cd Accelerator_IP/Gemma_Accelerator_IP/INT8_16x16/tb
# Using Vivado
vivado -mode batch -source sim_script.tcl

# Using ModelSim/QuestaSim
vsim -do "do compile.do; do simulate.do"

Tiling Implementation

cd Accelerator_IP/Gemma_Accelerator_IP/Tiling
# Compile and simulate with your preferred simulator
iverilog -g2012 -o sim *.v gemm_tb.sv
vvp sim

Hardware Integration

The accelerator is designed to integrate with RISC-V SoCs via AXI interfaces:

Instantiate the IP in your SoC design
Connect AXI4-Lite to the CPU's control bus
Connect AXI4 Master to system memory interconnect
Assign memory-mapped registers for accelerator control

Control Registers (via AXI4-Lite):

0x00: Control (start, reset)
0x04: Status (done, busy)
0x08: Matrix A base address
0x0C: Matrix B base address
0x10: Matrix C base address
0x14: Matrix dimensions (M, N, K)
0x18: Tile configuration (for tiling mode)

Software API

// Initialize accelerator
void gemma_acc_init(uint32_t base_addr);

// Offload matrix multiplication
void gemma_matmul(int8_t *A, int8_t *B, int32_t *C, 
                  uint32_t M, uint32_t N, uint32_t K);

// Check completion
bool gemma_acc_done();

See Application/INT8_16x16/ for full implementation.

Design Decisions

Why Systolic Arrays?

High throughput: All PEs compute in parallel
Efficient data reuse: Each data element used multiple times
Simple control: Regular, predictable dataflow
Scalable: Easy to increase array size for more performance

Why Tiling?

Large matrix support: Process matrices larger than hardware array
Reduced memory bandwidth: Tile-level data reuse
Flexible deployment: Same hardware supports multiple problem sizes
Energy efficiency: Minimizes DRAM accesses

INT8 Quantization

2-4× memory savings vs. FP16/FP32
Higher compute density: More operations per cycle
Acceptable accuracy for many neural network workloads
Simpler arithmetic: Lower area and power

Performance

INT8 16×16 Accelerator

Throughput: 256 INT8 MAC operations per cycle (peak)
Latency: 16-32 cycles (array fill time) + computation cycles
Frequency: Targets 100-200 MHz on modern FPGAs

Resource Utilization (Xilinx Artix-7)

16×16 array: ~15K LUTs, ~8K FFs, 512 DSP48 slices
32×32 array: ~60K LUTs, ~32K FFs, 2048 DSP48 slices

Verification

All modules include comprehensive testbenches:

Unit tests: Individual PE verification
Integration tests: Full systolic array testing
System tests: End-to-end matrix multiplication
Tiling tests: Large matrix decomposition and accumulation

Test cases cover:

Square and non-square matrices
Various data patterns (zeros, ones, random, checkerboard)
Edge cases (1×1, 2×2, misaligned sizes)
Performance benchmarks

Simulation Results

INT8 16×16 Accelerator - Data Flow

Figure 6: Simulation waveform showing data propagation through the systolic array

INT8 16×16 Accelerator - Results

Figure 7: Simulation waveform showing output results and accumulation

Hardware Implementation Results

SoC Integration

Figure 8: Accelerator integrated with RISC-V SoC on FPGA

Hardware Offload Benchmark

Figure 9: Performance comparison showing CPU vs. accelerator execution times

Tiling Implementation Results

Figure 10: Execution log demonstrating tiled matrix multiplication (32×32 using 16×16 tiles)

Future Work

FP16 systolic array implementation
Mixed-precision support (FP16 activations × INT8 weights)
Sparse matrix acceleration
Multi-array configurations for higher throughput
Integration with full SLM inference pipeline
Power optimization and clock gating
Support for non-square tile sizes

References

Google's TPU systolic array architecture
Xilinx Deep Learning Processor (DPU) design principles
RISC-V Vector Extension specifications
Quantization techniques for neural networks

License

This project is provided for educational and research purposes.

Contributors

Bavan2002 - Tiling subsystem implementation and integration
Friend - Systolic array core development and application framework

Acknowledgments

Special thanks to the RISC-V community and open-source hardware initiatives that made this project possible.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
Accelerator_IP		Accelerator_IP
Application/INT8_16x16		Application/INT8_16x16
Results		Results
docs		docs
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SLM Accelerator: Systolic Array-Based Matrix Multiplication IP

Overview

Key Features

Architecture

System Overview

Systolic Array Design

Tiling Implementation

Repository Structure

RTL Modules

Core Components

gemma_accelerator.v

systolic_array_16x16.v / systolic_array_32x32.v

pe_int8.v

accelerator_buffer.v

Tiling Subsystem

tile_scheduler.v

tile_address_generator.v

Usage

Simulation

INT8 16×16 Accelerator

Tiling Implementation

Hardware Integration

Software API

Design Decisions

Why Systolic Arrays?

Why Tiling?

INT8 Quantization

Performance

INT8 16×16 Accelerator

Resource Utilization (Xilinx Artix-7)

Verification

Simulation Results

INT8 16×16 Accelerator - Data Flow

INT8 16×16 Accelerator - Results

Hardware Implementation Results

SoC Integration

Hardware Offload Benchmark

Tiling Implementation Results

Future Work

References

License

Contributors

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`gemma_accelerator.v`

`systolic_array_16x16.v` / `systolic_array_32x32.v`

`pe_int8.v`

`accelerator_buffer.v`

`tile_scheduler.v`

`tile_address_generator.v`

Packages