This project implements a high-performance systolic array accelerator for matrix multiplication operations, designed specifically for Small Language Model (SLM) inference on RISC-V edge devices. The accelerator provides hardware acceleration for GEMM (General Matrix Multiply) operations, significantly reducing inference latency on resource-constrained platforms.
- INT8 Systolic Array Accelerators: 16×16 and 32×32 configurations
- Tiling Architecture: Support for large matrices using hardware tiles (e.g., 64×64 matrices on 16×16 hardware)
- AXI4 Interface: Full AXI4-Stream and AXI4-Lite integration for SoC deployment
- Scalable Designs: Parameterized implementations supporting INT4, INT8, and FP16 datatypes
- Bare-Metal Application: Reference C implementation for RISC-V processors
- Comprehensive Testbenches: SystemVerilog testbenches with full verification
Figure 1: Complete system architecture showing RISC-V processor integration with systolic accelerator
Figure 2: Inference engine dataflow and accelerator offload mechanism
Figure 3: Accelerator IP block diagram with AXI interfaces
The core architecture uses a 2D grid of Processing Elements (PEs) arranged in a systolic configuration:
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ PE │ → │ PE │ → │ PE │ → │ PE │ →
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
↓ ↓ ↓ ↓
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ PE │ → │ PE │ → │ PE │ → │ PE │ →
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
↓ ↓ ↓ ↓
Figure 4: Detailed systolic array architecture showing PE interconnections and data flow
Each Processing Element:
- Performs signed INT8×INT8 multiplication
- Accumulates partial sums in 32-bit registers
- Forwards data to neighboring PEs
- Operates in a fully pipelined manner
For matrices larger than the hardware array size, the accelerator implements a tiling subsystem that:
- Decomposes large matrices into smaller tiles (e.g., 64×64 → 16×16 tiles)
- Schedules tile processing using an FSM-based controller
- Generates tile addresses automatically for A, B, and C matrices
- Accumulates partial results across tiles to produce final output
Figure 5: Tiling architecture demonstrating how 32×32 matrices are processed using 16×16 hardware tiles
See docs/TILING_GUIDE.md for detailed tiling subsystem documentation.
SLM_Accelerator/
├── Accelerator_IP/
│ ├── Gemma_Accelerator_IP/
│ │ ├── INT8_16x16/
│ │ │ ├── src/
│ │ │ └── tb/
│ │ ├── INT8_32x32/
│ │ │ ├── src/
│ │ │ └── tb/
│ │ └── Tiling/
│ └── Systolic_array_IP/
│ ├── not_scalable_systolic_matmul/
│ ├── scalable_systolic_matmul/
│ │ ├── with_buffers/
│ │ └── without_buffers/
│ └── scalable_systolic_matmul_axi/
│ ├── INT4_INT4/
│ └── f16_INT4/
├── Application/
│ └── INT8_16x16/
├── Results/
├── docs/
├── .gitignore
└── README.md
Top-level accelerator module with:
- AXI4-Lite control interface
- AXI4 memory master interface
- Tiling control logic
- Double buffering for input/output
Systolic array grid instantiations:
- Grid of PE modules
- Data flow orchestration (north → south, west → east)
- Pipelined output collection
Processing Element implementation:
- INT8×INT8 signed multiplication
- 32-bit accumulation with saturation protection
- Registered data forwarding
Input/output buffering:
- A, B matrix buffers
- Result accumulation buffer
- Burst-friendly memory interface
FSM-based tile iteration controller:
- Computes tile loops for K, I, J dimensions
- Generates tile coordinates
- Controls accumulation vs. overwrite for result tiles
- Issues done signal when all tiles complete
Combinational address calculator:
- Converts tile coordinates (ki, ii, ji) to memory addresses
- Handles row-major matrix layouts
- Supports configurable tile and matrix sizes
cd Accelerator_IP/Gemma_Accelerator_IP/INT8_16x16/tb
# Using Vivado
vivado -mode batch -source sim_script.tcl
# Using ModelSim/QuestaSim
vsim -do "do compile.do; do simulate.do"cd Accelerator_IP/Gemma_Accelerator_IP/Tiling
# Compile and simulate with your preferred simulator
iverilog -g2012 -o sim *.v gemm_tb.sv
vvp simThe accelerator is designed to integrate with RISC-V SoCs via AXI interfaces:
- Instantiate the IP in your SoC design
- Connect AXI4-Lite to the CPU's control bus
- Connect AXI4 Master to system memory interconnect
- Assign memory-mapped registers for accelerator control
Control Registers (via AXI4-Lite):
0x00: Control (start, reset)0x04: Status (done, busy)0x08: Matrix A base address0x0C: Matrix B base address0x10: Matrix C base address0x14: Matrix dimensions (M, N, K)0x18: Tile configuration (for tiling mode)
// Initialize accelerator
void gemma_acc_init(uint32_t base_addr);
// Offload matrix multiplication
void gemma_matmul(int8_t *A, int8_t *B, int32_t *C,
uint32_t M, uint32_t N, uint32_t K);
// Check completion
bool gemma_acc_done();See Application/INT8_16x16/ for full implementation.
- High throughput: All PEs compute in parallel
- Efficient data reuse: Each data element used multiple times
- Simple control: Regular, predictable dataflow
- Scalable: Easy to increase array size for more performance
- Large matrix support: Process matrices larger than hardware array
- Reduced memory bandwidth: Tile-level data reuse
- Flexible deployment: Same hardware supports multiple problem sizes
- Energy efficiency: Minimizes DRAM accesses
- 2-4× memory savings vs. FP16/FP32
- Higher compute density: More operations per cycle
- Acceptable accuracy for many neural network workloads
- Simpler arithmetic: Lower area and power
- Throughput: 256 INT8 MAC operations per cycle (peak)
- Latency: 16-32 cycles (array fill time) + computation cycles
- Frequency: Targets 100-200 MHz on modern FPGAs
- 16×16 array: ~15K LUTs, ~8K FFs, 512 DSP48 slices
- 32×32 array: ~60K LUTs, ~32K FFs, 2048 DSP48 slices
All modules include comprehensive testbenches:
- Unit tests: Individual PE verification
- Integration tests: Full systolic array testing
- System tests: End-to-end matrix multiplication
- Tiling tests: Large matrix decomposition and accumulation
Test cases cover:
- Square and non-square matrices
- Various data patterns (zeros, ones, random, checkerboard)
- Edge cases (1×1, 2×2, misaligned sizes)
- Performance benchmarks
Figure 6: Simulation waveform showing data propagation through the systolic array
Figure 7: Simulation waveform showing output results and accumulation
Figure 8: Accelerator integrated with RISC-V SoC on FPGA
Figure 9: Performance comparison showing CPU vs. accelerator execution times
Figure 10: Execution log demonstrating tiled matrix multiplication (32×32 using 16×16 tiles)
- FP16 systolic array implementation
- Mixed-precision support (FP16 activations × INT8 weights)
- Sparse matrix acceleration
- Multi-array configurations for higher throughput
- Integration with full SLM inference pipeline
- Power optimization and clock gating
- Support for non-square tile sizes
- Google's TPU systolic array architecture
- Xilinx Deep Learning Processor (DPU) design principles
- RISC-V Vector Extension specifications
- Quantization techniques for neural networks
This project is provided for educational and research purposes.
- Bavan2002 - Tiling subsystem implementation and integration
- Friend - Systolic array core development and application framework
Special thanks to the RISC-V community and open-source hardware initiatives that made this project possible.