Skip to content

Bavan2002/SLM_Accelerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SLM Accelerator: Systolic Array-Based Matrix Multiplication IP

Overview

This project implements a high-performance systolic array accelerator for matrix multiplication operations, designed specifically for Small Language Model (SLM) inference on RISC-V edge devices. The accelerator provides hardware acceleration for GEMM (General Matrix Multiply) operations, significantly reducing inference latency on resource-constrained platforms.

Key Features

  • INT8 Systolic Array Accelerators: 16×16 and 32×32 configurations
  • Tiling Architecture: Support for large matrices using hardware tiles (e.g., 64×64 matrices on 16×16 hardware)
  • AXI4 Interface: Full AXI4-Stream and AXI4-Lite integration for SoC deployment
  • Scalable Designs: Parameterized implementations supporting INT4, INT8, and FP16 datatypes
  • Bare-Metal Application: Reference C implementation for RISC-V processors
  • Comprehensive Testbenches: SystemVerilog testbenches with full verification

Architecture

System Overview

System Architecture Figure 1: Complete system architecture showing RISC-V processor integration with systolic accelerator

Inference Engine Figure 2: Inference engine dataflow and accelerator offload mechanism

IP Design Overview Figure 3: Accelerator IP block diagram with AXI interfaces

Systolic Array Design

The core architecture uses a 2D grid of Processing Elements (PEs) arranged in a systolic configuration:

┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐
│ PE  │ → │ PE  │ → │ PE  │ → │ PE  │ →
└──┬──┘   └──┬──┘   └──┬──┘   └──┬──┘
   ↓         ↓         ↓         ↓
┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐
│ PE  │ → │ PE  │ → │ PE  │ → │ PE  │ →
└──┬──┘   └──┬──┘   └──┬──┘   └──┬──┘
   ↓         ↓         ↓         ↓

Systolic Array Block Diagram Figure 4: Detailed systolic array architecture showing PE interconnections and data flow

Each Processing Element:

  • Performs signed INT8×INT8 multiplication
  • Accumulates partial sums in 32-bit registers
  • Forwards data to neighboring PEs
  • Operates in a fully pipelined manner

Tiling Implementation

For matrices larger than the hardware array size, the accelerator implements a tiling subsystem that:

  1. Decomposes large matrices into smaller tiles (e.g., 64×64 → 16×16 tiles)
  2. Schedules tile processing using an FSM-based controller
  3. Generates tile addresses automatically for A, B, and C matrices
  4. Accumulates partial results across tiles to produce final output

Tiling Architecture Figure 5: Tiling architecture demonstrating how 32×32 matrices are processed using 16×16 hardware tiles

See docs/TILING_GUIDE.md for detailed tiling subsystem documentation.

Repository Structure

SLM_Accelerator/
├── Accelerator_IP/
│   ├── Gemma_Accelerator_IP/
│   │   ├── INT8_16x16/
│   │   │   ├── src/
│   │   │   └── tb/
│   │   ├── INT8_32x32/
│   │   │   ├── src/
│   │   │   └── tb/
│   │   └── Tiling/
│   └── Systolic_array_IP/
│       ├── not_scalable_systolic_matmul/
│       ├── scalable_systolic_matmul/
│       │   ├── with_buffers/
│       │   └── without_buffers/
│       └── scalable_systolic_matmul_axi/
│           ├── INT4_INT4/
│           └── f16_INT4/
├── Application/
│   └── INT8_16x16/
├── Results/
├── docs/
├── .gitignore
└── README.md

RTL Modules

Core Components

gemma_accelerator.v

Top-level accelerator module with:

  • AXI4-Lite control interface
  • AXI4 memory master interface
  • Tiling control logic
  • Double buffering for input/output

systolic_array_16x16.v / systolic_array_32x32.v

Systolic array grid instantiations:

  • Grid of PE modules
  • Data flow orchestration (north → south, west → east)
  • Pipelined output collection

pe_int8.v

Processing Element implementation:

  • INT8×INT8 signed multiplication
  • 32-bit accumulation with saturation protection
  • Registered data forwarding

accelerator_buffer.v

Input/output buffering:

  • A, B matrix buffers
  • Result accumulation buffer
  • Burst-friendly memory interface

Tiling Subsystem

tile_scheduler.v

FSM-based tile iteration controller:

  • Computes tile loops for K, I, J dimensions
  • Generates tile coordinates
  • Controls accumulation vs. overwrite for result tiles
  • Issues done signal when all tiles complete

tile_address_generator.v

Combinational address calculator:

  • Converts tile coordinates (ki, ii, ji) to memory addresses
  • Handles row-major matrix layouts
  • Supports configurable tile and matrix sizes

Usage

Simulation

INT8 16×16 Accelerator

cd Accelerator_IP/Gemma_Accelerator_IP/INT8_16x16/tb
# Using Vivado
vivado -mode batch -source sim_script.tcl

# Using ModelSim/QuestaSim
vsim -do "do compile.do; do simulate.do"

Tiling Implementation

cd Accelerator_IP/Gemma_Accelerator_IP/Tiling
# Compile and simulate with your preferred simulator
iverilog -g2012 -o sim *.v gemm_tb.sv
vvp sim

Hardware Integration

The accelerator is designed to integrate with RISC-V SoCs via AXI interfaces:

  1. Instantiate the IP in your SoC design
  2. Connect AXI4-Lite to the CPU's control bus
  3. Connect AXI4 Master to system memory interconnect
  4. Assign memory-mapped registers for accelerator control

Control Registers (via AXI4-Lite):

  • 0x00: Control (start, reset)
  • 0x04: Status (done, busy)
  • 0x08: Matrix A base address
  • 0x0C: Matrix B base address
  • 0x10: Matrix C base address
  • 0x14: Matrix dimensions (M, N, K)
  • 0x18: Tile configuration (for tiling mode)

Software API

// Initialize accelerator
void gemma_acc_init(uint32_t base_addr);

// Offload matrix multiplication
void gemma_matmul(int8_t *A, int8_t *B, int32_t *C, 
                  uint32_t M, uint32_t N, uint32_t K);

// Check completion
bool gemma_acc_done();

See Application/INT8_16x16/ for full implementation.

Design Decisions

Why Systolic Arrays?

  • High throughput: All PEs compute in parallel
  • Efficient data reuse: Each data element used multiple times
  • Simple control: Regular, predictable dataflow
  • Scalable: Easy to increase array size for more performance

Why Tiling?

  • Large matrix support: Process matrices larger than hardware array
  • Reduced memory bandwidth: Tile-level data reuse
  • Flexible deployment: Same hardware supports multiple problem sizes
  • Energy efficiency: Minimizes DRAM accesses

INT8 Quantization

  • 2-4× memory savings vs. FP16/FP32
  • Higher compute density: More operations per cycle
  • Acceptable accuracy for many neural network workloads
  • Simpler arithmetic: Lower area and power

Performance

INT8 16×16 Accelerator

  • Throughput: 256 INT8 MAC operations per cycle (peak)
  • Latency: 16-32 cycles (array fill time) + computation cycles
  • Frequency: Targets 100-200 MHz on modern FPGAs

Resource Utilization (Xilinx Artix-7)

  • 16×16 array: ~15K LUTs, ~8K FFs, 512 DSP48 slices
  • 32×32 array: ~60K LUTs, ~32K FFs, 2048 DSP48 slices

Verification

All modules include comprehensive testbenches:

  • Unit tests: Individual PE verification
  • Integration tests: Full systolic array testing
  • System tests: End-to-end matrix multiplication
  • Tiling tests: Large matrix decomposition and accumulation

Test cases cover:

  • Square and non-square matrices
  • Various data patterns (zeros, ones, random, checkerboard)
  • Edge cases (1×1, 2×2, misaligned sizes)
  • Performance benchmarks

Simulation Results

INT8 16×16 Accelerator - Data Flow

IP Simulation Waveform - Data Flow Figure 6: Simulation waveform showing data propagation through the systolic array

INT8 16×16 Accelerator - Results

IP Simulation Waveform - Results Figure 7: Simulation waveform showing output results and accumulation

Hardware Implementation Results

SoC Integration

SoC Implementation Figure 8: Accelerator integrated with RISC-V SoC on FPGA

Hardware Offload Benchmark

Offload Results Figure 9: Performance comparison showing CPU vs. accelerator execution times

Tiling Implementation Results

Tiling Execution Log Figure 10: Execution log demonstrating tiled matrix multiplication (32×32 using 16×16 tiles)

Future Work

  • FP16 systolic array implementation
  • Mixed-precision support (FP16 activations × INT8 weights)
  • Sparse matrix acceleration
  • Multi-array configurations for higher throughput
  • Integration with full SLM inference pipeline
  • Power optimization and clock gating
  • Support for non-square tile sizes

References

  • Google's TPU systolic array architecture
  • Xilinx Deep Learning Processor (DPU) design principles
  • RISC-V Vector Extension specifications
  • Quantization techniques for neural networks

License

This project is provided for educational and research purposes.

Contributors

  • Bavan2002 - Tiling subsystem implementation and integration
  • Friend - Systolic array core development and application framework

Acknowledgments

Special thanks to the RISC-V community and open-source hardware initiatives that made this project possible.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors