MLIR-based compiler for structured sparsity optimization
SparseFlow is an MLIR compiler infrastructure that detects and exploits structured sparsity in tensor operations. Our SPA (Sparsity Propagation Analysis) pass performs static analysis at compile-time to identify zero computation and generate optimized runtimes.
β
~4Γ CPU speedup on structured sparse matmuls (proven and reproducible)
β
Static analysis at compile-time (no runtime overhead)
β
2D sparsity tracking (rows + columns)
β
Production-ready OpenMP runtime
β
Cross-platform verified (WSL + GitHub Codespaces)
Last Updated: December 2024
- MLIR SPA Pass: 2D sparsity analysis for
linalg.matmul(row + column masks) - JSON Export:
spa_sparsity.jsonwith runtime-ready metadata - Python Demos: Reference implementations for validation
- C++ OpenMP Runtime: Production kernel achieving ~4Γ CPU speedup
- Cross-Platform: Verified on WSL (Ubuntu 22.04) and GitHub Codespaces (Ubuntu 24.04)
- Health Check: One-command verification (
./quick_check.sh) - Documentation: Technical overview, pitch deck, benchmarks
- GPU Kernels: No CUDA/ROCm support yet (CPU-only)
- MLIR Integration: No automatic lowering to runtime calls
- Framework Integration: No PyTorch / ONNX / TensorRT support
- Dynamic Sparsity: Only static analysis (no runtime profiling)
"SparseFlow SPA v0.6 provides static 2D sparsity analysis for MLIR that detects ~75% removable FLOPs on structured patterns, exports JSON metadata, and drives an OpenMP runtime achieving ~4Γ CPU speedup on benchmarks from 128Γ128 to 1024Γ1024. Verified on WSL and GitHub Codespaces."
# Open this repo in Codespaces, then:
# 1) Health check (builds everything + runs tests)
./quick_check.sh
# 2) See the speedup
cd runtime/build && ./benchmark_sparseExpected Result: ~4Γ speedup on CPU with ~75% sparsity detection
# Prerequisites
sudo apt install -y llvm-19-dev mlir-19-tools libmlir-19-dev libomp-dev
# Clone and build
git clone https://github.com/MapleSilicon/SparseFlow.git
cd SparseFlow
# Build compiler passes
cd compiler/build
cmake -DCMAKE_PREFIX_PATH=/usr/lib/llvm-19 .. && make -j4
# Build runtime
cd ../../runtime/build
cmake .. && make -j4
# Run demo
cd ../../
./run_spa_v06_demo.sh| Matrix Size | Dense Time | Sparse Time | Speedup |
|---|---|---|---|
| 256Γ256 | 22.3 ms | 5.2 ms | 4.3Γ |
| 512Γ512 | 336 ms | 101 ms | 3.3Γ |
| 768Γ768 | 745 ms | 156 ms | 4.8Γ |
| 1024Γ1024 | 4073 ms | 945 ms | 4.3Γ |
Average: 4.2Γ speedup (consistent with 75% FLOP reduction)
Pattern: 50% row + 50% column sparsity = 75% total sparsity
See BENCHMARKS.md for detailed methodology and cross-environment results.
- 3-Minute Demo - Prove it works in 3 commands
- Technical Overview - Architecture and examples
- Pitch Deck - Investor presentation (7 slides)
- Benchmarks - Detailed performance analysis
- Health Check - One-command verification
βββββββββββββββ
β MLIR Source β Standard linalg.matmul
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β SPA Pass β Detects: rowmask=[T,F,T,F]
β (v0.6) β colmask=[T,T,F,F]
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β JSON Export β spa_sparsity.json
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β C++ Runtime β OpenMP masked matmul
β (OpenMP) β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β ~4Γ Speedup β π₯
βββββββββββββββ
Input MLIR:
linalg.matmul ins(%A, %B : tensor<512x512xf32>)After SPA Analysis:
linalg.matmul {
sparseflow.spa_rowmask = [true, false, true, false, ...],
sparseflow.spa_colmask = [true, true, false, false, ...]
} ins(%A, %B : tensor<512x512xf32>)JSON Export:
{
"name": "linalg.matmul",
"row_sparsity_pct": 50,
"col_sparsity_pct": 50,
"total_rows": 512,
"total_cols": 512
}Runtime: Uses masks to skip 75% of computation β 3.3Γ faster
SparseFlow/
βββ compiler/passes/ # MLIR analysis passes
β βββ spa/ # SPA v0.6 implementation
β βββ SPAExportPass.cpp # JSON export
β βββ ...
βββ runtime/ # C++ OpenMP runtime
β βββ masked_matmul.cpp # Optimized sparse kernel
β βββ benchmark_sparse.cpp
βββ docs/
β βββ SPA_OVERVIEW.md # Technical deep-dive
β βββ pitch/SLIDES.md # Investor deck
βββ tests/ # Test cases
βββ quick_check.sh # Health check script
βββ run_spa_v06_demo.sh # Complete demo
βββ BENCHMARKS.md # Performance results
- 2D Sparsity: Tracks zero rows AND columns (not just 1D)
- Static Analysis: Compile-time detection (no runtime overhead)
- Structured Patterns: N:M, block, and custom sparsity
- Propagation: Tracks sparsity through arithmetic operations
- β
linalg.matmul(fully supported) - β
arith.addf,arith.subf(union semantics) - β
arith.mulf,arith.divf(intersection semantics) - β
arith.maximumf(ReLU detection) - β
linalg.transpose(swaps rows β cols) - β
linalg.reduce(preserves non-reduced dimension) - β
tensor.expand_shape(broadcasts pattern)
- Language: C++ with OpenMP
- Parallelization:
#pragma omp parallel for - Mask Type:
std::vector<uint8_t>(SIMD-friendly) - Algorithm: Extract active block β compute β scatter back
- 2D sparsity tracking
- JSON export
- CPU runtime
- Cross-platform verification
- CUDA masked matmul kernel
- 10-50Γ speedup potential
- cuSPARSE comparison
- PyTorch plugin
- ONNX Runtime backend
- TensorRT integration
- Dynamic sparsity profiling
- Automatic pattern detection
- Multi-dimensional tensors
Contributions welcome! Areas of interest:
- GPU kernel development (CUDA/ROCm)
- MLIR dialect integration
- Framework plugins (PyTorch/ONNX)
- Benchmark suite expansion
Gourav Kumar - Founder, MapleSilicon
GitHub: @MapleSilicon
Project: https://github.com/MapleSilicon/SparseFlow
Apache 2.0 - See LICENSE for details
Built with LLVM/MLIR 19. Tested on WSL and GitHub Codespaces.
Star this repo β if you find it useful!
SparseFlow includes a minimal C++ runtime that consumes the SPA masks and accelerates matmuls on CPU:
- Uses row/column masks from SPA to skip zero rows/cols
- Implements a blocked, OpenMP-parallel matmul kernel
- Achieves ~3β4Γ speedup on large matmuls (512β1024) when SPA detects 75% sparsity
Run the full demo:
./spa-runner.shThis will run:
- MLIR β SPA β
spa_sparsity.json - C++ runtime benchmark with dense vs sparse timings
On CPU with 50% row + 50% col sparsity (75% FLOP reduction):
- 512Γ512: ~3.4Γ speedup
- 1024Γ1024: ~4.9Γ speedup
- Theoretical maximum: 4.0Γ
Performance varies with cache effects and OpenMP overhead. Production deployments should target workloads β₯512Γ512 for consistent speedup.
From the repo root:
cd sparseflow_package
pip install -e .
## MLIR Driver (sparseflow-opt.sh)
SparseFlow provides a small convenience wrapper around \`mlir-opt\` to run SPA:
```bash
./sparseflow-opt.sh tests/test_spa_v6_full_2d.mlir > /tmp/out.mlir
cat spa_sparsity.json