FP8 per-tile scaled linear solver for consumer NVIDIA GPUs
ssBlast solves dense linear systems of the form Ax = b using FP8-precision matrix kernels
with per-tile dynamic scaling, followed by iterative refinement to recover full FP64 accuracy.
It is designed to run on any NVIDIA GPU from the GTX 10-series through the RTX 40-series,
with no proprietary solvers or data-center hardware required.
Version: 0.1.3 License: MIT PyPI: https://pypi.org/project/ssblast Repository: https://github.com/Sharveswar007/SSBLAST
- Overview
- Requirements
- Installation
- Quick Start
- Full API Reference
- Platform Notes
- GPU Compatibility
- Benchmark Results
- How It Works
- Accuracy
- Limitations
- Running the Tests
- Technical Notes
- References
- Author
- License
ssBlast is the first open-source FP8 linear solver designed for consumer NVIDIA GPUs.
Standard solvers such as cuSOLVER and MAGMA operate at FP64 precision throughout. ssBlast performs the expensive dense matrix-matrix multiply (GEMM) in FP8 using a Triton kernel where each 32x32 tile independently computes and applies its own scale factor. This avoids the global clipping that would otherwise destroy precision at FP8 range. The result is then corrected to full FP64 accuracy through iterative refinement using GPU-cached LU factorization.
Measured performance on an RTX 4050 Laptop GPU (Ada Lovelace) under WSL2 with Triton 3.6.0 shows 2x to 3x speedup over CuPy FP64 for matrix sizes of 2000 or larger, with maximum residual error below 1e-8.
- Python 3.10 or later
- NVIDIA GPU with CUDA compute capability 7.0 or greater (GTX 10-series and newer)
- CUDA Toolkit 12.x
cupy-cuda12x >= 13.0scipy >= 1.11numpy >= 1.24
- Linux or WSL2 (Windows Subsystem for Linux 2)
triton >= 3.0.0torch >= 2.0- RTX 40-series GPU (compute capability 8.9) for maximum performance
Note: Triton is not supported on native Windows. On native Windows the solver automatically falls back to the FP16 CuPy cuBLAS path, which still outperforms SciPy CPU for large matrices.
pip install ssblastThis installs the core package with GPU support via CuPy and a CPU fallback via NumPy. No Triton is required. The FP8 kernel path will not be active, but the FP16 and FP32 GPU paths remain available.
pip install "ssblast[triton]"This additionally installs Triton and PyTorch, enabling the FP8 Triton kernel on Linux and WSL2 with a supported NVIDIA GPU.
git clone https://github.com/Sharveswar007/SSBLAST.git
cd SSBLAST
pip install .To include Triton dependencies from source:
pip install ".[triton]"import ssblast
print(ssblast.__version__) # 0.1.3
print(ssblast.CUPY_AVAILABLE) # True if CuPy is installed and a GPU is detected
print(ssblast.TRITON_AVAILABLE) # True if Triton is installed (Linux/WSL2 only)from ssblast import solve
import numpy as np
# Create a random dense linear system
n = 4000
A = np.random.randn(n, n)
b = np.random.randn(n)
# Solve Ax = b
x = solve(A, b)
# Verify accuracy
residual = np.linalg.norm(A @ x - b) / np.linalg.norm(b)
print(f"Relative residual: {residual:.2e}") # typically < 1e-8You can also pass CuPy arrays directly, which avoids the CPU-to-GPU transfer:
from ssblast import solve
import cupy as cp
A = cp.random.randn(4000, 4000)
b = cp.random.randn(4000)
x = solve(A, b)The return value x is a CuPy array when a GPU is used, or a NumPy array on
CPU-only systems.
Solve the dense linear system Ax = b.
Parameters
| Parameter | Type | Description |
|---|---|---|
A |
numpy.ndarray or cupy.ndarray | Square coefficient matrix of shape (n, n). Must be non-singular. |
b |
numpy.ndarray or cupy.ndarray | Right-hand side vector of shape (n,). |
Returns
| Name | Type | Description |
|---|---|---|
x |
numpy.ndarray or cupy.ndarray | Solution vector of shape (n,). FP64-accurate. |
Raises
| Exception | Condition |
|---|---|
ValueError |
If A or b is None. |
ValueError |
If A is not square. |
ValueError |
If A.shape[0] does not match b.shape[0]. |
MemoryError |
If the matrix does not fit within available VRAM. |
Precision Path Selection
The solver automatically selects the best available path based on the detected GPU:
| GPU Generation | Path Selected |
|---|---|
| RTX 40-series (CC 8.9) with Triton on Linux/WSL2 | FP8 Triton kernel + iterative refinement |
| RTX 30-series, RTX 20-series | FP16 CuPy cuBLAS + iterative refinement |
| GTX 10-series | FP32 CuPy cuBLAS + iterative refinement |
| No GPU / CuPy unavailable | NumPy CPU solver |
No configuration is required. The selection is fully automatic.
| Attribute | Type | Description |
|---|---|---|
ssblast.__version__ |
str | Package version string. |
ssblast.CUPY_AVAILABLE |
bool | True if CuPy is installed and importable. |
ssblast.TRITON_AVAILABLE |
bool | True if Triton is installed and importable. |
The FP8 Triton kernel is available on Linux and WSL2. This is the primary target platform for peak performance on RTX 40-series GPUs.
To verify that the FP8 path is active:
from ssblast import TRITON_AVAILABLE
print(TRITON_AVAILABLE) # True on Linux/WSL2 with Triton installedTriton does not support native Windows. On a native Windows installation, ssBlast automatically uses the FP16 CuPy cuBLAS path for RTX 20-series and newer GPUs. This path still outperforms SciPy CPU for large matrices but does not reach the peak speedup of the FP8 kernel.
No code changes are needed. The fallback is transparent to the caller.
If no NVIDIA GPU or CuPy installation is detected, ssBlast falls back to
numpy.linalg.solve. The same solve(A, b) interface is used. Performance
will be equivalent to NumPy's default linear solver.
| GPU Series | Example Models | Compute Capability | Precision Path | Status |
|---|---|---|---|---|
| RTX 40-series | RTX 4090, 4080, 4070, 4060, 4050 | 8.9 | FP8 Triton kernel | Optimized |
| RTX 30-series | RTX 3090, 3080, 3070, 3060 | 8.6 | FP16 cuBLAS | Working |
| RTX 20-series | RTX 2080, 2070, 2060 | 7.5 | FP16 cuBLAS | Working |
| GTX 10-series | GTX 1080, 1070, 1060 | 6.1 / 7.0 | FP32 cuBLAS | Working |
| CPU fallback | Any system | N/A | NumPy | Available |
Compute Capability Threshold: CUDA compute capability 7.0 is the minimum for the CuPy GPU path. GTX 1060 has CC 6.1 and will use the CPU fallback unless a compatible GPU is also present.
Benchmarks were run on an NVIDIA RTX 4050 Laptop GPU (Ada Lovelace, CC 8.9, 6.4 GB VRAM) under WSL2 Ubuntu with CUDA 12.8 and Triton 3.6.0. Each measurement uses 3 timed runs with 1 warmup run and explicit CUDA stream synchronization barriers.
| Matrix Size | SciPy CPU FP64 | CuPy GPU FP64 | ssBlast FP8 | vs SciPy CPU | vs CuPy FP64 |
|---|---|---|---|---|---|
| 128 x 128 | 0.0002 s | 0.0008 s | 0.0063 s | 0.0x | 0.1x |
| 256 x 256 | 0.0007 s | 0.0024 s | 0.0079 s | 0.1x | 0.3x |
| 500 x 500 | n/a | 0.0073 s | 0.0097 s | n/a | 0.7x |
| 1000 x 1000 | 0.025 s | 0.026 s | 0.020 s | 1.2x | 1.3x |
| 2000 x 2000 | 0.128 s | 0.121 s | 0.050 s | 2.6x | 2.4x |
| 3000 x 3000 | 0.357 s | 0.293 s | 0.103 s | 3.5x | 2.8x |
| 4000 x 4000 | 0.713 s | 0.542 s | 0.188 s | 3.8x | 2.9x |
| 6000 x 6000 | 1.774 s | 1.104 s | 0.546 s | 3.3x | 2.0x |
| 8000 x 8000 | 4.041 s | 2.066 s | 1.021 s | 4.0x | 2.0x |
| 10000 x 10000 | 6.701 s | 4.026 s | 1.920 s | 3.5x | 2.1x |
- ssBlast outperforms CuPy FP64 for all matrix sizes of n >= 1000.
- Peak speedup of approximately 3x vs CuPy FP64 occurs at n = 3000 to 4000.
- For n < 1000, iterative refinement overhead is not yet amortized by the GEMM savings.
- For n >= 2000, ssBlast is consistently 2x to 3x faster than CuPy FP64.
- All results are FP64-accurate regardless of the FP8 internal representation.
The solver is structured as a six-layer pipeline:
Layer 0 — Entry Validation
Accept A (n x n) and b (n,). Validate types and shapes.
Route to GPU path if CuPy is available, otherwise CPU path.
Layer 1 — GPU Detection
Query NVIDIA GPU compute capability, VRAM, and driver version.
Assign a precision tier: FP8, FP16, FP32, or CPU.
Layer 2 — Precision Plan Selection
Select the compute plan based on the tier and available libraries.
Check for Triton availability (Linux/WSL2 required for FP8).
Layer 3 — Dispatcher
Transfer A and b to the GPU if not already on-device.
Invoke the correct compute backend.
Layer 4 — FP8 Triton GEMM Kernel (RTX 40-series, Linux/WSL2)
Tile A into configurable blocks (32x32 to 128x128, selected by autotuner).
For each tile, compute a local scale factor to keep values within FP8 range.
Perform GEMM using FP16 arithmetic with per-tile FP8-range scaling.
Rescale output tiles and accumulate the result in FP32 accumulators.
Layer 5 — Iterative Refinement
Compute the residual r = b - A*x_rough.
Solve a correction equation using a cached GPU LU factorization.
Apply the correction to converge to FP64-accurate x.
Loop until the relative residual is below the tolerance threshold.
The key innovation is in the FP8 kernel (ssblast/kernels/ssblast_kernel.py).
Standard FP8 implementations apply a single global scale factor to the entire
matrix. When the matrix has significant dynamic range variation, many values
are clipped, which destroys precision.
ssBlast computes a separate scale factor for each 32x32 tile, independent of all other tiles. Each tile uses the full FP8 value range without clipping. The scale factors are computed inside the Triton kernel with no CPU round-trip.
This approach was inspired by the GotoBLAS/BLIS per-block memory access strategy described in the FLAME How-to-Optimize-GEMM guide, adapted for FP8 dynamic range management.
The FP8 kernel is used only for the initial rough solve. The output of
ssblast.solve() is always FP64-accurate.
Accuracy is guaranteed by the iterative refinement stage, which corrects the FP8-approximate solution using standard FP64 LU factorization. The refinement converges in one to two iterations for well-conditioned matrices.
Maximum error vs SciPy reference across all tested matrix sizes: < 1e-8.
ssBlast handles the following cases correctly:
- Matrices with large dynamic range variation between entries
- Ill-conditioned matrices (condition number up to 1e12, tested)
- Matrices of non-power-of-two sizes
- Matrices that do not align exactly to tile block boundaries (boundary padding applied)
- The FP8 Triton kernel requires Linux or WSL2. Native Windows uses the FP16 path.
- The FP8 path requires an RTX 40-series GPU. Older GPUs use the FP16 or FP32 path.
- Performance advantage only applies for n >= 1000. For small systems, SciPy CPU is faster.
- The entire matrix must fit in VRAM. A 10000 x 10000 FP64 matrix requires approximately 800 MB. Consumer GPUs with 6 GB VRAM can handle matrices up to approximately n = 12000.
- Sparse matrices are not supported. Only dense matrices are handled.
- Batch solving (multiple right-hand sides) is not yet implemented. Call
solvein a loop. - The iterative refinement tolerance is fixed internally. There is no public parameter to adjust it.
The test suite requires a CUDA-capable GPU with CuPy installed.
# Install test dependencies
pip install pytest
# Run the full test suite
pytest tests/ -vExpected output (exact pass counts vary by GPU and platform):
tests/test_layer0.py 17 passed
tests/test_layer1.py 8 passed
tests/test_layer2.py 5 passed
tests/test_layer3.py 5 passed
tests/test_layer4.py 5 passed (skipped if Triton not installed)
tests/test_layer5.py 5 passed
tests/test_end_to_end.py 3 passed
tests/test_final_checks.py 14 passed (2 skipped if Triton not installed)
| Test Module | Area Tested |
|---|---|
test_layer0 |
Input validation: None, NaN, Inf, non-square, shape mismatch, 1D/3D inputs, empty matrix |
test_layer1 |
GPU detection, tier assignment, fallback logic |
test_layer2 |
Precision plan selection for each GPU tier |
test_layer3 |
Dispatcher routing, CuPy/NumPy array transfer |
test_layer4 |
FP8 Triton kernel, per-tile scaling, tile alignment (skipped without Triton) |
test_layer5 |
Iterative refinement convergence, accuracy |
test_end_to_end |
Full pipeline accuracy, numpy input handling, large matrix workload |
test_final_checks |
NaN/Inf rejection, FP8 path, accuracy over 10 runs, VRAM limit, ill-conditioned matrix |
| Solver | FP8 Support | Consumer GPU | Open Source | Typical Speedup vs FP64 |
|---|---|---|---|---|
| cuSOLVER | No | Limited (data center focus) | No | 1x baseline |
| MAGMA | No | Limited | Yes | ~1x |
| ssBlast | Yes | Yes (GTX 1060 and newer) | Yes | 2x to 3x |
Mixed-precision LAPACK (MPGESV) uses FP32 internally for the main factor and falls back to FP64 if convergence is not reached. It does not use FP8, which has only been standardized in NVIDIA hardware since the Ada Lovelace generation (2022). ssBlast targets FP8 specifically to take advantage of the higher FP8 Tensor Core throughput on RTX 40-series GPUs.
The benchmark results in this document correspond to commit 743e1e0 and were
produced on an RTX 4050 Laptop GPU under WSL2 Ubuntu with CUDA 12.8 and
Triton 3.6.0. Results may vary on different hardware, driver versions, or
operating system environments.
-
FLAME Project, UT Austin. How to Optimize GEMM. https://github.com/flame/how-to-optimize-gemm/wiki GotoBLAS/BLIS memory blocking strategy inspired the per-tile design.
-
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms, 2nd ed. SIAM. The iterative refinement algorithm is described in Chapter 12.
-
OpenAI Triton open-source GPU programming language. https://github.com/openai/triton
-
NVIDIA FP8 Formats for Deep Learning. https://arxiv.org/abs/2209.05433
SHARVESWAR MADASAMY B.Tech Computer Science and Engineering, SRM Institute of Science and Technology, Kattankulathur Contact: msharveswar220@gmail.com
MIT License. See the LICENSE file in the repository root for the full license text.