Memory-bounded compressed sparse attention via streaming top-k. Triton implementation of the DeepSeek-V4 lightning indexer that runs at sequence lengths where the reference materialize-then-topk path OOMs.
V4-Flash indexer scaling on a single NVIDIA H200, log-log axes. Materialize (red) OOMs at S=65,536. StreamIndex (blue) scales linearly to S=1,048,576 at 6.21 GB peak HBM, a 32× regime extension. The same pattern holds at V4-Pro dimensions:
DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a
lightning indexer scores compressed keys, the top-k are selected per
query, and a sparse attention kernel reads only those keys. Public CSA
implementations (V3.2-Exp reference, TileLang reference) materialize a
[B, S, H_I, T] FP32 score tensor before the top-k reduction. With
V4-Flash dimensions (H_I = 64, m = 4), that intermediate is 256 GB
at S = 65,536, exceeding the HBM of any single current GPU.
StreamIndex is a Triton implementation of the CSA pipeline. The core
component is a chunked partition-merge top-k driver that never
materializes the full intermediate. On a single NVIDIA H200, the
materialize path OOMs at S = 65,536 with V4-Flash dimensions; StreamIndex
runs the same indexer to S = 1,048,576 with 6.21 GB peak HBM, a
32× regime extension on a single GPU.
V4-Flash indexer scaling, single H200 SXM, BF16:
Sequence length S |
Materialize | StreamIndex (chunked) | Speedup |
|---|---|---|---|
| 32,768 | 317 ms / 129 GB | 31 ms / 0.40 GB | 10.3× |
| 65,536 | OOM | 122 ms / 0.59 GB | ∞ |
| 131,072 | OOM | 485 ms / 0.96 GB | ∞ |
| 262,144 | OOM | 1,935 ms / 1.71 GB | ∞ |
| 524,288 | OOM | 7,730 ms / 3.21 GB | ∞ |
| 1,048,576 | OOM | 30,900 ms / 6.21 GB | ∞ |
Set-overlap recall against the materialize ground truth is bit-exact at
small S where both fit. Raw logs are in
benchmarks/results/.
Editable install (recommended for development):
git clone https://github.com/RightNow-AI/StreamIndex
cd StreamIndex
pip install -e .Requirements: PyTorch ≥ 2.10, Triton ≥ 3.5, CUDA 12.x or 13.x, Python ≥ 3.11.
Tested on H100 / H200 (Hopper, sm_90a).
import torch
from flash_sparse.triton.chunked_indexer import chunked_indexer_topk
B, S, H_I, D_I = 1, 262144, 64, 128
T = S // 4 # m = 4 compression
k = 512
q = torch.randn(B, S, H_I, D_I, dtype=torch.bfloat16, device="cuda") * (D_I**-0.5)
k_idx = torch.randn(B, T, D_I, dtype=torch.bfloat16, device="cuda") * (D_I**-0.5)
weights = torch.randn(B, S, H_I, dtype=torch.float32, device="cuda")
# Returns top-k indices [B, S, k] int64 + scores [B, S, k] FP32.
top_idx, _ = chunked_indexer_topk(
q, k_idx, weights, top_k=k,
causal_ratio=4, # V4 causal mask: t < (s+1) // ratio
chunk_s=2048, chunk_t=8192,
)The causal_ratio argument is what enables long-context scaling: it
computes the per-tile causal mask without materializing the global
[B, S, T] bool tensor, which is itself 256 GB at S = 1M, T = 256K.
flash_sparse/
csa.py End-to-end CSA forward (compressor + indexer + sparse-attn).
hca.py Heavily-compressed-attention helpers.
reference.py PyTorch reference implementations (correctness ground truth).
triton/
chunked_indexer.py The chunked partition-merge top-k driver (151 LOC).
indexer_score.py Fused Triton indexer score kernel (Q.K^T, ReLU, weighted sum).
sparse_attn_fwd.py Triton sparse attention forward (single-D MQA layout).
sparse_attn_bwd.py Triton sparse attention backward (atomic-scatter dKV).
sparse_attn_bwd_v2.py Backward variant using inverted-topk (research artifact).
indexer_streaming_topk.py In-kernel streaming top-k attempt (parked, see file).
inverted_topk.py Inverted-index builder used by the v2 backward.
utils/ Compression / quantization stubs.
tests/ Correctness tests (pytest). All require CUDA.
benchmarks/ Benchmark scripts and raw result snapshots.
results/2026-04-25/ Kernel-level bench snapshot.
results/2026-04-27/ V4-Flash layer-level snapshot (32x regime extension).
results/2026-05-01/ Design-space sweep and ablations.
eval/ Reproducibility scripts: parity test, scaling bench, sweep.
docs/ Math doc, IR analysis, fusion analysis, integration plan.
# Parity check (small S, materialize fits as ground truth):
python eval/test_v4_indexer_parity.py
# Layer-level S-scaling (the 32x regime extension):
python eval/bench_v4_indexer_scaling.py
# Design-space sweep (chunk_S, chunk_T, top_k):
python eval/sweep_design_space.py
# Three ablations (no-merge, skip-saturated, FP16 accumulation):
python eval/ablation_chunked_indexer.pyEach script is a single Python file with no required arguments. Logs are
written to stdout. Reference numbers live under benchmarks/results/.
This is a kernel and a Python driver, not an end-to-end system.
- Tested on synthetic-but-realistic distributions, not loaded V4 checkpoint weights. The 270 GB FP8 V4-Flash weights do not fit on a single H200; multi-GPU follow-up is required for end-to-end perplexity, retrieval, QA, or code benchmarks.
- Total HBM I/O is unchanged from the materialize path. Only peak is
reduced. At small
Swhere materialize fits, materialize is faster. The chunked path is the only viable option at long context. - The auxiliary Triton attention forward in
flash_sparse/triton/reaches about 8% of H200 BF16 peak at V4-Pro dimensions (d_h = 512). For production attention throughput, compose the chunked indexer with TileLang's pipelinedsparse_mla_fwd(we benchmark this inbenchmarks/bench_pipeline_composition.py).
@article{jaber2026streamindex,
title = {{StreamIndex}: Memory-Bounded Compressed Sparse Attention via Streaming Top-k},
author = {Jaber, Jaber and Jaber, Osama},
year = {2026},
note = {arXiv preprint},
}MIT. See LICENSE.
StreamIndex builds on the V3.2-Exp / V4 reference modelling code released
by DeepSeek-AI, the TileLang sparse_mla_fwd_pipelined attention kernel,
the Triton compiler, and the FlashAttention line of work.

