This repository provides a patch for SGLang that enables IndexCache inference acceleration for models using DeepSeek Sparse Attention (DSA), including DeepSeek-V3.2 and GLM-5.
TL;DR: IndexCache eliminates up to 75% of indexer computations in DSA through cross-layer index reuse — achieving up to 1.82× prefill speedup and 1.48× decode speedup with negligible quality degradation. One
if/elsebranch, zero extra GPU memory.
In DSA, the lightning indexer selects the top-k most relevant tokens at each layer to make attention sparse. While cheap per-FLOP, it runs independently at every layer with O(L²) complexity. At long context lengths, it becomes the dominant bottleneck:
At 200K context, the indexer consumes 81% of prefill time.
We measured pairwise top-k index overlap across all 47 DSA layers and found that adjacent layers share 70–100% of their selected tokens:
Cross-layer top-k overlap heatmap. Most indexer computations are redundant.
IndexCache partitions layers into Full (F) layers that retain their indexer and Shared (S) layers that reuse the nearest F layer's cached indices:
We propose two complementary approaches:
| Approach | Description | Requires Training? |
|---|---|---|
| Training-free | Greedy search selects which indexers to remove based on LM loss on a calibration set | ✗ |
| Training-aware | Multi-layer distillation trains each retained indexer to serve all layers it covers | ✓ |
Both retain only 1/4 of indexers with negligible quality degradation.
| Baseline | IndexCache (1/4) | Speedup | |
|---|---|---|---|
| Prefill (200K) | 19.5s | 10.7s | 1.82× |
| Decode (200K) | 58 tok/s | 86 tok/s | 1.48× |
9 benchmarks virtually unchanged ✅
~1.2× E2E speedup with negligible degradation across 10 benchmarks (long-context + reasoning).
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout b638b25bThis patch is built and tested against commit
b638b25b. It may apply cleanly to newer versions, but if you encounter conflicts, use this specific commit.
git apply /path/to/indexcache.patchConfigure via --json-model-override-args. Two options:
Every N-th layer keeps its indexer:
python -m sglang.launch_server \
--model-path zai-org/GLM-5-FP8 \
--json-model-override-args '{"index_topk_freq": 2}' \
... # your other args (tp, dp, etc.)index_topk_freq=2 → every 2th layer is Full, rest are Shared (50% indexers removed).
Specify per-layer F/S assignment:
python -m sglang.launch_server \
--model-path zai-org/GLM-5-FP8 \
--json-model-override-args '{"index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSFFFSFSSSFSFFSFFSSS"}' \
... # your other argsEach character maps to one DSA layer: F = Full (runs indexer), S = Shared (reuses cached indices).
Default behavior: When neither parameter is set, all layers run their indexer — identical to standard DSA.
| Parameter | Type | Default | Description |
|---|---|---|---|
index_topk_freq |
int | 1 |
Keep indexer every N layers. 1 = disabled, 4 = keep 1/4 |
index_topk_pattern |
string | null |
Per-layer F/S pattern. Overrides index_topk_freq if set |
Which to use?
index_topk_freq: 4— Simple, good default. Best with training-aware models.- Custom pattern — Optimal for training-free deployment. The example above is the greedy-searched pattern for the GLM-5 model.
| Model | Architecture | Supported |
|---|---|---|
| DeepSeek-V3.2 | DeepseekV32ForCausalLM |
✅ |
| GLM-5 (744B) | GlmMoeDsaForCausalLM |
✅ |
Any model using DSA indexer through SGLang's DeepseekV2AttentionMLA benefits from this patch.
@article{bai2025indexcache,
title={IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse},
author={Bai, Yushi and Dong, Qian and Jiang, Ting and Lv, Xin and Du, Zhengxiao and Zeng, Aohan and Tang, Jie and Li, Juanzi},
journal={arXiv preprint arXiv:2603.12201},
year={2025}
}This patch is released under the Apache 2.0 License, consistent with SGLang.


