DeepSeek3.2: Onboard sparse attention #2933

RissyRan · 2026-01-13T04:37:49Z

Description

Background

DeepSeek V3.2 differs from DeepSeek V3 solely in the attention mechanism, aiming for efficiency in long-context scenario. While DeepSeek V3 uses Multi-head Latent Attention (MLA), DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA). DSA augments MLA with two components:

Indexer: parametric, qk product to get index score
Top-k token selection: non-parametric, select top-k key/value for each query, introducing sparsity to qkv attention

What this PR does

1. Naive implementation of DeepSeek Sparse Attention (DSA)

Indexer:
- qk product: currently implemented with dot product to get index scores. To be optimized.
- (minor) RoPE: indexer applies partial RoPE to q and k based on YaRN extension. It uses the same YaRN frequency as MLA, but with concatenated layout rather than interleaved layout.
- Based on index scores, get top-k indices and index mask
Top-k selection for qkv attention:
- This is currently implemented inside dot product attention, by adding index mask to regular attention mask. To be optimized.
training only (no prefill / decode)
See changes attention_mla.py, attention_op.py

2. Onboard deepseek3.2-671b config

deepseek3.2-671b.yml
deepseek v3.2 vs. v3: HF config diff: additional config for indexer

"index_head_dim": 128, "index_n_heads": 64, "index_topk": 2048,

number of parameter: (1) Similar to v3, HF safetensor of v3.2 contains an extra layer for MTP which we omit. (2) Note that indexer contains extra parameter. (3) By counting, v3 has 671026419200 (671.03B) and v3.2 has671877944064 (671.88B) parameters.

3. unit test: ahead-of-time train compile for deepseek3.2-671b

4. unit test: compare output against torch code for Indexer and MLA

check_deepseek32_vs_reference.py
The original torch reference can only run on GPU, due to quantization and fp8 kernel (act_quant, fp8_gemm, fp8_index). https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/model.py
We adapted the reference to run on CPU
- changes: https://diff.googleplex.com/#key=UN66QnHPkvzg, (original code after formatting)
- Remove quantization and use float32 for dtype and weight_dytpe
- Replace fp8 kernel with naive dot product
- (minor) Replace fast_hadamard_transform.hadamard_transform with F.linear

Reference

DeepSeek-V3.2 paper: https://arxiv.org/pdf/2512.02556
HuggingFace: https://huggingface.co/deepseek-ai/DeepSeek-V3.2
Github repo: https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/model.py

Future work

verify end-to-end training logits for deepseek3.2
more efficient implementation of DSA

Tests

Unit test against torch code (adapted from reference): indexer, MLA

python3 -m pytest -v --pyargs tests.check_deepseek32_vs_reference -rP -s

Unit test for train compile

python3 -m pytest -v --pyargs tests.train_compile_test -rP -s -k "test_deepseek32"

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-01-13T04:49:48Z

Codecov Report

❌ Patch coverage is 51.25000% with 39 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/layers/attention_mla.py	53.94%	31 Missing and 4 partials ⚠️
src/MaxText/layers/attention_op.py	0.00%	2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

RissyRan

Thanks for the change! I took a look at indexer part, and overall it looks good for functionality. It also has indexer logit kernel for performance, I will take a look there.

I will take a look at MLA part shortly.

src/MaxText/layers/attention_mla.py

RissyRan

Thanks for the change! Great work! A few comments.

RissyRan · 2026-01-20T20:32:23Z

src/MaxText/configs/models/deepseek3.2-671b.yml

@@ -0,0 +1,59 @@
+# Copyright 2023–2025 Google LLC


Nit with 2026

RissyRan · 2026-01-20T20:33:56Z

src/MaxText/configs/types.py

  v_head_dim: NonNegativeInt = Field(128, description="Dimension of V heads in MLA.")


+class AttentionIndexer(BaseModel):


Could you add those 4 configs into MoE config readme doc? Could be a follow-up PR

RissyRan · 2026-01-20T20:38:53Z

src/MaxText/configs/types.py

+  """Configuration for DeepSeek Sparse Attention (DSA): DeepSeek3.2-style MLA with indexer."""
+
+  use_sparse_indexer: bool = Field(False, description="If True, enables sparse indexer.")
+  index_head_dim: NonNegativeInt = Field(128, description="head dim for indexer")


Nit: Capitalize first letter in description to align with others. Similar comments for following.

RissyRan · 2026-01-20T20:40:36Z

src/MaxText/layers/attention_mla.py

+class Indexer(nnx.Module):
+  """
+  Indexer for DeepSeek Sparse Attention (DSA).
+  Introduced by DeepSeek V3.2: https://arxiv.org/pdf/2512.02556.


Do you think we could also attach the reference implementation here along with paper?

RissyRan · 2026-01-20T20:44:17Z

src/MaxText/layers/attention_mla.py

+        out_features_shape=(self.n_heads, self.head_dim),
+        axis=-1,
+        kernel_init=self.kernel_init,
+        # TODO(shuningjin): double check kernel axes


Do you have some concerns? I see it aligned with MLA:

maxtext/src/MaxText/layers/attention_mla.py

Line 425 in d4a259d

kernel_axes=("q_lora", "q_heads", "kv"),

We could start with this.

RissyRan · 2026-01-20T20:54:59Z

src/MaxText/layers/attention_mla.py

+    # Indexer Logic
+    index_mask = None
+    if self.use_sparse_indexer:
+      if self.q_lora_rank == 0:


Can we shift this logic earlier? I think validations are in type.py file. Similar comment, if you validate model_mode

RissyRan · 2026-01-20T20:56:30Z

src/MaxText/layers/attention_op.py

    length = query.shape[-3]
    target_hardware = self.mesh.devices[(0,) * self.mesh.devices.ndim].platform

+    if index_mask is not None and self.attention_kernel != "dot_product":


Similar comment with sparse attention enabled with other attention type.

RissyRan · 2026-01-20T20:57:45Z

tests/train_compile_test.py

+            "megablox=True",
+            "per_device_batch_size=1",
+            "max_target_length=1024",
+            "attention=dot_product",  # only support dot product now


You could write a TODO: update to flash attention when it's available.

RissyRan · 2026-01-20T20:59:10Z

tests/check_deepseek32_vs_reference.py

+  index_topk = 4
+
+
+SEQ_LEN = 8


Could we also try SEQ_LEN < index_topk to ensure end-to-end working fine?

RissyRan · 2026-01-20T21:02:22Z

tests/check_deepseek32_vs_reference.py

+
+    print("torch out", pt_out)
+    print("jax out", jax_out)
+    # np.testing.assert_allclose(to_jax(pt_out / pt_out.sum()), jax_out / jax_out.sum(), rtol=1e-3, atol=1e-2)


It seems this is like a normalization? Suggest to remove it if not needed, or keep it in the code (not in comment)

RissyRan · 2026-01-20T22:41:54Z

Also, don't forget to squash commits :)

gagika · 2026-01-21T06:24:30Z

src/MaxText/layers/attention_mla.py

+    # 2. Broadcast compare against [b, t, k] to get [b, t, k, s]
+    # 3. Use .any() to see if a s-index is present in any of the k slots
+    is_topk = (jnp.arange(s) == topk_indices[..., None]).any(axis=-2)


isn't this really large Tensor [b, t, k, s] that can cause OOM?

I think you can use
jnp.put_along_axis (or jax.lax.scatter) to construct the mask directly without materializing the [b, t, k, s] tensor.

…ngjin-ds

RissyRan commented Jan 13, 2026

View reviewed changes

src/MaxText/layers/attention_mla.py Outdated Show resolved Hide resolved

src/MaxText/layers/attention_mla.py Outdated Show resolved Hide resolved

RissyRan commented Jan 13, 2026

View reviewed changes

src/MaxText/layers/attention_mla.py Outdated Show resolved Hide resolved

shuningjin changed the title ~~[DO NO MERGE] Draft for sparse~~ DeepSeek3.2: Onboard sparse attention Jan 17, 2026

DeepSeek3.2: Onboard sparse attention

fe2ea34

shuningjin force-pushed the shuningjin-ds branch from 94b73d8 to fe2ea34 Compare January 17, 2026 01:00

shuningjin marked this pull request as ready for review January 17, 2026 01:01

shuningjin requested review from A9isha, NicoGrande, NuojCheng, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, suexu1025 and vipannalla as code owners January 17, 2026 01:01

shuningjin assigned RissyRan and gagika Jan 17, 2026

clean

43f6c55

shuningjin self-assigned this Jan 17, 2026

clean

d064870

clean

85e39d5

RissyRan commented Jan 20, 2026

View reviewed changes

RissyRan mentioned this pull request Jan 21, 2026

Add flops calculation for DeepSeek v3.2 #2979

Open

4 tasks

gagika reviewed Jan 21, 2026

View reviewed changes

Merge branch 'main' of github.com:AI-Hypercomputer/maxtext into shuni…

efb0980

…ngjin-ds

		v_head_dim: NonNegativeInt = Field(128, description="Dimension of V heads in MLA.")


		class AttentionIndexer(BaseModel):

DeepSeek3.2: Onboard sparse attention #2933

Are you sure you want to change the base?

DeepSeek3.2: Onboard sparse attention #2933

Conversation

RissyRan commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

What this PR does

Reference

Future work

Tests

Checklist

Uh oh!

codecov bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RissyRan commented Jan 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RissyRan commented Jan 13, 2026 •

edited

Loading

codecov bot commented Jan 13, 2026 •

edited

Loading