Skip to content

[fix][acc][sgl-atom] fix accuracy of fp8 attn weights model using ptpc quant recipe#747

Merged
zhuyuhua-v merged 3 commits into
mainfrom
yuhua/sgl-fp4-mtp-acc-fix
May 11, 2026
Merged

[fix][acc][sgl-atom] fix accuracy of fp8 attn weights model using ptpc quant recipe#747
zhuyuhua-v merged 3 commits into
mainfrom
yuhua/sgl-fp4-mtp-acc-fix

Conversation

@zhuyuhua-v
Copy link
Copy Markdown
Collaborator

@zhuyuhua-v zhuyuhua-v commented May 11, 2026

Motivation

Align with #670
The quark models, amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 has fp8 weight linear layers in attn and adopt ptpc quant recipe. But current code in ATOM forces block scale quant in _fuse_rmsnorm_quant. This pr fixed this issue.

Technical Details

_fuse_rmsnorm_quant should select correct quant type based on the quant config/recipe. For per-token quant, a new kernel: fused_qk_rmsnorm_per_token_quant is added in aiter.

Test Plan

The gsm8k dataset accuracy is validated with this pr on amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 with sglang-ATOM.

Test Result

Main branch:

amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 TP4

SGLang-ATOM:

local-completions ({'model': '/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4', 'base_url': 'http://localhost:8013/v1/completions', 'num_concurrent': 65, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.8544|±  |0.0097|
|     |       |strict-match    |     3|exact_match||0.8400|±  |0.0101|

This PR:

amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 TP4

SGLang-ATOM:

llocal-completions ({'model': '/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4', 'base_url': 'http://localhost:8013/v1/completions', 'num_concurrent': 65, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match||0.9340|±  |0.0068|
|     |       |strict-match    |     3|exact_match||0.9303|±  |0.0070|

For model amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4, acc increase from 0.85 to 0.93

Submission Checklist

@zhuyuhua-v zhuyuhua-v marked this pull request as ready for review May 11, 2026 08:40
Copilot AI review requested due to automatic review settings May 11, 2026 08:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an accuracy regression for SGLang-ATOM when running DeepSeek MLA attention with FP8-weight attention projections that use a PTPC (per-token) quantization recipe, by ensuring the fused RMSNorm+quant path honors the projection layer’s configured quantization type instead of forcing a block-scale scheme.

Changes:

  • Route SGLang MLA’s FP8 activation quantization through ATOM’s DeepSeek-V2 _fuse_rmsnorm_quant helper instead of fused_rms_fp8_group_quant.
  • Derive and pass quant_type from the corresponding FP8 projection modules (q_b_proj, kv_b_proj) so per-token quant recipes are respected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zhuyuhua-v zhuyuhua-v merged commit 7934d5e into main May 11, 2026
38 of 44 checks passed
@zhuyuhua-v zhuyuhua-v deleted the yuhua/sgl-fp4-mtp-acc-fix branch May 11, 2026 08:52
@zhuyuhua-v
Copy link
Copy Markdown
Collaborator Author

align with Hattie, merge this PR since sglang ci case always fail now, will fix in #614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants