Skip to content

Conversation

@fxyfxy777
Copy link
Contributor

@fxyfxy777 fxyfxy777 commented Jan 26, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

from fastdeploy.model_executor.ops.gpu import group_swiglu_with_masked
from fastdeploy.model_executor.ops.gpu import masked_per_token_quant
融合上述两个算子为fused_mask_swiglu_fp8_quant
去掉了fp16的支持,暂时看没有需要调用的地方
去掉了输入支持int64的场景,同样是没有需求
支持ue8m0的场景
精度:
bd7b915
这个commit中测试了融合后的算子和融合之前的算子逐位对齐的
删去了mask_per_token_quant算子,mask_swiglu算子在别的文件(custom_ops/gpu_ops/moe/moe_ffn.cu ,custom_ops/gpu_ops/moe/moe_expert_ffn_wint2.cu)中有调用,暂时先不删除

性能结论:测试数据:self.group_num = 10
self.group_size = 2048
self.hidden_dim = 7168
self.block_size = 128
每个rank10个专家,有效token数在0-512的范围内,
H 卡替换比:(约1.6倍)
fused_vs_separate_performance
B卡替换比:(约2倍)
fused_vs_separate_performance (1)

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Jan 26, 2026

Thanks for your contribution!

@fxyfxy777 fxyfxy777 changed the title optimize mask_quant op speed up 1.5 [Optimize] optimize mask_quant & swiglu Jan 26, 2026
@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2026

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6c685c9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
..._executor/layers/moe/fused_moe_deepgemm_backend.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6222   +/-   ##
==========================================
  Coverage           ?   67.00%           
==========================================
  Files              ?      385           
  Lines              ?    51283           
  Branches           ?     7998           
==========================================
  Hits               ?    34362           
  Misses             ?    14430           
  Partials           ?     2491           
Flag Coverage Δ
GPU 67.00% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fxyfxy777
Copy link
Contributor Author

/re-run all-failed

@fxyfxy777
Copy link
Contributor Author

/re-run all-failed

@fxyfxy777
Copy link
Contributor Author

/re-run all-failed

K11OntheBoat
K11OntheBoat previously approved these changes Jan 29, 2026
Copy link
Collaborator

@K11OntheBoat K11OntheBoat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

qingqing01
qingqing01 previously approved these changes Jan 30, 2026
yongqiangma
yongqiangma previously approved these changes Jan 31, 2026
Copy link
Collaborator

@yongqiangma yongqiangma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fxyfxy777
Copy link
Contributor Author

/re-run all-failed

qingqing01
qingqing01 previously approved these changes Feb 2, 2026
@K11OntheBoat K11OntheBoat merged commit 2ada119 into PaddlePaddle:develop Feb 2, 2026
31 of 36 checks passed
fxyfxy777 added a commit to fxyfxy777/FastDeploy that referenced this pull request Feb 2, 2026
K11OntheBoat pushed a commit that referenced this pull request Feb 3, 2026
* optimize mask_quant op speed up 1.5

* fix calculate sequence

* add fused

* rm log

* push kernel code

* add ut

* accuracy ok

* add ue8m0

* add ut

* add merge develop

* rm ut of mask_per_token_quant

* Revert "[Optimize] optimize mask_quant & swiglu (#6222)"

This reverts commit 2ada119.

* add block_size

* pre-commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants