[Cherry-Pick] [BugFix] fix all reduce fusion accurate issue (#7923)#7922
Conversation
|
Thanks for your contribution! |
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 required 任务已通过,建议通过;当前仍有 2 个 optional 任务失败,不阻塞合并,仅供参考。
2 任务状态汇总2.1 Required任务 : 10/10 通过
2.2 可选任务 — 24/26 通过
3 失败详情(仅 required)无 required 失败任务。 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7922 +/- ##
==============================================
Coverage ? 71.95%
==============================================
Files ? 385
Lines ? 54732
Branches ? 8568
==============================================
Hits ? 39381
Misses ? 12573
Partials ? 2778
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-28 10:53:35
📋 Review 摘要
PR 概述:修复同时启用 FD_USE_PHI_RMSNORM=1 和 --enable-flashinfer-allreduce-fusion 时 RMSNorm proxy_rmsnorm 路径缺少 allreduce fusion 导致的精度问题,并修复 GLM4-MoE shared expert 的 enable_all_reduce_fusion 逻辑错误。
变更范围:model_executor/layers/normalization.py、model_executor/models/glm4_moe.py、tests/layers/
影响面 Tag:[OP] [Models]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
normalization.py:289 |
proxy_rmsnorm 新路径继续使用 assert 做运行时校验 |
|
| 🟡 建议 | tests/layers/trtllm_allreduce_rms_fusion.py:588 |
TestRMSNormProxyAllreduceFused 全部 3 个测试方法均 skipTest(tp_size<2),单卡 CI 下完全跳过 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | assert 用于运行时内核状态校验 |
assert) |
📝 PR 规范检查
标题格式与 Cherry-Pick 规范存在偏差:[Cherry-Pick] 与 [BugFix] 之间多了空格,且 "accurate" 拼写应为 "accuracy",PR 号前多了空格。
标题建议(可直接复制):
[Cherry-Pick][BugFix] fix all reduce fusion accuracy issue (#7923)
描述结构完整,所有必填段落均已填写,Checklist 勾选符合实际,无需修改。
总体评价
本次修复逻辑清晰,核心 bug(proxy_rmsnorm 路径漏做 allreduce fusion 及 GLM4-MoE shared expert 的 reduce_results 条件判断错误)定位准确,精度测试也随之更新为更贴近真实场景的 bf16。历史遗留 assert 问题仍未解决,且新增的 proxy 路径又引入一处同类用法;新测试类在单卡 CI 下全部被 skip,建议补充不依赖多卡的轻量级覆盖。
| input_tensor=x, | ||
| residual=residual_input, | ||
| weight=self.weight, | ||
| eps=self.eps, |
There was a problem hiding this comment.
assert 用于运行时内核状态校验
assert 在 Python 以 -O 优化模式运行时会被完全跳过,用于校验内核执行状态不可靠。
建议修复方式:
if norm_out[0] is None:
raise RuntimeError("Trtllm-all-reduce fusion failed: norm_out is None")| mock_manager.cleanup.assert_called_once() | ||
|
|
||
|
|
||
| class TestRMSNormProxyAllreduceFused(unittest.TestCase): |
There was a problem hiding this comment.
🟡 建议 TestRMSNormProxyAllreduceFused 全部 3 个测试均要求 tp_size >= 2,单卡 CI 下完全跳过
所有测试方法开头均有 self.skipTest("Requires tp_size >= 2"),在常见单卡 CI 环境中这 3 个测试永远不会执行,等同于无守护覆盖。
建议:
- 将不需要真正 all-reduce 通信的逻辑拆成
tp_size=1可运行的 mock 单测(例如对use_allreduce_fused布尔推导条件的验证) - 或在 CI 配置中增加多卡分布式 job 保证这些测试能运行
Motivation
修复同时开启 export FD_USE_PHI_RMSNORM=1 和 --enable-flashinfer-allreduce-fusion 引起的精度问题
Modifications
开启 export FD_USE_PHI_RMSNORM=1 环境变量,RMSNorm其中一个分支适配 all reduce fusion
Usage or Command
NA
Accuracy Tests
done
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.