[Others] fix allreduce fusion accurate issue in ep + tp mode#7947
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务当前无失败、无运行中、无等待中;本次 PR 可优先关注 2 个 Optional 失败任务(不阻塞合并)。主测试任务
2 任务状态汇总日志列说明:失败任务直接使用 CI 提供的 Job 链接;运行中任务手动拼接 Job 链接。 2.1 Required任务 : 2/10 通过
2.2 可选任务 — 21/32 通过
3 失败详情(仅 required)无 required 失败任务;Optional 失败任务按规则不做深度分析。 |
b94cc7d to
7c4a99a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7947 +/- ##
==========================================
Coverage ? 67.52%
==========================================
Files ? 467
Lines ? 65180
Branches ? 10007
==========================================
Hits ? 44011
Misses ? 18348
Partials ? 2821
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 失败任务数 0,等待处理的 Required 任务数 0。主测试任务
2 任务状态汇总日志列说明:失败任务直接使用 2.1 Required任务 : 10/12 通过
2.2 可选任务 — 28/33 通过
3 失败详情(仅 required)无 Required 失败任务,无需展开失败详情。 |
46faa8b to
9e3664b
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-29 10:00:11
📋 Review 摘要
PR 概述:修复 EP + TP 模式下 allreduce fusion 导致的精度问题
变更范围:normalization layer、glm4_moe 模型、单元测试
影响面 Tag:[Models] [OP]
问题
未发现阻塞性问题。代码逻辑正确,修复方案合理。
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | post_attention_layernorm 在 EP 模式下仍启用 allreduce fusion |
关于 F1 的分析:
post_attention_layernorm接收的是 attention 层o_proj的输出。Attention 始终是 TP 并行的(即使 MoE 使用了 EP),o_proj无条件设置enable_all_reduce_fusion=True延迟 allreduce。因此post_attention_layernorm在 EP 模式下仍需执行 fused allreduce,当前行为正确,无需修改。
📝 PR 规范检查
存在以下问题:
- 标题 Tag 不当:
[Others]语义模糊,此 PR 为精度 Bug 修复,应使用[BugFix] - 缺少
## Usage or Command段落(§D2 必填段落) ## Accuracy Tests为空,但 Checklist 中 "Provide accuracy results" 已勾选 → 不一致,应填写精度对比结果或说明 N/A- Checklist "Add unit tests" 已勾选,但 PR 描述中无单测说明,请补充说明或取消勾选
标题建议(可直接复制):
[BugFix] fix allreduce fusion accuracy issue in ep + tp mode
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 ep + tp 模式下开启 allreduce fusion 精度问题
## Modifications
修改 allreduce fusion 触发条件(请补充具体改动文件/逻辑说明)
## Usage or Command
N/A
## Accuracy Tests
N/A(若有精度对比数据,请在此补充)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
修复逻辑正确:input_layernorm 在 EP 模式下禁用 allreduce fusion(MoE 层已自行完成 reduce),Glm4MoeMLP.down_proj 的 fusion 标志正确绑定到 not reduce_results。单测覆盖了关键组合场景,质量良好。
|
/re-run all-failed |
Motivation
修复 ep + tp 模式下开启 allreduce fusion 精度问题
Modifications
修改 allreduce fusion 触发条件
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.