Skip to content

[Others] fix allreduce fusion accurate issue in ep + tp mode#7947

Merged
K11OntheBoat merged 5 commits into
PaddlePaddle:developfrom
BingooYang:fix_allreduce_acc_dev
May 29, 2026
Merged

[Others] fix allreduce fusion accurate issue in ep + tp mode#7947
K11OntheBoat merged 5 commits into
PaddlePaddle:developfrom
BingooYang:fix_allreduce_acc_dev

Conversation

@BingooYang
Copy link
Copy Markdown
Contributor

Motivation

修复 ep + tp 模式下开启 allreduce fusion 精度问题

Modifications

修改 allreduce fusion 触发条件

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 28, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-28 11:42:37

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

Required 任务当前无失败、无运行中、无等待中;本次 PR 可优先关注 2 个 Optional 失败任务(不阻塞合并)。主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 当前为 skipped。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 23 2 0 0 17

2 任务状态汇总

日志列说明:失败任务直接使用 CI 提供的 Job 链接;运行中任务手动拼接 Job 链接。

2.1 Required任务 : 2/10 通过

必选任务阻塞合并,当前未发现 required 失败/运行中/等待中任务。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 2 个必选任务通过 - - - - -
⏭️ 其余 8 个必选任务未执行/跳过 - skipped/neutral 无需处理 - -

2.2 可选任务 — 21/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 19s Job -
Trigger Jenkins for PR 16s Job -
其余 21 个可选任务通过 - - -
⏭️ 其余 9 个可选任务跳过 - - -

3 失败详情(仅 required)

无 required 失败任务;Optional 失败任务按规则不做深度分析。

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@60e6223). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7947   +/-   ##
==========================================
  Coverage           ?   67.52%           
==========================================
  Files              ?      467           
  Lines              ?    65180           
  Branches           ?    10007           
==========================================
  Hits               ?    44011           
  Misses             ?    18348           
  Partials           ?     2821           
Flag Coverage Δ
GPU 77.76% <100.00%> (?)
XPU 7.08% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 28, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-29 13:17:23

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

Required 失败任务数 0,等待处理的 Required 任务数 0。主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 已通过;当前无阻塞合并的 Required CI 失败。Optional 有 3 个失败任务,仅供参考。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
84(39) 45 38 3 0 0 4

注意:action_required workflows 不计入上表的任务统计;当前 action_required workflow 数为 0。


2 任务状态汇总

日志列说明:失败任务直接使用 log_links_markdown 字段;运行中任务手动拼接 [Job]({html_url})

2.1 Required任务 : 10/12 通过

必选任务阻塞合并,失败需优先处理。当前 Required 无失败、无运行中、无等待;有 2 个 Required 计数未归为通过/失败/运行/等待(结合总览 skipped=4,通常为跳过/未触发类状态),请以 GitHub Checks 最终展示为准。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 10 个必选任务通过 - - - - -
⏭️ 其余 2 个必选任务未计入通过 - 无失败 无需深度分析,查看 Checks 确认 - -

2.2 可选任务 — 28/33 通过

可选任务不阻塞合并,失败仅供参考;按 Skill 规则不做深度日志分析。

状态 任务 耗时 日志 重跑
Check PR Template 27s Job 🔄×1
CI_HPU 1h37m Job 🔄×1
Trigger Jenkins for PR 19s Job 🔄×1
其余 28 个可选任务通过 - - -
⏭️ 其余 2 个可选任务跳过/未触发 - - -

3 失败详情(仅 required)

无 Required 失败任务,无需展开失败详情。

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-29 10:00:11

📋 Review 摘要

PR 概述:修复 EP + TP 模式下 allreduce fusion 导致的精度问题
变更范围:normalization layer、glm4_moe 模型、单元测试
影响面 Tag[Models] [OP]

问题

未发现阻塞性问题。代码逻辑正确,修复方案合理。

历史 Findings 修复情况

Finding 问题 状态
F1 post_attention_layernorm 在 EP 模式下仍启用 allreduce fusion ⚠️ 仍存在(经深度分析:行为正确)

关于 F1 的分析post_attention_layernorm 接收的是 attention 层 o_proj 的输出。Attention 始终是 TP 并行的(即使 MoE 使用了 EP),o_proj 无条件设置 enable_all_reduce_fusion=True 延迟 allreduce。因此 post_attention_layernorm 在 EP 模式下仍需执行 fused allreduce,当前行为正确,无需修改。

📝 PR 规范检查

存在以下问题:

  1. 标题 Tag 不当[Others] 语义模糊,此 PR 为精度 Bug 修复,应使用 [BugFix]
  2. 缺少 ## Usage or Command 段落(§D2 必填段落)
  3. ## Accuracy Tests 为空,但 Checklist 中 "Provide accuracy results" 已勾选 → 不一致,应填写精度对比结果或说明 N/A
  4. Checklist "Add unit tests" 已勾选,但 PR 描述中无单测说明,请补充说明或取消勾选

标题建议(可直接复制):

  • [BugFix] fix allreduce fusion accuracy issue in ep + tp mode
PR 描述建议(点击展开,可直接复制)
## Motivation

修复 ep + tp 模式下开启 allreduce fusion 精度问题

## Modifications

修改 allreduce fusion 触发条件(请补充具体改动文件/逻辑说明)

## Usage or Command

N/A

## Accuracy Tests

N/A(若有精度对比数据,请在此补充)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复逻辑正确:input_layernorm 在 EP 模式下禁用 allreduce fusion(MoE 层已自行完成 reduce),Glm4MoeMLP.down_proj 的 fusion 标志正确绑定到 not reduce_results。单测覆盖了关键组合场景,质量良好。

@BingooYang
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

Copy link
Copy Markdown
Collaborator

@K11OntheBoat K11OntheBoat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@K11OntheBoat K11OntheBoat merged commit 92fdcf7 into PaddlePaddle:develop May 29, 2026
75 of 85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants