Skip to content

[Cherry-Pick][CI] Sync dev optimizations to 2.6(#7602)#7610

Merged
EmmonsCurse merged 2 commits into
PaddlePaddle:release/2.6from
EmmonsCurse:ci_optimize_26_0424
Apr 24, 2026
Merged

[Cherry-Pick][CI] Sync dev optimizations to 2.6(#7602)#7610
EmmonsCurse merged 2 commits into
PaddlePaddle:release/2.6from
EmmonsCurse:ci_optimize_26_0424

Conversation

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

Motivation

  1. Some tests may intermittently fail due to OOM or process kill issues, especially under constrained CI resources. Previously, a high-risk OOM test list was maintained to mitigate this, but it increases maintenance overhead. Introducing a retry mechanism provides a more robust and scalable solution to handle transient failures without excluding tests.
  2. Enhance CI debugging capability by collecting detailed pytest failure logs, reducing troubleshooting time for flaky or failed cases.
  3. Using git diff upstream/$BRANCH directly may include unrelated changes when the local branch is not strictly rebased, leading to incorrect detection results in CI checks.

Modifications

Cherry-pick of #7602 #7601 #7405 and to release/2.6.

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

@EmmonsCurse
Copy link
Copy Markdown
Collaborator Author

/skip-ci ci_iluvatar
/skip-ci ci_hpu
/skip-ci build_xpu

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@5508979). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7610   +/-   ##
==============================================
  Coverage               ?   73.06%           
==============================================
  Files                  ?      376           
  Lines                  ?    53401           
  Branches               ?     8352           
==============================================
  Hits                   ?    39019           
  Misses                 ?    11631           
  Partials               ?     2751           
Flag Coverage Δ
GPU 73.06% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-24 17:44:34

📋 Review 摘要

PR 概述:Cherry-Pick CI 优化到 release/2.6,包含 git diff 修复、OOM 重试机制、pytest 失败日志收集及错误 grep 降噪
变更范围:CI 脚本(scripts/)、测试配置(tests/conftest.py
影响面 Tag[CI]


问题

级别 文件 概述
❓ 疑问 tests/conftest.py:152 pytest_runtest_makereport 仅捕获 call 阶段失败,setup/teardown 阶段失败不会写入错误日志

变更逐项说明

scripts/check_approval.sh:将 git diff upstream/$BRANCH 改为 git diff --merge-base upstream/$BRANCH,等价于以 $(git merge-base HEAD upstream/$BRANCH) 为基准做 diff,避免了在本地分支未严格 rebase 时误判无关变更。修复正确,4 处调用一致更新 ✓

scripts/coverage_run.sh:新增 exit code 137(SIGKILL/OOM Killer)重试机制,最多重试 3 次(共最多 4 次执行),非 137 错误和超时(exit 124)不触发重试,逻辑设计合理 ✓

scripts/run_golang_router.sh / run_gpu_4cards.sh / run_pre_ce.sh:在 grep -Rni error 命令中加入 --exclude="pytest_*_error.log" 过滤,避免 CI 日志中新增的 pytest 错误日志文件被重复扫描产生噪音 ✓

tests/conftest.py:代码重构(imports 移至文件头、补充 docstring)+ 新增 pytest_runtest_makereport hook,将失败测试的完整 traceback 写入 $FD_LOG_DIR/pytest_<case_name>_error.log 文件,便于 CI 后续调试 ✓


总体评价

改动清晰合理,三项优化相互配合:错误日志写入 → grep 过滤 → OOM 重试,整体 CI 稳定性和可调试性均有提升。无阻塞性问题。

Comment thread tests/conftest.py
outcome = yield
report = outcome.get_result()

if report.when == "call" and report.failed:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问:此处仅捕获 call(测试执行)阶段的失败,setupteardown 阶段的失败不会写入错误日志文件。如果 fixture 初始化失败,错误日志将不会被保存。请确认这是预期行为,还是需要同时覆盖 setup/teardown 阶段?

如需覆盖所有阶段,可改为:

if report.failed:  # 覆盖 setup/call/teardown 所有阶段

@EmmonsCurse EmmonsCurse merged commit c8a59a3 into PaddlePaddle:release/2.6 Apr 24, 2026
36 of 37 checks passed
@EmmonsCurse EmmonsCurse deleted the ci_optimize_26_0424 branch April 24, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants