Skip to content

[CI] Add pytest failure log collection and persistence#7405

Merged
EmmonsCurse merged 4 commits intoPaddlePaddle:developfrom
EmmonsCurse:ci_optimize_dev_0415
Apr 16, 2026
Merged

[CI] Add pytest failure log collection and persistence#7405
EmmonsCurse merged 4 commits intoPaddlePaddle:developfrom
EmmonsCurse:ci_optimize_dev_0415

Conversation

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

@EmmonsCurse EmmonsCurse commented Apr 14, 2026

Motivation

Enhance CI debugging capability by collecting detailed pytest failure logs, reducing troubleshooting time for flaky or failed cases.

Modifications

  • Introduce pytest hook to capture and persist failure logs into FD_LOG_DIR
  • Enable per-test error log generation for better traceability
  • Update test_*.py for debugging and validation
  • Disable test_Qwen3_30b_tp4.py due to unstable execution

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@EmmonsCurse
Copy link
Copy Markdown
Collaborator Author

EmmonsCurse commented Apr 14, 2026

/skip-ci ci_iluvatar
/skip-ci ci_hpu
/skip-ci build_xpu
/skip-ci pre_ce_test
/skip-ci stable_test
/skip-ci base_test
/skip-ci logprob_test

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 14, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@e952720). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7405   +/-   ##
==========================================
  Coverage           ?   73.29%           
==========================================
  Files              ?      398           
  Lines              ?    54950           
  Branches           ?     8606           
==========================================
  Hits               ?    40276           
  Misses             ?    11983           
  Partials           ?     2691           
Flag Coverage Δ
GPU 73.29% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 15, 2026

🤖 AI CI Agent | ci_status_monitor | {{TIMESTAMP}}

CI报告基于以下代码生成(每15分钟更新):


任务总览

唯一任务数 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 🔄 重跑次数
36 25 1 0 0 0

任务汇总

状态 任务 耗时 重跑 失败简述 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h15m - 测试失败: 6 个用例因断言被故意反转而失败 查看
⏭️ 10 个任务被跳过 - - bypass 条件未满足 -
25 个任务通过 - - - -

失败分析

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 测试断言被故意反转

分类: 测试失败 | 置信度: 高

根因分析:

本 PR 的核心功能是在 tests/conftest.py 中新增 pytest_runtest_makereport hook,用于捕获失败用例并将错误日志持久化到 FD_LOG_DIR。为验证该功能,PR 故意反转了多个测试文件中的断言条件==!= 或修改预期值),导致 6 个测试用例必然失败:

失败用例 变更内容
tests/batch_invariant/test_batch_invariance_op_addmm.py assert diff.item() == 0!= 0
tests/distributed/test_communication.py 预期参数 8 * 1024 * 102464 * 1024 * 1024
tests/e2e/test_EB_VL_Lite_serving.py 基准文件名 ernie-4_5-vl-base-tp2-dev-0311ernie-4_5-vl-base-tp2-dev
tests/e2e/test_ernie_21b_mtp.py assert cached_tokens == ...!= ...
tests/scheduler/test_workers.py assertEqual(len(workers.results), 0)1
tests/utils/test_utils.py assert retrive_model_from_server(...) == ...!= ...

注:tests/v1/test_schedule_output.py 也被反转(== 3329!= 3329),但该用例在本次运行中被跳过。

关键日志:

Unit tests failed (exit code 8)
Failed test cases:
tests/distributed/test_communication.py
tests/e2e/test_EB_VL_Lite_serving.py
tests/e2e/test_ernie_21b_mtp.py
tests/batch_invariant/test_batch_invariance_op_addmm.py
tests/scheduler/test_workers.py
tests/utils/test_utils.py

修复建议:

  1. ⚠️ 合入前必须还原所有被反转的断言:这些改动仅用于功能验证,不应合入主分支,否则会导致 CI 持续失败
  2. 建议将 conftest.py 中的日志收集功能与故意制造失败的测试改动拆分为独立 PR,仅合入 conftest.py 的改动
  3. 如需保留验证用例,可新增专门的 test_log_collection.py 使用 pytest.raises 等方式验证日志收集功能,而非修改已有断言

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-16 20:11 CST

📋 Review 摘要

PR 概述:为 CI 添加 pytest 失败用例日志收集与持久化能力,提升调试效率
变更范围tests/conftest.py(pytest hook 新增)、scripts/coverage_run.sh(grep 排除规则)、测试文件禁用
影响面 TagCI

问题

级别 文件 概述
🟡 建议 scripts/coverage_run.sh:111 --exclude 选项放在位置参数之后,在 POSIXLY_CORRECT 环境下可能失效
❓ 疑问 tests/conftest.py:22 serving_utils 顶层导入可能影响非 e2e 测试的 collection

总体评价

PR 整体质量良好,pytest hook 实现正确,日志文件命名和 sanitize 逻辑合理,与 coverage_run.sh--exclude 配合也完整。提出了两个小建议供参考,未发现阻塞性问题。

Comment thread scripts/coverage_run.sh

echo ">>> grep error in ${isolated_log_dir}"
grep -Rni --color=auto "error" "${isolated_log_dir}" || true
grep -Rni --color=auto "error" "${isolated_log_dir}" --exclude="pytest_*_error.log" || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 --exclude 选项建议放在位置参数(目录路径)之前

当前 --exclude 放在 "${isolated_log_dir}" 之后。虽然 GNU grep 默认接受这种写法,但若环境设置了 POSIXLY_CORRECT,选项在位置参数之后会被忽略,导致 pytest 错误日志未被排除,在 CI 输出中产生干扰信息。

建议调整为:

grep -Rni --color=auto --exclude="pytest_*_error.log" "error" "${isolated_log_dir}" || true

Comment thread tests/conftest.py
from typing import Any, Union

import pytest
from e2e.utils.serving_utils import ( # noqa: E402
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 from e2e.utils.serving_utils import ... 提升为模块顶层导入是否有风险?

原来该导入位于文件中部(FDRunner 类定义之前),现在移至文件最顶部。conftest.py 作为 tests/ 目录的根 conftest,会被所有测试(包括单元测试如 tests/logger/tests/scheduler/ 等)加载。若运行环境中 e2e.utils.serving_utils 模块不可导入(例如缺少依赖),会导致所有测试的 collection 阶段失败。

请确认这个导入在所有 CI 环境中都可用,或者考虑保持延迟导入(lazy import)策略——例如仅在 FDRunner.__init__ 中导入 clean_ports 等符号。

@EmmonsCurse EmmonsCurse merged commit 91b8bf2 into PaddlePaddle:develop Apr 16, 2026
38 checks passed
@EmmonsCurse EmmonsCurse deleted the ci_optimize_dev_0415 branch April 16, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants