Skip to content

[CI] Fix nightly test error and add container cleanup in build_rl#7335

Merged
EmmonsCurse merged 2 commits intoPaddlePaddle:developfrom
EmmonsCurse:ci_optimize_dev_0411
Apr 11, 2026
Merged

[CI] Fix nightly test error and add container cleanup in build_rl#7335
EmmonsCurse merged 2 commits intoPaddlePaddle:developfrom
EmmonsCurse:ci_optimize_dev_0411

Conversation

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

@EmmonsCurse EmmonsCurse commented Apr 11, 2026

Motivation

  • Fix failures observed in nightly tests, primarily related to container runtime and system-level limits.
  • Address issues caused by insufficient kernel parameters (e.g., shared memory / IPC limits), which can lead to unstable behavior or test crashes.
  • Improve CI security and stability by removing unnecessary --privileged usage.

Modifications

  • Fix nightly test errors by stabilizing the container runtime environment.
  • Add required sysctl parameters to ensure sufficient system resource limits for model inference and multiprocessing workloads.
  • Remove --privileged from Docker execution:
    • Reduce security risks by avoiding full host privilege exposure.
    • Replace with minimal required capabilities / configurations.
    • Ensure compatibility with CI environments where --privileged may be restricted.
  • Add container cleanup logic in build_rl to avoid leftover containers affecting subsequent jobs.

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@EmmonsCurse
Copy link
Copy Markdown
Collaborator Author

EmmonsCurse commented Apr 11, 2026

/skip-ci all

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 11, 2026

Thanks for your contribution!

fastdeploy-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-11

📋 Review 摘要

PR 概述:修复 nightly 测试错误,移除 --privileged 增强安全性,新增容器清理逻辑

变更范围:CI workflow 配置(.github/workflows/

影响面 Tag[CI]

PR 规范检查

PR 标题和描述符合规范:

  • ✅ 标题包含有效标签 [CI]
  • ✅ 描述完整填写了 Motivation 和 Modifications 章节

发现的问题

级别 文件 概述
🟡 建议 .github/workflows/_build_linux_rl.yml:142 --shm-size=64G 大小写与 _accuracy_test.yml 中的 64g 不统一

总体评价

PR 变更整体合理:

  1. 移除 --privileged 减少安全风险,保留必要的 --cap-add=SYS_PTRACE
  2. 新增 --sysctl 参数增加 IPC 消息队列容量,与测试 workflow 保持一致
  3. 新增容器清理逻辑与项目现有实践一致

唯一的小问题是 --shm-size 单位大小写不统一(64G vs 64g),建议保持一致性。

echo "PARENT_DIR:$PARENT_DIR"
docker run --rm --net=host \
--cap-add=SYS_PTRACE --privileged --shm-size=64G \
--cap-add=SYS_PTRACE --shm-size=64G \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 --shm-size=64G 与本 PR 中 _accuracy_test.yml64g 大小写不一致。

建议统一使用小写单位(64g)以保持一致性。

@EmmonsCurse EmmonsCurse merged commit fcf8b13 into PaddlePaddle:develop Apr 11, 2026
38 checks passed
@EmmonsCurse EmmonsCurse deleted the ci_optimize_dev_0411 branch April 11, 2026 04:14
EmmonsCurse added a commit to EmmonsCurse/FastDeploy that referenced this pull request Apr 11, 2026
EmmonsCurse added a commit to EmmonsCurse/FastDeploy that referenced this pull request Apr 11, 2026
EmmonsCurse added a commit to EmmonsCurse/FastDeploy that referenced this pull request Apr 12, 2026
EmmonsCurse added a commit that referenced this pull request Apr 12, 2026
* [Cherry-Pick][CI] Sync dev optimizations to 2.4(#7335)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants