[CI] increase shm-size to 128G in _unit_test_coverage.yml#7227
[CI] increase shm-size to 128G in _unit_test_coverage.yml#7227EmmonsCurse merged 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
|
/skip-ci ci_iluvatar |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7227 +/- ##
==========================================
Coverage ? 73.45%
==========================================
Files ? 377
Lines ? 53229
Branches ? 8312
==========================================
Hits ? 39098
Misses ? 11392
Partials ? 2739
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-08
📋 Review 摘要
PR 概述:增加 CI 测试容器的共享内存和 ulimit 限制
变更范围:.github/workflows/_unit_test_coverage.yml
影响面 Tag:[CI]
📝 PR 规范检查
PR 标题包含有效 Tag [CI],描述包含 Motivation 和 Modifications 部分,符合规范。
但 Modifications 部分描述不完整:仅提到了 shm-size 的增加,未提及新增的两个 --ulimit 配置。
建议更新:
## Modifications
- Increase Docker shared memory size from `64GB` to `128GB` in `_unit_test_coverage.yml` to improve stability under parallel workloads
- Add `--ulimit nofile=65536:65536` to increase maximum open file descriptors
- Add `--ulimit nproc=8192:8192` to increase maximum user processes问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | PR 描述 |
Modifications 未涵盖所有变更 |
总体评价
CI 配置变更合理,有助于提升并行测试稳定性。建议补充 PR 描述以完整记录所有修改内容。
另外注意到项目中还有多个其他 CI 文件使用 shm-size=64G(如 _base_test.yml、_gpu_4cards_case_test.yml 等),如果共享内存不足问题在其他测试中也存在,可能需要同步调整。但若仅 _unit_test_coverage.yml 存在此问题,当前变更范围合理。
| ${RDMA_DEVICES} \ | ||
| --device=/dev/infiniband/rdma_cm \ | ||
| --ulimit memlock=-1:-1 \ | ||
| --ulimit nofile=65536:65536 \ |
There was a problem hiding this comment.
🟡 建议 PR 描述中的 Modifications 部分未涵盖此变更。
PR 描述仅提到了 shm-size 的增加,但未提及新增的这两个 --ulimit 配置。建议在 PR 描述中补充说明这些 ulimit 限制的作用,便于后续维护者理解变更背景。
Motivation
In parallel test execution, multiple processes may concurrently utilize shared memory (
/dev/shm), which can lead to instability under the current64GBlimit. Specifically:/dev/shmsimultaneouslyThis may result in intermittent failures (e.g.,
Killed) due to insufficient shared memory.Modifications
64GBto128GBin_unit_test_coverage.ymlto improve stability under parallel workloads--ulimit nofile=65536:65536to increase maximum open file descriptors--ulimit nproc=8192:8192to increase maximum user processesUsage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.