Skip to content

[CI] Modify 4-card container startup config and move test case#7363

Merged
yuanlehome merged 1 commit intoPaddlePaddle:developfrom
EmmonsCurse:ci_optimize_dev_0413
Apr 13, 2026
Merged

[CI] Modify 4-card container startup config and move test case#7363
yuanlehome merged 1 commit intoPaddlePaddle:developfrom
EmmonsCurse:ci_optimize_dev_0413

Conversation

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

Motivation

  • Improve stability and compatibility of 4-card CI jobs with RDMA-enabled environments
  • Ensure proper device mounting and resource limits for multi-card inference tests

Modifications

  • Update 4-card container startup configuration:
    • Add RDMA device auto-detection and injection (RDMA_DEVICES)
    • Mount /dev/infiniband/rdma_cm for RDMA communication
    • Add --ulimit memlock=-1:-1 to support RDMA memory locking
    • Add required capabilities (SYS_PTRACE, IPC_LOCK)
    • Adjust shared memory configuration (--shm-size=64g)
  • Add new environment variables for router, connector, and RDMA ports
  • Enable CUDA cleanup flag (CLEAN_CUDA=1)
  • Reorganize e2e test case:
    • Move test_ernie_03b_pd_router_v1_rdma_tp2.py into 4cards_cases directory for clearer test grouping

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 13, 2026

Thanks for your contribution!

@EmmonsCurse
Copy link
Copy Markdown
Collaborator Author

/skip-ci ci_iluvatar
/skip-ci ci_hpu
/skip-ci build_xpu
/skip-ci coverage
/skip-ci stable_test
/skip-ci base_test
/skip-ci pre_ce_test
/skip-ci logprob_test

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-13

📋 Review 摘要

PR 概述:优化 4 卡 CI 容器启动配置,支持 RDMA 环境

变更范围.github/workflows/_gpu_4cards_case_test.ymltests/e2e/4cards_cases/

影响面 Tag[CI]

📝 PR 规范检查

PR 标题和描述均符合规范,Motivation 和 Modifications 章节清晰。

问题

未发现阻塞性问题。

总体评价

本次 PR 合理地添加了 RDMA 设备挂载、必要的 Docker capabilities(SYS_PTRACEIPC_LOCK)、共享内存配置(64G)和 ulimit 设置,以支持 RDMA 环境下的 4 卡测试。测试文件的导入路径和 GPU 设备分配修正也是正确的。

💡 改进建议(非阻塞性)

虽然不在本次 PR 变更范围内,但注意到 .github/workflows/_gpu_4cards_case_test.yml:158PORTS 数组未包含本次新增的 FD_ROUTER_PORTFD_CONNECTOR_PORTFD_RDMA_PORT,可能需要后续补充端口清理逻辑。

@yuanlehome yuanlehome merged commit 1e08ee7 into PaddlePaddle:develop Apr 13, 2026
37 checks passed
@EmmonsCurse EmmonsCurse deleted the ci_optimize_dev_0413 branch April 13, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants