[XPU][CI] lock xvllm version for fix bug#7264
[XPU][CI] lock xvllm version for fix bug#7264plusNew001 merged 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 主要面向 XPU CI 环境:调整 XPU EP 相关环境变量设置方式,并将 XPU 自定义算子依赖下载脚本中的 xvllm(develop 分支)版本从浮动的 latest 固定到指定日期版本,以提升构建/CI 的可复现性与稳定性。
Changes:
- 移除 XPU CI EP 环境中硬编码的
BKCL_RDMA_NICS配置(改为后续动态探测设置)。 - 将
custom_ops/xpu_ops/download_dependencies.sh中 develop 分支的 xvllm 版本从latest固定为20260407。
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/xpu_ci/conftest.py | 调整 EP 环境变量设置:去掉硬编码 RDMA NIC 列表,依赖动态获取。 |
| custom_ops/xpu_ops/download_dependencies.sh | 锁定 develop 分支 xvllm 下载版本以稳定 CI 依赖。 |
| env_vars = { | ||
| "BKCL_ENABLE_XDR": "1", | ||
| "BKCL_RDMA_NICS": "eth1,eth1,eth2,eth2", | ||
| "BKCL_TRACE_TOPO": "1", | ||
| "BKCL_PCIE_RING": "1", | ||
| "XSHMEM_MODE": "1", |
There was a problem hiding this comment.
在 setup_ep_env() 里移除了 env_vars 中的 BKCL_RDMA_NICS 后,当前函数仍会在后面根据 get_rdma_nics() 动态设置 BKCL_RDMA_NICS,但 original_values 不再保存该变量的旧值,导致调用方后续 restore_env(original_env) 时无法恢复/清理 BKCL_RDMA_NICS,可能造成用例间环境变量串扰。建议在设置 BKCL_RDMA_NICS 前把 os.environ.get("BKCL_RDMA_NICS") 记录进 original_values,并在 restore_env() 时按 None/非 None 逻辑恢复或删除。
| if [ "$1" == "stable" ]; then | ||
| version_xvllm="20251017" | ||
| version_xtdk="3.4.0.1" | ||
| else | ||
| version_xvllm="latest" | ||
| version_xvllm="20260407" | ||
| version_xtdk="latest" | ||
| fi |
There was a problem hiding this comment.
当前 PR 描述未填写(Motivation/Modifications/Usage/Accuracy Tests 等仍为空),而这里把 develop 分支的 xvllm 从 "latest" 锁定到 "20260407" 属于可能影响 CI/构建可复现性的变更。建议在 PR 描述中补充:需要锁版本的具体 bug 现象/链接、为何选择 20260407、以及如何验证(例如相关 XPU CI job 或复现命令)。
|
/skip-ci ci_iluvatar |
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-09
📋 Review 摘要
PR 概述:锁定 xvllm 版本到 20260407 修复 bug,删除 BKCL_RDMA_NICS 硬编码值改用动态获取
变更范围:custom_ops/xpu_ops/、tests/xpu_ci/
影响面 Tag:[CI] [XPU]
📝 PR 规范检查
PR 描述未填写 Motivation 和 Modifications。
标题建议(符合规范):
[CI][XPU] lock xvllm version and use dynamic RDMA NIC configuration
描述模板(建议补充):
## Motivation
1. 锁定 xvllm 版本到 20260417 以修复 [xxx] bug
2. 删除 BKCL_RDMA_NICS 硬编码值,改用 get_rdma_nics() 动态获取以适配不同测试环境
## Modifications
- custom_ops/xpu_ops/download_dependencies.sh: 锁定 xvllm 版本
- tests/xpu_ci/conftest.py: 删除 BKCL_RDMA_NICS 硬编码问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | PR 描述 | Motivation 和 Modifications 未填写 |
| 🟡 建议 | tests/xpu_ci/conftest.py |
环境变量恢复逻辑不完整 |
总体评价
PR 变更逻辑合理,删除硬编码使用动态配置是正确的改进方向。但 PR 描述过于简单,未说明变更原因;环境变量恢复逻辑存在小瑕疵,建议优化。
详细建议:当前代码在
setup_ep_env()中调用get_rdma_nics()获取 RDMA 网卡配置,如果获取失败返回空字符串仍会设置环境变量,但restore_env()只能恢复原始值无法清空该变量。建议在设置前判断返回值是否为空。
* Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh
|
✅ Cherry-pick successful! Created PR: #7265 |
* Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh
|
✅ Cherry-pick successful! Created PR: #7266 |
* Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh
|
ℹ️ Cherry-pick PR already exists: #7265 |
|
ℹ️ Cherry-pick PR already exists: #7266 |
* Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.