Skip to content

[XPU][CI] lock xvllm version for fix bug#7264

Merged
plusNew001 merged 2 commits intoPaddlePaddle:developfrom
plusNew001:ci-update-0408
Apr 9, 2026
Merged

[XPU][CI] lock xvllm version for fix bug#7264
plusNew001 merged 2 commits intoPaddlePaddle:developfrom
plusNew001:ci-update-0408

Conversation

@plusNew001
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings April 9, 2026 03:42
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 9, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 主要面向 XPU CI 环境:调整 XPU EP 相关环境变量设置方式,并将 XPU 自定义算子依赖下载脚本中的 xvllm(develop 分支)版本从浮动的 latest 固定到指定日期版本,以提升构建/CI 的可复现性与稳定性。

Changes:

  • 移除 XPU CI EP 环境中硬编码的 BKCL_RDMA_NICS 配置(改为后续动态探测设置)。
  • custom_ops/xpu_ops/download_dependencies.sh 中 develop 分支的 xvllm 版本从 latest 固定为 20260407

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/xpu_ci/conftest.py 调整 EP 环境变量设置:去掉硬编码 RDMA NIC 列表,依赖动态获取。
custom_ops/xpu_ops/download_dependencies.sh 锁定 develop 分支 xvllm 下载版本以稳定 CI 依赖。

Comment on lines 292 to 296
env_vars = {
"BKCL_ENABLE_XDR": "1",
"BKCL_RDMA_NICS": "eth1,eth1,eth2,eth2",
"BKCL_TRACE_TOPO": "1",
"BKCL_PCIE_RING": "1",
"XSHMEM_MODE": "1",
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 setup_ep_env() 里移除了 env_vars 中的 BKCL_RDMA_NICS 后,当前函数仍会在后面根据 get_rdma_nics() 动态设置 BKCL_RDMA_NICS,但 original_values 不再保存该变量的旧值,导致调用方后续 restore_env(original_env) 时无法恢复/清理 BKCL_RDMA_NICS,可能造成用例间环境变量串扰。建议在设置 BKCL_RDMA_NICS 前把 os.environ.get("BKCL_RDMA_NICS") 记录进 original_values,并在 restore_env() 时按 None/非 None 逻辑恢复或删除。

Copilot uses AI. Check for mistakes.
Comment on lines 14 to 20
if [ "$1" == "stable" ]; then
version_xvllm="20251017"
version_xtdk="3.4.0.1"
else
version_xvllm="latest"
version_xvllm="20260407"
version_xtdk="latest"
fi
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 PR 描述未填写(Motivation/Modifications/Usage/Accuracy Tests 等仍为空),而这里把 develop 分支的 xvllm 从 "latest" 锁定到 "20260407" 属于可能影响 CI/构建可复现性的变更。建议在 PR 描述中补充:需要锁版本的具体 bug 现象/链接、为何选择 20260407、以及如何验证(例如相关 XPU CI job 或复现命令)。

Copilot uses AI. Check for mistakes.
@plusNew001
Copy link
Copy Markdown
Collaborator Author

/skip-ci ci_iluvatar
/skip-ci ci_hpu
/skip-ci build_gpu

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-09

📋 Review 摘要

PR 概述:锁定 xvllm 版本到 20260407 修复 bug,删除 BKCL_RDMA_NICS 硬编码值改用动态获取

变更范围:custom_ops/xpu_ops/、tests/xpu_ci/

影响面 Tag[CI] [XPU]

📝 PR 规范检查

PR 描述未填写 Motivation 和 Modifications。

标题建议(符合规范):

  • [CI][XPU] lock xvllm version and use dynamic RDMA NIC configuration

描述模板(建议补充):

## Motivation

1. 锁定 xvllm 版本到 20260417 以修复 [xxx] bug
2. 删除 BKCL_RDMA_NICS 硬编码值,改用 get_rdma_nics() 动态获取以适配不同测试环境

## Modifications

- custom_ops/xpu_ops/download_dependencies.sh: 锁定 xvllm 版本
- tests/xpu_ci/conftest.py: 删除 BKCL_RDMA_NICS 硬编码

问题

级别 文件 概述
🟡 建议 PR 描述 Motivation 和 Modifications 未填写
🟡 建议 tests/xpu_ci/conftest.py 环境变量恢复逻辑不完整

总体评价

PR 变更逻辑合理,删除硬编码使用动态配置是正确的改进方向。但 PR 描述过于简单,未说明变更原因;环境变量恢复逻辑存在小瑕疵,建议优化。

详细建议:当前代码在 setup_ep_env() 中调用 get_rdma_nics() 获取 RDMA 网卡配置,如果获取失败返回空字符串仍会设置环境变量,但 restore_env() 只能恢复原始值无法清空该变量。建议在设置前判断返回值是否为空。

@plusNew001 plusNew001 merged commit 80d5d9f into PaddlePaddle:develop Apr 9, 2026
38 of 41 checks passed
EmmonsCurse pushed a commit to EmmonsCurse/FastDeploy that referenced this pull request Apr 9, 2026
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh
@EmmonsCurse
Copy link
Copy Markdown
Collaborator

✅ Cherry-pick successful! Created PR: #7265

EmmonsCurse pushed a commit to EmmonsCurse/FastDeploy that referenced this pull request Apr 9, 2026
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh
@EmmonsCurse
Copy link
Copy Markdown
Collaborator

✅ Cherry-pick successful! Created PR: #7266

EmmonsCurse pushed a commit to EmmonsCurse/FastDeploy that referenced this pull request Apr 9, 2026
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh
@EmmonsCurse
Copy link
Copy Markdown
Collaborator

ℹ️ Cherry-pick PR already exists: #7265

@EmmonsCurse
Copy link
Copy Markdown
Collaborator

ℹ️ Cherry-pick PR already exists: #7266

EmmonsCurse pushed a commit to EmmonsCurse/FastDeploy that referenced this pull request Apr 9, 2026
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh
plusNew001 added a commit that referenced this pull request Apr 9, 2026
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
plusNew001 added a commit that referenced this pull request Apr 9, 2026
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants