Skip to content

Conversation

@specture724
Copy link
Collaborator

Please set 'NCCL_IB_HCA' or 'PS_P2P_STORE_RDMA_DEVICES' environment variable to choose proper number of RDMA devices. The number of RDMA devices should be less than or equal to GPU count, and GPU count should be divisible by the number of RDMA devices. The acceptable value by NCCL_IB_HCA is documented in 'https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#id8'."

@specture724 specture724 requested review from Copilot and weixiao-huang and removed request for Copilot November 12, 2025 05:19
@specture724 specture724 force-pushed the fix/choose_rdma_devices branch from 98a32b4 to 7ef6034 Compare November 12, 2025 05:20
@weixiao-huang weixiao-huang merged commit e2b1e1b into MoonshotAI:main Nov 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants