Skip to content

[Bug] Invalid NICs detected in _get_rdma_devices() #15

@alogfans

Description

@alogfans

https://github.com/MoonshotAI/checkpoint-engine/blob/05827aadf5a5e0fa6203b7ad41eef393a5f4d07a/checkpoint_engine/ps.py#L272C1-L312C19

Currently, the code looks up a specific directory to obtain NIC devices. However in my environment, an disabled (I'm not sure as I'm not the system maintainer, but the device does not exist in the output of ibv_device). I advise using ibv_get_device_list (or equivalent API methods) to retrieve the list of valid RDMA NIC devices. This ensures better portability, correctness, and alignment with the official API.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions