-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mellanox IB Device Missing on NC24r & NC24rs_v2 #28
Comments
NCr_v3 is SR-IOV enabled while NCr and NCr_v2 are not SR-IOV enabled. Some details on the bifurcation here. So summarily, the ND driver path needs to be enabled for the non-SR-IOV NCr and NCr_v2 VM sizes. |
I've also updated the known issues section with this feedback which should go live in a few hours. |
@vermagit Thanks for the quick response! Unfortunately this still doesn't work for me. Following the instructions for Non SR-IOV machine types on the standard
In
|
It appears that the support for the ND driver stack (vmbus-rdma-driver required in the non-SRIOV VMs) was dropped in the 5.3 kernel in the latest Ubuntu 18.04-LTS image in the Marketplace. This will be taken up with Canonical. An older image with kernel 5.0 (say Canonical UbuntuServer 18.04-LTS 18.04.202004080) has the missing module "hv_network_direct" and should work. Thank you for reporting this issue. Please let us know here if the above workarounds work for you. |
Works with Ubuntu18.04.202004080 version and it has 5.0.0-1036-azure kernel. |
The Ubuntu18.04.202004080 version which has 5.0.0-1036-azure kernel is not working. It will generate: |
Not sure what I'm missing again however using CentOS-HPC 7.8 Gen2 there's no Mellanox Infiniband adapter available. Standard NC24rs_v3:
|
The latest MOFED that supports CX3-Pro is MOFED 5.0. Please see the notes in this document, and use a different version of CentOS HPC image. |
Well, it's just not really expected as that's the "only" Infiniband enabled v100 GPU SKU. So I'd appreciate a decent OS compatibility matrix (CentOS 7.9, CentOS 8.2, 8.3). |
@tbugfinder : Thanks for your feedback. |
Indeed there are two issues. |
Could we reopen this issue? I ran into similar issues on NC24r. With Ubuntu, the first problem that I encountered was @abagshaw above mentioned After downgrading the kernel, the rdma nic
Besides the timeout, the link layer and rate are different from what the document said:
Also Then I tried to reimage to CentOS-HPC 7.4. Now I downgraded the image to Any helps? Thanks! |
I've used the scripts to install onto a
ubuntu18.04
image but when launching a NC24r or NC24rs_v2 instance with the image I cannot see any mellanox infiniband device inlspci
(and RDMA tests likeib_write_bw
fail to find the device). The mellanox infiniband device shows up properly inlspci
and works just fine on NC24rs_v3 using this same image.Is there something I need to do to get the infiniband devices working on NC24r and NC24rs_v2 instances with
ubuntu18.04
?Thanks!
The text was updated successfully, but these errors were encountered: