mnnvl guard#3013
Conversation
Greptile SummaryReplaces
Confidence Score: 5/5Safe to merge; the one-line guard change is correct and the fallback behavior is preserved on all header configurations. The fix correctly switches from a version macro that can originate from a different package than the NVML headers to a symbol that is defined directly in nvml.h, making the guard self-consistent. The false branch still returns false and the true branch is unchanged, so no existing behavior is altered on a standard single-version CUDA installation. The only open item is a stale debug string that was already flagged in a prior review pass. No files require special attention beyond the already-flagged debug message on line 99. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["has_mnnvl_fabric(device_id)"] --> B{"nvmlGpuFabricInfo_v2 defined?\n(compile-time check)"}
B -- "Not defined" --> C["Log debug message\nReturn false"]
B -- "Defined\n(NVML headers >= 12.4)" --> D{"cudart_version() >= 12040?\n(run-time check)"}
D -- "No" --> E["Log debug message\nReturn false"]
D -- "Yes" --> F{"fabric handle\nsupported on device?"}
F -- "No" --> G["Return false"]
F -- "Yes" --> H["Query nvmlGpuFabricInfoV_t\nvia nvmlDeviceGetGpuFabricInfoV"]
H --> I{"fabric state COMPLETED\n& non-zero clusterUuid?"}
I -- "No" --> J["Return false"]
I -- "Yes" --> K["Return true\n(MNNVL supported)"]
Reviews (4): Last reviewed commit: "reverting NVIDIA_TF32_OVERRIDE=0" | Re-trigger Greptile |
5f1a6ca to
5180ff4
Compare
|
Thank you for the PR. I am a little hesitant with the TF32 override though. Could you split this PR into 2 - we could merge the MNNVL guard right away and then look into the TF32 changes? |
|
/te-ci |
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
b435798 to
e07592a
Compare
I am trying to compile TE on the CINECA Leonardo cluster and encountered a compilation issue that required a small fix to work around. Since this may also affect other environments with mixed CUDA header/toolkit versions, I am submitting this PR in case it is useful more broadly.
I should also mention that some TE tests are currently failing in my environment. From what I can tell so far, these failures appear unrelated to this change and are more likely tied to the system configuration. I still need to investigate them further, but any feedback or insight on that side would be appreciated.
Thank you for your work on TE and your time with this PR!
Description
This fixes a build failure when compiling TE in environments where the CUDA headers provided by pip-installed nvidia-cuda-runtime-cu12 are newer than the system CUDA toolkit.
I encountered this on the CINECA Leonardo cluster (RHEL 8, A100 SXM4, driver 535, system CUDA 12.2, pip nvidia-cuda-runtime-cu12 12.6), but the issue is not cluster-specific and can occur on any system with mixed CUDA header versions.
The failure looks like:
Root cause
The build currently relies on CUDA_VERSION to determine whether the newer NVML fabric APIs are available:
#if CUDA_VERSION < 12040However, in mixed environments there are effectively two independent version sources:
If the pip CUDA headers are newer than the system toolkit:
which causes compilation to fail.
This can happen on HPC clusters, shared cloud nodes, or developer systems where Python CUDA packages are updated independently from the system CUDA installation.
Fix
Replace the version-based guard with a capability-based guard:
nvmlGpuFabricInfo_v2 is introduced in CUDA 12.4 and is the actual API feature required by the code below. Checking for the symbol directly avoids assuming that all CUDA-related headers come from the same toolkit version.
This makes the logic resilient to mismatched header installations while preserving existing behavior.
Behaviour after this change
build succeeds
This fallback is also semantically correct, since MNNVL support is only relevant on newer H100/GH200-class systems and should return false on A100-era systems regardless.
Testing
Verified on CINECA Leonardo:
Unrelated but I had to add
NVIDIA_TF32_OVERRIDE=0totest_numerics.pyotherwise I would get test failing for small numerical mismatch with layer norms. This has also been done fortest_mhc.py.