Skip to content

mnnvl guard#3013

Open
francesco-bertolotti wants to merge 3 commits into
NVIDIA:mainfrom
francesco-bertolotti:f14-mnnvl-guard
Open

mnnvl guard#3013
francesco-bertolotti wants to merge 3 commits into
NVIDIA:mainfrom
francesco-bertolotti:f14-mnnvl-guard

Conversation

@francesco-bertolotti
Copy link
Copy Markdown

I am trying to compile TE on the CINECA Leonardo cluster and encountered a compilation issue that required a small fix to work around. Since this may also affect other environments with mixed CUDA header/toolkit versions, I am submitting this PR in case it is useful more broadly.

I should also mention that some TE tests are currently failing in my environment. From what I can tell so far, these failures appear unrelated to this change and are more likely tied to the system configuration. I still need to investigate them further, but any feedback or insight on that side would be appreciated.

Thank you for your work on TE and your time with this PR!


Description

This fixes a build failure when compiling TE in environments where the CUDA headers provided by pip-installed nvidia-cuda-runtime-cu12 are newer than the system CUDA toolkit.

I encountered this on the CINECA Leonardo cluster (RHEL 8, A100 SXM4, driver 535, system CUDA 12.2, pip nvidia-cuda-runtime-cu12 12.6), but the issue is not cluster-specific and can occur on any system with mixed CUDA header versions.

The failure looks like:

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp:123:5:
error: 'nvmlGpuFabricInfoV_t' was not declared in this scope
    nvmlGpuFabricInfoV_t fabricInfo = {};

Root cause

The build currently relies on CUDA_VERSION to determine whether the newer NVML fabric APIs are available:

#if CUDA_VERSION < 12040

However, in mixed environments there are effectively two independent version sources:

Source Controlled by Example version
CUDA_VERSION pip nvidia-cuda-runtime-cu12 headers ≥ 12.4
nvml.h system CUDA toolkit 12.2

If the pip CUDA headers are newer than the system toolkit:

  • CUDA_VERSION reports ≥ 12.4
  • the #else branch is enabled
  • but the system nvml.h does not define:
    • nvmlGpuFabricInfoV_t
    • nvmlGpuFabricInfo_v2

which causes compilation to fail.

This can happen on HPC clusters, shared cloud nodes, or developer systems where Python CUDA packages are updated independently from the system CUDA installation.

Fix

Replace the version-based guard with a capability-based guard:

- #if CUDA_VERSION < 12040
+ #if !defined(nvmlGpuFabricInfo_v2)

nvmlGpuFabricInfo_v2 is introduced in CUDA 12.4 and is the actual API feature required by the code below. Checking for the symbol directly avoids assuming that all CUDA-related headers come from the same toolkit version.

This makes the logic resilient to mismatched header installations while preserving existing behavior.

Behaviour after this change

  • CUDA ≥ 12.4 toolkit:
    • nvmlGpuFabricInfo_v2 is defined
    • existing MNNVL detection path is compiled
  • CUDA < 12.4 toolkit:
    • nvmlGpuFabricInfo_v2 is not defined
    • function returns false as before
  • Mixed-header environments:
    • CUDA_VERSION may report ≥ 12.4
    • but nvmlGpuFabricInfo_v2 is absent
    • code correctly falls back to return false
      build succeeds

This fallback is also semantically correct, since MNNVL support is only relevant on newer H100/GH200-class systems and should return false on A100-era systems regardless.

Testing

Verified on CINECA Leonardo:

  • A100 SXM4
  • driver 535
  • system CUDA 12.2
  • pip nvidia-cuda-runtime-cu12 12.6

Unrelated but I had to add NVIDIA_TF32_OVERRIDE=0 to test_numerics.py otherwise I would get test failing for small numerical mismatch with layer norms. This has also been done for test_mhc.py.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

Replaces #if CUDA_VERSION < 12040 with #if !defined(nvmlGpuFabricInfo_v2) in has_mnnvl_fabric so that the MNNVL-detection path is gated on whether the symbol is actually present in the installed NVML headers rather than on a version macro that may come from a different (pip-installed) source.

  • The one-line guard change is logically correct: nvmlGpuFabricInfo_v2 is a macro introduced in the CUDA 12.4 NVML headers, so defined() reliably detects its availability at compile time and survives mixed-header environments.
  • A pre-existing debug message in the false branch ("since it was not built with CUDA version >= 12.4") no longer accurately describes the new guard condition; an earlier review comment flagged this and a corrected string was proposed there.

Confidence Score: 5/5

Safe to merge; the one-line guard change is correct and the fallback behavior is preserved on all header configurations.

The fix correctly switches from a version macro that can originate from a different package than the NVML headers to a symbol that is defined directly in nvml.h, making the guard self-consistent. The false branch still returns false and the true branch is unchanged, so no existing behavior is altered on a standard single-version CUDA installation. The only open item is a stale debug string that was already flagged in a prior review pass.

No files require special attention beyond the already-flagged debug message on line 99.

Important Files Changed

Filename Overview
transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp Replaces CUDA_VERSION-based guard with capability-based !defined(nvmlGpuFabricInfo_v2) guard in has_mnnvl_fabric; the debug message in the false branch still references the old CUDA-version framing (flagged in a prior review comment).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["has_mnnvl_fabric(device_id)"] --> B{"nvmlGpuFabricInfo_v2 defined?\n(compile-time check)"}
    B -- "Not defined" --> C["Log debug message\nReturn false"]
    B -- "Defined\n(NVML headers >= 12.4)" --> D{"cudart_version() >= 12040?\n(run-time check)"}
    D -- "No" --> E["Log debug message\nReturn false"]
    D -- "Yes" --> F{"fabric handle\nsupported on device?"}
    F -- "No" --> G["Return false"]
    F -- "Yes" --> H["Query nvmlGpuFabricInfoV_t\nvia nvmlDeviceGetGpuFabricInfoV"]
    H --> I{"fabric state COMPLETED\n& non-zero clusterUuid?"}
    I -- "No" --> J["Return false"]
    I -- "Yes" --> K["Return true\n(MNNVL supported)"]
Loading

Reviews (4): Last reviewed commit: "reverting NVIDIA_TF32_OVERRIDE=0" | Re-trigger Greptile

@ptrendx ptrendx added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 19, 2026
@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented May 19, 2026

Thank you for the PR. I am a little hesitant with the TF32 override though. Could you split this PR into 2 - we could merge the MNNVL guard right away and then look into the TF32 changes?

@francesco-bertolotti
Copy link
Copy Markdown
Author

Hi @ptrendx, thank you for the quick reply!

I have reverted the modification on the tests and moved it in this PR #3014

@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented May 20, 2026

/te-ci

Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants