Skip to content

Build fixes#1640

Merged
ktangsali merged 3 commits into
NVIDIA:2.1.0-rcfrom
ktangsali:build-fixes
May 13, 2026
Merged

Build fixes#1640
ktangsali merged 3 commits into
NVIDIA:2.1.0-rcfrom
ktangsali:build-fixes

Conversation

@ktangsali
Copy link
Copy Markdown
Collaborator

@ktangsali ktangsali commented May 12, 2026

PhysicsNeMo Pull Request

Description

This PR includes fixes for two issues found in the latest builds:

  1. Adds a small compat shim (physicsnemo/nn/module/_nvfuser_compat.py) so PhysicsNeMo's fused SiLU path works with both legacy nvfuser and the newer nvfuser_direct package (one in 26.04 PyTorch container). Also makes the import resilient - orphan .dist-info or partial installs now fall back to non-fused SiLU instead of crashing every GNN model on import. fused_silu.py and gnn_layers/mesh_graph_mlp.py now import from the shim; behavior is unchanged where legacy nvfuser already works.

  2. natten dispatches na{1,2,3}d through torch.nn.attention.flex_attention, which raises NotImplementedError on CPU when any of q/k/v has requires_grad=True. The test_backward cases in test/nn/functional/test_natten.py therefore fail under the shared device=["cpu", "cuda:0"] fixture. Adds a small _skip_if_cpu_backward(device) helper and calls it at the top of the three backward tests so the CPU rows skip with an accurate reason while CUDA coverage is unchanged. No production code touched.

Full build logs: https://gitlab-master.nvidia.com/modulus/modulus-release-build-guide/-/jobs/316606693

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ktangsali ktangsali requested a review from pzharrington May 12, 2026 23:50
@pzharrington
Copy link
Copy Markdown
Collaborator

@ktangsali the natten issue should be taken care of by #1634

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR contains two targeted build fixes: an nvfuser compatibility shim (_nvfuser_compat.py) that makes the fused SiLU path work with both the legacy nvfuser and newer nvfuser_direct packages (and gracefully falls back when neither is importable), and a pytest helper that skips natten backward tests on CPU where flex_attention raises NotImplementedError.

  • nvfuser shim (_nvfuser_compat.py): Adds find_spec-guarded import logic, reimplements compute_contiguity, and provides unified FusionDefinition/DataType exports; fused_silu.py and mesh_graph_mlp.py are updated to import from the shim and replace define_constant calls with define_scalar.
  • natten test fix (test_natten.py): Adds _skip_if_cpu_backward(device) called at the top of the three backward test methods; CUDA coverage is unchanged.
  • Version bump: physicsnemo/__init__.py and CHANGELOG.md updated from 2.1.0a0 to 2.1.0.

Important Files Changed

Filename Overview
physicsnemo/nn/module/_nvfuser_compat.py New compat shim that unifies legacy nvfuser and nvfuser_direct behind one import surface; module docstring incorrectly claims define_constant is reimplemented.
physicsnemo/nn/module/fused_silu.py Migrates nvfuser imports to compat shim; replaces define_constant with define_scalar; clean refactor with no logic changes.
physicsnemo/nn/module/gnn_layers/mesh_graph_mlp.py Removes inline nvfuser import boilerplate in favour of compat shim; no logic changes.
test/nn/functional/test_natten.py Adds _skip_if_cpu_backward helper; correctly skips 1D/2D/3D backward tests on CPU where FlexAttention raises NotImplementedError.
physicsnemo/init.py Version bump from 2.1.0a0 to 2.1.0.
CHANGELOG.md Version and date stamped to 2.1.0 / 2026-05-26; empty Removed and Security sections cleaned up.

Reviews (1): Last reviewed commit: "fix natten tests" | Re-trigger Greptile

Comment on lines +17 to +25
"""Compatibility shim for the legacy ``nvfuser`` package and the newer
``nvfuser_direct`` package.

The nvFuser Python frontend is split into two distributions: the legacy
``nvfuser`` package (older PyTorch containers) and ``nvfuser_direct`` (newer
containers). This module hides the difference behind a single import surface
and reimplements the two helpers that exist only in the legacy package
(``compute_contiguity`` and ``define_constant``) so the rest of PhysicsNeMo
can target either backend without conditionals.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inaccurate module docstring for define_constant

The docstring states this shim "reimplements the two helpers that exist only in the legacy package (compute_contiguity and define_constant)," but define_constant is never reimplemented here. The callers in fused_silu.py are instead updated to call fd.define_scalar() directly. Readers looking for a define_constant wrapper will be misled; the docstring should describe the actual migration strategy used.

Comment on lines +53 to +61
try:
return importlib.import_module(name), name
except ImportError as e:
logger.warning(
"Found %s on sys.path but failed to import (%s); "
"trying next backend.",
name,
e,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Narrow exception catch may miss non-ImportError failures

The except ImportError block covers the orphan-dist-info scenario described in the PR, but some container environments raise RuntimeError (or OSError) when a CUDA-compiled extension fails to load due to a driver/library mismatch. Those exceptions would propagate uncaught and crash every downstream import (fused_silu, mesh_graph_mlp), defeating the resilience goal. Consider broadening the catch to except (ImportError, RuntimeError) and logging a warning, since the fallback behaviour (NV_FUSER_AVAILABLE = False) is safe.

@ktangsali ktangsali changed the base branch from main to 2.1.0-rc May 12, 2026 23:53
…dError

Cherry-picked test/nn/functional/test_natten.py from upstream commit
7f2451a ("Ci deps group (NVIDIA#1634)"). The previous device == "cpu"
early-skip was too broad; this wraps the forward call and only skips on
the specific NotImplementedError raised by FlexAttention's CPU-backward
guard. If natten picks a different backend (or FlexAttention ever
supports CPU backward), the test will run.
@ktangsali
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

Copy link
Copy Markdown
Collaborator

@mnabian mnabian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ktangsali ktangsali merged commit 23b6848 into NVIDIA:2.1.0-rc May 13, 2026
1 check passed
@ktangsali ktangsali deleted the build-fixes branch May 13, 2026 05:21
ktangsali added a commit that referenced this pull request May 27, 2026
* add fixes for the nvfuser bug

* test(natten): narrow CPU-backward skip to FlexAttention NotImplementedError

Cherry-picked test/nn/functional/test_natten.py from upstream commit
7f2451a ("Ci deps group (#1634)"). The previous device == "cpu"
early-skip was too broad; this wraps the forward call and only skips on
the specific NotImplementedError raised by FlexAttention's CPU-backward
guard. If natten picks a different backend (or FlexAttention ever
supports CPU backward), the test will run.

* black formatting

---------

Co-authored-by: Corey adams <6619961+coreyjadams@users.noreply.github.com>
ktangsali added a commit that referenced this pull request May 27, 2026
* add fixes for the nvfuser bug

* test(natten): narrow CPU-backward skip to FlexAttention NotImplementedError

Cherry-picked test/nn/functional/test_natten.py from upstream commit
7f2451a ("Ci deps group (#1634)"). The previous device == "cpu"
early-skip was too broad; this wraps the forward call and only skips on
the specific NotImplementedError raised by FlexAttention's CPU-backward
guard. If natten picks a different backend (or FlexAttention ever
supports CPU backward), the test will run.

* black formatting

---------

Co-authored-by: Corey adams <6619961+coreyjadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants