Build fixes#1640
Conversation
|
@ktangsali the natten issue should be taken care of by #1634 |
Greptile SummaryThis PR contains two targeted build fixes: an nvfuser compatibility shim (
Important Files Changed
Reviews (1): Last reviewed commit: "fix natten tests" | Re-trigger Greptile |
| """Compatibility shim for the legacy ``nvfuser`` package and the newer | ||
| ``nvfuser_direct`` package. | ||
|
|
||
| The nvFuser Python frontend is split into two distributions: the legacy | ||
| ``nvfuser`` package (older PyTorch containers) and ``nvfuser_direct`` (newer | ||
| containers). This module hides the difference behind a single import surface | ||
| and reimplements the two helpers that exist only in the legacy package | ||
| (``compute_contiguity`` and ``define_constant``) so the rest of PhysicsNeMo | ||
| can target either backend without conditionals. |
There was a problem hiding this comment.
Inaccurate module docstring for
define_constant
The docstring states this shim "reimplements the two helpers that exist only in the legacy package (compute_contiguity and define_constant)," but define_constant is never reimplemented here. The callers in fused_silu.py are instead updated to call fd.define_scalar() directly. Readers looking for a define_constant wrapper will be misled; the docstring should describe the actual migration strategy used.
| try: | ||
| return importlib.import_module(name), name | ||
| except ImportError as e: | ||
| logger.warning( | ||
| "Found %s on sys.path but failed to import (%s); " | ||
| "trying next backend.", | ||
| name, | ||
| e, | ||
| ) |
There was a problem hiding this comment.
Narrow exception catch may miss non-ImportError failures
The except ImportError block covers the orphan-dist-info scenario described in the PR, but some container environments raise RuntimeError (or OSError) when a CUDA-compiled extension fails to load due to a driver/library mismatch. Those exceptions would propagate uncaught and crash every downstream import (fused_silu, mesh_graph_mlp), defeating the resilience goal. Consider broadening the catch to except (ImportError, RuntimeError) and logging a warning, since the fallback behaviour (NV_FUSER_AVAILABLE = False) is safe.
…dError Cherry-picked test/nn/functional/test_natten.py from upstream commit 7f2451a ("Ci deps group (NVIDIA#1634)"). The previous device == "cpu" early-skip was too broad; this wraps the forward call and only skips on the specific NotImplementedError raised by FlexAttention's CPU-backward guard. If natten picks a different backend (or FlexAttention ever supports CPU backward), the test will run.
|
/blossom-ci |
* add fixes for the nvfuser bug * test(natten): narrow CPU-backward skip to FlexAttention NotImplementedError Cherry-picked test/nn/functional/test_natten.py from upstream commit 7f2451a ("Ci deps group (#1634)"). The previous device == "cpu" early-skip was too broad; this wraps the forward call and only skips on the specific NotImplementedError raised by FlexAttention's CPU-backward guard. If natten picks a different backend (or FlexAttention ever supports CPU backward), the test will run. * black formatting --------- Co-authored-by: Corey adams <6619961+coreyjadams@users.noreply.github.com>
* add fixes for the nvfuser bug * test(natten): narrow CPU-backward skip to FlexAttention NotImplementedError Cherry-picked test/nn/functional/test_natten.py from upstream commit 7f2451a ("Ci deps group (#1634)"). The previous device == "cpu" early-skip was too broad; this wraps the forward call and only skips on the specific NotImplementedError raised by FlexAttention's CPU-backward guard. If natten picks a different backend (or FlexAttention ever supports CPU backward), the test will run. * black formatting --------- Co-authored-by: Corey adams <6619961+coreyjadams@users.noreply.github.com>
PhysicsNeMo Pull Request
Description
This PR includes fixes for two issues found in the latest builds:
Adds a small compat shim (physicsnemo/nn/module/_nvfuser_compat.py) so PhysicsNeMo's fused SiLU path works with both legacy nvfuser and the newer nvfuser_direct package (one in 26.04 PyTorch container). Also makes the import resilient - orphan .dist-info or partial installs now fall back to non-fused SiLU instead of crashing every GNN model on import. fused_silu.py and gnn_layers/mesh_graph_mlp.py now import from the shim; behavior is unchanged where legacy nvfuser already works.
natten dispatches na{1,2,3}d through torch.nn.attention.flex_attention, which raises NotImplementedError on CPU when any of q/k/v has requires_grad=True. The test_backward cases in test/nn/functional/test_natten.py therefore fail under the shared device=["cpu", "cuda:0"] fixture. Adds a small _skip_if_cpu_backward(device) helper and calls it at the top of the three backward tests so the CPU rows skip with an accurate reason while CUDA coverage is unchanged. No production code touched.
Full build logs: https://gitlab-master.nvidia.com/modulus/modulus-release-build-guide/-/jobs/316606693
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.