Checkpoint Refactor by loliverhennigh · Pull Request #65 · NVIDIA/physicsnemo

loliverhennigh · 2023-07-10T18:00:41Z

Description

This feature adds better saving and loading of models that store arguments in a json file. This allows loading the model without needing to keep track of model parameters. Along with this merge there are several related side features. from_torch allows users to bring in models from torch seamlessly assuming they have json serializable inputs. Also, we have exposed the models through endpoints following discussions about expanding usability of modulus.
Here is a list of high level features,

Saving models now contains other information such as version and params so you can instantiate the model directly from the checkpoint from_checkpoint
Bring in model from torch and use it seamlessly as a Modulus module, from_torch
Register custom model dynamically. This allows loading models from checkpoints for example.
Factory that takes a name of a Module and returns the class for it.

loliverhennigh · 2023-07-10T18:34:29Z

This should be ready for review now. I had to mess around with my fork to change it to the public PR so hopefully nothing weird happened.

NickGeneva

Initial comments

loliverhennigh · 2023-07-11T23:25:54Z

Hey @akshaysubr , @NickGeneva and I talked and I will refactor some of the registry and entrypoints stuff to be a bit more in line with fsspec. You can look at other parts of the code but I wouldn't bother seeing too much until I make these changes.

loliverhennigh · 2023-07-13T18:14:11Z

Linking with Modulus launch PR that should be merged at the same time, https://github.com/NVIDIA/modulus-launch/pull/42

loliverhennigh · 2023-07-14T20:33:31Z

Hey @akshaysubr and @NickGeneva, This is ready for another round of revisions. I think everything here is in a pretty good place however there is one key design decision that I will explain bellow.

So for the entrypoint exposure I followed the example here to the letter, https://amir.rachum.com/python-entry-points/. In particular I follow their final revision and expose all our models through entrypoints and treat internal and external models the same. You can see where I get the list of models here, https://github.com/loliverhennigh/modulus/blob/fea-checkpoint_refactor/modulus/models/registry.py#L20.

This goes somewhat counter to the way that fsspec does things. Instead of exposing their protocols through entrypoints and getting them from there they manually write them in a dict here https://github.com/fsspec/filesystem_spec/blob/master/fsspec/registry.py#L62C2-L62C2 . Functionally this leads to the same design and interface for users and they have good documentation on how to add protocols through entrypoints. I would say the advantage of this is they can easily add protocols from outside packages and you can see a few examples of this in the dict. If we did the same thing maybe we could bring in hugging face models in for example.

I am open to setting all this up following the fsspec example but at this point there is no benefit to doing it I think. I would say we go with the design I have in this PR and then if we want we can switch to the fsspec setup at any time. This should be easy to do given how I structured the code and will basically just be adding a dict to the registry.py file. This will also not effect any user interface.

Also, just to make it a bit more concrete on how a user can expose a model to modulus via entrypoints I included a little example here https://github.com/loliverhennigh/modulus/blob/fea-checkpoint_refactor/examples/ExternalPackage/example/load_model.py. Ill delete this after but it was really good to wrap my head around this stuff.

akshaysubr

@loliverhennigh This is great work! Left some comments about suggested changes. Would also be good to change the PR title and description to something more verbose and descriptive.

akshaysubr · 2023-07-18T06:43:30Z

Also, just to make it a bit more concrete on how a user can expose a model to modulus via entrypoints I included a little example here https://github.com/loliverhennigh/modulus/blob/fea-checkpoint_refactor/examples/ExternalPackage/example/load_model.py. Ill delete this after but it was really good to wrap my head around this stuff.

I think it is good to keep this example. Just need to ensure it gets run every so often to ensure that it doesn't go stale.

…refactor

loliverhennigh · 2023-07-21T21:24:16Z

Hey @akshaysubr and @NickGeneva , Ready for another round of reviews! I think I addressed everything. Could I get another look when you have time. Thanks!

NickGeneva · 2023-08-03T16:54:23Z

@loliverhennigh I tested this now and it works great! Really liking how this turned out!

NickGeneva · 2023-08-03T16:55:03Z

/blossom-ci

loliverhennigh · 2023-08-03T18:28:27Z

/blossom-ci

loliverhennigh · 2023-08-03T18:38:32Z

/blossom-ci

loliverhennigh · 2023-08-03T19:31:55Z

/blossom-ci

loliverhennigh · 2023-08-03T19:38:36Z

/blossom-ci

loliverhennigh · 2023-08-03T20:43:39Z

/blossom-ci

loliverhennigh · 2023-08-03T20:53:45Z

/blossom-ci

loliverhennigh · 2023-08-03T23:46:29Z

/blossom-ci

* fixed grid effect * blew up commit history * fixed commint * added modulus model version and checkpoing * refactored model registry * example registry * updated model registry * modified activations * save git hash changed to verbose * Fixed most issues * ~90% functionality implemented * added docstring about json input * added map location here * updated init method * black formated * removed example external package * Update test_from_torch.py * from torch model black --------- Co-authored-by: oliver <ohennigh@nvidia.com>

Addresses Peter Sharpe's CHANGES_REQUESTED review on PR NVIDIA#1576 in full, and subsumes the 3D-UFNO portion of the planned xFNO PR into this one. Net: -139 lines despite gaining trunkless mode, time-axis-extend, multi-channel output, coord features, multi-layer lift, and 2D/3D genericity. Theme 1 — Dimensional unification (Peter NVIDIA#37, NVIDIA#47, NVIDIA#52, NVIDIA#53): - ``DeepONet`` (formerly DeepONet + DeepONet3D) takes ``dimension: int`` (2 or 3) and dispatches via ``_DIM_DEFAULTS`` and per-dim conv/spectral primitives, mirroring the ``FNO`` pattern. - ``SpatialBranch`` (formerly SpatialBranch + SpatialBranch3D) takes ``dimension`` and uses an ``_DIM_LAYERS`` lookup for ``SpectralConv``/``Conv``/``BatchNorm``/``AdaptiveAvgPool``/ ``UNetAdapter`` and the permute helpers. - ``Conv{2,3}dFCLayer`` is selected via a one-line lookup (Peter NVIDIA#45). Theme 2 — Wrappers folded into DeepONet (Peter NVIDIA#54, NVIDIA#64, NVIDIA#65): - ``wrappers.py`` deleted (``DeepONetWrapper`` and ``DeepONet3DWrapper`` removed). Padding behaviour is now a constructor flag, ``auto_pad: bool = False``, and the model dispatches to ``_forward_packed`` / ``_forward_packed_trunkless`` accordingly. - 6-cell call matrix (trunked/trunkless × packed/core × spatial/mlp) is documented in the class docstring. - The previous private ``_temporal_projection`` attribute is exposed as a public ``has_temporal_projection`` property (Peter NVIDIA#55). Theme 3 — Deduplication (Peter NVIDIA#43, NVIDIA#44, NVIDIA#50, NVIDIA#51, NVIDIA#40, related Greptile): - ``TrunkNet`` and ``MLPBranch`` deleted — both duplicated ``physicsnemo.models.mlp.FullyConnected``; users now pass any ``nn.Module`` for the trunk / branches (DI-first API). - ``_SinActivation`` deleted; the activation is registered as ``"sin"`` in ``physicsnemo.nn.module.activations.ACT2FN`` (previous commit). All ``if activation_fn.lower() == "sin"`` special-cases removed. - ``DeepONet.from_config`` and the dict-config schema removed entirely; Hydra-style ``_target_`` instantiation supersedes it. - ``count_params`` collapsed from 4 duplicate copies to 1. Theme 4 — xFNO fold-in: - ``trunk: nn.Module | None = None`` enables trunkless mode (the 3D-UFNO use case from the planned xFNO PR). - ``out_channels: int = 1`` adds multi-channel output to every path. - ``time_modes: int | None`` enables xFNO-style time-axis-extend in trunkless packed mode: replicate-pads the last spatial axis to fit ``2 * time_modes`` and crops to the requested ``target_times``. - ``coord_features`` and ``lift_layers``/``lift_hidden_width`` parameters on ``SpatialBranch`` replace the deleted dict-driven "conv encoder" option. Theme 5 — Housekeeping (Peter NVIDIA#33, NVIDIA#34, NVIDIA#38, NVIDIA#41, NVIDIA#48, NVIDIA#57, NVIDIA#58, NVIDIA#59, Charlelie NVIDIA#26, Greptile NVIDIA#5, NVIDIA#6): - ``padding.py`` renamed to private ``_padding.py``; all functions carry ``jaxtyping.Shaped`` annotations. - All public forward methods carry ``jaxtyping.Float`` annotations and ``torch.compiler.is_compiling`` shape-validation guards. - ``Literal`` type aliases for ``decoder_type`` and other enums; case-insensitive validation against ``get_args`` (Greptile NVIDIA#15). - Modern type hints throughout (``dict[str, Any] | None``, no ``Dict``/``Optional``). - All public docstrings use ``r"""`` raw-string prefix, LaTeX math for tensor shapes, double backticks for inline code, and Examples sections. - ``Notes`` block in ``branches.py`` documents the ``num_unet_layers`` 8x memory/compute penalty (Peter NVIDIA#49). Theme 6 — Tests (Peter NVIDIA#60, NVIDIA#61, NVIDIA#62, NVIDIA#63, Charlelie NVIDIA#29): - ``_FIXTURE_REGISTRY`` drives all non-regression tests across 9 scenarios: u_deeponet 2D/3D, fourier_deeponet, mionet, temporal_projection, multi-channel packed 2D, xfno trunkless 3D (with and without time-axis-extend), and core 2D MLP-branch. - New 3D gradient-flow test and 3D ``torch.compile`` test. - ``fullgraph=True`` probe tests for 2D and 3D marked ``@pytest.mark.xfail(strict=False)`` to empirically answer Peter NVIDIA#63. - ``_load_golden`` uses ``pytest.skip`` for missing fixtures so CI passes pending cluster-side golden regeneration. - Test class structure mirrors MOD-008a/b/c: ``TestDeepONetConstructor``, ``TestDeepONetNonRegression``, ``TestDeepONetCheckpoint``, ``TestDeepONetGradientFlow``, ``TestDeepONetCompile``, ``TestDeepONetTimeAxisExtend``. CHANGELOG bullet rewritten to describe the actual shipped API (was stale, still described the old config-driven 8-variant model).

loliverhennigh added 2 commits April 14, 2023 13:58

fixed grid effect

667a79b

blew up commit history

0259cde

loliverhennigh closed this Jul 10, 2023

merged with upstream

e4c79f8

loliverhennigh reopened this Jul 10, 2023

fixed commint

905c499

loliverhennigh requested review from NickGeneva, akshaysubr and ktangsali July 10, 2023 18:32

NickGeneva reviewed Jul 11, 2023

View reviewed changes

Comment thread modulus/models/module.py Outdated

Comment thread modulus/models/module.py Outdated

loliverhennigh added 2 commits July 13, 2023 15:24

added modulus model version and checkpoing

132c8bd

refactored model registry

ef84fb3

akshaysubr requested changes Jul 18, 2023

View reviewed changes

NickGeneva reviewed Jul 18, 2023

View reviewed changes

Comment thread modulus/models/module.py Outdated

NickGeneva added enhancement New feature or request 2 - In Progress Currently a work in progress labels Jul 18, 2023

NickGeneva assigned loliverhennigh Jul 18, 2023

loliverhennigh changed the title ~~Fea checkpoint refactor~~ Checkpoint Refactor Jul 18, 2023

example registry

47ca1d1

akshaysubr reviewed Jul 19, 2023

View reviewed changes

Comment thread modulus/models/registry.py Outdated

akshaysubr reviewed Jul 19, 2023

View reviewed changes

Comment thread modulus/models/registry.py Outdated

akshaysubr reviewed Jul 19, 2023

View reviewed changes

Comment thread modulus/models/registry.py Outdated

loliverhennigh added 2 commits July 19, 2023 14:54

updated model registry

5eb8426

modified activations

ee5f668

loliverhennigh added 4 commits July 21, 2023 11:06

Merge branch 'main' of github.com:NVIDIA/modulus into fea-checkpoint_…

1bfdc27

…refactor

~90% functionality implemented

84e98bb

added docstring about json input

c578647

added map location here

08e418b

NickGeneva added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 25, 2023

updated init method

6bf6bf5

NickGeneva removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 1, 2023

loliverhennigh added 2 commits August 3, 2023 09:47

merged and fixed unit tests

9fed7a2

black formated

a104dca

NickGeneva approved these changes Aug 3, 2023

View reviewed changes

removed example external package

ac050f4

Update test_from_torch.py

baba246

from torch model black

5fd50f1

akshaysubr approved these changes Aug 4, 2023

View reviewed changes

loliverhennigh merged commit 656b15e into NVIDIA:main Aug 4, 2023

giprayogo mentioned this pull request Jan 24, 2026

🐛[BUG]: Module.from_torch() produces incomplete __init__ signature (missing keyword-only arguments) #1349

Closed

Conversation

loliverhennigh commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

loliverhennigh commented Jul 10, 2023

Uh oh!

NickGeneva left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

loliverhennigh commented Jul 11, 2023

Uh oh!

loliverhennigh commented Jul 13, 2023

Uh oh!

loliverhennigh commented Jul 14, 2023

Uh oh!

akshaysubr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akshaysubr commented Jul 18, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

loliverhennigh commented Jul 21, 2023

Uh oh!

NickGeneva commented Aug 3, 2023

Uh oh!

NickGeneva commented Aug 3, 2023

Uh oh!

loliverhennigh commented Aug 3, 2023

Uh oh!

loliverhennigh commented Aug 3, 2023

Uh oh!

loliverhennigh commented Aug 3, 2023

Uh oh!

loliverhennigh commented Aug 3, 2023

Uh oh!

loliverhennigh commented Aug 3, 2023

Uh oh!

loliverhennigh commented Aug 3, 2023

Uh oh!

loliverhennigh commented Aug 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

loliverhennigh commented Jul 10, 2023 •

edited

Loading