Fix SongUNet with ShardTensor when using zero embedding by jleinonen · Pull Request #1432 · NVIDIA/physicsnemo

jleinonen · 2026-02-19T15:55:15Z

PhysicsNeMo Pull Request

When training regression models (i.e. no time step embedding, embedding_type == "zero") using ShardTensor, SongUNet was giving this error:

  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/models/diffusion_unets/song_unet.py", line 663, in forward
    x = block(x, emb)
        ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/nn/module/unet_layers.py", line 242, in forward
    params = self.affine(emb).unsqueeze(2).unsqueeze(3)
             ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/physicsnemo/nn/module/fully_connected_layers.py", line 527, in forward
    x = x @ weight.t()

caused by the emb tensor being a plain torch.Tensor:

physicsnemo/physicsnemo/models/diffusion_unets/song_unet.py

Lines 627 to 632 in 70b06ed

    
           else: 
        
               emb = torch.zeros( 
        
                   (noise_labels.shape[0], self.emb_channels), 
        
                   device=x.device, 
        
                   dtype=x.dtype, 
        
               )

I added a conversion of emb to ShardTensor if x is a ShardTensor. With this fix, it is possible to train the StormCastUNet type regression models with ShardTensor.

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

greptile-apps · 2026-02-19T15:57:53Z

Greptile Summary

This PR fixes a tensor type mismatch error that occurred when training regression models with SongUNet using embedding_type == "zero" and ShardTensor. The error was caused by the emb tensor being a plain torch.Tensor while the input x was a ShardTensor, causing incompatibility in the matrix multiplication operation in the Linear layer (x @ weight.t()).

The fix adds a conditional check: when x is a ShardTensor, the emb tensor (created as zeros) is converted to a ShardTensor using ShardTensor.from_local() with the same device mesh as x. This ensures type consistency between tensors in subsequent operations.

Adds ShardTensor import from physicsnemo.domain_parallel.shard_tensor
Adds conversion logic after creating zero embedding tensor (lines 634-635)
The fix is minimal and targeted, only affecting the zero embedding path
The non-zero embedding path doesn't need this fix since the mapping layers (map_layer0, map_layer1) are Linear modules that preserve the tensor type through operations

Important Files Changed

Filename	Overview
physicsnemo/models/diffusion_unets/song_unet.py	Adds ShardTensor conversion for zero embedding case to fix tensor type mismatch error

_{Last reviewed commit: 8db65bf}

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

… into unet-shardtensor

pzharrington · 2026-02-19T19:39:31Z

/blossom-ci

* Bug fixes for ShardTensor+SongUNet * Handle dtensor spec in sharded view * Fix SongUNet with ShardTensor when using zero embedding * Use buffer for zero embed --------- Co-authored-by: Peter Harrington <48932392+pzharrington@users.noreply.github.com> Co-authored-by: Peter Harrington <pharrington@nvidia.com>

jleinonen and others added 9 commits February 11, 2026 03:23

Bug fixes for ShardTensor+SongUNet

b9c1ae7

Merge branch 'NVIDIA:main' into unet-shardtensor

6cfa88a

Merge branch 'main' into unet-shardtensor

e5f27aa

Merge branch 'NVIDIA:main' into unet-shardtensor

8b81db7

Merge branch 'NVIDIA:main' into unet-shardtensor

e939615

Merge branch 'NVIDIA:main' into unet-shardtensor

4f24fff

Handle dtensor spec in sharded view

4d24830

Merge branch 'NVIDIA:main' into unet-shardtensor

f226659

Fix SongUNet with ShardTensor when using zero embedding

8db65bf

jleinonen self-assigned this Feb 19, 2026

jleinonen requested a review from pzharrington February 19, 2026 15:55

greptile-apps bot reviewed Feb 19, 2026

View reviewed changes

pzharrington added 3 commits February 19, 2026 10:41

Merge branch 'main' of https://github.com/NVIDIA/physicsnemo

1296c5f

Merge branch 'unet-shardtensor' of https://github.com/jleinonen/modulus…

e36ef4a

… into unet-shardtensor

Use buffer for zero embed

0a61acf

pzharrington changed the base branch from main to 2.0.0-rc February 19, 2026 19:46

pzharrington approved these changes Feb 19, 2026

View reviewed changes

pzharrington merged commit c3a8248 into NVIDIA:2.0.0-rc Feb 19, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SongUNet with ShardTensor when using zero embedding#1432

Fix SongUNet with ShardTensor when using zero embedding#1432
pzharrington merged 12 commits intoNVIDIA:2.0.0-rcfrom
jleinonen:unet-shardtensor

jleinonen commented Feb 19, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 19, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

pzharrington commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	else:
	emb = torch.zeros(
	(noise_labels.shape[0], self.emb_channels),
	device=x.device,
	dtype=x.dtype,
	)

Conversation

jleinonen commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

greptile-apps bot commented Feb 19, 2026

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

pzharrington commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jleinonen commented Feb 19, 2026 •

edited

Loading