Skip to content

Adds fix for sporadic CI bug in Barnes-Hut test#1580

Merged
peterdsharpe merged 7 commits intoNVIDIA:mainfrom
peterdsharpe:psharpe/globe_bh_test_hotfix
Apr 23, 2026
Merged

Adds fix for sporadic CI bug in Barnes-Hut test#1580
peterdsharpe merged 7 commits intoNVIDIA:mainfrom
peterdsharpe:psharpe/globe_bh_test_hotfix

Conversation

@peterdsharpe
Copy link
Copy Markdown
Collaborator

@peterdsharpe peterdsharpe commented Apr 22, 2026

PhysicsNeMo Pull Request

Description

In CI, seeing a sporadic bug reported by @ktangsali (example failing run):

CI trace report
______________________ test_bh_nested_source_data_keys[2] ______________________

n_dims = 2

    @dims_params
    def test_bh_nested_source_data_keys(n_dims: int):
        """Convergence with nested TensorDict keys matching GLOBE's production format.
    
        GLOBE passes source_data structured like:
            {"physical": {"velocity": ...}, "latent": {"scalars": {"0": ...},
             "vectors": {"0": ...}}, "normals": ...}
    
        The aggregation, split_by_leaf_rank, and TensorDict.cat operations must
        handle this nesting correctly.
        """
        torch.manual_seed(DEFAULT_SEED)
        n_src, n_tgt = 30, 15
    
        source_data_ranks = {
            "physical": {"pressure": 0},
            "latent": {"scalars": {"0": 0, "1": 0}, "vectors": {"0": 1}},
            "normals": 1,
        }
        output_field_ranks = {"p": 0, "u": 1}
    
        common_kwargs = dict(
            n_spatial_dims=n_dims,
            output_field_ranks={
                k: (0 if v == "scalar" else 1) for k, v in output_field_ranks.items()
            },
            source_data_ranks=source_data_ranks,
            hidden_layer_sizes=[16],
        )
    
        bh_kernel = BarnesHutKernel(**common_kwargs, leaf_size=DEFAULT_LEAF_SIZE)
        exact_kernel = Kernel(**common_kwargs)
        exact_kernel.load_state_dict(bh_kernel.state_dict(), strict=False)
        bh_kernel.eval()
        exact_kernel.eval()
    
        torch.manual_seed(DEFAULT_SEED + 1)
        source_data = TensorDict(
            {
                "physical": TensorDict(
                    {"pressure": torch.randn(n_src)},
                    batch_size=[n_src],
                ),
                "latent": TensorDict(
                    {
                        "scalars": TensorDict(
                            {"0": torch.randn(n_src), "1": torch.randn(n_src)},
                            batch_size=[n_src],
                        ),
                        "vectors": TensorDict(
                            {"0": F.normalize(torch.randn(n_src, n_dims), dim=-1)},
                            batch_size=[n_src],
                        ),
                    },
                    batch_size=[n_src],
                ),
        }
        output_field_ranks = {"p": 0, "u": 1}
    
        common_kwargs = dict(
            n_spatial_dims=n_dims,
            output_field_ranks={
                k: (0 if v == "scalar" else 1) for k, v in output_field_ranks.items()
            },
            source_data_ranks=source_data_ranks,
            hidden_layer_sizes=[16],
        )
    
        bh_kernel = BarnesHutKernel(**common_kwargs, leaf_size=DEFAULT_LEAF_SIZE)
        exact_kernel = Kernel(**common_kwargs)
        exact_kernel.load_state_dict(bh_kernel.state_dict(), strict=False)
        bh_kernel.eval()
        exact_kernel.eval()
    
        torch.manual_seed(DEFAULT_SEED + 1)
        source_data = TensorDict(
            {
                "physical": TensorDict(
                    {"pressure": torch.randn(n_src)},
                    batch_size=[n_src],
                ),
                "latent": TensorDict(
                    {
                        "scalars": TensorDict(
                            {"0": torch.randn(n_src), "1": torch.randn(n_src)},
                            batch_size=[n_src],
                        ),
                        "vectors": TensorDict(
                            {"0": F.normalize(torch.randn(n_src, n_dims), dim=-1)},
                            batch_size=[n_src],
                        ),
                    },
                    batch_size=[n_src],
                ),
                "normals": F.normalize(torch.randn(n_src, n_dims), dim=-1),
            },
            batch_size=[n_src],
        )
    
        data = {
            "source_points": torch.randn(n_src, n_dims),
            "target_points": torch.randn(n_tgt, n_dims) * 5,
            "source_strengths": torch.rand(n_src) + 0.1,
            "reference_length": torch.ones(()),
            "source_data": source_data,
            "global_data": TensorDict({}, batch_size=[]),
        }
    
        exact_result = exact_kernel(**data)
        bh_result = bh_kernel(**data, theta=0.01)
    
        for field_name in output_field_ranks:
>           torch.testing.assert_close(
                bh_result[field_name],
                exact_result[field_name],
                atol=1e-3,
                rtol=1e-2,
                msg=f"Nested keys: {field_name!r} not close to exact at theta=0.01",
            )
E           AssertionError: Nested keys: 'p' not close to exact at theta=0.01

test/models/globe/test_barnes_hut_kernel.py:960: AssertionError

This is possibly introduced by #1494, which recently modified these files. However, CI passed on this PR, and this bug seems to occur only sporadically.

On local machines, these test seem to consistently pass.

For now, adding diagnostic code here to try reproducing this on CI and get more debugging details.

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@peterdsharpe peterdsharpe added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Apr 22, 2026
… consistent TensorDict structure. Update test tolerances for output comparisons to enhance robustness against numerical discrepancies.
@peterdsharpe
Copy link
Copy Markdown
Collaborator Author

peterdsharpe commented Apr 23, 2026

Root cause: tensordict shipped a new release on April 20 which changed the ordering of keys when flattening tensordicts, and we didn't guard against this. Hence, causing our CI to pick up the new version, which caused a previously-passing test to fail. Testing a fix.

@peterdsharpe peterdsharpe marked this pull request as ready for review April 23, 2026 17:13
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR fixes a sporadic CI failure in the Barnes-Hut kernel caused by TensorDict returning leaves in a non-deterministic iteration order across library versions. concatenate_leaves now sorts items by str(key) for a stable column layout, and all split-back enumerations plus the vectors.keys() loop in _evaluate_interactions are aligned to that same order. The test gains per-pair pre-aggregation invariant checks via a monkey-patch capture pattern, tighter tolerances, and a correction to the output_field_ranks setup that was silently misconfiguring the test.

Important Files Changed

Filename Overview
physicsnemo/experimental/models/globe/utilities/tensordict_utils.py Core fix: concatenate_leaves now sorts items by str(key) before concatenation, producing a canonical column ordering independent of TensorDict construction order and library version.
physicsnemo/experimental/models/globe/field_kernel.py Two aligned fixes: scalar split-back now reconstructs a nested TensorDict matching the exact kernel's leaf structure; both scalar and vector key enumerations are sorted by str to match concatenate_leaves's canonical ordering. The vectors.keys() loop is also sorted for consistency.
test/models/globe/test_barnes_hut_kernel.py Adds diagnostic invariants: state_dict completeness, per-pair pre-aggregation bit-identity via monkey-patch + inv_perm reindex, and tighter final tolerances. Also fixes a silent test bug where the dict-comprehension always returned rank=1 for integer-valued output_field_ranks.
CHANGELOG.md Adds a changelog entry accurately describing the TensorDict key-ordering fix.

Reviews (1): Last reviewed commit: "Revises fix" | Re-trigger Greptile

Comment thread test/models/globe/test_barnes_hut_kernel.py Outdated
Comment thread test/models/globe/test_barnes_hut_kernel.py Outdated
@peterdsharpe
Copy link
Copy Markdown
Collaborator Author

/blossom-ci

@peterdsharpe peterdsharpe removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Apr 23, 2026
@peterdsharpe peterdsharpe self-assigned this Apr 23, 2026
@peterdsharpe peterdsharpe enabled auto-merge April 23, 2026 17:50
@peterdsharpe peterdsharpe added this pull request to the merge queue Apr 23, 2026
Merged via the queue into NVIDIA:main with commit def51ba Apr 23, 2026
4 checks passed
@peterdsharpe peterdsharpe deleted the psharpe/globe_bh_test_hotfix branch April 23, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants