Skip to content

Conversation

@wujingyue
Copy link
Collaborator

Before the PR, this test failed with mpirun -np 1.

Before the PR, this test failed with `mpirun -np 1`.
@wujingyue wujingyue requested a review from Priya2698 January 12, 2026 01:21
@wujingyue
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 12, 2026

Review updated until commit def54ac

Description

  • Reduce tensor dimension size from 12288 to 768 (h*8) to minimize numerical differences

  • Remove explicit tolerance requirements from allclose assertion

  • Make unroll factor check more flexible (GT instead of EQ)

  • Fix test failure that occurred with mpirun -np 1

Changes walkthrough

Relevant files
Tests
test_multidevice_sharding.cpp
Fix multi-device test by reducing dimension size and tolerances

tests/cpp/test_multidevice_sharding.cpp

  • Changed e dimension from fixed 12288 to h*8 (768) for vectorization
  • Removed explicit tolerance parameters from allclose assertion
  • Changed unroll factor check from exact match to greater than 1
  • +4/-3     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Numerical Precision Fix

    The PR changes tensor dimension e from 12288 to h * 8 (768) and removes explicit tolerance parameters in at::allclose. This should improve numerical stability, but the reviewer should verify that the test now passes consistently with mpirun -np 1 and that the reduced dimension still adequately exercises the vectorization path.

    int64_t b = 1, s = 2048, h = 96, e = h * 8;
    Test Robustness

    The unroll factor expectation was changed from exact value 8 to > 1. This makes the test more flexible but less specific. The reviewer should confirm this change doesn't mask underlying issues with the scheduler heuristics.

    EXPECT_GT(rparams->unroll_factor_iter_dom, 1);

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 12, 2026

    Greptile Overview

    Greptile Summary

    This PR fixes a numerical precision issue in the OuterReductionShardedInnerDimension test that was causing failures when run with mpirun -np 1.

    Changes Made

    The PR makes three key changes to improve test stability:

    1. Reduced dimension size: Changed e from a hardcoded value of 12288 to h * 8 (which equals 768), reducing the dimension by a factor of 16. This significantly reduces numerical error accumulation during BFloat16 operations, particularly in the sum reduction over large tensors. The comment notes "e = h * 8 to exercise vectorization", establishing the relationship needed for vectorization to work properly.

    2. Removed explicit tolerance parameters: Changed from at::allclose(nvf_out, ref_out, 1e-3, 1e-3) to at::allclose(nvf_out, ref_out), now using default tolerances (typically rtol=1e-5, atol=1e-8). The stricter default tolerances are appropriate given the reduced numerical errors from smaller dimensions.

    3. Relaxed unroll factor assertion: Changed from EXPECT_EQ(rparams->unroll_factor_iter_dom, 8) to EXPECT_GT(rparams->unroll_factor_iter_dom, 1). With the reduced dimension size, the scheduler may choose different unroll factors based on heuristics, but the test still verifies that vectorization is enabled and some level of unrolling occurs.

    Why This Works

    The test performs BFloat16 operations including concatenation, reshape, casting to Float, and sum reduction. With the original dimension e = 12288, the large tensor sizes led to significant numerical error accumulation in BFloat16 (which has only 7 bits of precision in the mantissa). By reducing to e = 768, the errors stay within acceptable bounds even with stricter tolerance checks.

    The test still exercises the intended functionality: outer reduction with sharded inner dimensions, vectorization, and multi-device coordination.

    Confidence Score: 4/5

    • This PR is safe to merge with minimal risk - it only modifies test code to address numerical precision issues.
    • The changes are well-justified and appropriate for fixing the numerical precision issue. The dimension reduction (16x smaller) significantly reduces BFloat16 error accumulation while still testing the same functionality. However, the relaxed unroll factor check (from exactly 8 to just > 1) is quite permissive and could potentially hide future regressions in the scheduler's unrolling heuristics. The change from looser (1e-3) to stricter default tolerances is counterintuitive at first but makes sense given the reduced error accumulation.
    • No files require special attention - this is a straightforward test fix with appropriate changes to dimensions and validation criteria.

    Important Files Changed

    File Analysis

    Filename Score Overview
    tests/cpp/test_multidevice_sharding.cpp 4/5 Reduces dimension e from 12288 to 768 (16x smaller) to fix numerical precision issues in BFloat16 operations. Removes explicit tolerance parameters to use stricter defaults, and relaxes unroll factor check from exactly 8 to greater than 1. Changes are appropriate for addressing the test failure with smaller data.

    Sequence Diagram

    sequenceDiagram
        participant Test as OuterReductionShardedInnerDimension Test
        participant Fusion as Fusion Definition
        participant Inputs as Input Tensors (BFloat16)
        participant Shard as Sharding Operation
        participant Exec as FusionExecutorCache
        participant Ref as Reference Calculation
        participant Valid as Validation
        
        Note over Test: Setup: b=1, s=2048, h=96<br/>OLD: e=12288<br/>NEW: e=h*8=768
        
        Test->>Fusion: Define fusion with 3 inputs shape {b,s,h,e/h}
        Fusion->>Fusion: cat(tv0, tv1, tv2, dim=-1) → {b,s,h,3*e/h}
        Fusion->>Fusion: reshape → {b,s,3*e}
        Fusion->>Fusion: cast(BFloat16→Float)
        Fusion->>Fusion: sum(dims={0,1}) → {3*e}
        Fusion->>Fusion: cast(Float→BFloat16)
        
        Test->>Inputs: Create 3 random BFloat16 tensors {b,s,h,e/h}
        Inputs->>Shard: Shard on dimension 2 (h→h/d)
        Note over Shard: Each device gets {b,s,h/d,e/h}
        
        Shard->>Exec: Run fusion with sharded inputs
        Exec->>Exec: Execute fusion on device
        Exec-->>Test: Return nvf_out (BFloat16)
        
        Test->>Ref: Compute reference output
        Ref->>Ref: cat(sharded_inputs, dim=-1)
        Ref->>Ref: view({b,s,3*e/d})
        Ref->>Ref: sum(dims={0,1}) → {3*e/d}
        Ref-->>Test: Return ref_out
        
        Test->>Valid: Compare nvf_out vs ref_out
        Note over Valid: OLD: allclose(rtol=1e-3, atol=1e-3)<br/>NEW: allclose(default tolerances)
        Valid-->>Test: Assertion result
        
        Test->>Test: Check vectorization enabled
        Test->>Test: Check unroll_factor > 1
        Note over Test: OLD: Expected exactly 8<br/>NEW: Expect > 1
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    No files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @wujingyue
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    No files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @wujingyue wujingyue merged commit 99a167f into main Jan 12, 2026
    61 of 62 checks passed
    @wujingyue wujingyue deleted the wjy/numeric branch January 12, 2026 23:45
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants