fix tests for CI #2627

t-vi · 2025-10-10T11:53:52Z

Before submitting

[ n/a ] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
[ n/a ] Did you make sure to update the docs?
[ n/a] Did you write any new necessary tests?

What does this PR do?

Fixes the compilation of the triton ce kernel (but note that the results are not checked in CI...
Loosens a failing tensor parrallel test.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

t-vi · 2025-10-10T12:56:23Z

@kshitij12345 any idea how to fix this: (in the failing te (v2) run)


=================================== FAILURES ===================================
___ test_te_activation_checkpointing_correctness[ThunderFX-delayed_scaling] ____

fp8_recipe = recipe_type=DelayedScaling, margin=0, format=HYBRID, amax_history_len=1024, reduce_amax=True, fp8_dpa=False, fp8_mha=False
compile_path = 'ThunderFX'

    @requiresCUDA
    @pytest.mark.parametrize("fp8_recipe", recipes, ids=recipe_ids)
    @pytest.mark.parametrize("compile_path", ["jit", "ThunderFX"])
    @pytest.mark.filterwarnings("ignore::FutureWarning")  # Coming from TE v2.3
    @skip_on_sm120_and_sm121
    def test_te_activation_checkpointing_correctness(fp8_recipe: recipe.Recipe, compile_path: str):
        if not fp8_recipe:
            pytest.skip(
                "When recipe is None a new recipe is created for each iteration. This makes the results not numerically comparable."
            )
    
        if fp8_recipe and not (fp8_recipe.delayed() or is_mxfp8_supported):
            pytest.skip(msg_mxfp8)
    
        dtype = torch.bfloat16
        device = "cuda"
        iterations = 6
    
        from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
    
        # Before starting, reset the state manager.
        FP8GlobalStateManager.reset()
    
        checkpoint_fn = partial(torch.utils.checkpoint.checkpoint, use_reentrant=False)
    
        input_shape = (768, 4096)

....


        train_model(thunder_model, thunder_sgd_optimizer, thunder_loss_hist)
    
        for loss, te_loss in zip(thunder_loss_hist, te_loss_hist):
            assert_close(loss, te_loss)
    
>       assert_close(w1, te_linear1.weight)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 1 / 16777216 (0.0%)
E       Greatest absolute difference: 3.0517578125e-05 at index (1913, 1805) (up to 1e-05 allowed)
E       Greatest relative difference: 0.0299072265625 at index (1913, 1805) (up to 0.016 allowed)

thunder/tests/test_transformer_engine_executor.py:622: AssertionError

kshitij12345 · 2025-10-10T13:07:02Z

@riccardofelluga do you have any idea? IIRC, you were investigating this.

riccardofelluga · 2025-10-10T13:51:49Z

@kshitij12345 thanks for the cc, I've been lookin at it on and off lately. What I've see on TE library is this

https://github.com/NVIDIA/TransformerEngine/blob/dd9433e7ad28c12f27da9770be54c9c584e85fa0/tests/pytorch/test_numerics.py#L1232-L1240

While in their case it is used for comparing torch and TE, i guess the tolerances for the inputs based on dtypes can apply also to our tests here. Will update #2444 and open a PR

fix tests for CI

91b4e06

t-vi requested review from KaelanDt, lantiga and mruberry as code owners October 10, 2025 11:53

Merge branch 'main' into tom/tests-fix

705020a

t-vi merged commit 6698f0c into main Oct 13, 2025
49 of 51 checks passed

t-vi deleted the tom/tests-fix branch October 13, 2025 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix tests for CI #2627

fix tests for CI #2627

Uh oh!

t-vi commented Oct 10, 2025

Uh oh!

t-vi commented Oct 10, 2025

Uh oh!

kshitij12345 commented Oct 10, 2025

Uh oh!

riccardofelluga commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix tests for CI #2627

fix tests for CI #2627

Uh oh!

Conversation

t-vi commented Oct 10, 2025

What does this PR do?

PR review

Did you have fun?

Uh oh!

t-vi commented Oct 10, 2025

Uh oh!

kshitij12345 commented Oct 10, 2025

Uh oh!

riccardofelluga commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants