Skip to content

Conversation

@t-vi
Copy link
Collaborator

@t-vi t-vi commented Oct 10, 2025

Before submitting
  • [ n/a ] Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • [ n/a ] Did you make sure to update the docs?
  • [ n/a] Did you write any new necessary tests?

What does this PR do?

Fixes the compilation of the triton ce kernel (but note that the results are not checked in CI...
Loosens a failing tensor parrallel test.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@t-vi
Copy link
Collaborator Author

t-vi commented Oct 10, 2025

@kshitij12345 any idea how to fix this: (in the failing te (v2) run)


=================================== FAILURES ===================================
___ test_te_activation_checkpointing_correctness[ThunderFX-delayed_scaling] ____

fp8_recipe = recipe_type=DelayedScaling, margin=0, format=HYBRID, amax_history_len=1024, reduce_amax=True, fp8_dpa=False, fp8_mha=False
compile_path = 'ThunderFX'

    @requiresCUDA
    @pytest.mark.parametrize("fp8_recipe", recipes, ids=recipe_ids)
    @pytest.mark.parametrize("compile_path", ["jit", "ThunderFX"])
    @pytest.mark.filterwarnings("ignore::FutureWarning")  # Coming from TE v2.3
    @skip_on_sm120_and_sm121
    def test_te_activation_checkpointing_correctness(fp8_recipe: recipe.Recipe, compile_path: str):
        if not fp8_recipe:
            pytest.skip(
                "When recipe is None a new recipe is created for each iteration. This makes the results not numerically comparable."
            )
    
        if fp8_recipe and not (fp8_recipe.delayed() or is_mxfp8_supported):
            pytest.skip(msg_mxfp8)
    
        dtype = torch.bfloat16
        device = "cuda"
        iterations = 6
    
        from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
    
        # Before starting, reset the state manager.
        FP8GlobalStateManager.reset()
    
        checkpoint_fn = partial(torch.utils.checkpoint.checkpoint, use_reentrant=False)
    
        input_shape = (768, 4096)

....


        train_model(thunder_model, thunder_sgd_optimizer, thunder_loss_hist)
    
        for loss, te_loss in zip(thunder_loss_hist, te_loss_hist):
            assert_close(loss, te_loss)
    
>       assert_close(w1, te_linear1.weight)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 1 / 16777216 (0.0%)
E       Greatest absolute difference: 3.0517578125e-05 at index (1913, 1805) (up to 1e-05 allowed)
E       Greatest relative difference: 0.0299072265625 at index (1913, 1805) (up to 0.016 allowed)

thunder/tests/test_transformer_engine_executor.py:622: AssertionError

@kshitij12345
Copy link
Collaborator

@riccardofelluga do you have any idea? IIRC, you were investigating this.

@riccardofelluga
Copy link
Collaborator

@kshitij12345 thanks for the cc, I've been lookin at it on and off lately. What I've see on TE library is this

https://github.com/NVIDIA/TransformerEngine/blob/dd9433e7ad28c12f27da9770be54c9c584e85fa0/tests/pytorch/test_numerics.py#L1232-L1240

While in their case it is used for comparing torch and TE, i guess the tolerances for the inputs based on dtypes can apply also to our tests here. Will update #2444 and open a PR

@t-vi t-vi merged commit 6698f0c into main Oct 13, 2025
49 of 51 checks passed
@t-vi t-vi deleted the tom/tests-fix branch October 13, 2025 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants