[PyTorch] Disable Flash Attention backend in Userbuffers tests by timmoon10 · Pull Request #2399 · NVIDIA/TransformerEngine

timmoon10 · 2025-11-18T23:37:36Z

Description

We have experienced some Userbuffers test failures on A100s, apparently because the Flash Attention backward pass introduces numerical errors. This test is primarily intended to test the linear layers and not attention, so as a quick fix I've just disabled the Flash Attention backend.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Disable Flash Attention backend in Userbuffers tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

ksivaman

LGTM as a fix for green CI, @cyanguwa we should document + try and root cause this

greptile-apps · 2025-11-18T23:40:12Z

Greptile Summary

Disables Flash Attention backend in Userbuffers layer tests by setting NVTE_FLASH_ATTN=0 to prevent numerical errors on A100s
The fix is appropriately scoped to _run_layer_with_overlap function only, affecting tests for Linear, LayerNormLinear, LayerNormMLP, MultiheadAttention, and TransformerLayer

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The change is minimal and surgical - it only adds an environment variable to disable Flash Attention in specific tests that were experiencing numerical errors. The fix properly sets and unsets the environment variable, maintaining clean test isolation. The scope is appropriately limited to layer tests (not low-level GEMM tests), which aligns with the PR description stating this is a quick fix for Userbuffers tests that are primarily intended to test linear layers, not attention.
No files require special attention

Important Files Changed

Filename	Overview
tests/pytorch/distributed/test_comm_gemm_overlap.py	Added `NVTE_FLASH_ATTN=0` environment variable to disable Flash Attention backend in layer overlap tests to avoid numerical errors

Sequence Diagram

sequenceDiagram
    participant Test as test_layers_with_overlap_*
    participant Helper as _run_layer_with_overlap
    participant Env as Environment
    participant Subprocess as run_layer_with_overlap.py
    
    Test->>Helper: Call with layer parameters
    Helper->>Env: Set "NVTE_FLASH_ATTN=0"
    Helper->>Env: Set "PYTORCH_JIT=0"
    Helper->>Env: Set "NVTE_TORCH_COMPILE=0"
    Helper->>Env: Set "NVTE_ALLOW_NONDETERMINISTIC_ALGO=0"
    Helper->>Subprocess: Run test with modified environment
    Subprocess-->>Helper: Return test result
    Helper->>Env: Unset "NVTE_FLASH_ATTN"
    Helper->>Env: Unset "PYTORCH_JIT"
    Helper->>Env: Unset "NVTE_TORCH_COMPILE"
    Helper->>Env: Unset "NVTE_ALLOW_NONDETERMINISTIC_ALGO"
    Helper-->>Test: Return success/failure

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

timmoon10 · 2025-11-18T23:46:09Z

/te-ci pytorch L1

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}
_{React with 👍 or 👎 to share your feedback on this new summary format}

KshitijLakhani

LGTM
Agree with @ksivaman on documenting it. On the JAX side, we've mostly been sprinkling TODO for such things - I wonder if @cyanguwa would prefer something like that or a different approach ?

Disable Flash attention in Userbuffers tests Signed-off-by: Tim Moon <tmoon@nvidia.com>

Disable Flash attention in Userbuffers tests

fe141e9

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 requested review from KshitijLakhani and cyanguwa November 18, 2025 23:37

timmoon10 added bug Something isn't working 2.10.0 labels Nov 18, 2025

ksivaman approved these changes Nov 18, 2025

View reviewed changes

greptile-apps Bot reviewed Nov 18, 2025

View reviewed changes

Merge branch 'main' into tmoon/debug-ub-on-ampere

28148b0

greptile-apps Bot reviewed Nov 18, 2025

View reviewed changes

KshitijLakhani approved these changes Nov 18, 2025

View reviewed changes

ksivaman merged commit e6da012 into NVIDIA:main Nov 19, 2025
24 of 31 checks passed

timmoon10 mentioned this pull request Nov 20, 2025

[PyTorch] Only disable Flash Attention in Userbuffers test on SM 8.0 #2401

Merged

14 tasks

timmoon10 deleted the tmoon/debug-ub-on-ampere branch November 20, 2025 05:01

KshitijLakhani pushed a commit that referenced this pull request Nov 20, 2025

[PyTorch] Disable Flash Attention backend in Userbuffers tests (#2399)

d551ee7

Disable Flash attention in Userbuffers tests Signed-off-by: Tim Moon <tmoon@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Disable Flash Attention backend in Userbuffers tests#2399

[PyTorch] Disable Flash Attention backend in Userbuffers tests#2399
ksivaman merged 2 commits into
NVIDIA:mainfrom
timmoon10:tmoon/debug-ub-on-ampere

timmoon10 commented Nov 18, 2025

Uh oh!

ksivaman left a comment

Uh oh!

greptile-apps Bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

timmoon10 commented Nov 18, 2025

Uh oh!

greptile-apps Bot left a comment

Uh oh!

KshitijLakhani left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timmoon10 commented Nov 18, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Nov 18, 2025

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Nov 18, 2025 •

edited

Loading