Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FMHA T5x test #442

Merged
merged 68 commits into from
Mar 8, 2024
Merged

Add FMHA T5x test #442

merged 68 commits into from
Mar 8, 2024

Conversation

hmonishN
Copy link
Contributor

@hmonishN hmonishN commented Dec 29, 2023

Adding the JAX T5x FMHA E2E system test to check for fmha lowering support. Following are the steps implemented in the test:

FMHA lowering flag is enabled by default now, enabled the dumping of hlo to track fmha forward and backward instructions.
Added the test as part of _ci.yaml file and also added a nightly workflow file for it. We will add this test as part of performance benchmarking later and add hlo to baseline.
Also added changes for correction of seq length of decoder (should be a multiple of 64)

The test was failing with following error related to CUDNN_STATUS_BAD_PARAM. The fix for this is added in the [PR] (openxla/xla#6872) in upstream which is now merged and the test passes. Bug for this error.

run for these changes: workflow run link

@hmonishN hmonishN requested a review from yhtang December 29, 2023 07:54
@hmonishN hmonishN self-assigned this Dec 29, 2023
@yhtang yhtang mentioned this pull request Jan 22, 2024
@hmonishN
Copy link
Contributor Author

Closed the old PR #327 as this one is the latest one.

@terrykong terrykong marked this pull request as draft January 28, 2024 17:46
.github/container/test-t5x.sh Show resolved Hide resolved
.github/container/test-t5x-fmha.sh Outdated Show resolved Hide resolved
.github/container/test-t5x-fmha.sh Outdated Show resolved Hide resolved
@terrykong
Copy link
Contributor

Moved to draft since there still seems to be active development on this PR.

BTW, it may be better to add this to _test_t5x.yaml since we were already running pretty close to the maximum number of reusable workflows allowed in github.

@hmonishN hmonishN marked this pull request as ready for review February 14, 2024 23:17
.github/container/test-t5x.sh Outdated Show resolved Hide resolved
.github/container/test-t5x.sh Outdated Show resolved Hide resolved
.github/workflows/_test_upstream_t5x.yaml Outdated Show resolved Hide resolved
Incorporated review comments
fixed indentation
hmonishN and others added 5 commits March 7, 2024 14:35
updating the new name for _test_t5x

Co-authored-by: Terry Kong <terryk@nvidia.com>
reduced number of jobs for fmha
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests that were added passed; metrics fail due to issue already resolved in main

@terrykong terrykong merged commit 8e8320f into main Mar 8, 2024
150 of 154 checks passed
@terrykong terrykong deleted the add_t5x_fmha_test branch March 8, 2024 17:44
@hmonishN hmonishN mentioned this pull request May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants