Skip to content

Fix the failing test cases in the CI#1806

Merged
ptrendx merged 6 commits intoNVIDIA:mainfrom
ptrendx:pr_tests_2.4
May 23, 2025
Merged

Fix the failing test cases in the CI#1806
ptrendx merged 6 commits intoNVIDIA:mainfrom
ptrendx:pr_tests_2.4

Conversation

@ptrendx
Copy link
Copy Markdown
Member

@ptrendx ptrendx commented May 21, 2025

Description

  • distributed test: FP32 test should use higher rtol/atol due to TF32 usage in the GEMM (and so the real FP32 epsilon does not apply there)
  • cast mxfp8 dgelu - modified the test to not hit a rare occurence of amax * e4m3_max_rcp after dgelu having 0 mantissa on the CPU and nonzero mantissa on the GPU (resulting in different scales)
  • Fixed the issue with the number of cores in a particular machine would affect the result of the CI
  • Fixed the issue where the gamma_in_weight_dtype setting would be ignored when caching the Normalization plans.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented May 21, 2025

/te-ci

timmoon10
timmoon10 previously approved these changes May 21, 2025
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't fix the root issues in the C++ tests, but are fine as quick expedients. Getting back to green will be very nice.

Comment thread tests/cpp/operator/test_cast_transpose_dbias_dgelu.cu
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
ptrendx added 2 commits May 22, 2025 16:49
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented May 22, 2025

/te-ci

timmoon10
timmoon10 previously approved these changes May 23, 2025
Comment thread tests/cpp/test_common.cu Outdated
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx merged commit cd37379 into NVIDIA:main May 23, 2025
11 checks passed
KshitijLakhani pushed a commit that referenced this pull request May 23, 2025
* Modify the test cases

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Make the tests reproducible on different machines

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixed the cache of the gamma_in_weight_dtype setting

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Reinstate the tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* More verbose code and comments

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
phu0ngng pushed a commit to phu0ngng/TransformerEngine that referenced this pull request May 29, 2025
* Modify the test cases

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Make the tests reproducible on different machines

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixed the cache of the gamma_in_weight_dtype setting

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Reinstate the tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* More verbose code and comments

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants