Fix nondeterministic RNG in test_fused_mxfp4_quant#2562
Merged
brunomazzottiamd merged 2 commits intoROCm:mainfrom Apr 1, 2026
Merged
Fix nondeterministic RNG in test_fused_mxfp4_quant#2562brunomazzottiamd merged 2 commits intoROCm:mainfrom
brunomazzottiamd merged 2 commits intoROCm:mainfrom
Conversation
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
This comment was marked as resolved.
This comment was marked as resolved.
Tests in test_fused_mxfp4_quant.py were failing in CI, especially when executed as part of shard 3. The failures were not reproducible when running the test line in isolation. Thanks to Bruno for providing the command line. Root cause: The random seed was previously set to be at the top-level part of the module just after imports via torch.manual_seed(). This caused test behaviour to depend on the global RNG state, which is affected by previously executed tests in the same shard (which makes sense why it worked in isolation, but not in the shard). As a result, the test outcomes were order-dependent and non-deterministic. Fix: - Removed torch.manual_seed() from top-level part of module - Added this deterministic seeding behaviour to the test case that was being impacted by this to ensure order-independent behaviour Validation: - Reproduced failure using CI shard 3 command locally - Verified the failures occuring in op_tests/triton_tests/quant/test_fused_mxfp4_quant.py::test_fused_rms_quant - After fix: - All tests pass in shard 3 with TRITON_HIP_USE_ASYNC_COPY=0 - Test_fused_rms_quant also passes with ASYNC_COPY enabled (in command line run with shard 3 and isolation) - Tests pass consistently in isolation and repeated runs Additional Notes: - Remaining failures with TRITON_HIP_USE_ASYNC_COPY=1 are affected (MoE + GEMM known issues with ASYNC enabled). This is unrelated to the current task and can be addressed separately
…Moved the seeds after skip condition, and used black to format file
082947e to
46bbfd3
Compare
Contributor
Author
(edited by @brunomazzottiamd) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Tests in test_fused_mxfp4_quant.py were failing in CI, especially when executed as part of shard 3. The failures were not reproducible when running the test line in isolation. Thanks to Bruno for providing the command line.
Root cause:
The random seed was previously set to be at the top-level part of the module just after imports via torch.manual_seed(). This caused test behaviour to depend on the global RNG state, which is affected by previously executed tests in the same shard (which makes sense why it worked in isolation, but not in the shard). As a result, the test outcomes were order-dependent and non-deterministic.
Fix:
Validation:
Additional Notes: