Skip to content

Fix nondeterministic RNG in test_fused_mxfp4_quant#2562

Merged
brunomazzottiamd merged 2 commits intoROCm:mainfrom
nidal567:golden-tot-kernels
Apr 1, 2026
Merged

Fix nondeterministic RNG in test_fused_mxfp4_quant#2562
brunomazzottiamd merged 2 commits intoROCm:mainfrom
nidal567:golden-tot-kernels

Conversation

@nidal567
Copy link
Copy Markdown
Contributor

Summary:

Tests in test_fused_mxfp4_quant.py were failing in CI, especially when executed as part of shard 3. The failures were not reproducible when running the test line in isolation. Thanks to Bruno for providing the command line.

Root cause:

The random seed was previously set to be at the top-level part of the module just after imports via torch.manual_seed(). This caused test behaviour to depend on the global RNG state, which is affected by previously executed tests in the same shard (which makes sense why it worked in isolation, but not in the shard). As a result, the test outcomes were order-dependent and non-deterministic.

Fix:

  • Removed torch.manual_seed() from top-level part of module
  • Added this deterministic seeding behaviour to the test case that was being impacted by this to ensure order-independent behaviour

Validation:

  • Reproduced failure using CI shard 3 command locally
  • Verified the failures occuring in op_tests/triton_tests/quant/test_fused_mxfp4_quant.py::test_fused_rms_quant
  • After fix:
    • All tests pass in shard 3 with TRITON_HIP_USE_ASYNC_COPY=0
    • Test_fused_rms_quant also passes with ASYNC_COPY enabled (in command line run with shard 3 and isolation)
    • Tests pass consistently in isolation and repeated runs

Additional Notes:

  • Remaining failures with TRITON_HIP_USE_ASYNC_COPY=1 are affected (MoE + GEMM known issues with ASYNC enabled). This is unrelated to the current task and can be addressed separately

@nidal567 nidal567 requested review from a team and brunomazzottiamd March 31, 2026 19:29
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2562 --add-label <label>

brunomazzottiamd

This comment was marked as resolved.

@brunomazzottiamd

This comment was marked as resolved.

Tests in test_fused_mxfp4_quant.py were failing in CI, especially when
executed as part of shard 3. The failures were not reproducible when
running the test line in isolation. Thanks to Bruno for providing the
command line.

Root cause:
The random seed was previously set to be at the top-level part of the
module just after imports via torch.manual_seed(). This caused test
behaviour to depend on the global RNG state, which is affected by
previously executed tests in the same shard (which makes sense why it
worked in isolation, but not in the shard). As a result, the test
outcomes were order-dependent and non-deterministic.

Fix:
- Removed torch.manual_seed() from top-level part of module
- Added this deterministic seeding behaviour to the test case that was
  being impacted by this to ensure order-independent behaviour

Validation:
- Reproduced failure using CI shard 3 command locally
- Verified the failures occuring in op_tests/triton_tests/quant/test_fused_mxfp4_quant.py::test_fused_rms_quant
- After fix:
	- All tests pass in shard 3 with TRITON_HIP_USE_ASYNC_COPY=0
	- Test_fused_rms_quant also passes with ASYNC_COPY enabled (in
	  command line run with shard 3 and isolation)
	- Tests pass consistently in isolation and repeated runs

Additional Notes:
- Remaining failures with TRITON_HIP_USE_ASYNC_COPY=1 are affected (MoE + GEMM known issues with ASYNC enabled). This is unrelated to the current task and can be addressed separately
…Moved the seeds after skip condition, and used black to format file
@nidal567 nidal567 force-pushed the golden-tot-kernels branch from 082947e to 46bbfd3 Compare April 1, 2026 00:08
Copy link
Copy Markdown
Contributor

@brunomazzottiamd brunomazzottiamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nidal567
Copy link
Copy Markdown
Contributor Author

nidal567 commented Apr 1, 2026

(edited by @brunomazzottiamd)

@brunomazzottiamd brunomazzottiamd merged commit 381129b into ROCm:main Apr 1, 2026
111 of 132 checks passed
@nidal567 nidal567 mentioned this pull request Apr 1, 2026
11 tasks
@nidal567 nidal567 deleted the golden-tot-kernels branch April 2, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci:triton-355 triton

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants