[ROCm] Fix FP8 collective ops E2E tests for MI300X (NANOO FP8) by phambinhfin · Pull Request #776 · ROCm/xla

phambinhfin · 2026-04-05T02:57:10Z

The 5 FP8 tests in collective_ops_e2e_test use f8e4m3fn (OCP/IEEE FP8), but MI300X (gfx942) only supports f8e4m3fnuz (NANOO FP8). The GEMM rewriter correctly rejects the FP8 rewrite for mismatched types, causing the tests to produce __cublas$lt$matmul instead of __cublas$lt$matmul$f8.

Add a ReplaceFp8Types() helper that substitutes f8e4m3fn -> f8e4m3fnuz and f8e5m2 -> f8e5m2fnuz when running on NANOO-only hardware (MI300X), following the same pattern used in gemm_rewriter_fp8_test.cc.

Input

cd /workspace/rocm-jax/xla && bazel test //xla/backends/gpu/tests:collective_ops_e2e_test_amdgpu_any \
  --config=rocm \
  --test_env=TF_TESTS_PER_GPU=1 \
  --test_env=TF_GPU_COUNT=8 \
  --action_env=NCCL_MAX_NCHANNELS=1 \
  --test_filter='CollectiveOpsTestE2E.CollectivePipelinerF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8' \
  --test_output=all \
  --test_timeout=300 \
  2>&1 | tail -100
I0000 00:00:1775356947.890825 3915356 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 5 for CollectiveBFCAllocator.
I0000 00:00:1775356947.890905 3915356 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 6 for BFCAllocator.
I0000 00:00:1775356947.925936 3915356 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 6 for CollectiveBFCAllocator.
I0000 00:00:1775356947.926012 3915356 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 7 for BFCAllocator.
I0000 00:00:1775356947.961116 3915356 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 7 for CollectiveBFCAllocator.
I0000 00:00:1775356947.984081 3915356 tfrt_gpu_client.cc:239] TfrtGpuClient created with 8 / 8 addressable devices.
W0000 00:00:1775356947.985491 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356948.178974 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356948.258954 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
xla/backends/gpu/tests/collective_ops_e2e_test.cc:117: Failure
Expected equality of these values:
  gemm_op->custom_call_target()
    Which is: "__cublas$lt$matmul"
  "__cublas$lt$matmul$f8"

I0000 00:00:1775356948.580927 3914400 tfrt_gpu_client.cc:263] TfrtGpuClient destroyed.
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8 (1033 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8
I0000 00:00:1775356948.712805 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 0 for BFCAllocator.
I0000 00:00:1775356948.747897 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1775356948.747981 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 1 for BFCAllocator.
I0000 00:00:1775356948.783173 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 1 for CollectiveBFCAllocator.
I0000 00:00:1775356948.783235 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 2 for BFCAllocator.
I0000 00:00:1775356948.818462 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 2 for CollectiveBFCAllocator.
I0000 00:00:1775356948.818532 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 3 for BFCAllocator.
I0000 00:00:1775356948.853658 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 3 for CollectiveBFCAllocator.
I0000 00:00:1775356948.853739 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 4 for BFCAllocator.
I0000 00:00:1775356948.888880 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 4 for CollectiveBFCAllocator.
I0000 00:00:1775356948.888960 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 5 for BFCAllocator.
I0000 00:00:1775356948.923984 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 5 for CollectiveBFCAllocator.
I0000 00:00:1775356948.924064 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 6 for BFCAllocator.
I0000 00:00:1775356948.959110 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 6 for CollectiveBFCAllocator.
I0000 00:00:1775356948.959188 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 7 for BFCAllocator.
I0000 00:00:1775356948.994160 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 7 for CollectiveBFCAllocator.
I0000 00:00:1775356949.017215 3915782 tfrt_gpu_client.cc:239] TfrtGpuClient created with 8 / 8 addressable devices.
W0000 00:00:1775356949.018960 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356949.091653 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356949.171884 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
xla/backends/gpu/tests/collective_ops_e2e_test.cc:117: Failure
Expected equality of these values:
  gemm_op->custom_call_target()
    Which is: "__cublas$lt$matmul"
  "__cublas$lt$matmul$f8"

I0000 00:00:1775356949.474531 3914400 tfrt_gpu_client.cc:263] TfrtGpuClient destroyed.
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8 (893 ms)
[----------] 4 tests from CollectiveOpsTestE2EWindowedNonWindowed (19696 ms total)

[----------] 1 test from CollectiveOpsTestE2E
[ RUN      ] CollectiveOpsTestE2E.CollectivePipelinerF8
I0000 00:00:1775356949.604425 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 0 for BFCAllocator.
I0000 00:00:1775356949.605668 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1775356949.605696 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 1 for BFCAllocator.
I0000 00:00:1775356949.606835 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 1 for CollectiveBFCAllocator.
I0000 00:00:1775356949.606855 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 2 for BFCAllocator.
I0000 00:00:1775356949.607992 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 2 for CollectiveBFCAllocator.
I0000 00:00:1775356949.608009 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 3 for BFCAllocator.
I0000 00:00:1775356949.609147 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 3 for CollectiveBFCAllocator.
I0000 00:00:1775356949.609166 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 4 for BFCAllocator.
I0000 00:00:1775356949.610301 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 4 for CollectiveBFCAllocator.
I0000 00:00:1775356949.610316 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 5 for BFCAllocator.
I0000 00:00:1775356949.611551 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 5 for CollectiveBFCAllocator.
I0000 00:00:1775356949.611576 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 6 for BFCAllocator.
I0000 00:00:1775356949.612698 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 6 for CollectiveBFCAllocator.
I0000 00:00:1775356949.612732 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 7 for BFCAllocator.
I0000 00:00:1775356949.613861 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 7 for CollectiveBFCAllocator.
I0000 00:00:1775356949.636857 3916204 tfrt_gpu_client.cc:239] TfrtGpuClient created with 8 / 8 addressable devices.
W0000 00:00:1775356949.638884 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
xla/backends/gpu/tests/collective_ops_e2e_test.cc:117: Failure
Expected equality of these values:
  gemm_op->custom_call_target()
    Which is: "__cublas$lt$matmul"
  "__cublas$lt$matmul$f8"

I0000 00:00:1775356949.912597 3914400 tfrt_gpu_client.cc:263] TfrtGpuClient destroyed.
[  FAILED  ] CollectiveOpsTestE2E.CollectivePipelinerF8 (438 ms)
[----------] 1 test from CollectiveOpsTestE2E (438 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 2 test suites ran. (20134 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 5 tests, listed below:
[  FAILED  ] CollectiveOpsTestE2E.CollectivePipelinerF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8

 5 FAILED TESTS
================================================================================
INFO: Found 1 test target...

Tests fixed:

CollectiveOpsTestE2E.CollectivePipelinerF8
CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8
CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8
CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8
CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8

[==========] Running 5 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from CollectiveOpsTestE2E
[ RUN      ] CollectiveOpsTestE2E.CollectivePipelinerF8
[       OK ] CollectiveOpsTestE2E.CollectivePipelinerF8 (2714 ms)
[----------] 1 test from CollectiveOpsTestE2E (2714 ms total)
[----------] 4 tests from CollectiveOpsTestE2EWindowedNonWindowed
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8 (12528 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8 (1231 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8 (1006 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8 (896 ms)
[----------] 4 tests from CollectiveOpsTestE2EWindowedNonWindowed (15662 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 2 test suites ran. (18377 ms total)
[  PASSED  ] 5 tests.
//xla/backends/gpu/tests:collective_ops_e2e_test_amdgpu_any              PASSED in 35.6s
Executed 1 out of 1 test: 1 test passes.

The 5 FP8 tests in collective_ops_e2e_test use f8e4m3fn (OCP/IEEE FP8), but MI300X (gfx942) only supports f8e4m3fnuz (NANOO FP8). The GEMM rewriter correctly rejects the FP8 rewrite for mismatched types, causing the tests to produce __cublas$lt$matmul instead of __cublas$lt$matmul$f8. Add a ReplaceFp8Types() helper that substitutes f8e4m3fn -> f8e4m3fnuz and f8e5m2 -> f8e5m2fnuz when running on NANOO-only hardware (MI300X), following the same pattern used in gemm_rewriter_fp8_test.cc. Tests fixed: - CollectiveOpsTestE2E.CollectivePipelinerF8 - CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8 - CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8 - CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8 - CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8

claude · 2026-04-05T03:03:48Z

xla/backends/gpu/tests/collective_ops_e2e_test.cc

+        Capability().rocm_compute_capability()->has_nanoo_fp8_support() &&
+        !Capability().rocm_compute_capability()->has_ocp_fp8_support()) {
+      return absl::StrReplaceAll(
+          hlo_text, {{"f8e4m3fn", "f8e4m3fnuz"}, {"f8e5m2", "f8e5m2fnuz"}});


nit: The {"f8e4m3fn", "f8e4m3fnuz"} replacement pair has a subtle prefix-match risk: if an HLO string ever contained a pre-existing f8e4m3fnuz token, absl::StrReplaceAll would match the f8e4m3fn prefix first and produce f8e4m3fnuzuz.

This is safe today because no HLO strings in this file contain fnuz types, and the replacement is gated on NANOO-only hardware. However, the gemm_rewriter_fp8_test.cc reference implementation avoids this by using placeholder tokens (<<F8E4M3>>, <<F8E5M2>>) instead of substring replacement.

Consider adding a brief comment noting this constraint, or switching to the placeholder approach if more FP8 tests are added to this file.

Resolved — addressed in this revision. A comment noting the prefix-match constraint has been added.

claude · 2026-04-05T03:03:52Z

xla/backends/gpu/tests/collective_ops_e2e_test.cc

+  std::string ReplaceFp8Types(absl::string_view hlo_text) {
+    if (Capability().IsRocm() &&
+        Capability().rocm_compute_capability()->has_nanoo_fp8_support() &&
+        !Capability().rocm_compute_capability()->has_ocp_fp8_support()) {


ultra-nit: Capability() is called three times in succession. A local reference would read slightly cleaner:

Suggested change

std::string ReplaceFp8Types(absl::string_view hlo_text) {

if (Capability().IsRocm() &&

Capability().rocm_compute_capability()->has_nanoo_fp8_support() &&

!Capability().rocm_compute_capability()->has_ocp_fp8_support()) {

const auto& cap = Capability();

if (cap.IsRocm() &&

cap.rocm_compute_capability()->has_nanoo_fp8_support() &&

!cap.rocm_compute_capability()->has_ocp_fp8_support()) {

Negligible perf-wise since this runs once per test, just a readability suggestion.

Resolved — addressed in this revision. Capability() is now cached in a local reference.

claude · 2026-04-05T03:04:01Z

Review Summary

Solid fix. The ReplaceFp8Types() helper correctly targets MI300X (NANOO-only) hardware and is inserted at exactly the two entry points all 5 failing FP8 tests flow through. The HasFp8Support() guard remains correct since it's the union of OCP and NANOO support. CUDA and MI350+ paths are unaffected.

Two minor inline suggestions posted (prefix-match risk note, repeated Capability() calls). Neither is blocking.

🤖 Generated with Claude Code

- Store Capability() in a local const reference to avoid repeated calls - Add comment noting the substring replacement constraint (input HLO must not already contain FNUZ type strings)

claude · 2026-04-05T03:13:58Z

Re-review Summary

Previous review feedback has been addressed — both issues are now resolved:

Prefix-match risk: Warning comment added to document the absl::StrReplaceAll ordering constraint.
Capability() caching: Now cached in a const auto& local reference.

No new issues found. The fix is clean, well-scoped, and follows the established pattern from gemm_rewriter_fp8_test.cc. LGTM.

i-chaochen

IIUC, this test is checking ocp-fp8

xla/xla/backends/gpu/tests/collective_ops_e2e_test.cc

Lines 1386 to 1387 in 8298142

    
             lhs = f8e4m3fn[2,16,48]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]} 
        
             rhs = f8e4m3fn[48,192]{1,0} parameter(1), sharding={devices=[1,4]<=[4]}

and

xla/xla/backends/gpu/tests/collective_ops_e2e_test.cc

Lines 1423 to 1424 in 8298142

    
             lhs = f8e4m3fn[2,16,48]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]} 
        
             rhs = f8e4m3fn[2,24,192]{2,1,0} parameter(1), sharding={devices=[1,1,4]<=[4]}

and

xla/xla/backends/gpu/tests/collective_ops_e2e_test.cc

Lines 1504 to 1505 in 8298142

    
             lhs = f8e4m3fn[2,16,192]{2,1,0} parameter(0), sharding={devices=[1,1,4]<=[4]} 
        
             rhs = f8e4m3fn[192,48]{1,0} parameter(1), sharding={devices=[4,1]<=[4]}

So I don't think this is a right thing that we just modify the test case as nanoo fp8 to get a PASS. The proper way should let gfx942 to skip these ocp-fp8 tests (as it's not supported, and gfx950 can no impact) and write related nanoo fp8 hlo for gfx942.

i-chaochen · 2026-04-05T17:26:56Z

xla/backends/gpu/tests/collective_ops_e2e_test.cc

+  // strings are written with OCP types; this replaces them with NANOO types
+  // when running on MI300X so the GEMM rewriter can produce FP8 custom calls.
+  // Note: the input HLO must not already contain FNUZ type strings, as the
+  // substring replacement of "f8e4m3fn" would also match inside "f8e4m3fnuz".


If you want to make a statement, please put the docs as the reference https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html#id4

i-chaochen

Thanks! It's ok to me and please upstream it. Although I was expecting you did some slight changes on the input hlo to both have OCP and Nanno FP8....as it will has less code change. But let's see what upstream will be ok this.

Instead of silently replacing OCP FP8 types (f8e4m3fn) with FNUZ types (f8e4m3fnuz) at runtime via string substitution, properly separate the tests by FP8 variant: - Skip OCP FP8 tests on NANOO-only platforms since they do not support OCP FP8 (ref: https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html#id4) - Add dedicated FNUZ FP8 test cases with native f8e4m3fnuz HLO for NANOO FP8 platforms - Remove ReplaceFp8Types() helper that changed test semantics - Add IsNanooFp8Only() helper (ROCm-only, no CUDA changes) New FNUZ FP8 tests: - CollectivePipelinerFnuzF8 - WindowedEinsumE2EAllGatherFnuzF8 - WindowedEinsumE2EAllGatherReshapeFnuzF8 - WindowedEinsumE2EAllGatherMultiConsumerFnuzF8 - WindowedEinsumE2EReduceScatterFnuzF8

claude · 2026-04-07T14:16:35Z

Re-review Summary

Previous Claude comments (prefix-match risk with StrReplaceAll, repeated Capability() calls) are both resolved — the ReplaceFp8Types() helper has been removed entirely in favor of separate FNUZ test functions with a simple IsNanooFp8Only() guard.

The new approach is straightforward and correct: all five FNUZ HLO modules are faithful type substitutions (f8e4m3fn → f8e4m3fnuz) of their OCP counterparts, with shapes, layouts, sharding, and scale-parameter semantics preserved. No new issues found.

phambinhfin · 2026-04-07T14:22:49Z

@i-chaochen
Upstream : openxla#40490

phambinhfin requested a review from i-chaochen April 5, 2026 02:57

phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 5, 2026

claude bot reviewed Apr 5, 2026

View reviewed changes

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 5, 2026

Address review feedback: cache Capability() and add prefix-match note

8298142

- Store Capability() in a local const reference to avoid repeated calls - Add comment noting the substring replacement constraint (input HLO must not already contain FNUZ type strings)

phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 5, 2026

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 5, 2026

i-chaochen requested changes Apr 5, 2026

View reviewed changes

i-chaochen approved these changes Apr 7, 2026

View reviewed changes

phambinhfin force-pushed the phambinh/fix-collective-ops-fp8-tests branch 4 times, most recently from e0b99d7 to f390a1d Compare April 7, 2026 14:03

phambinhfin force-pushed the phambinh/fix-collective-ops-fp8-tests branch from f390a1d to 5cc99ea Compare April 7, 2026 14:05

phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 7, 2026

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix FP8 collective ops E2E tests for MI300X (NANOO FP8)#776

[ROCm] Fix FP8 collective ops E2E tests for MI300X (NANOO FP8)#776
phambinhfin wants to merge 3 commits intomainfrom
phambinh/fix-collective-ops-fp8-tests

phambinhfin commented Apr 5, 2026 •

edited

Loading

Uh oh!

claude bot Apr 5, 2026

Uh oh!

claude bot Apr 5, 2026

Uh oh!

claude bot Apr 5, 2026

Uh oh!

claude bot Apr 5, 2026

Uh oh!

claude bot commented Apr 5, 2026

Uh oh!

claude bot commented Apr 5, 2026

Uh oh!

i-chaochen left a comment •

edited

Loading

Uh oh!

i-chaochen Apr 5, 2026

Uh oh!

i-chaochen left a comment •

edited

Loading

Uh oh!

claude bot commented Apr 7, 2026

Uh oh!

phambinhfin commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	lhs = f8e4m3fn[2,16,48]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]}
	rhs = f8e4m3fn[48,192]{1,0} parameter(1), sharding={devices=[1,4]<=[4]}

	lhs = f8e4m3fn[2,16,48]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]}
	rhs = f8e4m3fn[2,24,192]{2,1,0} parameter(1), sharding={devices=[1,1,4]<=[4]}

	lhs = f8e4m3fn[2,16,192]{2,1,0} parameter(0), sharding={devices=[1,1,4]<=[4]}
	rhs = f8e4m3fn[192,48]{1,0} parameter(1), sharding={devices=[4,1]<=[4]}

Conversation

phambinhfin commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 5, 2026

Review Summary

Uh oh!

claude bot commented Apr 5, 2026

Re-review Summary

Uh oh!

i-chaochen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

i-chaochen Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

i-chaochen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 7, 2026

Re-review Summary

Uh oh!

phambinhfin commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phambinhfin commented Apr 5, 2026 •

edited

Loading

i-chaochen left a comment •

edited

Loading

i-chaochen left a comment •

edited

Loading