Skip to content

[ROCm] Fix FP8 collective ops E2E tests for MI300X (NANOO FP8)#776

Open
phambinhfin wants to merge 3 commits intomainfrom
phambinh/fix-collective-ops-fp8-tests
Open

[ROCm] Fix FP8 collective ops E2E tests for MI300X (NANOO FP8)#776
phambinhfin wants to merge 3 commits intomainfrom
phambinh/fix-collective-ops-fp8-tests

Conversation

@phambinhfin
Copy link
Copy Markdown

@phambinhfin phambinhfin commented Apr 5, 2026

The 5 FP8 tests in collective_ops_e2e_test use f8e4m3fn (OCP/IEEE FP8), but MI300X (gfx942) only supports f8e4m3fnuz (NANOO FP8). The GEMM rewriter correctly rejects the FP8 rewrite for mismatched types, causing the tests to produce __cublas$lt$matmul instead of __cublas$lt$matmul$f8.

Add a ReplaceFp8Types() helper that substitutes f8e4m3fn -> f8e4m3fnuz and f8e5m2 -> f8e5m2fnuz when running on NANOO-only hardware (MI300X), following the same pattern used in gemm_rewriter_fp8_test.cc.

Input

cd /workspace/rocm-jax/xla && bazel test //xla/backends/gpu/tests:collective_ops_e2e_test_amdgpu_any \
  --config=rocm \
  --test_env=TF_TESTS_PER_GPU=1 \
  --test_env=TF_GPU_COUNT=8 \
  --action_env=NCCL_MAX_NCHANNELS=1 \
  --test_filter='CollectiveOpsTestE2E.CollectivePipelinerF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8:CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8' \
  --test_output=all \
  --test_timeout=300 \
  2>&1 | tail -100
I0000 00:00:1775356947.890825 3915356 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 5 for CollectiveBFCAllocator.
I0000 00:00:1775356947.890905 3915356 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 6 for BFCAllocator.
I0000 00:00:1775356947.925936 3915356 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 6 for CollectiveBFCAllocator.
I0000 00:00:1775356947.926012 3915356 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 7 for BFCAllocator.
I0000 00:00:1775356947.961116 3915356 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 7 for CollectiveBFCAllocator.
I0000 00:00:1775356947.984081 3915356 tfrt_gpu_client.cc:239] TfrtGpuClient created with 8 / 8 addressable devices.
W0000 00:00:1775356947.985491 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356948.178974 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356948.258954 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
xla/backends/gpu/tests/collective_ops_e2e_test.cc:117: Failure
Expected equality of these values:
  gemm_op->custom_call_target()
    Which is: "__cublas$lt$matmul"
  "__cublas$lt$matmul$f8"

I0000 00:00:1775356948.580927 3914400 tfrt_gpu_client.cc:263] TfrtGpuClient destroyed.
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8 (1033 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8
I0000 00:00:1775356948.712805 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 0 for BFCAllocator.
I0000 00:00:1775356948.747897 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1775356948.747981 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 1 for BFCAllocator.
I0000 00:00:1775356948.783173 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 1 for CollectiveBFCAllocator.
I0000 00:00:1775356948.783235 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 2 for BFCAllocator.
I0000 00:00:1775356948.818462 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 2 for CollectiveBFCAllocator.
I0000 00:00:1775356948.818532 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 3 for BFCAllocator.
I0000 00:00:1775356948.853658 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 3 for CollectiveBFCAllocator.
I0000 00:00:1775356948.853739 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 4 for BFCAllocator.
I0000 00:00:1775356948.888880 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 4 for CollectiveBFCAllocator.
I0000 00:00:1775356948.888960 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 5 for BFCAllocator.
I0000 00:00:1775356948.923984 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 5 for CollectiveBFCAllocator.
I0000 00:00:1775356948.924064 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 6 for BFCAllocator.
I0000 00:00:1775356948.959110 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 6 for CollectiveBFCAllocator.
I0000 00:00:1775356948.959188 3915782 gpu_helpers.cc:141] XLA backend allocating 4294967296 bytes on device 7 for BFCAllocator.
I0000 00:00:1775356948.994160 3915782 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 7 for CollectiveBFCAllocator.
I0000 00:00:1775356949.017215 3915782 tfrt_gpu_client.cc:239] TfrtGpuClient created with 8 / 8 addressable devices.
W0000 00:00:1775356949.018960 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356949.091653 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
W0000 00:00:1775356949.171884 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
xla/backends/gpu/tests/collective_ops_e2e_test.cc:117: Failure
Expected equality of these values:
  gemm_op->custom_call_target()
    Which is: "__cublas$lt$matmul"
  "__cublas$lt$matmul$f8"

I0000 00:00:1775356949.474531 3914400 tfrt_gpu_client.cc:263] TfrtGpuClient destroyed.
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8 (893 ms)
[----------] 4 tests from CollectiveOpsTestE2EWindowedNonWindowed (19696 ms total)

[----------] 1 test from CollectiveOpsTestE2E
[ RUN      ] CollectiveOpsTestE2E.CollectivePipelinerF8
I0000 00:00:1775356949.604425 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 0 for BFCAllocator.
I0000 00:00:1775356949.605668 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1775356949.605696 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 1 for BFCAllocator.
I0000 00:00:1775356949.606835 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 1 for CollectiveBFCAllocator.
I0000 00:00:1775356949.606855 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 2 for BFCAllocator.
I0000 00:00:1775356949.607992 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 2 for CollectiveBFCAllocator.
I0000 00:00:1775356949.608009 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 3 for BFCAllocator.
I0000 00:00:1775356949.609147 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 3 for CollectiveBFCAllocator.
I0000 00:00:1775356949.609166 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 4 for BFCAllocator.
I0000 00:00:1775356949.610301 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 4 for CollectiveBFCAllocator.
I0000 00:00:1775356949.610316 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 5 for BFCAllocator.
I0000 00:00:1775356949.611551 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 5 for CollectiveBFCAllocator.
I0000 00:00:1775356949.611576 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 6 for BFCAllocator.
I0000 00:00:1775356949.612698 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 6 for CollectiveBFCAllocator.
I0000 00:00:1775356949.612732 3916204 gpu_helpers.cc:141] XLA backend allocating 134217728 bytes on device 7 for BFCAllocator.
I0000 00:00:1775356949.613861 3916204 gpu_helpers.cc:183] XLA backend will use up to 51535413248 bytes on device 7 for CollectiveBFCAllocator.
I0000 00:00:1775356949.636857 3916204 tfrt_gpu_client.cc:239] TfrtGpuClient created with 8 / 8 addressable devices.
W0000 00:00:1775356949.638884 3914400 sharding_propagation.cc:3130] GSPMD sharding propagation is going to be deprecated and not supported in the future. Please consider migrating to Shardy (https://openxla.org/shardy). For reference, Shardy is already the default partitioner in JAX.
xla/backends/gpu/tests/collective_ops_e2e_test.cc:117: Failure
Expected equality of these values:
  gemm_op->custom_call_target()
    Which is: "__cublas$lt$matmul"
  "__cublas$lt$matmul$f8"

I0000 00:00:1775356949.912597 3914400 tfrt_gpu_client.cc:263] TfrtGpuClient destroyed.
[  FAILED  ] CollectiveOpsTestE2E.CollectivePipelinerF8 (438 ms)
[----------] 1 test from CollectiveOpsTestE2E (438 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 2 test suites ran. (20134 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 5 tests, listed below:
[  FAILED  ] CollectiveOpsTestE2E.CollectivePipelinerF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8
[  FAILED  ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8

 5 FAILED TESTS
================================================================================
INFO: Found 1 test target...

Tests fixed:

  • CollectiveOpsTestE2E.CollectivePipelinerF8
  • CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8
  • CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8
  • CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8
  • CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8
[==========] Running 5 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from CollectiveOpsTestE2E
[ RUN      ] CollectiveOpsTestE2E.CollectivePipelinerF8
[       OK ] CollectiveOpsTestE2E.CollectivePipelinerF8 (2714 ms)
[----------] 1 test from CollectiveOpsTestE2E (2714 ms total)
[----------] 4 tests from CollectiveOpsTestE2EWindowedNonWindowed
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8 (12528 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8 (1231 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8 (1006 ms)
[ RUN      ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8
[       OK ] CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8 (896 ms)
[----------] 4 tests from CollectiveOpsTestE2EWindowedNonWindowed (15662 ms total)
[----------] Global test environment tear-down
[==========] 5 tests from 2 test suites ran. (18377 ms total)
[  PASSED  ] 5 tests.
//xla/backends/gpu/tests:collective_ops_e2e_test_amdgpu_any              PASSED in 35.6s
Executed 1 out of 1 test: 1 test passes.

The 5 FP8 tests in collective_ops_e2e_test use f8e4m3fn (OCP/IEEE FP8),
but MI300X (gfx942) only supports f8e4m3fnuz (NANOO FP8). The GEMM
rewriter correctly rejects the FP8 rewrite for mismatched types, causing
the tests to produce __cublas$lt$matmul instead of __cublas$lt$matmul$f8.

Add a ReplaceFp8Types() helper that substitutes f8e4m3fn -> f8e4m3fnuz
and f8e5m2 -> f8e5m2fnuz when running on NANOO-only hardware (MI300X),
following the same pattern used in gemm_rewriter_fp8_test.cc.

Tests fixed:
- CollectiveOpsTestE2E.CollectivePipelinerF8
- CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherF8
- CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherReshapeF8
- CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EReduceScatterF8
- CollectiveOpsTestE2EWindowedNonWindowed.WindowedEinsumE2EAllGatherMultiConsumerF8
@phambinhfin phambinhfin requested a review from i-chaochen April 5, 2026 02:57
@phambinhfin phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 5, 2026
Capability().rocm_compute_capability()->has_nanoo_fp8_support() &&
!Capability().rocm_compute_capability()->has_ocp_fp8_support()) {
return absl::StrReplaceAll(
hlo_text, {{"f8e4m3fn", "f8e4m3fnuz"}, {"f8e5m2", "f8e5m2fnuz"}});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The {"f8e4m3fn", "f8e4m3fnuz"} replacement pair has a subtle prefix-match risk: if an HLO string ever contained a pre-existing f8e4m3fnuz token, absl::StrReplaceAll would match the f8e4m3fn prefix first and produce f8e4m3fnuzuz.

This is safe today because no HLO strings in this file contain fnuz types, and the replacement is gated on NANOO-only hardware. However, the gemm_rewriter_fp8_test.cc reference implementation avoids this by using placeholder tokens (<<F8E4M3>>, <<F8E5M2>>) instead of substring replacement.

Consider adding a brief comment noting this constraint, or switching to the placeholder approach if more FP8 tests are added to this file.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. A comment noting the prefix-match constraint has been added.

Comment on lines +95 to +98
std::string ReplaceFp8Types(absl::string_view hlo_text) {
if (Capability().IsRocm() &&
Capability().rocm_compute_capability()->has_nanoo_fp8_support() &&
!Capability().rocm_compute_capability()->has_ocp_fp8_support()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultra-nit: Capability() is called three times in succession. A local reference would read slightly cleaner:

Suggested change
std::string ReplaceFp8Types(absl::string_view hlo_text) {
if (Capability().IsRocm() &&
Capability().rocm_compute_capability()->has_nanoo_fp8_support() &&
!Capability().rocm_compute_capability()->has_ocp_fp8_support()) {
const auto& cap = Capability();
if (cap.IsRocm() &&
cap.rocm_compute_capability()->has_nanoo_fp8_support() &&
!cap.rocm_compute_capability()->has_ocp_fp8_support()) {

Negligible perf-wise since this runs once per test, just a readability suggestion.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved — addressed in this revision. Capability() is now cached in a local reference.

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

Review Summary

Solid fix. The ReplaceFp8Types() helper correctly targets MI300X (NANOO-only) hardware and is inserted at exactly the two entry points all 5 failing FP8 tests flow through. The HasFp8Support() guard remains correct since it's the union of OCP and NANOO support. CUDA and MI350+ paths are unaffected.

Two minor inline suggestions posted (prefix-match risk note, repeated Capability() calls). Neither is blocking.

🤖 Generated with Claude Code

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 5, 2026
- Store Capability() in a local const reference to avoid repeated calls
- Add comment noting the substring replacement constraint (input HLO
  must not already contain FNUZ type strings)
@phambinhfin phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 5, 2026
@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

Re-review Summary

Previous review feedback has been addressed — both issues are now resolved:

  • Prefix-match risk: Warning comment added to document the absl::StrReplaceAll ordering constraint.
  • Capability() caching: Now cached in a const auto& local reference.

No new issues found. The fix is clean, well-scoped, and follows the established pattern from gemm_rewriter_fp8_test.cc. LGTM.

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 5, 2026
Copy link
Copy Markdown
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this test is checking ocp-fp8

lhs = f8e4m3fn[2,16,48]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]}
rhs = f8e4m3fn[48,192]{1,0} parameter(1), sharding={devices=[1,4]<=[4]}

and

lhs = f8e4m3fn[2,16,48]{2,1,0} parameter(0), sharding={devices=[1,4,1]<=[4]}
rhs = f8e4m3fn[2,24,192]{2,1,0} parameter(1), sharding={devices=[1,1,4]<=[4]}

and

lhs = f8e4m3fn[2,16,192]{2,1,0} parameter(0), sharding={devices=[1,1,4]<=[4]}
rhs = f8e4m3fn[192,48]{1,0} parameter(1), sharding={devices=[4,1]<=[4]}

So I don't think this is a right thing that we just modify the test case as nanoo fp8 to get a PASS. The proper way should let gfx942 to skip these ocp-fp8 tests (as it's not supported, and gfx950 can no impact) and write related nanoo fp8 hlo for gfx942.

// strings are written with OCP types; this replaces them with NANOO types
// when running on MI300X so the GEMM rewriter can produce FP8 custom calls.
// Note: the input HLO must not already contain FNUZ type strings, as the
// substring replacement of "f8e4m3fn" would also match inside "f8e4m3fnuz".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to make a statement, please put the docs as the reference https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html#id4

Copy link
Copy Markdown
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's ok to me and please upstream it. Although I was expecting you did some slight changes on the input hlo to both have OCP and Nanno FP8....as it will has less code change. But let's see what upstream will be ok this.

@phambinhfin phambinhfin force-pushed the phambinh/fix-collective-ops-fp8-tests branch 4 times, most recently from e0b99d7 to f390a1d Compare April 7, 2026 14:03
Instead of silently replacing OCP FP8 types (f8e4m3fn) with FNUZ types
(f8e4m3fnuz) at runtime via string substitution, properly separate the
tests by FP8 variant:

- Skip OCP FP8 tests on NANOO-only platforms since they do not support
  OCP FP8
  (ref: https://rocm.docs.amd.com/projects/HIP/en/latest/reference/low_fp_types.html#id4)
- Add dedicated FNUZ FP8 test cases with native f8e4m3fnuz HLO for
  NANOO FP8 platforms
- Remove ReplaceFp8Types() helper that changed test semantics
- Add IsNanooFp8Only() helper (ROCm-only, no CUDA changes)

New FNUZ FP8 tests:
- CollectivePipelinerFnuzF8
- WindowedEinsumE2EAllGatherFnuzF8
- WindowedEinsumE2EAllGatherReshapeFnuzF8
- WindowedEinsumE2EAllGatherMultiConsumerFnuzF8
- WindowedEinsumE2EReduceScatterFnuzF8
@phambinhfin phambinhfin force-pushed the phambinh/fix-collective-ops-fp8-tests branch from f390a1d to 5cc99ea Compare April 7, 2026 14:05
@phambinhfin phambinhfin added the claude-review Request a Claude AI code review for this PR label Apr 7, 2026
@claude
Copy link
Copy Markdown

claude bot commented Apr 7, 2026

Re-review Summary

Previous Claude comments (prefix-match risk with StrReplaceAll, repeated Capability() calls) are both resolved — the ReplaceFp8Types() helper has been removed entirely in favor of separate FNUZ test functions with a simple IsNanooFp8Only() guard.

The new approach is straightforward and correct: all five FNUZ HLO modules are faithful type substitutions (f8e4m3fnf8e4m3fnuz) of their OCP counterparts, with shapes, layouts, sharding, and scale-parameter semantics preserved. No new issues found.

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 7, 2026
@phambinhfin
Copy link
Copy Markdown
Author

@i-chaochen
Upstream : openxla#40490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants