MXFP8 test scale off by 1 fix #338

alextmagro · 2025-10-16T20:33:20Z

Description

Added a combined check for scale and values
If an scale difference is detected, values are checked for an off by one boundary condition
mismatch tolerance allows for a few scale differences (1% by default), before throwing an error.

Additionally, issues with dgelu CPU-side jitter have been resolved with data generation fixes.

tests/cpp/CMakeLists.txt

tests/cpp/test_common.cu

alextmagro · 2025-10-22T21:52:58Z

I have added a refactor in the 2nd commit -- I am unsure if it is cleaner this way around, but both solve the issue. @wenchenvincent and @ipanfilo , please have a look at both versions and let me know which one looks better

tests/cpp/operator/test_cast_mxfp8.cu

tests/cpp/CMakeLists.txt

tests/cpp/test_common.cu

tests/cpp/test_common.h

ipanfilo · 2025-10-23T04:16:37Z

tests/cpp/test_common.cu

+                             const size_t row_blocks, const size_t col_blocks, const size_t stride, 
+                             double tol, bool rowwise, std::vector<std::tuple<size_t, size_t, int>> &mismatch_idx) {
+  constexpr bool on_gpus = true;
+  if (on_gpus) output.to_cpu();


getting cpu_scale_inv_ptr() below performs to_cpu()

ipanfilo · 2025-10-23T04:20:53Z

tests/cpp/test_common.cu

+        if (std::abs(t_scale - r_scale) == 1) {
+          mismatch_idx.emplace_back(i, j, r_scale-t_scale);
+        } else {
+          ASSERT_FALSE(1) << "Error in " << name << std::endl


You can use GTEST_FAIL() instead of ASSERT_FALSE(1)

ipanfilo · 2025-10-23T04:31:15Z

tests/cpp/test_common.h

+    for (; ii_min < ii_max; ii_min++) {
+      size_t jj_min = j * row_blocks;
+      const size_t jj_max = std::min(jj_min + row_blocks, cols);
+      for (; jj_min < jj_max; jj_min++) {


Why do we have ii and jj nested loops here? One scale value refers to 32 items either in row or in col but not in both

Either row_blocks or col_blocks is always 1, so we are doing the logic in one direction or the other.

So one of loops is guaranteed to be 1, which means either of col_blocks or row_blocks is always 1, right?

That's right.

wangye805

LGTM

please merge after getting LGTM from Ilya as well

ipanfilo · 2025-10-23T15:38:42Z

tests/cpp/test_common.h

+    for (; ii_min < ii_max; ii_min++) {
+      size_t jj_min = j * row_blocks;
+      const size_t jj_max = std::min(jj_min + row_blocks, cols);
+      for (; jj_min < jj_max; jj_min++) {


So one of loops is guaranteed to be 1, which means either of col_blocks or row_blocks is always 1, right?

ipanfilo · 2025-10-23T15:39:36Z

tests/cpp/test_common.h

+      const size_t jj_max = std::min(jj_min + row_blocks, cols);
+      for (; jj_min < jj_max; jj_min++) {
+        const size_t data_idx = ii_min * cols + jj_min;
+        if (scale_diff == 1) {


May be move IF out of the loops by making float scale_value 2.0 or 0.5

tests/cpp/test_common.h

ipanfilo · 2025-10-23T15:46:08Z

tests/cpp/test_common.h

+        } else if (scale_diff == -1) {
+          ref_data[data_idx] = static_cast<T>(static_cast<float>(ref_data[data_idx])/2);
+        } else { // Shouldn't ever reach this
+          ASSERT_FALSE(1) << "Error in adjust_ref, |scale_diff| > 1";


GTEST_FAIL() too?

wenchenvincent

LGTM. Let's wait for CI before merging.

* MXFP8 test scale off by 1 fix

MXFP8 test scale off by 1 fix

91193de

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners October 16, 2025 20:33

wenchenvincent reviewed Oct 19, 2025

View reviewed changes

tests/cpp/CMakeLists.txt Outdated Show resolved Hide resolved

ipanfilo reviewed Oct 20, 2025

View reviewed changes

tests/cpp/test_common.cu Outdated Show resolved Hide resolved

Refactor to change ref data only when needed

b7ab60d

wangye805 requested changes Oct 22, 2025

View reviewed changes

tests/cpp/operator/test_cast_mxfp8.cu Outdated Show resolved Hide resolved

tests/cpp/CMakeLists.txt Outdated Show resolved Hide resolved

tests/cpp/test_common.cu Outdated Show resolved Hide resolved

wangye805 requested changes Oct 22, 2025

View reviewed changes

tests/cpp/test_common.h Outdated Show resolved Hide resolved

ipanfilo reviewed Oct 23, 2025

View reviewed changes

alextmagro added 2 commits October 23, 2025 09:42

cleanup & GTEST_FAIL add

c680760

Skip new logic for CAST_ONLY

0ac9b74

wangye805 approved these changes Oct 23, 2025

View reviewed changes

ipanfilo reviewed Oct 23, 2025

View reviewed changes

Moved adjust_ref to test_common.cu

a0f6d5d

ipanfilo approved these changes Oct 23, 2025

View reviewed changes

Fixed col/row orientation for v2

2748bc6

wenchenvincent approved these changes Oct 29, 2025

View reviewed changes

alextmagro merged commit b092058 into dev Oct 31, 2025
6 checks passed

alextmagro deleted the mxfp8_cast_test_fix branch October 31, 2025 15:43

ipanfilo pushed a commit that referenced this pull request Nov 8, 2025

MXFP8 test scale off by 1 fix (#338)

bcae459

* MXFP8 test scale off by 1 fix

MXFP8 test scale off by 1 fix #338

MXFP8 test scale off by 1 fix #338

Uh oh!

Conversation

alextmagro commented Oct 16, 2025

Description

Uh oh!

Uh oh!

Uh oh!

alextmagro commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangye805 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenchenvincent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangye805 left a comment •

edited

Loading