-
Notifications
You must be signed in to change notification settings - Fork 23
MXFP8 test scale off by 1 fix #338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have added a refactor in the 2nd commit -- I am unsure if it is cleaner this way around, but both solve the issue. @wenchenvincent and @ipanfilo , please have a look at both versions and let me know which one looks better |
tests/cpp/test_common.cu
Outdated
| const size_t row_blocks, const size_t col_blocks, const size_t stride, | ||
| double tol, bool rowwise, std::vector<std::tuple<size_t, size_t, int>> &mismatch_idx) { | ||
| constexpr bool on_gpus = true; | ||
| if (on_gpus) output.to_cpu(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting cpu_scale_inv_ptr() below performs to_cpu()
tests/cpp/test_common.cu
Outdated
| if (std::abs(t_scale - r_scale) == 1) { | ||
| mismatch_idx.emplace_back(i, j, r_scale-t_scale); | ||
| } else { | ||
| ASSERT_FALSE(1) << "Error in " << name << std::endl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use GTEST_FAIL() instead of ASSERT_FALSE(1)
tests/cpp/test_common.h
Outdated
| for (; ii_min < ii_max; ii_min++) { | ||
| size_t jj_min = j * row_blocks; | ||
| const size_t jj_max = std::min(jj_min + row_blocks, cols); | ||
| for (; jj_min < jj_max; jj_min++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have ii and jj nested loops here? One scale value refers to 32 items either in row or in col but not in both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either row_blocks or col_blocks is always 1, so we are doing the logic in one direction or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So one of loops is guaranteed to be 1, which means either of col_blocks or row_blocks is always 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
please merge after getting LGTM from Ilya as well
tests/cpp/test_common.h
Outdated
| for (; ii_min < ii_max; ii_min++) { | ||
| size_t jj_min = j * row_blocks; | ||
| const size_t jj_max = std::min(jj_min + row_blocks, cols); | ||
| for (; jj_min < jj_max; jj_min++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So one of loops is guaranteed to be 1, which means either of col_blocks or row_blocks is always 1, right?
tests/cpp/test_common.h
Outdated
| const size_t jj_max = std::min(jj_min + row_blocks, cols); | ||
| for (; jj_min < jj_max; jj_min++) { | ||
| const size_t data_idx = ii_min * cols + jj_min; | ||
| if (scale_diff == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be move IF out of the loops by making float scale_value 2.0 or 0.5
tests/cpp/test_common.h
Outdated
| } else if (scale_diff == -1) { | ||
| ref_data[data_idx] = static_cast<T>(static_cast<float>(ref_data[data_idx])/2); | ||
| } else { // Shouldn't ever reach this | ||
| ASSERT_FALSE(1) << "Error in adjust_ref, |scale_diff| > 1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GTEST_FAIL() too?
wenchenvincent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's wait for CI before merging.
* MXFP8 test scale off by 1 fix
Description
Added a combined check for scale and values
If an scale difference is detected, values are checked for an off by one boundary condition
mismatch tolerance allows for a few scale differences (1% by default), before throwing an error.
Additionally, issues with dgelu CPU-side jitter have been resolved with data generation fixes.