Skip to content

[Common] Fix incorrect amax initialization in non-RHT NVFP4 C++ tests#2943

Merged
timmoon10 merged 8 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_nvfp4_cpp_tests_fix
May 1, 2026
Merged

[Common] Fix incorrect amax initialization in non-RHT NVFP4 C++ tests#2943
timmoon10 merged 8 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_nvfp4_cpp_tests_fix

Conversation

@Oleg-Goncharov
Copy link
Copy Markdown
Collaborator

Description

This PR fixes the C++ test infrastructure for the non-RHT NVFP4 path.

Previously, the test flow populated the unused scale field instead of amax, which caused incorrect CPU/GPU comparisons in the non-RHT coverage. This change updates the test setup to initialize amax correctly.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Fix incorrect amax initialization
  • No kernel changes

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Oleg-Goncharov and others added 5 commits April 30, 2026 00:02
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR fixes the NVFP4 non-RHT C++ test path where the output tensor's scale field was incorrectly populated instead of amax, causing CPU/GPU reference comparisons to fail. The fix realigns the NVTE_NVFP4_1D_SCALING path in test_common to allocate and sync rowwise and columnwise amax buffers (rather than a scale buffer), and updates performTest to set a stable golden amax (448 × 6 × 8 = 21504) via the new setters. As a bonus, the destructor now frees the previously-leaked amax, columnwise_amax, and scale GPU allocations.

Confidence Score: 5/5

Safe to merge — targeted test-infrastructure fix with no kernel changes and correct memory lifecycle.

All three files contain narrow, well-scoped changes. The allocation/free symmetry in the new constructor + destructor is correct (NVTE_DELAYED_TENSOR_SCALING gets amax + scale freed; NVTE_NVFP4_1D_SCALING gets amax + columnwise_amax freed; scale is null for NVFP4 so no double-free). The from_cpu/to_cpu paths for NVFP4 correctly gate on rowwise_/columnwise_ flags. No production code is touched.

No files require special attention.

Important Files Changed

Filename Overview
tests/cpp/operator/test_cast_nvfp4_transpose.cu Replaces dynamic amax computation + set_scale with a hardcoded golden amax and set_tensor_amax/set_tensor_amax_columnwise, fixing the root cause of the CPU/GPU comparison mismatch.
tests/cpp/test_common.cu Switches NVTE_NVFP4_1D_SCALING from allocating a scale buffer to allocating rowwise and columnwise amax buffers, and adds proper CPU↔GPU sync paths for them in to_cpu()/from_cpu().
tests/cpp/test_common.h Adds amax_cpu_data_columnwise_ member, amax_columnwise() accessor, set_tensor_amax/set_tensor_amax_columnwise setters, and extends the destructor to free amax, columnwise_amax, and scale GPU allocations (fixing a pre-existing memory leak).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Tensor constructor\nNVTE_NVFP4_1D_SCALING] --> B[cudaMalloc amax\ncudaMalloc amax_columnwise]
    B --> C[tensor_.set_amax\ntensor_.set_columnwise_amax]
    C --> D[set_tensor_amax\nset_tensor_amax_columnwise]
    D --> E[from_cpu\ncopies amax to GPU]
    E --> F[GPU kernel runs\nreads amax, writes scale_inv & data]
    F --> G[to_cpu\ncopies amax, scale_inv & data back]
    G --> H[compare_nvfp4_tensors\nCPU vs GPU reference check]
Loading

Reviews (4): Last reviewed commit: "Fixed memory leakage" | Re-trigger Greptile

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov requested review from ptrendx and timmoon10 and removed request for ptrendx April 30, 2026 17:18
timmoon10
timmoon10 previously approved these changes Apr 30, 2026
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as a quick fix. I think the C++ testing infrastructure has gotten to the point where it is too messy to really trust. The recipe-specific logic is complicated and has many unhandled edge cases.

Comment thread tests/cpp/test_common.cu
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov
Copy link
Copy Markdown
Collaborator Author

/te-ci

@timmoon10 timmoon10 merged commit 4fafdf2 into NVIDIA:main May 1, 2026
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants