Skip to content

Modifying the NVFP4 block quantization kernel to handle the single le…#5680

Merged
jjsjann123 merged 3 commits intomainfrom
pbasu/nvfp4_onelevel
Dec 18, 2025
Merged

Modifying the NVFP4 block quantization kernel to handle the single le…#5680
jjsjann123 merged 3 commits intomainfrom
pbasu/nvfp4_onelevel

Conversation

@protonu
Copy link
Copy Markdown
Collaborator

@protonu protonu commented Dec 15, 2025

The quantization needs to NVFP4 when there's no global scale needs a minor correction in the math.
This PR also updates the existing C++ test to reflect that.
We can't test this against TE as yet since there's no implementation without the global scale.

@protonu protonu requested a review from jjsjann123 December 15, 2025 17:36
@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 15, 2025

Review updated until commit c9b7cf1

Description

  • Fix NVFP4 block quantization kernel math for single-level quantization (no global scale)

  • Add reciprocal operation when global scale is not used in both kernel and test

  • Correct quantization scaling factor calculation for standalone block quantization

  • Update C++ test to validate the fixed single-level quantization behavior

Changes walkthrough

Relevant files
Bug fix
block_quantization_kernels.cu
Fix NVFP4 kernel quantization math for single-level           

runtime/block_quantization_kernels.cu

  • Add else branch to handle single-level quantization by taking
    reciprocal of scaled_max
  • Fix quantization math when USE_GLOBAL_SCALE is false
  • Ensure proper scaling factor calculation without global scale
  • +2/-0     
    test_low_precision_recipe.cpp
    Update NVFP4 test for single-level quantization fix           

    tests/cpp/test_low_precision_recipe.cpp

  • Add else branch to handle single-level quantization by taking
    reciprocal of block scale
  • Update test to validate fixed quantization behavior without global
    scale
  • Ensure test covers both global scale and single-level quantization
    cases
  • +2/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Numerical Stability

    The reciprocal operation 1.0f / scaled_max could cause numerical issues if scaled_max is zero or very close to zero. Consider adding bounds checking or using a safe reciprocal function to prevent division by zero or overflow.

    scaled_max = 1.0f / scaled_max;
    Test Coverage

    While the test is updated to handle the no-global-scale case, verify that the test actually exercises this code path by setting use_global_scale to false in the test cases. Consider adding specific test cases that validate the mathematical correctness of the reciprocal operation.

    } else {
      tv_block_scale_fp32 = reciprocal(tv_block_scale_fp32);
    }

    Test failures

    • (Medium, 1) CUDA OOM in nvFuser TmaPointwiseTestF.SplitGridDim2D (runner dlcluster_h100)

      Test Name H100 Source
      TmaPointwiseTestF.SplitGridDim2D Link

    @greptile-apps
    Copy link
    Copy Markdown
    Contributor

    greptile-apps bot commented Dec 15, 2025

    Greptile Summary

    This PR fixes a mathematical bug in the NVFP4 block quantization kernel for single-level quantization (when no global scale is provided).

    Key Changes:

    • Added missing reciprocal calculation (scaled_max = 1.0f / scaled_max) in the single-level quantization path in runtime/block_quantization_kernels.cu:133-135
    • Updated the C++ test fusion in tests/cpp/test_low_precision_recipe.cpp:177-179 to match the corrected kernel logic using reciprocal() operation
    • The fix ensures mathematical consistency between single-level and two-level quantization paths

    Technical Context:
    The quantization process computes a block scaling factor by:

    1. Finding the max value in a block
    2. Scaling it down by 1/6 (since FP4 max is 6.0)
    3. Converting to FP8 and back to quantize the scale itself
    4. Computing the reciprocal to get the final scaling factor for data quantization

    The bug was that step 4 was missing for the single-level case, while it was correctly implemented for the two-level case (line 132). This PR adds the symmetrical logic for single-level quantization.

    Confidence Score: 5/5

    • This PR is safe to merge with minimal risk
    • The change fixes a clear mathematical bug in the single-level NVFP4 quantization path by adding the missing reciprocal operation. The fix mirrors the existing two-level quantization logic and is correctly reflected in both the CUDA kernel and C++ test code. The change is minimal, focused, and mathematically sound.
    • No files require special attention

    Important Files Changed

    Filename Overview
    runtime/block_quantization_kernels.cu Added missing reciprocal calculation for single-level NVFP4 quantization when no global scale is used
    tests/cpp/test_low_precision_recipe.cpp Updated test to match the corrected runtime kernel by adding reciprocal operation for single-level quantization

    Sequence Diagram

    sequenceDiagram
        participant Input as Input Data (FP32/BF16)
        participant Kernel as block_quantize_to_nvfp4
        participant MaxCalc as Max Calculation
        participant FP8Conv as FP8 Conversion
        participant Scale as Scale Computation
        participant Quant as Quantization
    
        Input->>Kernel: Array of values (ITEMS_PER_THREAD)
        Kernel->>MaxCalc: Convert to float & compute local max
        MaxCalc->>MaxCalc: Reduce across threads (16/ITEMS_PER_THREAD)
        MaxCalc-->>Kernel: block_max
        
        alt USE_GLOBAL_SCALE = true
            Kernel->>FP8Conv: scaled_max = block_max * global_scale * (1/6)
        else USE_GLOBAL_SCALE = false
            Kernel->>FP8Conv: scaled_max = block_max * (1/6)
        end
        
        FP8Conv->>FP8Conv: clamped_max_fp8 = __float2e4m3(scaled_max)
        FP8Conv->>FP8Conv: scaled_max = __e4m32float(clamped_max_fp8)
        FP8Conv-->>Scale: quantized block scale
        
        alt USE_GLOBAL_SCALE = true
            Scale->>Scale: scaled_max = global_scale / scaled_max
        else USE_GLOBAL_SCALE = false (NEW)
            Scale->>Scale: scaled_max = 1.0 / scaled_max
        end
        
        Scale-->>Quant: scaling factor (scaled_max)
        Quant->>Quant: Write block_scales to global memory
        Quant->>Quant: Scale input values by scaled_max
        Quant->>Quant: Convert to FP4: __float2e2m1
        Quant-->>Kernel: Quantized FP4 output
    
    Loading

    Copy link
    Copy Markdown
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @protonu
    Copy link
    Copy Markdown
    Collaborator Author

    protonu commented Dec 15, 2025

    !test

    @protonu
    Copy link
    Copy Markdown
    Collaborator Author

    protonu commented Dec 16, 2025

    !test

    Copy link
    Copy Markdown
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @protonu
    Copy link
    Copy Markdown
    Collaborator Author

    protonu commented Dec 17, 2025

    !test

    @jjsjann123 jjsjann123 merged commit 250709a into main Dec 18, 2025
    57 of 61 checks passed
    @jjsjann123 jjsjann123 deleted the pbasu/nvfp4_onelevel branch December 18, 2025 22:59
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants