Skip to content

skip test cutlass mxfp8_gemm on unsupported arches#5810

Merged
liqiangxl merged 2 commits intomainfrom
llu/skip_unsupported_arches_mxfp8_gemm
Jan 13, 2026
Merged

skip test cutlass mxfp8_gemm on unsupported arches#5810
liqiangxl merged 2 commits intomainfrom
llu/skip_unsupported_arches_mxfp8_gemm

Conversation

@liqiangxl
Copy link
Copy Markdown
Collaborator

@liqiangxl liqiangxl commented Jan 13, 2026

err msg on unsupported hardwares: Exception raised from runGemm at /opt/pytorch/nvfuser/cutlass/mxfp8_scaled_mm.cu:262

@github-actions
Copy link
Copy Markdown

Description

  • Replace manual compute capability check with microarchitecture_is utility function

  • Add import for microarchitecture_is from python.direct_utils module

  • Maintain same test skipping behavior for unsupported architectures (compute capability < 10.0)

  • Simplify architecture detection logic using centralized utility

Changes walkthrough

Relevant files
Enhancement
test_cutlass_mxfp8_gemm.py
Replace compute capability check with utility function     

tests/python/direct/test_cutlass_mxfp8_gemm.py

  • Replace manual compute capability check with microarchitecture_is
    utility function
  • Add import statement for microarchitecture_is from python.direct_utils
  • Maintain same test skipping behavior for unsupported architectures
    (compute capability < 10.0)
  • Simplify architecture detection logic using centralized utility
  • +2/-2     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Logic Equivalence

    Verify that microarchitecture_is(10, 0) provides equivalent logic to the original condition compute_cap < (10, 0) or compute_cap >= (12, 0). The new condition should only allow compute capabilities 10.x and 11.x, excluding < 10.0 and >= 12.0.

    if not microarchitecture_is(10, 0):

    @liqiangxl liqiangxl requested a review from jacobhinkle January 13, 2026 14:41
    @greptile-apps
    Copy link
    Copy Markdown
    Contributor

    greptile-apps Bot commented Jan 13, 2026

    Greptile Overview

    Greptile Summary

    Changes

    This PR refactors the architecture compatibility check in test_cutlass_mxfp8_gemm.py to use the microarchitecture_is utility function from direct_utils instead of directly calling torch.cuda.get_device_capability().

    Before:

    compute_cap = torch.cuda.get_device_capability()
    if compute_cap < (10, 0) or compute_cap >= (12, 0):
        pytest.skip(...)

    After:

    if not microarchitecture_is(10, 0):
        pytest.skip(...)

    Key Observations

    1. Logic Change: The original check allowed a range of compute capabilities [10.0, 12.0), which included architectures like 10.1, 10.2, 10.3, and 11.x. The new check using microarchitecture_is(10, 0) only allows exactly compute capability 10.0 (major==10 AND minor==0), making it more restrictive.

    2. Consistency with Codebase: The pattern matches test_cutlass_gemm.py which also uses microarchitecture_is(10, 0) to restrict tests to exactly Blackwell 10.0.

    3. Import Added: The change adds an import for microarchitecture_is from python.direct_utils, consistent with other test files in the repository.

    Impact

    If MxFP8 GEMM operations were intended to work on architectures 10.1-10.3 (e.g., B300/GB300 at 10.3) or 11.x, this change will skip those tests. If the restriction to exactly 10.0 is intentional for safety/testing reasons, this aligns with the conservative approach used in test_cutlass_gemm.py.

    Confidence Score: 3/5

    • Safe to merge with awareness of narrowed architecture support compared to original logic
    • The code change is syntactically correct and follows established patterns in the codebase. However, there's a semantic difference: the original logic supported a range [10.0, 12.0) while the new logic only supports exactly 10.0. This could be intentional (conservative approach matching test_cutlass_gemm.py) or unintentional (overlooking that the original allowed 10.x and 11.x). The change won't break existing tests on 10.0 hardware, but may skip tests on 10.1+ and 11.x that previously ran.
    • tests/python/direct/test_cutlass_mxfp8_gemm.py - Verify the narrowed architecture support (only 10.0) is intentional vs the original range [10.0, 12.0)

    Important Files Changed

    File Analysis

    Filename Score Overview
    tests/python/direct/test_cutlass_mxfp8_gemm.py 3/5 Changes architecture check from range [10.0, 12.0) to exactly 10.0 only, which is more restrictive than the original logic and may skip tests on valid architectures like 10.1, 10.2, 10.3, and 11.x

    Sequence Diagram

    sequenceDiagram
        participant Test as test_cutlass_mxfp8_gemm.py
        participant DirectUtils as python.direct_utils
        participant CUDA as torch.cuda
        participant Pytest as pytest
    
        Note over Test: Module load time
        Test->>DirectUtils: import microarchitecture_is
        Test->>DirectUtils: microarchitecture_is(10, 0)
        DirectUtils->>CUDA: get_device_properties(current_device())
        CUDA-->>DirectUtils: device properties (major, minor)
        DirectUtils->>DirectUtils: Check: major == 10 AND minor == 0
        alt Architecture is NOT exactly 10.0
            DirectUtils-->>Test: False
            Test->>Pytest: pytest.skip(allow_module_level=True)
            Note over Test,Pytest: Module skipped - tests won't run
        else Architecture is exactly 10.0
            DirectUtils-->>Test: True
            Note over Test: Continue loading module
            Test->>DirectUtils: import linear_to_swizzled_128_4, swizzled_to_linear_128_4
            Note over Test: Tests will run
        end
    
    Loading

    Copy link
    Copy Markdown
    Contributor

    @greptile-apps greptile-apps Bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile


    compute_cap = torch.cuda.get_device_capability()
    if compute_cap < (10, 0) or compute_cap >= (12, 0):
    if not microarchitecture_is(10, 0):
    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    [P1] Logic change narrows supported architectures more than original

    The original condition compute_cap < (10, 0) or compute_cap >= (12, 0) would skip tests outside the range [10.0, 12.0), meaning it would allow compute capabilities like 10.1, 10.2, 10.3, 11.0, 11.5, etc.

    The new condition not microarchitecture_is(10, 0) only allows exactly 10.0 (major==10 AND minor==0), and skips everything else including 10.1, 10.2, 10.3, and all 11.x architectures.

    If the intent was to maintain the original behavior while using the utility function, consider using a range check or adding support for additional architectures that were previously supported (10.1-10.3, 11.x). If restricting to exactly 10.0 is intentional, this is fine but represents a significant behavior change from the original code.

    Copy link
    Copy Markdown
    Collaborator

    @jacobhinkle jacobhinkle left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    @greptile-apps
    Copy link
    Copy Markdown
    Contributor

    greptile-apps Bot commented Jan 13, 2026

    Greptile Overview

    Greptile Summary

    This PR updates the architecture compatibility check for MxFp8 GEMM tests, restricting execution to compute capability 10.0 only. This follows the pattern established in PR #5757 for test_cutlass_gemm.py.

    Key Changes

    • Replaced direct torch.cuda.get_device_capability() check with microarchitecture_is(10, 0) helper function from python.direct_utils
    • Changed from range-based check (10.0 ≤ cc < 12.0) to exact match (cc == 10.0)
    • This restricts the test to only Blackwell 10.0 architectures (B200, GB200), excluding untested variants like 10.3 (B300, GB300) and 12.x

    Behavioral Change

    Before: Test ran on any compute capability from 10.0 up to (but excluding) 12.0
    After: Test runs only on exactly compute capability 10.0

    This is a conservative approach that prevents potential failures on untested architectures, consistent with the test_cutlass_gemm.py pattern.

    Observations

    1. The error message could be more specific about requiring exactly 10.0 rather than just "10"
    2. Missing explanatory comments about tested/untested architectures (present in test_cutlass_gemm.py)
    3. test_cutlass_nvfp4_gemm.py still uses the old pattern and may benefit from similar updates for consistency

    Confidence Score: 4/5

    • This PR is safe to merge with minor documentation improvements recommended
    • The change follows an established pattern from PR Skip un-supported arches test_grouped_mm #5757 and correctly implements the architecture restriction. The logic is sound and the helper function is properly imported. Score of 4 (rather than 5) reflects that the error message could be clearer and documentation comments would improve maintainability, but these are style/clarity issues rather than functional problems.
    • No files require special attention - the single changed file is straightforward and follows established patterns

    Important Files Changed

    File Analysis

    Filename Score Overview
    tests/python/direct/test_cutlass_mxfp8_gemm.py 4/5 Updated architecture check to restrict test to compute capability 10.0 only, following pattern from test_cutlass_gemm.py. Error message could be more specific about the exact requirement.

    Sequence Diagram

    sequenceDiagram
        participant Test as test_mxfp8_gemm
        participant Utils as python.direct_utils
        participant PyTorch as torch.cuda
        participant Pytest as pytest
        
        Note over Test: Module Load Time
        Test->>Utils: import microarchitecture_is
        Test->>Utils: microarchitecture_is(10, 0)
        Utils->>PyTorch: get_device_properties(current_device())
        PyTorch-->>Utils: device_properties
        Utils->>Utils: Check if major == 10 and minor == 0
        alt Compute capability is exactly 10.0
            Utils-->>Test: True
            Note over Test: Test execution proceeds
        else Compute capability is NOT 10.0
            Utils-->>Test: False
            Test->>Pytest: pytest.skip(allow_module_level=True)
            Note over Test,Pytest: Entire test module skipped<br/>with reason message
        end
        
        Note over Test: Previous Behavior (before PR)
        Note over Test: Used torch.cuda.get_device_capability()<br/>directly and checked range [10.0, 12.0)
    
    Loading

    Copy link
    Copy Markdown
    Contributor

    @greptile-apps greptile-apps Bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 3 comments

    Edit Code Review Agent Settings | Greptile

    if compute_cap < (10, 0) or compute_cap >= (12, 0):
    if not microarchitecture_is(10, 0):
    pytest.skip(
    reason="MxFp8 Requires compute capability 10.",
    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The error message "MxFp8 Requires compute capability 10." is misleading. The new logic using microarchitecture_is(10, 0) only runs on exactly compute capability 10.0, not all 10.x architectures.

    Previous behavior: Allowed 10.0, 10.1, 10.2, 10.3, 11.x (anything from 10.0 up to but excluding 12.0)
    New behavior: Only allows 10.0

    Consider updating the message to be more specific, e.g., "MxFp8 requires compute capability 10.0. Other architectures have not been tested." This matches the pattern used in test_cutlass_gemm.py which has detailed comments about tested vs untested architectures.

    Suggested change
    reason="MxFp8 Requires compute capability 10.",
    reason="MxFp8 requires compute capability 10.0. Other architectures have not been tested.",


    compute_cap = torch.cuda.get_device_capability()
    if compute_cap < (10, 0) or compute_cap >= (12, 0):
    if not microarchitecture_is(10, 0):
    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Consider adding explanatory comments about which architectures are supported/tested, similar to test_cutlass_gemm.py (lines 12-16). This would help future maintainers understand why the test is restricted to compute capability 10.0 only.

    Example from test_cutlass_gemm.py:

    # GPU Compute Capability: https://developer.nvidia.com/cuda/gpus
    # tested on blackwell compute 10.0 (B200 and GB200)
    # doesn't support 12.0 (RTX PRO 6000 and RTX 50XX)
    # Not tested on 10.3 (B300 and GB300)
    # Not tested on 12.1 (DGX Spark)
    if not microarchitecture_is(10, 0):

    This documentation helps clarify the intentional restriction and provides context for when the restriction might be relaxed in the future.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    import pytest
    import torch
    from nvfuser_direct import nvf_cutlass
    from python.direct_utils import microarchitecture_is
    Copy link
    Copy Markdown
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Note: test_cutlass_nvfp4_gemm.py still uses the old pattern with torch.cuda.get_device_capability() and allows all architectures in the range [10.0, 12.0). For consistency, consider updating that test file as well if NVFP4 should also be restricted to only tested architectures.

    Current state in test_cutlass_nvfp4_gemm.py:

    compute_cap = torch.cuda.get_device_capability()
    if compute_cap < (10, 0) or compute_cap >= (12, 0):
        pytest.skip(
            reason="Nvfp4 Requires compute capability 10.",
            allow_module_level=True,
        )

    If NVFP4 has the same testing limitations as MxFp8 and grouped_mm, it should follow the same pattern for maintainability and clarity.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    @liqiangxl
    Copy link
    Copy Markdown
    Collaborator Author

    !build

    @liqiangxl liqiangxl merged commit 3011961 into main Jan 13, 2026
    20 checks passed
    @liqiangxl liqiangxl deleted the llu/skip_unsupported_arches_mxfp8_gemm branch January 13, 2026 15:46
    github-actions Bot pushed a commit that referenced this pull request Jan 13, 2026
    Same as #5810
    err msg `Exception raised from runGemm at
    /opt/pytorch/nvfuser/cutlass/nvfp4_scaled_mm.cu:255`
    github-actions Bot pushed a commit that referenced this pull request Jan 15, 2026
    Same as #5810
    err msg 
    `Exception raised from run_nvfp4_scaled_group_mm at
    /opt/pytorch/nvfuser/cutlass/nvfp4_scaled_group_mm.cu:518`
    liqiangxl added a commit that referenced this pull request Jan 20, 2026
    Same as #5810
    Skip tests in `test_narrow_precision` that use scaled/grouped mm
    err msg `Exception raised from runGemm at
    /opt/pytorch/nvfuser/cutlass/nvfp4_scaled_mm.cu:255`
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants