NVIDIA · hannahli-nv · Apr 21, 2026 · Apr 18, 2026 · Apr 20, 2026
diff --git a/.claude/skills/improve-cutile-kernel-perf/SKILL.md b/.claude/skills/improve-cutile-kernel-perf/SKILL.md
@@ -61,7 +61,7 @@ Every experiment iteration applies ONE optimization to the target kernel, verifi
 
 #### Correctness test:
 ```bash
-python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_op and cutile" -v
+python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_ and cutile and not test_perf" -v
 ```
 
 #### Performance benchmark:

diff --git a/README.md b/README.md
@@ -31,7 +31,7 @@ This repository aims to provide helpful kernel tutorials and examples for tile-b
 
 ### Prerequisites
 
-> **GPU Support**: TileGym requires **CUDA 13.1+** and an **NVIDIA Ampere** (e.g., A100) or **Blackwell GPU** (e.g., B200, RTX 5080, RTX 5090). All released cuTile kernels are validated on both architectures. Note that Ampere performance is still being actively optimized. Download CUDA from [NVIDIA CUDA Downloads](https://developer.nvidia.com/cuda-downloads).
+> **GPU Support**: TileGym requires **CUDA 13.1+** and a **Blackwell GPU** (e.g., B200, RTX 5080, RTX 5090). **NVIDIA Ampere** (e.g., A100) is also supported with **CUDA 13.2+**. All released cuTile kernels are validated on both architectures. Download CUDA from [NVIDIA CUDA Downloads](https://developer.nvidia.com/cuda-downloads).
 
 - PyTorch (version 2.9.1 or compatible)
 - **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)** (Required - TileGym is built and tested exclusively on CUDA 13.1+)

diff --git a/README_chs.md b/README_chs.md
@@ -31,7 +31,7 @@ TileGym 是一个 CUDA Tile 内核库，提供了丰富的基于 Tile 的 GPU
 
 ### 前置要求
 
-> **GPU 支持**：TileGym 需要 **CUDA 13.1+** 和 **NVIDIA Ampere**（如 A100）或 **Blackwell GPU**（如 B200、RTX 5080、RTX 5090）。所有已发布的 cuTile 内核均在两种架构上经过验证。注意，Ampere 性能仍在积极优化中。请从 [NVIDIA CUDA 下载页面](https://developer.nvidia.com/cuda-downloads) 下载 CUDA。
+> **GPU 支持**：TileGym 需要 **CUDA 13.1+** 和 **Blackwell GPU**（如 B200、RTX 5080、RTX 5090）。**NVIDIA Ampere**（如 A100）也受支持，但需要 **CUDA 13.2+**。所有已发布的 cuTile 内核均在两种架构上经过验证。请从 [NVIDIA CUDA 下载页面](https://developer.nvidia.com/cuda-downloads) 下载 CUDA。
 
 - PyTorch（版本 2.9.1 或兼容版本）
 - **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)**（必需 - TileGym 仅在 CUDA 13.1+ 上构建和测试）

diff --git a/README_cht.md b/README_cht.md
@@ -31,7 +31,7 @@ TileGym 是一個 CUDA Tile 核心函式庫，提供了豐富的基於 Tile 的
 
 ### 前置需求
 
-> **GPU 支援**：TileGym 需要 **CUDA 13.1+** 和 **NVIDIA Ampere**（如 A100）或 **Blackwell GPU**（如 B200、RTX 5080、RTX 5090）。所有已發布的 cuTile 核心均在兩種架構上經過驗證。注意，Ampere 效能仍在積極優化中。請從 [NVIDIA CUDA 下載頁面](https://developer.nvidia.com/cuda-downloads) 下載 CUDA。
+> **GPU 支援**：TileGym 需要 **CUDA 13.1+** 和 **Blackwell GPU**（如 B200、RTX 5080、RTX 5090）。**NVIDIA Ampere**（如 A100）也受支援，但需要 **CUDA 13.2+**。所有已發布的 cuTile 核心均在兩種架構上經過驗證。請從 [NVIDIA CUDA 下載頁面](https://developer.nvidia.com/cuda-downloads) 下載 CUDA。
 
 - PyTorch（版本 2.9.1 或相容版本）
 - **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)**（必需 - TileGym 僅在 CUDA 13.1+ 上建構和測試）

diff --git a/README_fr.md b/README_fr.md
@@ -31,7 +31,7 @@ Ce dépôt vise à fournir des tutoriels et des exemples de noyaux utiles pour l
 
 ### Prérequis
 
-> **Support GPU** : TileGym nécessite **CUDA 13.1+** et un **GPU NVIDIA Ampere** (ex. A100) ou **Blackwell** (ex. B200, RTX 5080, RTX 5090). Tous les noyaux cuTile publiés sont validés sur les deux architectures. La performance sur Ampere est encore en cours d'optimisation active. Téléchargez CUDA depuis [Téléchargements NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads).
+> **Support GPU** : TileGym nécessite **CUDA 13.1+** et un **GPU Blackwell** (ex. B200, RTX 5080, RTX 5090). Les **GPU NVIDIA Ampere** (ex. A100) sont également supportés avec **CUDA 13.2+**. Tous les noyaux cuTile publiés sont validés sur les deux architectures. Téléchargez CUDA depuis [Téléchargements NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads).
 
 - PyTorch (version 2.9.1 ou compatible)
 - **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)** (Requis - TileGym est construit et testé exclusivement sur CUDA 13.1+)

diff --git a/README_ja.md b/README_ja.md
@@ -31,7 +31,7 @@ TileGym は、タイルベースの GPU プログラミングのための豊富
 
 ### 前提条件
 
-> **GPU サポート**: TileGym には **CUDA 13.1+** と **NVIDIA Ampere**（例：A100）または **Blackwell GPU**（例：B200、RTX 5080、RTX 5090）が必要です。リリース済みのすべての cuTile カーネルは両アーキテクチャで検証済みです。Ampere のパフォーマンスは現在も積極的に最適化中です。CUDA は [NVIDIA CUDA ダウンロード](https://developer.nvidia.com/cuda-downloads) からダウンロードしてください。
+> **GPU サポート**: TileGym には **CUDA 13.1+** と **Blackwell GPU**（例：B200、RTX 5080、RTX 5090）が必要です。**NVIDIA Ampere**（例：A100）も **CUDA 13.2+** でサポートされています。リリース済みのすべての cuTile カーネルは両アーキテクチャで検証済みです。CUDA は [NVIDIA CUDA ダウンロード](https://developer.nvidia.com/cuda-downloads) からダウンロードしてください。
 
 - PyTorch（バージョン 2.9.1 または互換バージョン）
 - **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)**（必須 - TileGym は CUDA 13.1+ でのみビルドおよびテストされています）

diff --git a/src/tilegym/ops/cutile/gemma_attention.py b/src/tilegym/ops/cutile/gemma_attention.py
@@ -396,8 +396,9 @@ def forward(
                 has_soft_cap,
             )
 
-        BLOCK_M = 128
-        BLOCK_N = 128
+        _gemma_cap = torch.cuda.get_device_capability()
+        BLOCK_M = 64 if _gemma_cap[0] < 9 else 128
+        BLOCK_N = 64 if _gemma_cap[0] < 9 else 128
         EVEN_K = (S_kv % BLOCK_N) == 0
         grid = ((S_qo + BLOCK_M - 1) // BLOCK_M, B * H, 1)
 

diff --git a/src/tilegym/ops/cutile/layer_norm_legacy.py b/src/tilegym/ops/cutile/layer_norm_legacy.py
@@ -162,26 +162,17 @@ def _persistent_layer_norm_autotune_configs():
     - BLOCK_N: [2, 4, 8, 16, 32] - number of rows per block
     - num_ctas: [1] - single CTA for this kernel
     """
-    # BLOCK_N options
-    block_n_options = [2, 4, 8, 16, 32]
-
-    # num_ctas options
-    num_ctas_options = [1]
-
-    for block_n in block_n_options:
-        for num_ctas in num_ctas_options:
-            yield SimpleNamespace(
-                BLOCK_N=block_n,
-                num_ctas=num_ctas,
-            )
+    for block_n in [2, 4, 8, 16, 32]:
+        yield SimpleNamespace(BLOCK_N=block_n, num_ctas=1)
 
 
 def _get_default_persistent_layer_norm_configs():
     """GPU-specific defaults when autotune is disabled."""
-    return {
-        "BLOCK_N": 8,
-        "num_ctas": 1,
-    }
+    gpu_capability = torch.cuda.get_device_capability()
+    if gpu_capability[0] < 9:
+        # Smaller BLOCK_N reduces shared memory pressure on pre-SM90 GPUs.
+        return {"BLOCK_N": 4, "num_ctas": 1}
+    return {"BLOCK_N": 8, "num_ctas": 1}
 
 
 def _persistent_layer_norm_early_config_prune(configs, N, D, BLOCK_D):
@@ -390,8 +381,23 @@ def cutile_persistent_layer_norm_fwd(
     # Calculate block sizes
     BLOCK_D = next_power_of_2(D)
 
-    # Check if autotune is enabled (default: enabled)
-    enable_autotune = os.environ.get("DISABLE_CUTILE_TUNE", "0") != "1"
+    # BLOCK_D is a ct.Constant; tileiras compile time grows super-linearly with it.
+    # This kernel cannot use a smaller tile because LayerNorm statistics are global
+    # over all columns. Fall back to torch.nn.functional for large D on pre-SM90.
+    gpu_capability = torch.cuda.get_device_capability(x.device)
+    _sm80_max_block_d = 1024
+    if gpu_capability[0] < 9 and BLOCK_D > _sm80_max_block_d:
+        import torch.nn.functional as F
+
+        y_out = F.layer_norm(x, (D,), weight, bias, eps)
+        if compute_mean_and_rstd:
+            x_fp32 = x.float()
+            mean.copy_(x_fp32.mean(dim=-1))
+            rstd.copy_((1.0 / (x_fp32.var(dim=-1, unbiased=False) + eps).sqrt()))
+        return y_out, mean, rstd, BLOCK_D, 8
+
+    # Autotune disabled on pre-SM90: per-constant compilation is too slow per config.
+    enable_autotune = os.environ.get("DISABLE_CUTILE_TUNE", "0") != "1" and gpu_capability[0] >= 9
 
     if enable_autotune:
         _persistent_layer_norm_autotune_base(

diff --git a/src/tilegym/ops/cutile/matmul.py b/src/tilegym/ops/cutile/matmul.py
@@ -195,12 +195,18 @@ def _matmul_autotune_configs():
         yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=64, TILE_SIZE_K=64, num_ctas=1, occupancy=1)
         yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=64, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
     elif gpu_capability[0] < 9:
-        # sm80 (A100)
-        yield SimpleNamespace(TILE_SIZE_M=64, TILE_SIZE_N=64, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
-        yield SimpleNamespace(TILE_SIZE_M=64, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
-        yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=64, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
-        yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=1)
-        yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
+        # Pre-SM90: num_ctas=1 (CGA unsupported); sweep TILE_K in [32, 64, 128]
+        for TILE_M in [64, 128]:
+            for TILE_N in [64, 128]:
+                for TILE_K in [32, 64, 128]:
+                    for occ in [1, 2]:
+                        yield SimpleNamespace(
+                            TILE_SIZE_M=TILE_M,
+                            TILE_SIZE_N=TILE_N,
+                            TILE_SIZE_K=TILE_K,
+                            num_ctas=1,
+                            occupancy=occ,
+                        )
     else:
         # sm100+ (Blackwell)
         yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=1)

diff --git a/src/tilegym/ops/cutile/mla.py b/src/tilegym/ops/cutile/mla.py
@@ -3,13 +3,30 @@
 # SPDX-License-Identifier: MIT
 
 import math
+import os
+from types import SimpleNamespace
 
 import cuda.tile as ct
 import torch
 from cuda.tile._numeric_semantics import RoundingMode as RMd
 
 from tilegym.backend import register_impl
 
+
+def _mla_sm80_autotune_configs():
+    """Pre-SM90 autotune search space for MLA prefill — num_ctas=1 only."""
+    for tm in [64, 128]:
+        for tn in [64, 128]:
+            yield SimpleNamespace(TILE_M=tm, TILE_N=tn, num_ctas=1, occupancy=2)
+
+
+def _mla_sm90_autotune_configs():
+    """SM90+ autotune search space for MLA prefill."""
+    for tm in [64, 128, 256]:
+        for tn in [64, 128]:
+            yield SimpleNamespace(TILE_M=tm, TILE_N=tn, num_ctas=1, occupancy=1)
+
+
 # Type aliases for constants
 ConstInt = ct.Constant[int]
 
@@ -149,30 +166,49 @@ def forward(ctx, q, qpe, k, kpe, v, sm_scale, IS_CAUSAL, kernel_configs):
         else:
             assert H % num_head_kv == 0
             query_group_size = int(H / num_head_kv)
-        # Launch fmha fwd kernel
-        grid = (math.ceil(S_qo / kernel_configs.get("TILE_M", 256)), B * H, 1)
-        TILE_M = kernel_configs.get("TILE_M", 256)
-        TILE_N = kernel_configs.get("TILE_N", 128)
-        ct.launch(
-            torch.cuda.current_stream(),
-            grid,
-            prefill_mla,
-            (
-                q,
-                qpe,
-                k,
-                kpe,
-                v,
-                o,
-                sm_scale,
-                TILE_D,
-                TILE_KPE,
-                H,
-                TILE_M,
-                TILE_N,
-                query_group_size,
-            ),
+        # Launch fmha fwd kernel.
+        # Autotune runs when ENABLE_CUTILE_TUNE=1 AND caller did not supply explicit
+        # kernel_configs. Explicit kernel_configs always bypasses autotune so callers
+        # can pin a fixed config for controlled A/B comparisons.
+        _gpu_cap = torch.cuda.get_device_capability(q.device)
+        _use_autotune = os.environ.get("ENABLE_CUTILE_TUNE", "0") == "1" and not kernel_configs.get(
+            "_user_explicit", False
         )
+        if _use_autotune:
+            import cuda.tile_experimental as ct_experimental  # lazy — may not be installed
+
+            ct_experimental.autotune_launch(
+                torch.cuda.current_stream(),
+                grid_fn=lambda cfg: (math.ceil(S_qo / cfg.TILE_M), B * H, 1),
+                kernel=prefill_mla,
+                args_fn=lambda cfg: (
+                    q,
+                    qpe,
+                    k,
+                    kpe,
+                    v,
+                    o,
+                    sm_scale,
+                    TILE_D,
+                    TILE_KPE,
+                    H,
+                    cfg.TILE_M,
+                    cfg.TILE_N,
+                    query_group_size,
+                ),
+                hints_fn=lambda cfg: {"num_ctas": cfg.num_ctas, "occupancy": cfg.occupancy},
+                search_space=list(_mla_sm80_autotune_configs() if _gpu_cap[0] < 9 else _mla_sm90_autotune_configs()),
+            )
+        else:
+            TILE_M = kernel_configs.get("TILE_M", 64 if _gpu_cap[0] < 9 else 256)
+            TILE_N = kernel_configs.get("TILE_N", 64 if _gpu_cap[0] < 9 else 128)
+            grid = (math.ceil(S_qo / TILE_M), B * H, 1)
+            ct.launch(
+                torch.cuda.current_stream(),
+                grid,
+                prefill_mla,
+                (q, qpe, k, kpe, v, o, sm_scale, TILE_D, TILE_KPE, H, TILE_M, TILE_N, query_group_size),
+            )
         ctx.save_for_backward(q, k, v, o)
         ctx.sm_scale = sm_scale
         ctx.shapes = (B, H, S_qo, S_kv)
@@ -215,7 +251,8 @@ def tile_mla(q, k, v, qpe, kpe, is_causal, scaling, **kwargs):
     if user_cfg is None:
         kernel_configs = defaults
     else:
-        kernel_configs = {**defaults, **user_cfg}
+        # Tag so forward() knows to bypass autotune and use the explicit config.
+        kernel_configs = {**defaults, **user_cfg, "_user_explicit": True}
     attention = Attention(is_causal, kernel_configs)
     o = attention(q, k, v, scaling, qpe, kpe)
     return o

diff --git a/src/tilegym/ops/cutile/rms_norm.py b/src/tilegym/ops/cutile/rms_norm.py
@@ -313,6 +313,23 @@ def ceil_div(a, b):
             TILE_SIZE_M = 4  # Default value, could be made configurable
             TILE_SIZE_N = next_power_of_2(N)
 
+            # Pre-SM90: TILE_SIZE_N as a ct.Constant causes per-N recompilation.
+            # Gather kernel avoids this by treating N as a runtime variable.
+            if torch.cuda.get_device_capability(x.device)[0] < 9:
+                MAX_FUSED_SIZE = 4096 // x.element_size()
+                _tile = min(MAX_FUSED_SIZE, next_power_of_2(N))
+                ct.launch(
+                    torch.cuda.current_stream(),
+                    (M,),
+                    rms_norm_kernel_gather,
+                    (x_arg, weight, y, rstd, N, eps, offset, _tile),
+                )
+                ctx.save_for_backward(x, weight, rstd)
+                ctx.TILE_SIZE = _tile
+                ctx.eps = eps
+                ctx.offset = offset
+                return y.view(*x.shape)
+
             # Other block sizes are more optimal when other dimension is too large/too small
             if TILE_SIZE_N <= 1024:
                 TILE_SIZE_M = 16

diff --git a/src/tilegym/ops/cutile/softmax.py b/src/tilegym/ops/cutile/softmax.py
@@ -284,6 +284,17 @@ def forward(
         use_chunked=False,
     ):
         assert not (use_tma and use_chunked), "Cannot use both TMA and chunked softmax at the same time"
+        # TMA may be emulated on this arch; redirect to non-TMA path with a warning.
+        if use_tma and torch.cuda.get_device_capability(x.device)[0] < 9:
+            import warnings
+
+            warnings.warn(
+                "softmax: use_tma=True has no effect on this GPU (TMA is emulated). "
+                "Falling back to use_tma=False. Pass use_tma=False to suppress this.",
+                UserWarning,
+                stacklevel=3,
+            )
+            use_tma = False
         n_rows, n_cols = x.shape
         TILE_SIZE = next_power_of_2(n_cols)
         MAX_TILE_SIZE = 8192

diff --git a/src/tilegym/ops/cutile/splitk_reduce.py b/src/tilegym/ops/cutile/splitk_reduce.py
@@ -16,7 +16,7 @@
 ConstBool = ct.Constant[bool]
 
 
-@ct.kernel(occupancy=4)
+@ct.kernel(occupancy=ct.ByTarget(sm_80=2, default=4))
 def splitk_reduce_kernel(
     attn_splitk_out,
     lse_splitk_out,
@@ -112,13 +112,14 @@ def splitk_reduce(attn_splitk_out, lse_splitk_out, attn_out, S_kv, **kwargs):
     TILE_D = min(128, next_power_of_2(head_dim))
     NUM_KV_SPLITS_POW2 = next_power_of_2(NUM_KV_SPLITS)
 
-    # Determine if we should use dot product based on conditions
-    USE_DOT = NUM_KV_SPLITS_POW2 >= 16
+    # MMA is efficient once NUM_KV_SPLITS is large enough to amortize launch overhead.
+    _split_cap = torch.cuda.get_device_capability()
+    _dot_threshold = 4 if _split_cap[0] < 9 else 16
+    USE_DOT = NUM_KV_SPLITS_POW2 >= _dot_threshold
 
     # Calculate grid dimensions
     grid = (B, num_heads, (head_dim + TILE_D - 1) // TILE_D)
 
-    # Launch kernel
     ct.launch(
         torch.cuda.current_stream(),
         grid,

diff --git a/tests/benchmark/bench_fused_attention.py b/tests/benchmark/bench_fused_attention.py
@@ -74,7 +74,9 @@ def create_benchmark_config(datatype, HEAD_DIM, mode, causal):
     )
 
 
-_dtypes = [torch.float16, torch.float8_e5m2]
+# FP8 requires SM90+ (Hopper/Blackwell); skip on SM80 (A100)
+_gpu_cap = torch.cuda.get_device_capability()
+_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]
 
 
 @triton.testing.perf_report(

diff --git a/tests/benchmark/bench_group_gemm.py b/tests/benchmark/bench_group_gemm.py
@@ -69,10 +69,15 @@ def create_benchmark_config(datatype, num_groups, transpose_b):
     )
 
 
+# FP8 requires SM90+; skip on pre-SM90 (checked at module load, not at import time)
+_gpu_cap = torch.cuda.get_device_capability()
+_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]
+
+
 @triton.testing.perf_report(
     [
         create_benchmark_config(datatype, num_groups, transpose_b)
-        for datatype in [torch.float16, torch.float8_e5m2]
+        for datatype in _dtypes
         for num_groups in [4, 16]
         for transpose_b in [False, True]
     ]

diff --git a/tests/benchmark/bench_matrix_multiplication.py b/tests/benchmark/bench_matrix_multiplication.py
@@ -69,7 +69,12 @@ def create_benchmark_config(datatype):
     )
 
 
-@triton.testing.perf_report([create_benchmark_config(datatype) for datatype in [torch.float16, torch.float8_e5m2]])
+# FP8 requires SM90+ (Hopper/Blackwell); skip on SM80 (A100)
+_gpu_cap = torch.cuda.get_device_capability()
+_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]
+
+
+@triton.testing.perf_report([create_benchmark_config(datatype) for datatype in _dtypes])
 def benchmark(M, N, K, backend, datatype):
     if datatype == torch.float8_e5m2:
         a = torch.randn((M, K), device=DEVICE, dtype=torch.float16).to(torch.float8_e5m2)

diff --git a/tests/benchmark/bench_persistent_matmul.py b/tests/benchmark/bench_persistent_matmul.py
@@ -64,11 +64,10 @@ def create_benchmark_config(datatype, static_persistent=True):
     )
 
 
-# Generate configs for different datatypes
-configs = [create_benchmark_config(torch.float16)]
-# Add FP8 if available
-if hasattr(torch, "float8_e5m2"):
-    configs.append(create_benchmark_config(torch.float8_e5m2))
+# FP8 requires SM90+ (Hopper/Blackwell); skip on SM80 (A100)
+_gpu_cap = torch.cuda.get_device_capability()
+_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]
+configs = [create_benchmark_config(dt) for dt in _dtypes]
 
 
 @triton.testing.perf_report(configs)