Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/skills/improve-cutile-kernel-perf/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Every experiment iteration applies ONE optimization to the target kernel, verifi

#### Correctness test:
```bash
python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_op and cutile" -v
python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_ and cutile and not test_perf" -v
```

#### Performance benchmark:
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ This repository aims to provide helpful kernel tutorials and examples for tile-b

### Prerequisites

> **GPU Support**: TileGym requires **CUDA 13.1+** and an **NVIDIA Ampere** (e.g., A100) or **Blackwell GPU** (e.g., B200, RTX 5080, RTX 5090). All released cuTile kernels are validated on both architectures. Note that Ampere performance is still being actively optimized. Download CUDA from [NVIDIA CUDA Downloads](https://developer.nvidia.com/cuda-downloads).
> **GPU Support**: TileGym requires **CUDA 13.1+** and a **Blackwell GPU** (e.g., B200, RTX 5080, RTX 5090). **NVIDIA Ampere** (e.g., A100) is also supported with **CUDA 13.2+**. All released cuTile kernels are validated on both architectures. Download CUDA from [NVIDIA CUDA Downloads](https://developer.nvidia.com/cuda-downloads).

- PyTorch (version 2.9.1 or compatible)
- **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)** (Required - TileGym is built and tested exclusively on CUDA 13.1+)
Expand Down
2 changes: 1 addition & 1 deletion README_chs.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ TileGym 是一个 CUDA Tile 内核库,提供了丰富的基于 Tile 的 GPU

### 前置要求

> **GPU 支持**:TileGym 需要 **CUDA 13.1+** 和 **NVIDIA Ampere**(如 A100)或 **Blackwell GPU**(如 B200、RTX 5080、RTX 5090)。所有已发布的 cuTile 内核均在两种架构上经过验证。注意,Ampere 性能仍在积极优化中。请从 [NVIDIA CUDA 下载页面](https://developer.nvidia.com/cuda-downloads) 下载 CUDA。
> **GPU 支持**:TileGym 需要 **CUDA 13.1+** 和 **Blackwell GPU**(如 B200、RTX 5080、RTX 5090)。**NVIDIA Ampere**(如 A100)也受支持,但需要 **CUDA 13.2+**。所有已发布的 cuTile 内核均在两种架构上经过验证。请从 [NVIDIA CUDA 下载页面](https://developer.nvidia.com/cuda-downloads) 下载 CUDA。

- PyTorch(版本 2.9.1 或兼容版本)
- **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)**(必需 - TileGym 仅在 CUDA 13.1+ 上构建和测试)
Expand Down
2 changes: 1 addition & 1 deletion README_cht.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ TileGym 是一個 CUDA Tile 核心函式庫,提供了豐富的基於 Tile 的

### 前置需求

> **GPU 支援**:TileGym 需要 **CUDA 13.1+** 和 **NVIDIA Ampere**(如 A100)或 **Blackwell GPU**(如 B200、RTX 5080、RTX 5090)。所有已發布的 cuTile 核心均在兩種架構上經過驗證。注意,Ampere 效能仍在積極優化中。請從 [NVIDIA CUDA 下載頁面](https://developer.nvidia.com/cuda-downloads) 下載 CUDA。
> **GPU 支援**:TileGym 需要 **CUDA 13.1+** 和 **Blackwell GPU**(如 B200、RTX 5080、RTX 5090)。**NVIDIA Ampere**(如 A100)也受支援,但需要 **CUDA 13.2+**。所有已發布的 cuTile 核心均在兩種架構上經過驗證。請從 [NVIDIA CUDA 下載頁面](https://developer.nvidia.com/cuda-downloads) 下載 CUDA。

- PyTorch(版本 2.9.1 或相容版本)
- **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)**(必需 - TileGym 僅在 CUDA 13.1+ 上建構和測試)
Expand Down
2 changes: 1 addition & 1 deletion README_fr.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Ce dépôt vise à fournir des tutoriels et des exemples de noyaux utiles pour l

### Prérequis

> **Support GPU** : TileGym nécessite **CUDA 13.1+** et un **GPU NVIDIA Ampere** (ex. A100) ou **Blackwell** (ex. B200, RTX 5080, RTX 5090). Tous les noyaux cuTile publiés sont validés sur les deux architectures. La performance sur Ampere est encore en cours d'optimisation active. Téléchargez CUDA depuis [Téléchargements NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads).
> **Support GPU** : TileGym nécessite **CUDA 13.1+** et un **GPU Blackwell** (ex. B200, RTX 5080, RTX 5090). Les **GPU NVIDIA Ampere** (ex. A100) sont également supportés avec **CUDA 13.2+**. Tous les noyaux cuTile publiés sont validés sur les deux architectures. Téléchargez CUDA depuis [Téléchargements NVIDIA CUDA](https://developer.nvidia.com/cuda-downloads).
- PyTorch (version 2.9.1 ou compatible)
- **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)** (Requis - TileGym est construit et testé exclusivement sur CUDA 13.1+)
Expand Down
2 changes: 1 addition & 1 deletion README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ TileGym は、タイルベースの GPU プログラミングのための豊富

### 前提条件

> **GPU サポート**: TileGym には **CUDA 13.1+** と **NVIDIA Ampere**(例:A100)または **Blackwell GPU**(例:B200、RTX 5080、RTX 5090)が必要です。リリース済みのすべての cuTile カーネルは両アーキテクチャで検証済みです。Ampere のパフォーマンスは現在も積極的に最適化中です。CUDA は [NVIDIA CUDA ダウンロード](https://developer.nvidia.com/cuda-downloads) からダウンロードしてください。
> **GPU サポート**: TileGym には **CUDA 13.1+** と **Blackwell GPU**(例:B200、RTX 5080、RTX 5090)が必要です。**NVIDIA Ampere**(例:A100)も **CUDA 13.2+** でサポートされています。リリース済みのすべての cuTile カーネルは両アーキテクチャで検証済みです。CUDA は [NVIDIA CUDA ダウンロード](https://developer.nvidia.com/cuda-downloads) からダウンロードしてください。

- PyTorch(バージョン 2.9.1 または互換バージョン)
- **[CUDA 13.1+](https://developer.nvidia.com/cuda-downloads)**(必須 - TileGym は CUDA 13.1+ でのみビルドおよびテストされています)
Expand Down
5 changes: 3 additions & 2 deletions src/tilegym/ops/cutile/gemma_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,8 +396,9 @@ def forward(
has_soft_cap,
)

BLOCK_M = 128
BLOCK_N = 128
_gemma_cap = torch.cuda.get_device_capability()
BLOCK_M = 64 if _gemma_cap[0] < 9 else 128
BLOCK_N = 64 if _gemma_cap[0] < 9 else 128
EVEN_K = (S_kv % BLOCK_N) == 0
grid = ((S_qo + BLOCK_M - 1) // BLOCK_M, B * H, 1)

Expand Down
42 changes: 24 additions & 18 deletions src/tilegym/ops/cutile/layer_norm_legacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,26 +162,17 @@ def _persistent_layer_norm_autotune_configs():
- BLOCK_N: [2, 4, 8, 16, 32] - number of rows per block
- num_ctas: [1] - single CTA for this kernel
"""
# BLOCK_N options
block_n_options = [2, 4, 8, 16, 32]

# num_ctas options
num_ctas_options = [1]

for block_n in block_n_options:
for num_ctas in num_ctas_options:
yield SimpleNamespace(
BLOCK_N=block_n,
num_ctas=num_ctas,
)
for block_n in [2, 4, 8, 16, 32]:
yield SimpleNamespace(BLOCK_N=block_n, num_ctas=1)


def _get_default_persistent_layer_norm_configs():
"""GPU-specific defaults when autotune is disabled."""
return {
"BLOCK_N": 8,
"num_ctas": 1,
}
gpu_capability = torch.cuda.get_device_capability()
if gpu_capability[0] < 9:
# Smaller BLOCK_N reduces shared memory pressure on pre-SM90 GPUs.
return {"BLOCK_N": 4, "num_ctas": 1}
return {"BLOCK_N": 8, "num_ctas": 1}


def _persistent_layer_norm_early_config_prune(configs, N, D, BLOCK_D):
Expand Down Expand Up @@ -390,8 +381,23 @@ def cutile_persistent_layer_norm_fwd(
# Calculate block sizes
BLOCK_D = next_power_of_2(D)

# Check if autotune is enabled (default: enabled)
enable_autotune = os.environ.get("DISABLE_CUTILE_TUNE", "0") != "1"
# BLOCK_D is a ct.Constant; tileiras compile time grows super-linearly with it.
# This kernel cannot use a smaller tile because LayerNorm statistics are global
# over all columns. Fall back to torch.nn.functional for large D on pre-SM90.
gpu_capability = torch.cuda.get_device_capability(x.device)
_sm80_max_block_d = 1024
if gpu_capability[0] < 9 and BLOCK_D > _sm80_max_block_d:
import torch.nn.functional as F

y_out = F.layer_norm(x, (D,), weight, bias, eps)
if compute_mean_and_rstd:
x_fp32 = x.float()
mean.copy_(x_fp32.mean(dim=-1))
rstd.copy_((1.0 / (x_fp32.var(dim=-1, unbiased=False) + eps).sqrt()))
return y_out, mean, rstd, BLOCK_D, 8

# Autotune disabled on pre-SM90: per-constant compilation is too slow per config.
enable_autotune = os.environ.get("DISABLE_CUTILE_TUNE", "0") != "1" and gpu_capability[0] >= 9

if enable_autotune:
_persistent_layer_norm_autotune_base(
Expand Down
18 changes: 12 additions & 6 deletions src/tilegym/ops/cutile/matmul.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,12 +195,18 @@ def _matmul_autotune_configs():
yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=64, TILE_SIZE_K=64, num_ctas=1, occupancy=1)
yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=64, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
elif gpu_capability[0] < 9:
# sm80 (A100)
yield SimpleNamespace(TILE_SIZE_M=64, TILE_SIZE_N=64, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
yield SimpleNamespace(TILE_SIZE_M=64, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=64, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=1)
yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=2)
# Pre-SM90: num_ctas=1 (CGA unsupported); sweep TILE_K in [32, 64, 128]
for TILE_M in [64, 128]:
for TILE_N in [64, 128]:
for TILE_K in [32, 64, 128]:
for occ in [1, 2]:
yield SimpleNamespace(
TILE_SIZE_M=TILE_M,
TILE_SIZE_N=TILE_N,
TILE_SIZE_K=TILE_K,
num_ctas=1,
occupancy=occ,
)
else:
# sm100+ (Blackwell)
yield SimpleNamespace(TILE_SIZE_M=128, TILE_SIZE_N=128, TILE_SIZE_K=32, num_ctas=1, occupancy=1)
Expand Down
85 changes: 61 additions & 24 deletions src/tilegym/ops/cutile/mla.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,30 @@
# SPDX-License-Identifier: MIT

import math
import os
from types import SimpleNamespace

import cuda.tile as ct
import torch
from cuda.tile._numeric_semantics import RoundingMode as RMd

from tilegym.backend import register_impl


def _mla_sm80_autotune_configs():
"""Pre-SM90 autotune search space for MLA prefill — num_ctas=1 only."""
for tm in [64, 128]:
for tn in [64, 128]:
yield SimpleNamespace(TILE_M=tm, TILE_N=tn, num_ctas=1, occupancy=2)


def _mla_sm90_autotune_configs():
"""SM90+ autotune search space for MLA prefill."""
for tm in [64, 128, 256]:
for tn in [64, 128]:
yield SimpleNamespace(TILE_M=tm, TILE_N=tn, num_ctas=1, occupancy=1)


# Type aliases for constants
ConstInt = ct.Constant[int]

Expand Down Expand Up @@ -149,30 +166,49 @@ def forward(ctx, q, qpe, k, kpe, v, sm_scale, IS_CAUSAL, kernel_configs):
else:
assert H % num_head_kv == 0
query_group_size = int(H / num_head_kv)
# Launch fmha fwd kernel
grid = (math.ceil(S_qo / kernel_configs.get("TILE_M", 256)), B * H, 1)
TILE_M = kernel_configs.get("TILE_M", 256)
TILE_N = kernel_configs.get("TILE_N", 128)
ct.launch(
torch.cuda.current_stream(),
grid,
prefill_mla,
(
q,
qpe,
k,
kpe,
v,
o,
sm_scale,
TILE_D,
TILE_KPE,
H,
TILE_M,
TILE_N,
query_group_size,
),
# Launch fmha fwd kernel.
# Autotune runs when ENABLE_CUTILE_TUNE=1 AND caller did not supply explicit
# kernel_configs. Explicit kernel_configs always bypasses autotune so callers
# can pin a fixed config for controlled A/B comparisons.
_gpu_cap = torch.cuda.get_device_capability(q.device)
_use_autotune = os.environ.get("ENABLE_CUTILE_TUNE", "0") == "1" and not kernel_configs.get(
"_user_explicit", False
)
if _use_autotune:
import cuda.tile_experimental as ct_experimental # lazy — may not be installed

ct_experimental.autotune_launch(
torch.cuda.current_stream(),
grid_fn=lambda cfg: (math.ceil(S_qo / cfg.TILE_M), B * H, 1),
kernel=prefill_mla,
args_fn=lambda cfg: (
q,
qpe,
k,
kpe,
v,
o,
sm_scale,
TILE_D,
TILE_KPE,
H,
cfg.TILE_M,
cfg.TILE_N,
query_group_size,
),
hints_fn=lambda cfg: {"num_ctas": cfg.num_ctas, "occupancy": cfg.occupancy},
search_space=list(_mla_sm80_autotune_configs() if _gpu_cap[0] < 9 else _mla_sm90_autotune_configs()),
)
else:
TILE_M = kernel_configs.get("TILE_M", 64 if _gpu_cap[0] < 9 else 256)
TILE_N = kernel_configs.get("TILE_N", 64 if _gpu_cap[0] < 9 else 128)
grid = (math.ceil(S_qo / TILE_M), B * H, 1)
ct.launch(
torch.cuda.current_stream(),
grid,
prefill_mla,
(q, qpe, k, kpe, v, o, sm_scale, TILE_D, TILE_KPE, H, TILE_M, TILE_N, query_group_size),
)
ctx.save_for_backward(q, k, v, o)
ctx.sm_scale = sm_scale
ctx.shapes = (B, H, S_qo, S_kv)
Expand Down Expand Up @@ -215,7 +251,8 @@ def tile_mla(q, k, v, qpe, kpe, is_causal, scaling, **kwargs):
if user_cfg is None:
kernel_configs = defaults
else:
kernel_configs = {**defaults, **user_cfg}
# Tag so forward() knows to bypass autotune and use the explicit config.
kernel_configs = {**defaults, **user_cfg, "_user_explicit": True}
attention = Attention(is_causal, kernel_configs)
o = attention(q, k, v, scaling, qpe, kpe)
return o
Expand Down
17 changes: 17 additions & 0 deletions src/tilegym/ops/cutile/rms_norm.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,23 @@ def ceil_div(a, b):
TILE_SIZE_M = 4 # Default value, could be made configurable
TILE_SIZE_N = next_power_of_2(N)

# Pre-SM90: TILE_SIZE_N as a ct.Constant causes per-N recompilation.
# Gather kernel avoids this by treating N as a runtime variable.
if torch.cuda.get_device_capability(x.device)[0] < 9:
MAX_FUSED_SIZE = 4096 // x.element_size()
_tile = min(MAX_FUSED_SIZE, next_power_of_2(N))
ct.launch(
torch.cuda.current_stream(),
(M,),
rms_norm_kernel_gather,
(x_arg, weight, y, rstd, N, eps, offset, _tile),
)
ctx.save_for_backward(x, weight, rstd)
ctx.TILE_SIZE = _tile
ctx.eps = eps
ctx.offset = offset
return y.view(*x.shape)

# Other block sizes are more optimal when other dimension is too large/too small
if TILE_SIZE_N <= 1024:
TILE_SIZE_M = 16
Expand Down
11 changes: 11 additions & 0 deletions src/tilegym/ops/cutile/softmax.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,17 @@ def forward(
use_chunked=False,
):
assert not (use_tma and use_chunked), "Cannot use both TMA and chunked softmax at the same time"
# TMA may be emulated on this arch; redirect to non-TMA path with a warning.
if use_tma and torch.cuda.get_device_capability(x.device)[0] < 9:
import warnings

warnings.warn(
"softmax: use_tma=True has no effect on this GPU (TMA is emulated). "
"Falling back to use_tma=False. Pass use_tma=False to suppress this.",
UserWarning,
stacklevel=3,
)
use_tma = False
n_rows, n_cols = x.shape
TILE_SIZE = next_power_of_2(n_cols)
MAX_TILE_SIZE = 8192
Expand Down
9 changes: 5 additions & 4 deletions src/tilegym/ops/cutile/splitk_reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
ConstBool = ct.Constant[bool]


@ct.kernel(occupancy=4)
@ct.kernel(occupancy=ct.ByTarget(sm_80=2, default=4))
def splitk_reduce_kernel(
attn_splitk_out,
lse_splitk_out,
Expand Down Expand Up @@ -112,13 +112,14 @@ def splitk_reduce(attn_splitk_out, lse_splitk_out, attn_out, S_kv, **kwargs):
TILE_D = min(128, next_power_of_2(head_dim))
NUM_KV_SPLITS_POW2 = next_power_of_2(NUM_KV_SPLITS)

# Determine if we should use dot product based on conditions
USE_DOT = NUM_KV_SPLITS_POW2 >= 16
# MMA is efficient once NUM_KV_SPLITS is large enough to amortize launch overhead.
_split_cap = torch.cuda.get_device_capability()
_dot_threshold = 4 if _split_cap[0] < 9 else 16
USE_DOT = NUM_KV_SPLITS_POW2 >= _dot_threshold

# Calculate grid dimensions
grid = (B, num_heads, (head_dim + TILE_D - 1) // TILE_D)

# Launch kernel
ct.launch(
torch.cuda.current_stream(),
grid,
Expand Down
4 changes: 3 additions & 1 deletion tests/benchmark/bench_fused_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,9 @@ def create_benchmark_config(datatype, HEAD_DIM, mode, causal):
)


_dtypes = [torch.float16, torch.float8_e5m2]
# FP8 requires SM90+ (Hopper/Blackwell); skip on SM80 (A100)
_gpu_cap = torch.cuda.get_device_capability()
_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]


@triton.testing.perf_report(
Expand Down
7 changes: 6 additions & 1 deletion tests/benchmark/bench_group_gemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,15 @@ def create_benchmark_config(datatype, num_groups, transpose_b):
)


# FP8 requires SM90+; skip on pre-SM90 (checked at module load, not at import time)
_gpu_cap = torch.cuda.get_device_capability()
_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]


@triton.testing.perf_report(
[
create_benchmark_config(datatype, num_groups, transpose_b)
for datatype in [torch.float16, torch.float8_e5m2]
for datatype in _dtypes
for num_groups in [4, 16]
for transpose_b in [False, True]
]
Expand Down
7 changes: 6 additions & 1 deletion tests/benchmark/bench_matrix_multiplication.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,12 @@ def create_benchmark_config(datatype):
)


@triton.testing.perf_report([create_benchmark_config(datatype) for datatype in [torch.float16, torch.float8_e5m2]])
# FP8 requires SM90+ (Hopper/Blackwell); skip on SM80 (A100)
_gpu_cap = torch.cuda.get_device_capability()
_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]


@triton.testing.perf_report([create_benchmark_config(datatype) for datatype in _dtypes])
def benchmark(M, N, K, backend, datatype):
if datatype == torch.float8_e5m2:
a = torch.randn((M, K), device=DEVICE, dtype=torch.float16).to(torch.float8_e5m2)
Expand Down
9 changes: 4 additions & 5 deletions tests/benchmark/bench_persistent_matmul.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,10 @@ def create_benchmark_config(datatype, static_persistent=True):
)


# Generate configs for different datatypes
configs = [create_benchmark_config(torch.float16)]
# Add FP8 if available
if hasattr(torch, "float8_e5m2"):
configs.append(create_benchmark_config(torch.float8_e5m2))
# FP8 requires SM90+ (Hopper/Blackwell); skip on SM80 (A100)
_gpu_cap = torch.cuda.get_device_capability()
_dtypes = [torch.float16] if _gpu_cap[0] < 9 else [torch.float16, torch.float8_e5m2]
configs = [create_benchmark_config(dt) for dt in _dtypes]


@triton.testing.perf_report(configs)
Expand Down
Loading