add `multi_wave_cached` rms norm by liqiangxl · Pull Request #113 · NVIDIA/TileGym

liqiangxl · 2026-04-23T02:06:42Z

Add rms_norm_kernel_multi_wave_cached, a single-tile RMSNorm kernel that caches inputs in registers to avoid reloading from memory.

Replace the boolean static_persistent parameter with a mode parameter for explicit kernel selection:

None: heuristic selection based on tensor shape (default)
"static_persistent": rms_norm_kernel_static_persistent
"multi_wave_reload": rms_norm_kernel_multi_wave_reload
"multi_wave_cached": rms_norm_kernel_multi_wave_cached

Rename kernels for consistency:

rms_norm_kernel_gather -> rms_norm_kernel_multi_wave_reload
rms_norm_kernel_gather_regs_cached -> rms_norm_kernel_multi_wave_cached

Update benchmark to compare all kernel modes side-by-side per dtype.

Performance on GB200 with M = 4096

Current do_bench_cudagraph based performance is not reliable see #82, so I doubled checked with torch.profiler, see 3c8de12

dtype	N	Reload do_bench_cudagraph (GB/s)	Cached do_bench_cudagraph (GB/s)	Speedup do_bench_cudagraph	Reload profiler (GB/s)	Cached profiler (GB/s)	Speedup profiler
float16	1024	4314.1	4428.8	1.03x	3084.4	3139.8	1.02x
float16	2048	6253.6	6252.2	1.00x	4128.8	4128.8	1.00x
float16	4096	8152.3	8241.5	1.01x	5128.1	5204.1	1.01x
float16	8192	5916.0	6317.6	1.07x	5645.5	5949.8	1.05x
float16	16384	6197.5	6404.7	1.03x	5967.0	6365.4	1.07x
bfloat16	1024	4364.3	4374.6	1.00x	3066.4	3102.7	1.01x
bfloat16	2048	6319.6	6328.8	1.00x	4049.5	4033.5	1.00x
bfloat16	4096	8027.3	7560.4	0.94x	4958.8	5103.2	1.03x
bfloat16	8192	5984.2	6400.0	1.07x	5563.6	5908.2	1.06x
bfloat16	16384	6017.0	6385.3	1.06x	5742.3	6341.7	1.10x

Description

CI Configuration

config:
  build: true
  # valid options are "ops", "benchmark", and "sanity"
  test: ["ops", "benchmark"]

Checklist

Code formatted and imports sorted via repo specifications (./format.sh)
Documentation updated (if needed)
CI configuration reviewed

…ection Add rms_norm_kernel_multi_wave_cached, a single-tile RMSNorm kernel that caches inputs in registers to avoid reloading from memory. Replace the boolean static_persistent parameter with a mode parameter for explicit kernel selection: - None: heuristic selection based on tensor shape (default) - "static_persistent": rms_norm_kernel_static_persistent - "multi_wave_reload": rms_norm_kernel_multi_wave_reload - "multi_wave_cached": rms_norm_kernel_multi_wave_cached Rename kernels for consistency: - rms_norm_kernel_gather -> rms_norm_kernel_multi_wave_reload - rms_norm_kernel_gather_regs_cached -> rms_norm_kernel_multi_wave_cached Update benchmark to compare all kernel modes side-by-side per dtype.

copy-pr-bot · 2026-04-23T02:06:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

xjmxyt · 2026-04-23T02:19:10Z

 DEVICE = triton.runtime.driver.active.get_active_torch_device()


 def reference_rms_norm(


Should keep the same interface as src/tilegym/ops/ops.py

xjmxyt · 2026-04-29T15:38:39Z

/ok to test 99c6f79

xjmxyt reviewed Apr 23, 2026

View reviewed changes

liqiangxl and others added 2 commits April 23, 2026 06:01

ensure same signature

757a591

Merge branch 'main' into rmsnorm-mode-parameter

58c72f6

liqiangxl requested a review from xjmxyt April 27, 2026 13:12

Merge branch 'main' into rmsnorm-mode-parameter

99c6f79

xjmxyt approved these changes Apr 29, 2026

View reviewed changes

xjmxyt merged commit d9bf003 into NVIDIA:main Apr 29, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `multi_wave_cached` rms norm#113

add `multi_wave_cached` rms norm#113
xjmxyt merged 4 commits into
NVIDIA:mainfrom
liqiangxl:rmsnorm-mode-parameter

liqiangxl commented Apr 23, 2026

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

xjmxyt Apr 23, 2026

Uh oh!

liqiangxl Apr 23, 2026

Uh oh!

xjmxyt commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		DEVICE = triton.runtime.driver.active.get_active_torch_device()


		def reference_rms_norm(

Conversation

liqiangxl commented Apr 23, 2026

Performance on GB200 with M = 4096

Description

CI Configuration

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

xjmxyt Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

xjmxyt commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants