feat: RMSNorm backward pass kernels by aghilann · Pull Request #29 · NVIDIA/TileGym

aghilann · 2026-01-04T04:40:37Z

Description

Adds RMSNorm backward pass to TileGym - the first backward kernel implementation

Implementation:

Single CuTile kernel (rms_norm_backward_kernel_dx) computes dx row-parallel and stores intermediate values (dy * x * rstd) into a float32 temp_buffer
dw is computed via temp_buffer.sum(dim=0) using PyTorch's optimized reduction (avoids a second kernel with different access pattern)
All accumulations are done in FP32 regardless of input dtype for numerical stability
Added PyTorch reference impl (rms_norm_backward_torch) for testing/benchmarking
Added test_rmsnorm_backward.py and bench_rmsnorm_backward.py

Performance

2-5x faster than PyTorch across all dtypes.
Both Torch Reference and cuTILE still do accumulates in FP32

bfloat16

N	CuTile (GB/s)	PyTorch (GB/s)
1024	1537.03	472.44
2048	2280.10	533.60
4096	2813.37	515.98
8192	3640.76	526.04
16384	3916.75	542.96

float16

N	CuTile (GB/s)	PyTorch (GB/s)
1024	1563.98	484.67
2048	2361.91	535.59
4096	2927.92	520.76
8192	3566.86	532.16
16384	4023.40	549.59

float32

N	CuTile (GB/s)	PyTorch (GB/s)
1024	2255.25	853.33
2048	2922.93	931.21
4096	3892.06	901.09
8192	3645.70	922.79
16384	2638.26	955.57

CI Configuration

config:
  build: true
  test: ["ops", "benchmark"]

Checklist

Code formatted (./format.sh)
Documentation updated (if needed)
CI configuration reviewed

…aw torch

copy-pr-bot · 2026-01-04T04:40:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vgoklani · 2026-01-04T13:14:02Z

why is the bfloat16 implementation so much slower?

into rms-norm-backward2

aghilann · 2026-01-04T17:31:30Z

why is the bfloat16 implementation so much slower?

I changed my the design of the second kernel and removed it altogether since it was a huge bottleneck and there was an easier way. I honestly have no clue why bf16 was so much slower then fp16 since I deleted it before investigating it.

vgoklani · 2026-01-04T20:32:02Z

We use all bfloat16 so this would be a nonstarter.

aghilann · 2026-01-04T22:18:31Z

@vgoklani Yes I'm aware, that's why I changed the algorithm (it was slow regardless of BF16 vs FP16, though not sure why BF16 was so much worse). Anyway with my new implementation, BF16 vs FP16 performance are now pretty similar.

aghilann · 2026-01-06T00:19:31Z

@hannahli-nv could I get a review + CI on this PR?

hannahli-nv · 2026-01-06T02:34:28Z

/ok to test 24e65b8

hannahli-nv · 2026-01-06T06:30:36Z

/ok to test 077e8fd

aghilann · 2026-01-06T06:53:17Z

Oops don't know how that # made it in there. Thanks for fixing it

hannahli-nv

Overall LGTM, thanks for the contribution!

aghilann added 4 commits January 3, 2026 16:58

feat: benchmark file setup and debug implementation of backprop via r…

b804646

…aw torch

fix: implement rms_norm_backward_dx kernel

10ae547

fix: cleaning up some code

0d63bfd

lint

011315d

aghilann added 2 commits January 3, 2026 20:47

fix: cleaning up some code

181bda7

fix: cleaning up some code

0418cb4

aghilann changed the title ~~feat: RMSNorm backward pass~~ feat: RMSNorm backward pass kernels Jan 4, 2026

aghilann and others added 2 commits January 3, 2026 20:58

fix: testing edge cases

465e33b

Merge branch 'main' into rms-norm-backward2

ba50cbd

aghilann marked this pull request as draft January 4, 2026 17:01

aghilann added 2 commits January 4, 2026 09:22

fix: remove second kernel

fb314b2

Merge branch 'rms-norm-backward2' of https://github.com/aghilann/TileGym

4177d88

into rms-norm-backward2

fix: improve perf with occupancy

24e65b8

aghilann marked this pull request as ready for review January 4, 2026 22:17

hannahli-nv and others added 2 commits January 6, 2026 14:09

Merge branch 'main' into rms-norm-backward2

a4133af

Remove the unnecessary comment symbol

077e8fd

hannahli-nv approved these changes Jan 6, 2026

View reviewed changes

hannahli-nv merged commit d901cf4 into NVIDIA:main Jan 6, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RMSNorm backward pass kernels#29

feat: RMSNorm backward pass kernels#29
hannahli-nv merged 13 commits intoNVIDIA:mainfrom
aghilann:rms-norm-backward2

aghilann commented Jan 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jan 4, 2026

Uh oh!

vgoklani commented Jan 4, 2026

Uh oh!

aghilann commented Jan 4, 2026

Uh oh!

vgoklani commented Jan 4, 2026

Uh oh!

aghilann commented Jan 4, 2026 •

edited

Loading

Uh oh!

aghilann commented Jan 6, 2026

Uh oh!

hannahli-nv commented Jan 6, 2026

Uh oh!

hannahli-nv commented Jan 6, 2026

Uh oh!

aghilann commented Jan 6, 2026

Uh oh!

hannahli-nv left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aghilann commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Performance

bfloat16

float16

float32

CI Configuration

Checklist

Uh oh!

copy-pr-bot Bot commented Jan 4, 2026

Uh oh!

vgoklani commented Jan 4, 2026

Uh oh!

aghilann commented Jan 4, 2026

Uh oh!

vgoklani commented Jan 4, 2026

Uh oh!

aghilann commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aghilann commented Jan 6, 2026

Uh oh!

hannahli-nv commented Jan 6, 2026

Uh oh!

hannahli-nv commented Jan 6, 2026

Uh oh!

aghilann commented Jan 6, 2026

Uh oh!

hannahli-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aghilann commented Jan 4, 2026 •

edited

Loading

aghilann commented Jan 4, 2026 •

edited

Loading