feat: chunked fused linear cross-entropy kernel forward by aghilann · Pull Request #65 · NVIDIA/TileGym

aghilann · 2026-02-25T07:40:38Z

Summary

Adds a cuTile chunked fused linear cross-entropy op to TileGym, plus tests and benchmark coverage.

Added

src/tilegym/ops/cutile/fused_linear_cross_entropy.py
Tests:
- tests/ops/test_fused_linear_cross_entropy.py
Benchmarks:
- tests/benchmark/bench_fused_linear_cross_entropy.py

Motivation

For LLM training, LM head + cross-entropy is often the main OOM source as context length scales. This kernel is currently slower than dense PyTorch CE in latency. Still, the key benefit is significantly lower peak memory at larger batch sizes and contexts, which is often the practical bottleneck for training.

Benchmark Results (RTX 5070 Ti)

Command:
uv run python tests/benchmark/bench_fused_linear_cross_entropy.py

Latency (ms) — slightly higher - not ideal

BT	CuTile	PyTorch
512	2.476	1.680
1024	5.680	3.106
2048	9.982	6.272
4096	19.497	12.523
8192	35.463	23.070
16384	73.284	47.048

Peak memory (MB) — way lower memory

BT	CuTile	PyTorch
512	242.257	242.255
1024	372.284	338.259
2048	376.295	532.267
4096	384.319	920.282
8192	400.366	1696.313
16384	432.459	3248.376

Checklist

Code formatted and imports sorted via repo specifications (./format.sh)
Documentation updated (if needed)
CI configuration reviewed

copy-pr-bot · 2026-02-25T07:40:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

xjmxyt · 2026-03-02T15:43:07Z

/ok to test 4ca2ca5

xjmxyt · 2026-03-03T02:24:09Z

Hi @aghilann , in order to make CI monitor your case correctly, your benchmark should follow: If compute-bound report GBps, if math bound report TFLOPS.
Plot name should end with GBps or TFLOPS.
Thx.

hannahli-nv · 2026-03-03T02:41:29Z

@@ -0,0 +1,169 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Could you please move this file to src/tilegym/ops/cutile/experimental folder since it's a newly added kernel?

hannahli-nv · 2026-03-03T02:43:19Z

@@ -0,0 +1,158 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


And move this benchmark file to tests/benchmark/experimental.

hannahli-nv · 2026-03-03T02:44:23Z

@@ -0,0 +1,127 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


And move this test file to tests/ops/experimental.

Addressed the PR comments, ready for re-review

aghilann · 2026-03-07T06:20:58Z

Hi @aghilann , in order to make CI monitor your case correctly, your benchmark should follow: If compute-bound report GBps, if math bound report TFLOPS. Plot name should end with GBps or TFLOPS. Thx.

Hey a bit confused by the comment, did you mean

If memory-bound → report GB/s
If compute/math-bound → report TFLOPS

Or I might be misunderstanding something @xjmxyt

hannahli-nv · 2026-03-10T03:12:41Z

/ok to test fcdea3c

hannahli-nv · 2026-03-10T03:15:31Z

Hi @aghilann , in order to make CI monitor your case correctly, your benchmark should follow: If compute-bound report GBps, if math bound report TFLOPS. Plot name should end with GBps or TFLOPS. Thx.

Hey a bit confused by the comment, did you mean

If memory-bound → report GB/s If compute/math-bound → report TFLOPS

Or I might be misunderstanding something @xjmxyt

Yes, that's correct. The plot_name needs to end with "-GBps" or "-TFLOPS" for it to be recognized by the CI summary page.

aghilann · 2026-03-10T03:49:58Z

@hannahli-nv Had an import error in my code which caused a CI failure, can you re-run, thx!

hannahli-nv · 2026-03-10T04:41:12Z

/ok to test 356d135

hannahli-nv · 2026-03-10T05:57:55Z

/ok to test da84607

Add chunked fused linear cross-entropy op, tests, and benchmarks

d966a0a

aghilann marked this pull request as draft February 25, 2026 07:40

aghilann added 2 commits February 25, 2026 07:54

Refactor fused CE as forward-only isolated kernel

71a0f61

Inline SM count query in fused CE kernel

a7f0010

aghilann marked this pull request as ready for review February 26, 2026 03:48

Merge branch 'main' into ce-online

4ca2ca5

hannahli-nv reviewed Mar 3, 2026

View reviewed changes

aghilann and others added 3 commits March 7, 2026 07:08

fix: pr feedback on moving files

cd9b806

fix: pr feedback on moving files

af3d4cd

Merge branch 'NVIDIA:main' into ce-online

95034a3

aghilann requested a review from hannahli-nv March 7, 2026 07:14

Merge branch 'main' into ce-online

fcdea3c

fix: import error

356d135

hannahli-nv approved these changes Mar 10, 2026

View reviewed changes

Merge branch 'main' into ce-online

da84607

hannahli-nv merged commit cb076ce into NVIDIA:main Mar 10, 2026
19 checks passed

		@@ -0,0 +1,169 @@
		# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,158 @@
		# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,127 @@
		# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

aghilann commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Added

Motivation

Benchmark Results (RTX 5070 Ti)

Latency (ms) — slightly higher - not ideal

Peak memory (MB) — way lower memory

Checklist

Uh oh!

copy-pr-bot Bot commented Feb 25, 2026

Uh oh!

xjmxyt commented Mar 2, 2026

Uh oh!

xjmxyt commented Mar 3, 2026

Uh oh!

hannahli-nv Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

hannahli-nv Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hannahli-nv Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

aghilann Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

aghilann commented Mar 7, 2026

Uh oh!

hannahli-nv commented Mar 10, 2026

Uh oh!

hannahli-nv commented Mar 10, 2026

Uh oh!

aghilann commented Mar 10, 2026

Uh oh!

hannahli-nv commented Mar 10, 2026

Uh oh!

hannahli-nv commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aghilann commented Feb 25, 2026 •

edited

Loading

hannahli-nv Mar 3, 2026 •

edited

Loading