feat: chunked fused linear cross-entropy kernel forward#65
feat: chunked fused linear cross-entropy kernel forward#65hannahli-nv merged 10 commits intoNVIDIA:mainfrom
Conversation
|
/ok to test 4ca2ca5 |
|
Hi @aghilann , in order to make CI monitor your case correctly, your benchmark should follow: If compute-bound report GBps, if math bound report TFLOPS. |
| @@ -0,0 +1,169 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Could you please move this file to src/tilegym/ops/cutile/experimental folder since it's a newly added kernel?
| @@ -0,0 +1,158 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
And move this benchmark file to tests/benchmark/experimental.
| @@ -0,0 +1,127 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
And move this test file to tests/ops/experimental.
There was a problem hiding this comment.
Addressed the PR comments, ready for re-review
Hey a bit confused by the comment, did you mean If memory-bound → report GB/s Or I might be misunderstanding something @xjmxyt |
|
/ok to test fcdea3c |
Yes, that's correct. The plot_name needs to end with "-GBps" or "-TFLOPS" for it to be recognized by the CI summary page. |
|
@hannahli-nv Had an import error in my code which caused a CI failure, can you re-run, thx! |
|
/ok to test 356d135 |
|
/ok to test da84607 |
Summary
Adds a cuTile chunked fused linear cross-entropy op to TileGym, plus tests and benchmark coverage.
Added
src/tilegym/ops/cutile/fused_linear_cross_entropy.pytests/ops/test_fused_linear_cross_entropy.pytests/benchmark/bench_fused_linear_cross_entropy.pyMotivation
For LLM training, LM head + cross-entropy is often the main OOM source as context length scales. This kernel is currently slower than dense PyTorch CE in latency. Still, the key benefit is significantly lower peak memory at larger batch sizes and contexts, which is often the practical bottleneck for training.
Benchmark Results (RTX 5070 Ti)
Command:
uv run python tests/benchmark/bench_fused_linear_cross_entropy.py
Latency (ms) — slightly higher - not ideal
Peak memory (MB) — way lower memory
Checklist
./format.sh)