[PyTorch|common] Optimize unpadding kernel for FP8 #1866

xiaoxi-wangfj · 2025-06-11T06:58:46Z

Add multi-tensor unpadding kernel
Replace split+cat with unpadding kernel in Fp8Padding and Fp8Unpadding
Add unpadding with padding unit tests

Description

This PR introduces a high-performance CUDA kernel implementation for tensor unpadding, replacing the previous inefficient torch.split + torch.cat approach. Key improvements include:

2x faster unpadding kernel performance (measured with microbenchmarks).
- 0.3% MFU uplift in my end-to-end 1000B parameter model training.
Added comprehensive unit tests for edge cases (e.g., partial padding units).

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
[x ] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Replace the existing torch.split + torch.cat unpadding implementation in Fp8Padding backward pass and Fp8Unpadding forward pass
Add unpadding inplement in common
Add unpadding unit test in tests

Checklist:

[ x] I have read and followed the contributing guidelines
[ x] The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
[x ] I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

1. Add multi-tensor unpadding kernel 2. Replace split+cat with unpadding kernel in Fp8Padding and Fp8Unpadding 3. Add unpadding with padding unit tests Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

for more information, see https://pre-commit.ci

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2025-06-24T05:26:17Z

/te-ci

yaox12 · 2025-06-25T05:47:26Z

/te-ci

xiaoxi-wangfj and others added 3 commits June 11, 2025 06:29

[PyTorch|common] Implement unpadding kernel for FP8

001b2b1

1. Add multi-tensor unpadding kernel 2. Replace split+cat with unpadding kernel in Fp8Padding and Fp8Unpadding 3. Add unpadding with padding unit tests Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c3017b

for more information, see https://pre-commit.ci

Merge branch 'main' into main

7cafdd4

xiaoxi-wangfj changed the title ~~[PyTorch|common] Implement unpadding kernel for FP8~~ [PyTorch|common] Optimize unpadding kernel for FP8 Jun 11, 2025

xiaoxi-wangfj and others added 4 commits June 12, 2025 15:06

Merge branch 'main' into main

2a0afb1

Merge branch 'main' into main

27f4713

add license

feb99d7

Signed-off-by: Xin Yao <xiny@nvidia.com>

Update padding.cu

acd52a6

Signed-off-by: Xin Yao <xiny@nvidia.com>

Merge branch 'main' into main

b530dcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch|common] Optimize unpadding kernel for FP8 #1866

[PyTorch|common] Optimize unpadding kernel for FP8 #1866

xiaoxi-wangfj commented Jun 11, 2025

Uh oh!

yaox12 commented Jun 24, 2025

Uh oh!

yaox12 commented Jun 25, 2025

Uh oh!

Uh oh!

[PyTorch|common] Optimize unpadding kernel for FP8 #1866

Are you sure you want to change the base?

[PyTorch|common] Optimize unpadding kernel for FP8 #1866

Conversation

xiaoxi-wangfj commented Jun 11, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

yaox12 commented Jun 24, 2025

Uh oh!

yaox12 commented Jun 25, 2025

Uh oh!

Uh oh!