RoPE enhancements by sudhakarsingh27 · Pull Request #1478 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2025-02-11T22:43:50Z

Description

TLDR;
Enable application of staggered rope embeddings to different sequences within the same batch.

During generation tasks, different sequences in a batch might have different start positions (technically different end positions as well but that's bounded by the max sequence length in the batch so something we can afford to ignore for now). This change simply modifies the rope kernel to apply the rope embeddings in a staggered manner to different sequences in the batch using an argument start_positions.

(The start_positions and related changes are directly adapted from #829 which was authored by @pggPL)

start_positions is only intended to be used in generation/inference mode and works with sbhd/bshd/thd input tensor formats.
start_positions is not intended for Context Parallelism use-cases as CP is not used during inference/generation. Although, it should be possible to support that as well but it's not the scope of this PR.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Non breaking changes to apply_rotary_pos_emb function but this is non breaking since start_positions is a default kwarg here.
Breaking changes to FusedRoPEFunc and all the extensions/kernels that are called internally by this function.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…make staggered rope application faster Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

tests/pytorch/test_fused_rope.py

yaox12

Generally LGTM. With #1626, the fused and unfused versions have the same support matrix for RoPE options, so it's better to merge the unfused implementation to apply_rotary_pos_emb in rope.py.

transformer_engine/common/fused_rope/fused_rope.cu

cyanguwa · 2025-04-11T19:50:39Z

I agree with @yaox12's comments. I think we need to add some documentation about our support matrix for starting_positions and how to use it (None or [s,b,1,d] freqs). We should expand our support to all three qkv_formats and CP/non-CP cases as well.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…to rope_enhancement

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…rmerEngine into rope_enhancement

…enhancement

for more information, see https://pre-commit.ci

sudhakarsingh27 · 2025-04-18T07:40:14Z

/te-ci pytorch

transformer_engine/pytorch/dot_product_attention/rope.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…rmerEngine into rope_enhancement

for more information, see https://pre-commit.ci

transformer_engine/common/fused_rope/fused_rope.cu

transformer_engine/pytorch/csrc/extensions.h

transformer_engine/pytorch/csrc/extensions/apply_rope.cpp

transformer_engine/pytorch/dot_product_attention/rope.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

sudhakarsingh27 · 2025-04-21T22:29:30Z

/te-ci pytorch

yaox12

LGTM.

tests/pytorch/test_fused_rope.py

transformer_engine/pytorch/dot_product_attention/rope.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…rmerEngine into rope_enhancement

* add support for `sb1d` freqs tensor in Fused RoPE Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * add `start_positions` variable to `apply_rotary_pos_emb` function to make staggered rope application faster Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add pytorch path for `start_positions` and corresponding tests Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add tests for start_positions with thd Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes from feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove start_positions from backward pass Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * from feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make notes shorter Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

sudhakarsingh27 added 3 commits December 5, 2024 04:18

add support for sb1d freqs tensor in Fused RoPE

ccc1ec6

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of https://github.com/NVIDIA/TransformerEngine

26a9fb7

add start_positions variable to apply_rotary_pos_emb function to …

09a7c4c

…make staggered rope application faster Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 requested review from cyanguwa and ptrendx February 11, 2025 22:43

sudhakarsingh27 changed the title ~~RoPE functionality enhancements~~ RoPE enhancements Feb 11, 2025

sudhakarsingh27 self-assigned this Feb 11, 2025

cyanguwa added the 2.3.0 label Mar 12, 2025

cyanguwa mentioned this pull request Mar 14, 2025

Add KV cache for paged/non-paged attention #1355

Merged

13 tasks

cyanguwa requested a review from yaox12 March 31, 2025 18:35

yaox12 reviewed Apr 2, 2025

View reviewed changes

tests/pytorch/test_fused_rope.py Outdated Show resolved Hide resolved

yaox12 reviewed Apr 2, 2025

View reviewed changes

tests/pytorch/test_fused_rope.py Outdated Show resolved Hide resolved

yaox12 reviewed Apr 2, 2025

View reviewed changes

tests/pytorch/test_fused_rope.py Outdated Show resolved Hide resolved

yaox12 reviewed Apr 2, 2025

View reviewed changes

transformer_engine/common/fused_rope/fused_rope.cu Outdated Show resolved Hide resolved

sudhakarsingh27 added 2 commits April 16, 2025 13:37

rebase with main

6e02187

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of https://github.com/NVIDIA/TransformerEngine in…

ccc6e27

…to rope_enhancement

sudhakarsingh27 force-pushed the rope_enhancement branch from c48b9ac to ccc6e27 Compare April 16, 2025 20:41

pre-commit-ci bot and others added 7 commits April 16, 2025 20:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

253c90a

for more information, see https://pre-commit.ci

add pytorch path for start_positions and corresponding tests

0b661bf

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

fix merge conflicts

284308a

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b309a4

for more information, see https://pre-commit.ci

add tests for start_positions with thd

fb6fa03

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'rope_enhancement' of github.com:sudhakarsingh27/Transfo…

9057ec0

…rmerEngine into rope_enhancement

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into rope_…

87613ce

…enhancement

sudhakarsingh27 requested a review from yaox12 April 18, 2025 07:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

2710406

for more information, see https://pre-commit.ci

Merge branch 'main' into rope_enhancement

91d420c

cyanguwa reviewed Apr 18, 2025

View reviewed changes

transformer_engine/pytorch/dot_product_attention/rope.py Show resolved Hide resolved

sudhakarsingh27 and others added 7 commits April 20, 2025 16:36

fixes from feedback

c3ab1b0

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

fixes from feedback

975c0a3

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b74c6d5

for more information, see https://pre-commit.ci

remove start_positions from backward pass

01b5648

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'rope_enhancement' of github.com:sudhakarsingh27/Transfo…

294b17c

…rmerEngine into rope_enhancement

[pre-commit.ci] auto fixes from pre-commit.com hooks

4986bd9

for more information, see https://pre-commit.ci

Merge branch 'main' into rope_enhancement

9d0f6fd

yaox12 reviewed Apr 21, 2025

View reviewed changes

cyanguwa reviewed Apr 21, 2025

View reviewed changes

transformer_engine/pytorch/dot_product_attention/rope.py Show resolved Hide resolved

sudhakarsingh27 and others added 4 commits April 21, 2025 13:14

Merge branch 'main' into rope_enhancement

9074bdf

from feedback

44896a8

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

merge with PR

5412246

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9ed1dda

for more information, see https://pre-commit.ci

sudhakarsingh27 requested review from cyanguwa and yaox12 April 22, 2025 01:22

yaox12 previously approved these changes Apr 22, 2025

View reviewed changes

cyanguwa reviewed Apr 22, 2025

View reviewed changes

tests/pytorch/test_fused_rope.py Outdated Show resolved Hide resolved

cyanguwa reviewed Apr 22, 2025

View reviewed changes

tests/pytorch/test_fused_rope.py Outdated Show resolved Hide resolved

cyanguwa reviewed Apr 22, 2025

View reviewed changes

transformer_engine/pytorch/dot_product_attention/rope.py Outdated Show resolved Hide resolved

cyanguwa reviewed Apr 22, 2025

View reviewed changes

transformer_engine/pytorch/dot_product_attention/rope.py Outdated Show resolved Hide resolved

sudhakarsingh27 added 2 commits April 22, 2025 10:31

make notes shorter

016c280

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'rope_enhancement' of github.com:sudhakarsingh27/Transfo…

f349f09

…rmerEngine into rope_enhancement

sudhakarsingh27 dismissed yaox12’s stale review via f349f09 April 22, 2025 17:31

Merge branch 'main' into rope_enhancement

313c717

cyanguwa approved these changes Apr 22, 2025

View reviewed changes

sudhakarsingh27 merged commit 94bff09 into NVIDIA:main Apr 22, 2025
11 checks passed

Conversation

sudhakarsingh27 commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaox12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cyanguwa commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sudhakarsingh27 commented Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 commented Apr 21, 2025

Uh oh!

yaox12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sudhakarsingh27 commented Feb 11, 2025 •

edited

Loading

cyanguwa commented Apr 11, 2025 •

edited

Loading