Merged
Conversation
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…make staggered rope application faster Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
13 tasks
yaox12
reviewed
Apr 2, 2025
yaox12
reviewed
Apr 2, 2025
yaox12
reviewed
Apr 2, 2025
yaox12
reviewed
Apr 2, 2025
Collaborator
|
I agree with @yaox12's comments. I think we need to add some documentation about our support matrix for |
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…to rope_enhancement
c48b9ac to
ccc6e27
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…rmerEngine into rope_enhancement
for more information, see https://pre-commit.ci
Collaborator
Author
|
/te-ci pytorch |
cyanguwa
reviewed
Apr 18, 2025
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…rmerEngine into rope_enhancement
for more information, see https://pre-commit.ci
yaox12
reviewed
Apr 21, 2025
cyanguwa
reviewed
Apr 21, 2025
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
for more information, see https://pre-commit.ci
Collaborator
Author
|
/te-ci pytorch |
cyanguwa
reviewed
Apr 22, 2025
cyanguwa
reviewed
Apr 22, 2025
cyanguwa
reviewed
Apr 22, 2025
cyanguwa
reviewed
Apr 22, 2025
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…rmerEngine into rope_enhancement
cyanguwa
approved these changes
Apr 22, 2025
KshitijLakhani
pushed a commit
that referenced
this pull request
Apr 23, 2025
* add support for `sb1d` freqs tensor in Fused RoPE Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * add `start_positions` variable to `apply_rotary_pos_emb` function to make staggered rope application faster Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add pytorch path for `start_positions` and corresponding tests Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add tests for start_positions with thd Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes from feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove start_positions from backward pass Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * from feedback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make notes shorter Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
TLDR;
Enable application of staggered rope embeddings to different sequences within the same batch.
During generation tasks, different sequences in a batch might have different start positions (technically different end positions as well but that's bounded by the max sequence length in the batch so something we can afford to ignore for now). This change simply modifies the rope kernel to apply the rope embeddings in a staggered manner to different sequences in the batch using an argument
start_positions.(The
start_positionsand related changes are directly adapted from #829 which was authored by @pggPL)start_positionsis only intended to be used in generation/inference mode and works withsbhd/bshd/thdinput tensor formats.start_positionsis not intended for Context Parallelism use-cases as CP is not used during inference/generation. Although, it should be possible to support that as well but it's not the scope of this PR.Fixes # (issue)
Type of change
Changes
apply_rotary_pos_embfunction but this is non breaking sincestart_positionsis a default kwarg here.FusedRoPEFuncand all the extensions/kernels that are called internally by this function.