[GPTOSS] Support sequence parallelism with attention sinks by SumanthRH · Pull Request #558 · NovaSky-AI/SkyRL

SumanthRH · 2025-10-22T22:44:20Z

What does this PR do?

Follow-up PR to #515 . Adds support for sequence parallelism for the custom flex attention implementation to scale to longer context lengths.

Summary

Support sequence parallelism for attention sinks: Currently, we use Unsloth's flex attention implementation which has a score mod function sink_score_mod . This function uses sink weights per attention head to change the attention score. Now, with ulysses sequence parallelism, each rank initially requires query states of size (bsz, seq_len // sp_size, num_heads, hidden_dim), and then after All2All this comes (bsz, seq_len, num_heads // sp_size, hidden_dim). Different SP ranks thus handle compute for different attention heads. We thus need to index into the sink weights appropriately for different sp ranks.
Support custom attention functions for ulysses sequence parallelism: The current ulysses sequence parallel implemention supports only flash attention. This PR adds a simple wrapper to support custom attention functions.
Tranpose dimentions for Q, K, V states in GPT OSS: Further, there's an issue with the current GPTOSS attention forward implementation: It receives query states in the format (bsz, num_heads, seq_len, hidden_dim) instead of the standard (bsz, seq_len, num_heads, hidden_dim). This PR adds an additional transpose before calling the flex attention function. This makes it compatible with the ulysses sequence parallelism implemention.

Benchmarking

Using our full context training scripts, I tested out the max context lengths I could use. I tested with 4K, 8K, 16K, 32K, and 64K context length for single node training on 1 8xH100. I can scale upto 32K context length without sequence parallelism and with sequence parallelism I can go beyond 32K.

Correctness checks

I modified our existing test_model_wrapper.py file to use GPT-OSS to validate that logprobs with the new implementation are correct. Logprobs match but differ in the second decimal place (higher error because of the flex attention implementation)

E2E Validation

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

…allel Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

SumanthRH added 13 commits October 18, 2025 01:39

init patches for unsloth's flex attention

2e5378f

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

463bc5b

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

bc717e2

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

9efb4ee

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

4673316

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

0a7a876

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

port more changes to branch

8d886bb

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

0552570

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

100dbb2

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

port changes over for seq parallelism

bb4ce98

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

Merge remote-tracking branch 'upstream/main' into gptoss-flex-seq-par…

27f7bff

…allel Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

29d30ed

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

x

b1ce5c8

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

erictang000 added the skyrl-train label Dec 1, 2025

SumanthRH closed this Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPTOSS] Support sequence parallelism with attention sinks#558

[GPTOSS] Support sequence parallelism with attention sinks#558
SumanthRH wants to merge 13 commits intoNovaSky-AI:mainfrom
SumanthRH:gptoss-flex-seq-parallel

SumanthRH commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SumanthRH commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary

Benchmarking

Correctness checks

E2E Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanthRH commented Oct 22, 2025 •

edited

Loading