[Examples] Add an example for step-wise training by SumanthRH · Pull Request #436 · NovaSky-AI/SkyRL

SumanthRH · 2025-10-09T02:48:01Z

What does this PR do?

Adds an example for step-wise training where each turn is represented as an individual sample in the batch.

Currently, the example still assumes outcome rewards.

Implements :

A custom generator for providing inputs and outputs at each step as an individual sample
A custom trainer that can handle step-wise generator output
A custom advantage estimation function that will compute advantages for the last step and broadcast it to the other steps in that trajectory
A custom evaluation function to calculate metrics correctly

Currently this uses TITO with multi-turn chat templating for a simple demonstration. We simply append responses and observations to a running list of input ids. The generator is not yet compatible with qwen3 or gpt-oss like chat templating - where think tokens are removed - this will be added as a follow-up

There are many bits that can be cleaned up (for example, padding logic is brittle at the moment given special handling for the tensor is_last_step) but it works as an initial example.

I've tested convergence with SkyRL2SQL for the first 20 steps and it seems to match the original wandb curve

Original curve for reference: https://wandb.ai/sky-posttraining-uc-berkeley/skyrl-sql/reports/SkyRL-SQL---VmlldzoxMzM0MTAyMw?accessToken=vrqncoa32qcobvvpuo672yji4gweguk6tjxvaflk1zh73fn70j6l5rj8j619uvry

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…into sumanthrh/gptoss

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…epwise Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH · 2025-10-09T02:59:39Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable example for step-wise training, which is a great addition. The implementation of custom components like the StepWiseGenerator and StepWiseTrainer is well-structured. I've identified a few areas for improvement, primarily concerning correctness in the example's run script and padding logic, as well as some opportunities to enhance performance and code readability. Addressing these points will make the example more robust and easier to follow.

skyrl-train/examples/step_wise/run_skyrl_sql_step_wise.sh

skyrl-train/examples/step_wise/step_wise_trainer.py

skyrl-train/examples/step_wise/step_wise_evaluate.py

skyrl-train/examples/step_wise/step_wise_generator.py

skyrl-train/skyrl_train/generators/utils.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

skyrl-train/examples/step_wise/step_wise_trainer.py

tyler-griggs · 2025-10-09T18:16:54Z

skyrl-train/examples/step_wise/step_wise_trainer.py

+                additional_dims = tuple(tensor.shape[1:]) if len(tensor.shape) > 1 else ()
+
+                if key == "is_last_step":
+                    padding_tensor = torch.ones(pad_size, *additional_dims, dtype=tensor.dtype, device=tensor.device)


I didn't follow why this should be ones instead of zeros -- can you explain?

is_last_step needs to have a one non-zero entry for each trajectory - each trajctory has atleast one last step... if you add all zeros, that means the padding trajectories have no last step at all.

Ideally pad_batch is very generic and we can use it in other examples as well, but is_last_step is special and I'd rather do the padding correctly here

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

tyler-griggs

nits

skyrl-train/examples/step_wise/step_wise_generator.py

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

tyler-griggs · 2025-10-12T05:52:42Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an example for step-wise training, where each turn in a conversation is treated as a separate sample. This is a significant feature addition, implemented through a new StepWiseGenerator and StepWiseTrainer that customize the data generation and training loop. The changes also include a custom advantage estimation function tailored for outcome rewards in a multi-turn setting and a corresponding evaluation function.

My review focuses on the clarity and maintainability of the new implementation. The core logic for step-wise processing appears correct. I've suggested minor improvements to the method signatures in StepWiseTrainer to enhance code clarity by explicitly marking unused parameters inherited from the base class. The modifications to existing utility functions to support this new training paradigm are well-designed for extensibility.

skyrl-train/examples/step_wise/step_wise_trainer.py

tyler-griggs

nice!

tyler-griggs · 2025-10-12T05:55:14Z

skyrl-train/examples/step_wise/step_wise_trainer.py

+        if generator_output["rollout_metrics"] is not None:
+            self.all_metrics.update(generator_output["rollout_metrics"])
+
+        # don't validate - will error out


Could you add just a little more detail on why it will error out, just for posterity :)

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

# What does this PR do? Adds an example for step-wise training where each turn is represented as an individual sample in the batch. Currently, the example still assumes outcome rewards. Implements : - A custom generator for providing inputs and outputs at each step as an individual sample - A custom trainer that can handle step-wise generator output - A custom advantage estimation function that will compute advantages for the last step and broadcast it to the other steps in that trajectory - A custom evaluation function to calculate metrics correctly Currently this uses TITO with multi-turn chat templating for a simple demonstration. We simply append responses and observations to a running list of input ids. The generator is not yet compatible with qwen3 or gpt-oss like chat templating - where think tokens are removed - this will be added as a follow-up There are many bits that can be cleaned up (for example, padding logic is brittle at the moment given special handling for the tensor `is_last_step`) but it works as an initial example. I've tested convergence with SkyRL2SQL for the first 20 steps and it seems to match the original wandb curve <img width="1045" height="595" alt="Screenshot 2025-10-08 at 4 39 13 PM" src="https://github.com/user-attachments/assets/78b1f135-cb0a-4553-afc9-d032bc1459a7" /> <img width="1037" height="636" alt="Screenshot 2025-10-08 at 4 39 25 PM" src="https://github.com/user-attachments/assets/5d44fcf6-9d21-4e3f-8f02-c12843fbaecd" /> Original curve for reference: https://wandb.ai/sky-posttraining-uc-berkeley/skyrl-sql/reports/SkyRL-SQL---VmlldzoxMzM0MTAyMw?accessToken=vrqncoa32qcobvvpuo672yji4gweguk6tjxvaflk1zh73fn70j6l5rj8j619uvry --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

SumanthRH and others added 30 commits September 14, 2025 05:48

current draft

4f92084

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

e752997

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

c9b58a6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

b2168a4

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'upstream/main' into sumanthrh/gptoss

2cd359b

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

a4dfb98

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

c8f84bc

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

49183f6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

89573ff

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

867c953

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

cleanup

4fad311

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

4e6dd1a

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

6af340e

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

9522370

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

71ef7a5

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

03f3427

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Update skyrl-train/examples/gptoss/run_gsm8k_gptoss.sh

7c0edfa

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix tests

7e39ce6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge branch 'sumanthrh/gptoss' of https://github.com/SumanthRH/SkyRL …

5176f44

…into sumanthrh/gptoss

x

ebac474

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

ckpt progress for custom attention mask

f5fc781

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

d02c684

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

ee7be29

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'upstream/main' into sumanthrh/gptoss-st…

1f28be8

…epwise Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

init step wise

beee1b6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

step wise

aa6710b

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

e6cd70f

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'upstream/main' into step-wise-example

7cde165

x

2b7f096

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

f679624

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH added 5 commits October 9, 2025 02:28

working

9d72d85

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

7f151a4

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

03c15b7

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

995f4e0

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

a8225a7

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review October 9, 2025 02:58

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

SumanthRH and others added 5 commits October 8, 2025 20:06

Update run_skyrl_sql_step_wise.sh

1d2377f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

x

317c623

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

0e0dd76

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

0628324

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

41097dc

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH assigned tyler-griggs Oct 9, 2025

SumanthRH requested a review from tyler-griggs October 9, 2025 17:34

tyler-griggs reviewed Oct 9, 2025

View reviewed changes

SumanthRH added 2 commits October 9, 2025 18:28

x

4239977

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

2793d94

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH requested a review from tyler-griggs October 9, 2025 21:48

tyler-griggs reviewed Oct 9, 2025

View reviewed changes

SumanthRH requested a review from tyler-griggs October 10, 2025 20:22

x

6bc4da5

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist bot reviewed Oct 12, 2025

View reviewed changes

skyrl-train/examples/step_wise/step_wise_trainer.py Show resolved Hide resolved

skyrl-train/examples/step_wise/step_wise_trainer.py Show resolved Hide resolved

tyler-griggs approved these changes Oct 12, 2025

View reviewed changes

x

48b67cc

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH merged commit 6d41277 into NovaSky-AI:main Oct 12, 2025
1 check passed

Conversation

SumanthRH commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

SumanthRH commented Oct 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tyler-griggs Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SumanthRH Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tyler-griggs commented Oct 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

tyler-griggs left a comment

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanthRH commented Oct 9, 2025 •

edited

Loading

SumanthRH Oct 9, 2025 •

edited

Loading