[skyrl-train][step-wise] 1/N - Support step-wise training with `step_wise_training` flag #694

SumanthRH · 2025-11-21T02:41:54Z

What does this PR do?

Supports step wise training natively with step_wise_training

Currently step-wise training introduces a new generator output format, and some custom book-keeping in the agent loop. For the first integration, we add this functionality as a separate generator.

I plan to have a follow-up PR where we simplify this and have the logic in the base generator.

TODO:

Migrate step_wise example to use the same flag
Update docs
Clean up trainer

E2E Run:

Step wise training for SkyRL-SQL:

Reference run:

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH · 2025-11-21T03:48:29Z

skyrl-train/skyrl_train/generators/base.py

+    trajectory_ids: Optional[List[TrajectoryID]]
+    # Applicable only for step-wise training
+    is_last_step: Optional[List[bool]]


Both the fields are optional right now since it's not a hard requirement for all generators to send this over, and only required for step wise training

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

CharlieFRuan · 2025-11-24T18:38:23Z

/gemini review

gemini-code-assist

Code Review

This PR is a good first step towards integrating step-wise training natively into skyrl-train. It successfully refactors the logic from the examples/step_wise directory into the core library, controlled by a new step_wise_training flag. The changes are well-contained and the new flag provides a clear way to enable the feature. The removal of custom entrypoints and trainers for the example and moving the logic to the base classes is a great improvement for maintainability. I have one major concern regarding the padding logic for is_last_step which might lead to incorrect advantage calculations. Please see my detailed comment.

gemini-code-assist · 2025-11-24T18:40:25Z

skyrl-train/skyrl_train/trainer.py

+                additional_dims = tuple(tensor.shape[1:]) if len(tensor.shape) > 1 else ()
+
+                if key == "is_last_step":
+                    padding_tensor = torch.ones(pad_size, *additional_dims, dtype=tensor.dtype, device=tensor.device)


The padding for is_last_step should be False (i.e., torch.zeros) instead of True (torch.ones). When is_last_step is True for padded rows, they are incorrectly included in the advantage calculation for step-wise training. This can lead to incorrect advantages being computed, as rewards from padded rows (which are cloned from other valid rows) are used as if they are from a final step of a trajectory. Although these padded rows are masked out from the loss calculation, the incorrect advantage values could still affect metrics and potentially other parts of the training logic in the future.

Suggested change

padding_tensor = torch.ones(pad_size, *additional_dims, dtype=tensor.dtype, device=tensor.device)

padding_tensor = torch.zeros(pad_size, *additional_dims, dtype=tensor.dtype, device=tensor.device)

This is incorrect

CharlieFRuan

Thank you so much! Only one comment. We could add some unit tests as follow ups

CharlieFRuan · 2025-11-24T18:50:17Z

skyrl-train/skyrl_train/generators/step_wise_generator.py

                response_ids=response_ids,
                reward=step_reward,
                loss_mask=copy.deepcopy(loss_mask),
                prompt_ids=copy.deepcopy(input_ids[:current_prompt_length]),


This commet is for the line response_ids = copy.deepcopy(input_ids[current_prompt_length:]).

This input_ids is after we added the observation tokens and next turn's generation prompt right? Shouldn't the response IDs just be output_ids?

Hmm aren't both equivalent ways for step wise training? i.e you could treat (assistant response + obs, reward) as a step vs just count (assistant resp, reward) as a step?

True, was just wondering if this adds additional computation. Indeed I don't think the current way affects correctness

CharlieFRuan · 2025-11-24T18:57:43Z

skyrl-train/skyrl_train/trainer.py

        ) / len(response_ids)
+
+        logger.info(f"Number of sequences before padding: {len(training_input['sequences'])}")
+        training_input = self.pad_batch(training_input)


Just wanted to make sure the PR doesn't break existing flow. Would this be a no-op if we're not doing step-wise training?

That's correct! The pad_batch logic is actually very similar to the initial _remove_tail_data logic - if the batch is already divisible by the DP dimensions then there's no need for padding. The pad_batch statement is also inserted in convert_to_training_input after generation has fully finished. Without step wise training, in existing flow, there are two branches: with and without dynamic sampling. In both cases, the batch size should be train_batch_size*num_prompts (with tail data trimmed) - and padding should be zero

got it, thanks for the explanation!

CharlieFRuan · 2025-11-24T18:58:17Z

skyrl-train/skyrl_train/trainer.py

+        if self.cfg.trainer.step_wise_training:
+            avg_rewards: float = return_sums[data["is_last_step"][: num_samples - pad_size]].mean().item()
+        else:
+            avg_rewards: float = return_sums.mean().item()


would the changes here be no-op if we're not doing step-wise training?

Yes! pad_size is 0

CharlieFRuan · 2025-11-25T19:33:18Z

skyrl-train/skyrl_train/trainer.py

        )
        training_input.metadata = {
            "uids": uids,
+            "trajectory_ids": [trajectory_id.to_string() for trajectory_id in generator_output["trajectory_ids"]],


Just realized this is breaking. We now require trajectory_ids to be a required field.

…h `step_wise_training` flag" (#706) Reverts #694 See #694 (comment) The PR expects `trajectory_ids` to always be in the generator output, which currently is not enforced and is breaking. `run_gsm8k.sh` fails with https://gist.github.com/CharlieFRuan/cbbef69fde60a20d483d03efb13d60bb

…h `step_wise_training` flag" (NovaSky-AI#706) Reverts NovaSky-AI#694 See NovaSky-AI#694 (comment) The PR expects `trajectory_ids` to always be in the generator output, which currently is not enforced and is breaking. `run_gsm8k.sh` fails with https://gist.github.com/CharlieFRuan/cbbef69fde60a20d483d03efb13d60bb

…ise_training flag (#715) # What does this PR do? Re-doing #694 with fix to preserve the default path TODO: - [x] E2E run with run_gsm8k.sh --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH added 4 commits November 21, 2025 02:37

step-wise native

b834066

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

11c88cc

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

2e98a67

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

7d51e99

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH commented Nov 21, 2025

View reviewed changes

SumanthRH added 7 commits November 21, 2025 03:50

Merge remote-tracking branch 'upstream/main' into step-wise-native

d754a81

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

9c29c34

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

23ce3c5

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

44e153c

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

1fbede4

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'upstream/main' into step-wise-native

5ac79f0

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

c38aa76

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH marked this pull request as ready for review November 22, 2025 00:46

SumanthRH requested a review from CharlieFRuan November 22, 2025 00:46

gemini-code-assist bot reviewed Nov 24, 2025

View reviewed changes

CharlieFRuan approved these changes Nov 24, 2025

View reviewed changes

CharlieFRuan reviewed Nov 24, 2025

View reviewed changes

SumanthRH merged commit a30405f into NovaSky-AI:main Nov 24, 2025
3 checks passed

CharlieFRuan reviewed Nov 25, 2025

View reviewed changes

CharlieFRuan mentioned this pull request Nov 25, 2025

Revert "[skyrl-train][step-wise] 1/N - Support step-wise training with step_wise_training flag" #706

Merged

SumanthRH mentioned this pull request Nov 27, 2025

[skyrl-train][step-wise] 1/N - Support step-wise training with step_wise_training flag #715

Merged

1 task

	padding_tensor = torch.ones(pad_size, *additional_dims, dtype=tensor.dtype, device=tensor.device)
	padding_tensor = torch.zeros(pad_size, *additional_dims, dtype=tensor.dtype, device=tensor.device)

[skyrl-train][step-wise] 1/N - Support step-wise training with step_wise_training flag #694

[skyrl-train][step-wise] 1/N - Support step-wise training with step_wise_training flag #694

Uh oh!

Conversation

SumanthRH commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan commented Nov 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SumanthRH Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[skyrl-train][step-wise] 1/N - Support step-wise training with `step_wise_training` flag #694

[skyrl-train][step-wise] 1/N - Support step-wise training with `step_wise_training` flag #694

SumanthRH commented Nov 21, 2025 •

edited

Loading

SumanthRH Nov 24, 2025 •

edited

Loading