Skip to content

[Docs] Small update on docs#1348

Merged
CharlieFRuan merged 1 commit intomainfrom
trivial
Mar 19, 2026
Merged

[Docs] Small update on docs#1348
CharlieFRuan merged 1 commit intomainfrom
trivial

Conversation

@CharlieFRuan
Copy link
Copy Markdown
Member

@CharlieFRuan CharlieFRuan commented Mar 19, 2026

@CharlieFRuan CharlieFRuan merged commit 72f8d86 into main Mar 19, 2026
1 check was pending
@CharlieFRuan CharlieFRuan deleted the trivial branch March 19, 2026 06:27
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a helpful clarification to the step-wise-training.mdx documentation, explaining how the structure of GeneratorOutput changes for step-wise training. The change correctly states that each element in the output lists corresponds to a single step rather than a full trajectory. I've included one suggestion to make the wording slightly more precise for improved clarity.


## GeneratorOutput Format

Normally, each element in `GeneratorOutput` (i.e. `response_ids[i]`, `prompt_token_ids[i]`, `rewards[i]`, etc.) represents a single trajectory. With step-wise training, each element instead represents a single **step** (one LLM turn within a trajectory). A trajectory with 3 turns produces 3 elements rather than 1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This explanation is very helpful. To make it even more precise and avoid potential confusion, you could clarify that this per-step/per-trajectory structure applies specifically to the list-based fields in GeneratorOutput. The GeneratorOutput TypedDict also contains non-list fields like rollout_metrics, which are aggregated for the entire batch and don't follow this pattern. Specifying this distinction will make the documentation more robust.

Normally, for the list-based fields in `GeneratorOutput` (e.g., `response_ids`, `prompt_token_ids`, `rewards`), each element represents a single trajectory. With step-wise training, each element instead represents a single **step** (one LLM turn within a trajectory). A trajectory with 3 turns produces 3 elements rather than 1.

devpatelio pushed a commit that referenced this pull request Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant