Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a helpful clarification to the step-wise-training.mdx documentation, explaining how the structure of GeneratorOutput changes for step-wise training. The change correctly states that each element in the output lists corresponds to a single step rather than a full trajectory. I've included one suggestion to make the wording slightly more precise for improved clarity.
|
|
||
| ## GeneratorOutput Format | ||
|
|
||
| Normally, each element in `GeneratorOutput` (i.e. `response_ids[i]`, `prompt_token_ids[i]`, `rewards[i]`, etc.) represents a single trajectory. With step-wise training, each element instead represents a single **step** (one LLM turn within a trajectory). A trajectory with 3 turns produces 3 elements rather than 1. |
There was a problem hiding this comment.
This explanation is very helpful. To make it even more precise and avoid potential confusion, you could clarify that this per-step/per-trajectory structure applies specifically to the list-based fields in GeneratorOutput. The GeneratorOutput TypedDict also contains non-list fields like rollout_metrics, which are aggregated for the entire batch and don't follow this pattern. Specifying this distinction will make the documentation more robust.
Normally, for the list-based fields in `GeneratorOutput` (e.g., `response_ids`, `prompt_token_ids`, `rewards`), each element represents a single trajectory. With step-wise training, each element instead represents a single **step** (one LLM turn within a trajectory). A trajectory with 3 turns produces 3 elements rather than 1.
Uh oh!
There was an error while loading. Please reload this page.