Skip to content

[GPT-OSS 1/N] Initial GPT-OSS support for single turn training#390

Merged
SumanthRH merged 22 commits intoNovaSky-AI:mainfrom
SumanthRH:sumanthrh/gptoss
Oct 7, 2025
Merged

[GPT-OSS 1/N] Initial GPT-OSS support for single turn training#390
SumanthRH merged 22 commits intoNovaSky-AI:mainfrom
SumanthRH:sumanthrh/gptoss

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

What does this PR do?

Initia GPT-OSS support for single turn training.

Support training in mixed precision (BF16) and inference in half precision only at the moment.

Given some quirks in chat templating, the current SkyRLGymGenerator is not compatible with GPT-OSS for multi-turn tasks.

This PR further adds overrides for chat templating with chat_template_kwargs to be used in the agent loop in SkyRLGymGenerator.

For GPT-OSS, we can provide the reasoning effort in the system prompt, so this feature is important.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces initial support for single-turn training with GPT-OSS models. This is achieved by adding a chat_template_kwargs configuration to pass specific arguments to the tokenizer's chat template, which is necessary for GPT-OSS. The changes include a new example script for running GSM8K with GPT-OSS, updates to the base configuration, and modifications to SkyRLGymGenerator to utilize the new kwargs.

My review focuses on improving script robustness and code maintainability. I've suggested making the new shell script more robust by adding error handling flags. I've also pointed out an opportunity to refactor the repeated use of the new configuration in SkyRLGymGenerator to improve code clarity and make future changes easier.

SumanthRH and others added 3 commits October 4, 2025 16:15
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH
Copy link
Copy Markdown
Member Author

Receipts:

Reward curve for gms8k training :

image

Comparing rewards and generation lengths for reasoning effort high and low:

Screenshot 2025-10-04 at 4 27 11 PM Screenshot 2025-10-04 at 4 27 16 PM

Example sequence from sequences array after generation (got from dump_data_batch=true) with reasoning_effort=low (default is medium):

<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-04

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>The selling price of a bicycle that had sold for $220 last year was increased by 15%. What is the new price? Let's think step by step and output the final answer after "####".<|end|><|start|>assistant<|channel|>analysis<|message|>Compute 15% of 220 = 33. So new price = 220+33=253.<|end|><|start|>assistant<|channel|>final<|message|>The price increased by 15% means it is multiplied by \(1 + 0.15 = 1.15\).

\[
220 \times 1.15 = 220 \times \left(\frac{115}{100}\right)
               = 220 \times 1.15
               = 220 + 33
               = 253
\]

The new selling price is **$253**.

####<|return|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
trainer.run_name="gsm8k_test_gptoss_low" \
trainer.resume_mode=latest \
trainer.ckpt_path="$HOME/ckpts/gsm8k_1.5B_ckpt_gptoss" \
+generator.chat_template_kwargs={reasoning_effort:'low'} \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need + since this config param exists?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah....without the plus I get:

Could not override 'generator.chat_template_kwargs'.
To append to your config use +generator.chat_template_kwargs={reasoning_effort:low}
Key 'reasoning_effort' is not in struct
    full_key: generator.chat_template_kwargs.reasoning_effort
    object_type=dict

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh funny, got it

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH requested a review from tyler-griggs October 7, 2025 04:08
- ``generator.max_turns``: Maximum number of turns for generation with multi-turn RL.
- ``generator.use_conversation_multi_turn``: Whether to use conversation format for multi-turn generation. If set to ``true`` then observations are appended to the chat history as a new turn. If set to ``false`` then observations are appended as-is to the assistant response in token space and generation is continued (after removing any EOS token in the response). We've observed some cases where model can be sensitive to chat history format (ex: in SkyRL-SQL), and thus ``false`` can be used for full control over the exact tokens added after environment interaction.
- ``generator.engine_init_kwargs``: Inference engine arguments passed directly to the vLLM or SGLang engine. To specify an engine arg in the CLI override, use the format: +generator.engine_init_kwargs.[arg_name]=value. If duplicate kwargs are passed or kwargs clash with existing generator arguments (e.g., ``tensor_parallel_size``), an error is raised.
- ``generator.chat_template``: Custom chat template configuration if needed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

@SumanthRH SumanthRH merged commit ae65f56 into NovaSky-AI:main Oct 7, 2025
3 checks passed
SumanthRH added a commit that referenced this pull request Oct 8, 2025
The config parameter `reasoning_effort` needs a + otherwise you'll see: 

```bash
Could not override 'generator.chat_template_kwargs'.
To append to your config use +generator.chat_template_kwargs={reasoning_effort:low}
Key 'reasoning_effort' is not in struct
    full_key: generator.chat_template_kwargs.reasoning_effort
    object_type=dict
```

This was added by mistake in #390 while I was testing the script

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
…ky-AI#390)

# What does this PR do?

Initia GPT-OSS support for single turn training. 

Support training in mixed precision (BF16) and inference in half
precision only at the moment.

Given some quirks in chat templating, the current SkyRLGymGenerator is
not compatible with GPT-OSS for multi-turn tasks.

This PR further adds overrides for chat templating with
`chat_template_kwargs` to be used in the agent loop in
SkyRLGymGenerator.

For GPT-OSS, we can provide the reasoning effort in the system prompt,
so this feature is important.

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
The config parameter `reasoning_effort` needs a + otherwise you'll see: 

```bash
Could not override 'generator.chat_template_kwargs'.
To append to your config use +generator.chat_template_kwargs={reasoning_effort:low}
Key 'reasoning_effort' is not in struct
    full_key: generator.chat_template_kwargs.reasoning_effort
    object_type=dict
```

This was added by mistake in NovaSky-AI#390 while I was testing the script

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants