[GPT-OSS 1/N] Initial GPT-OSS support for single turn training by SumanthRH · Pull Request #390 · NovaSky-AI/SkyRL

SumanthRH · 2025-10-04T23:14:15Z

What does this PR do?

Initia GPT-OSS support for single turn training.

Support training in mixed precision (BF16) and inference in half precision only at the moment.

Given some quirks in chat templating, the current SkyRLGymGenerator is not compatible with GPT-OSS for multi-turn tasks.

This PR further adds overrides for chat templating with chat_template_kwargs to be used in the agent loop in SkyRLGymGenerator.

For GPT-OSS, we can provide the reasoning effort in the system prompt, so this feature is important.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request introduces initial support for single-turn training with GPT-OSS models. This is achieved by adding a chat_template_kwargs configuration to pass specific arguments to the tokenizer's chat template, which is necessary for GPT-OSS. The changes include a new example script for running GSM8K with GPT-OSS, updates to the base configuration, and modifications to SkyRLGymGenerator to utilize the new kwargs.

My review focuses on improving script robustness and code maintainability. I've suggested making the new shell script more robust by adding error handling flags. I've also pointed out an opportunity to refactor the repeated use of the new configuration in SkyRLGymGenerator to improve code clarity and make future changes easier.

skyrl-train/examples/gptoss/run_gsm8k_gptoss.sh

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…into sumanthrh/gptoss

SumanthRH · 2025-10-04T23:28:16Z

Receipts:

Reward curve for gms8k training :

Comparing rewards and generation lengths for reasoning effort high and low:

Example sequence from sequences array after generation (got from dump_data_batch=true) with reasoning_effort=low (default is medium):

<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-04

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>The selling price of a bicycle that had sold for $220 last year was increased by 15%. What is the new price? Let's think step by step and output the final answer after "####".<|end|><|start|>assistant<|channel|>analysis<|message|>Compute 15% of 220 = 33. So new price = 220+33=253.<|end|><|start|>assistant<|channel|>final<|message|>The price increased by 15% means it is multiplied by \(1 + 0.15 = 1.15\).

\[
220 \times 1.15 = 220 \times \left(\frac{115}{100}\right)
               = 220 \times 1.15
               = 220 + 33
               = 253
\]

The new selling price is **$253**.

####<|return|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

skyrl-train/skyrl_train/config/ppo_base_config.yaml

skyrl-train/examples/gptoss/run_gsm8k_gptoss.sh

tyler-griggs · 2025-10-07T03:55:20Z

skyrl-train/examples/gptoss/run_gsm8k_gptoss.sh

+  trainer.run_name="gsm8k_test_gptoss_low" \
+  trainer.resume_mode=latest \
+  trainer.ckpt_path="$HOME/ckpts/gsm8k_1.5B_ckpt_gptoss" \
+  +generator.chat_template_kwargs={reasoning_effort:'low'} \


Do you need + since this config param exists?

Yeah....without the plus I get:

Could not override 'generator.chat_template_kwargs'. To append to your config use +generator.chat_template_kwargs={reasoning_effort:low} Key 'reasoning_effort' is not in struct full_key: generator.chat_template_kwargs.reasoning_effort object_type=dict

oh funny, got it

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

tyler-griggs · 2025-10-07T05:05:22Z

skyrl-train/docs/configuration/config.rst

 - ``generator.max_turns``: Maximum number of turns for generation with multi-turn RL.
 - ``generator.use_conversation_multi_turn``: Whether to use conversation format for multi-turn generation. If set to ``true`` then observations are appended to the chat history as a new turn. If set to ``false`` then observations are appended as-is to the assistant response in token space and generation is continued  (after removing any EOS token in the response).  We've observed some cases where model can be sensitive to chat history format (ex: in SkyRL-SQL), and thus ``false`` can be used for full control over the exact tokens added after environment interaction.
 - ``generator.engine_init_kwargs``: Inference engine arguments passed directly to the vLLM or SGLang engine. To specify an engine arg in the CLI override, use the format: +generator.engine_init_kwargs.[arg_name]=value. If duplicate kwargs are passed or kwargs clash with existing generator arguments (e.g., ``tensor_parallel_size``), an error is raised.
+- ``generator.chat_template``: Custom chat template configuration if needed.


The config parameter `reasoning_effort` needs a + otherwise you'll see: ```bash Could not override 'generator.chat_template_kwargs'. To append to your config use +generator.chat_template_kwargs={reasoning_effort:low} Key 'reasoning_effort' is not in struct full_key: generator.chat_template_kwargs.reasoning_effort object_type=dict ``` This was added by mistake in #390 while I was testing the script Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…ky-AI#390) # What does this PR do? Initia GPT-OSS support for single turn training. Support training in mixed precision (BF16) and inference in half precision only at the moment. Given some quirks in chat templating, the current SkyRLGymGenerator is not compatible with GPT-OSS for multi-turn tasks. This PR further adds overrides for chat templating with `chat_template_kwargs` to be used in the agent loop in SkyRLGymGenerator. For GPT-OSS, we can provide the reasoning effort in the system prompt, so this feature is important. --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

The config parameter `reasoning_effort` needs a + otherwise you'll see: ```bash Could not override 'generator.chat_template_kwargs'. To append to your config use +generator.chat_template_kwargs={reasoning_effort:low} Key 'reasoning_effort' is not in struct full_key: generator.chat_template_kwargs.reasoning_effort object_type=dict ``` This was added by mistake in NovaSky-AI#390 while I was testing the script Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH added 16 commits September 14, 2025 05:48

current draft

4f92084

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

e752997

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

c9b58a6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

b2168a4

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge remote-tracking branch 'upstream/main' into sumanthrh/gptoss

2cd359b

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

a4dfb98

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

c8f84bc

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

49183f6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

89573ff

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

867c953

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

cleanup

4fad311

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

4e6dd1a

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

6af340e

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

9522370

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

71ef7a5

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

03f3427

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist bot reviewed Oct 4, 2025

View reviewed changes

skyrl-train/examples/gptoss/run_gsm8k_gptoss.sh Outdated Show resolved Hide resolved

SumanthRH and others added 3 commits October 4, 2025 16:15

Update skyrl-train/examples/gptoss/run_gsm8k_gptoss.sh

7c0edfa

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fix tests

7e39ce6

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

Merge branch 'sumanthrh/gptoss' of https://github.com/SumanthRH/SkyRL …

5176f44

…into sumanthrh/gptoss

SumanthRH assigned tyler-griggs Oct 4, 2025

x

ebac474

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH assigned CharlieFRuan Oct 6, 2025

Merge remote-tracking branch 'upstream/main' into sumanthrh/gptoss

1dcd2e4

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

tyler-griggs reviewed Oct 7, 2025

View reviewed changes

skyrl-train/skyrl_train/config/ppo_base_config.yaml Outdated Show resolved Hide resolved

skyrl-train/examples/gptoss/run_gsm8k_gptoss.sh Outdated Show resolved Hide resolved

tyler-griggs reviewed Oct 7, 2025

View reviewed changes

address comments; omg i didn't add docs

72150d0

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH requested a review from tyler-griggs October 7, 2025 04:08

tyler-griggs approved these changes Oct 7, 2025

View reviewed changes

SumanthRH merged commit ae65f56 into NovaSky-AI:main Oct 7, 2025
3 checks passed

SumanthRH mentioned this pull request Oct 8, 2025

[Fix] Fix GPT-OSS example #431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPT-OSS 1/N] Initial GPT-OSS support for single turn training#390

[GPT-OSS 1/N] Initial GPT-OSS support for single turn training#390
SumanthRH merged 22 commits intoNovaSky-AI:mainfrom
SumanthRH:sumanthrh/gptoss

SumanthRH commented Oct 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

SumanthRH commented Oct 4, 2025

Uh oh!

Uh oh!

Uh oh!

tyler-griggs Oct 7, 2025

Uh oh!

SumanthRH Oct 7, 2025

Uh oh!

tyler-griggs Oct 7, 2025

Uh oh!

tyler-griggs Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SumanthRH commented Oct 4, 2025

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

SumanthRH commented Oct 4, 2025

Uh oh!

Uh oh!

Uh oh!

tyler-griggs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants