# OpenAI OSS fine-tuning by Trelis
Advanced scripts available at [Trelis.com](https://Trelis.com/ADVANCED-fine-tuning)

*Based on the [OpenAI cookbook notebook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers).*


Large reasoning models like **OpenAI o3** generate a *chain‑of‑thought* to improve the accuracy and quality of their responses.  
However, most of these models reason in English, even when a question is asked in another language.

In this notebook, we show how the open‑weight reasoning model **`openai/gpt-oss-20b`** can be fine‑tuned to reason effectively in multiple languages.  
We'll add a new **“reasoning language”** option to the model’s system prompt and apply supervised fine‑tuning with Hugging Face’s **TRL** library on a multilingual reasoning dataset.

**Outline**

1. **Setup** – install libraries  
2. **Prepare the dataset** – download & format  
3. **Prepare the model** – load, quantize & LoRA‑wrap  
4. **Fine‑tuning** – train with multilingual reasoning data  
5. **Inference** – generate reasoning responses in different languages  

When we're done you’ll have a multilingual reasoning model that can:  

* reason in **English, Spanish, French, Italian, or German**,  
* even mix languages – e.g. ask in Spanish, reason in German, answer in Spanish.

> **Example**

```
User:
    ¿Cuál es el capital de Australia?
Assistant reasoning:
    Okay, der Benutzer fragt nach der Hauptstadt Australiens. [...]
Assistant response:
    La capital de Australia es **Canberra**. [...]
```


## 1&nbsp;&nbsp;Setup

In [1]:
# Install PyTorch (CUDA 12.8 build)
!python -m pip install --upgrade pip
!pip install uv -qU

# !pip show torch

!uv pip install torch --index-url https://download.pytorch.org/whl/cu128 --system -q

# Install remaining dependencies
!uv pip install hf_transfer "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0" trackio --system -q

import os
os.environ["HF_TRANSFER"] = "1"

Collecting pip
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m192.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.0.1
    Uninstalling pip-25.0.1:
      Successfully uninstalled pip-25.0.1
Successfully installed pip-25.2


In [2]:
# # Log in to the Hugging Face Hub
# from huggingface_hub import notebook_login
# notebook_login()

## 2&nbsp;&nbsp;Prepare the dataset

In [2]:
from datasets import load_dataset

# Multilingual chain‑of‑thought dataset
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")
dataset

README.md:   0%|          | 0.00/782 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/5.29M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'],
    num_rows: 1000
})

In [3]:
# Look at the first training example
dataset[0]

{'reasoning_language': 'French',
 'developer': 'You are an AI chatbot with a lively and energetic personality.',
 'user': 'Can you show me the latest trends on Twitter right now?',
 'analysis': "D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.\n\nJe devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section «\xa0En vogue\xa0» sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et 

The **gpt‑oss** models use the *Harmony* response format to structure conversations:

| role       | purpose                                                         |
|------------|-----------------------------------------------------------------|
| developer  | custom system instructions                                      |
| user       | user input                                                      |
| assistant  | tool calls or responses                                         |
| analysis   | chain‑of‑thought                                                |
| final      | final answer for the end‑user                                   |

We convert these messages with `tokenizer.apply_chat_template()` so the model understands them.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")

messages = dataset[0]["messages"]
conversation = tokenizer.apply_chat_template(messages, tokenize=False)
print(conversation)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are an AI chatbot with a lively and energetic personality.<|end|><|start|>user<|message|>Can you show me the latest trends on Twitter right now?<|end|><|start|>assistant<|channel|>analysis<|message|>D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.

Je devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque ré

## 3&nbsp;&nbsp;Prepare the model

In [5]:
import torch
from transformers import AutoModelForCausalLM, Mxfp4Config

quantization_config = Mxfp4Config(dequantize=True)
model_kwargs = dict(
    attn_implementation="eager",
    torch_dtype=torch.bfloat16, # float16 for colab [although will OOM on T4], bfloat16 for ampere, hopper or later
    quantization_config=quantization_config,
    use_cache=False,
    device_map="auto",
)

model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b", **model_kwargs)

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00000-of-00002.safetensors:   0%|          | 0.00/4.79G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.80G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

In [6]:
!nvidia-smi

Wed Aug  6 10:53:16 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:85:00.0 Off |                    0 |
| N/A   31C    P0            132W /  700W |   44331MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [7]:
messages = [{"role": "user", "content": "¿Cuál es el capital de Australia?"}]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(input_ids, max_new_tokens=512)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


systemYou are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.user¿Cuál es el capital de Australia?assistantanalysisThis is Spanish asking: "What is the capital of Australia?" The answer: Canberra. Provide answer in Spanish. Probably short. Ensure correct.assistantfinalLa capital de Australia es Canberra.


### LoRA configuration

In [8]:
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules="all-linear",
    target_parameters=[
        "7.mlp.experts.gate_up_proj",
        "7.mlp.experts.down_proj",
        "15.mlp.experts.gate_up_proj",
        "15.mlp.experts.down_proj",
        "23.mlp.experts.gate_up_proj",
        "23.mlp.experts.down_proj",
    ],
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

trainable params: 15,040,512 || all params: 20,929,797,696 || trainable%: 0.0719




## 4&nbsp;&nbsp;Fine‑tuning

In [9]:
from trl import SFTConfig

training_args = SFTConfig(
    learning_rate=2e-4,
    gradient_checkpointing=True,
    # num_train_epochs=1,
    max_steps=2,
    logging_steps=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_length=2048,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine_with_min_lr",
    lr_scheduler_kwargs={"min_lr_rate": 0.1},
    output_dir="outputs/gpt-oss-20b-multilingual-reasoner",
    report_to="trackio",
    push_to_hub=True,
)

In [10]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=peft_model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)

trainer.train()

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

* Trackio project initialized: huggingface
* Trackio metrics logged to: /root/.cache/huggingface/trackio
* View dashboard by running in your terminal:
[1m[93mtrackio show --project "huggingface"[0m
* or by running in Python: trackio.show(project="huggingface")


Step,Training Loss
1,1.9633
2,2.0314


TrainOutput(global_step=2, training_loss=1.9973854422569275, metrics={'train_runtime': 42.6598, 'train_samples_per_second': 0.75, 'train_steps_per_second': 0.047, 'total_flos': 6429019046441472.0, 'train_loss': 1.9973854422569275})

In [None]:
!trackio show --project "huggingface"

# Requires port 7860 to be exposed in the runpod template
# Then view with: https://{pod-id}-7860.proxy.runpod.net/?project=huggingface

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


* Trackio UI launched at: http://127.0.0.1:7860/?project=huggingface


In [11]:
trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name="HuggingFaceH4/Multilingual-Thinking")

No files have been modified since last commit. Skipping to prevent empty commit.


## 5&nbsp;&nbsp;Inference
**YOU MAY NEED TO RESTART THE KERNEL HERE TO CLEAR THE GPU**

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")

model_kwargs = dict(attn_implementation="eager", torch_dtype="auto", use_cache=True, device_map="auto")
base_model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b", **model_kwargs).cuda()

peft_model_id = "outputs/gpt-oss-20b-multilingual-reasoner"
model = PeftModel.from_pretrained(base_model, peft_model_id)
model = model.merge_and_unload()

REASONING_LANGUAGE = "German"
SYSTEM_PROMPT = f"reasoning language: {REASONING_LANGUAGE}"
USER_PROMPT = "¿Cuál es el capital de Australia?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

gen_kwargs = {"max_new_tokens": 512, "do_sample": True, "temperature": 0.6}
output_ids = model.generate(input_ids, **gen_kwargs)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


systemYou are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.developer# Instructions

reasoning language: Germanuser¿Cuál es el capital de Australia?assistantanalysisThe user asks in Spanish: "¿Cuál es el capital de Australia?" They want the capital of Australia. The answer: Canberra. The instructions: "reasoning language: German". That might mean that the assistant's reasoning step should be in German. But the user asked in Spanish. The answer should be in Spanish? The instructions say: "reasoning language: German". That likely means the assistant should do its internal reasoning in German, but the output to the user can be in the user's language. So the assistant should answer in Spanish: "La capital de Australia es Canberra." The reasoning should be in German. So I should produce a brief answer in Spanish. The reasoning 

In [2]:
# You need to train more for this to work in chinese.

REASONING_LANGUAGE = "French"
SYSTEM_PROMPT = f"reasoning language: {REASONING_LANGUAGE}"
USER_PROMPT = "C'est quoi la langue nationale du Canada?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(input_ids, **gen_kwargs)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

systemYou are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-06

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.developer# Instructions

reasoning language: FrenchuserC'est quoi la langue nationale du Canada?assistantanalysisThe user asks: "C'est quoi la langue nationale du Canada?" In French. The user is asking about the "langue nationale du Canada". The official languages of Canada are English and French. So the answer: Canada has two langues nationales: l'anglais et le français. So respond in French, mention that Canada has two official languages: English and French. Possibly mention that Canada has no single national language, but both are official. Also mention that the constitution recognizes both. Provide concise answer.assistantfinalLe Canada ne possède pas une seule langue nationale ; il a **deux langues officielles** : l’anglais et le français. Ces deux langues so

## 6&nbsp;&nbsp;Conclusion

You fine‑tuned **`openai/gpt-oss-20b`** to reason in multiple languages using **TRL** + **LoRA** and the *Multilingual‑Thinking* dataset.  
Adapt these steps to your own data and build models that think in any language you need!

For more advanced scripts, check out [Youtube.com/@TrelisResearch].

---

*Notebook generated by Trelis.*