# 🚀 Customize and Deploy `gpt-oss` model using Amazon SageMaker AI

---
In this Notebook, we provide a solution that allows you to pick up the latest released GPT-OSS models,
* [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
* [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)


This notebook demonstrates a complete workflow using Amazon SageMaker for fine-tuning and deploying OpenAI’s newly released **GPT‑OSS** open‑weight models. You’ll learn how to leverage **gpt‑oss‑20b** and **gpt‑oss‑120b**, two high-performance, open-source GPT variants, to build customizable, transparent, and deployable LLMs.

**What are GPT‑OSS Models?**

OpenAI released **gpt‑oss‑120b** and **gpt‑oss‑20b** on **August 5, 2025**—its first open‑weight language models since GPT‑2. These models are provided under the **Apache 2.0 license**, enabling both commercial and non-commercial use with full access to the model weights.

- **gpt‑oss‑120b**  
  - ~117 billion parameters, but only ~5.1 billion active per token via Mixture‑of‑Experts (MoE) routing  
  - 36 layers, 128 experts total, with 4 active per token  
  - Supports up to **128 k context length** using dense + sparse attention, grouped multi‑query attention, and RoPE

- **gpt‑oss‑20b**  
  - ~21 billion parameters, ~3.6 billion active per token  
  - 24 layers, 32 total experts, with 4 active per token  
  - Same efficient attention and context‑length capabilities as the large variant 

These models support **chain‑of‑thought (CoT) reasoning**, structured outputs, and are compatible with the OpenAI Responses API. You can adjust reasoning effort (low/medium/high) with a simple system message—balancing latency against performance.

- **gpt‑oss‑120b** matches or exceeds the performance of OpenAI’s proprietary **o4‑mini** model on benchmarks such as Codeforces (coding), MMLU and HLE (general reasoning), HealthBench (health), and AIME (competition math).
- **gpt‑oss‑20b**, despite its smaller size, outperforms **o3‑mini** across similar benchmarks, especially in mathematics and coding domains.
---
**Using This Notebook**

Here, you'll:
1. **Pull GPT‑OSS models and Fine-tune on SageMaker AI** (20B or 120B) via Hugging Face, kick off fine-tuning on SageMaker AI training jobs with your custom dataset using SageMaker’s HuggingFace Estimator or PyTorch Estimators.
3. **Deploy** the fine-tuned model as a SageMaker endpoint for interactive inference.
---


In [None]:
%pip install -Uq sagemaker datasets==4.0.0

In [None]:
import boto3
import sagemaker

In [None]:
sess = sagemaker.Session()

sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

In [None]:
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Data Prep for HuggingFace Training

**Preparing Your Dataset in `messages` Format for GPT‑OSS Fine‑Tuning on SageMaker AI**

This section walks you through creating a conversation-style dataset—the required `messages` format—for directly training GPT‑OSS models (`gpt‑oss‑20b` or `gpt‑oss‑120b`) using SageMaker AI.

**What Is the `messages` Format?**

The `messages` format structures instances as chat-like exchanges, wrapping each conversation turn into a role-labeled JSON array. It’s widely used by frameworks like TRL (Training RL-based Language models), and aligns with both OpenAI and SageMaker JumpStart chat APIs :contentReference[oaicite:1]{index=1}.

Example entry:

```json
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "How do I bake sourdough?" },
    { "role": "assistant", "content": "First, you need to create a starter by..." }
  ]
}


In [None]:
import os
import json
import pprint
from tqdm import tqdm
from datasets import load_dataset

In [None]:
dataset_parent_path = os.path.join(os.getcwd(), "local_dataset")
os.makedirs(dataset_parent_path, exist_ok=True)

#### Load Dataset from HuggingFace

**Dataset: HuggingFaceH4/Multilingual‑Thinking**

- A **multilingual reasoning dataset** that includes reasoning chains (chain‑of‑thought) translated into multiple languages such as French, Spanish, German, and more.  
- Small-scale dataset with approximately **1,000 examples** in total, sufficient for fine‑tuning large pretrained models like GPT‑OSS.

**Format & Structure**  
- Available in **Parquet** format, compatible with Hugging Face `datasets` and pandas.  
- Each example contains these key fields:
  - `reasoning_language` – the language used in the reasoning.
  - `developer` – system or instruction-level prompt.
  - `user` – the user's input/query.
  - `analysis` – the step-by-step reasoning (chain-of-thought).
  - `final` – the assistant's final response.
  - `messages` – a chat-style structure combining all of the above into a conversation.

Ideal for **fine-tuning GPT‑OSS models** to improve their multilingual reasoning capabilities—models can learn to think step-by-step in a user’s language, enhancing interpretability and usability across language contexts.

In [None]:
dataset_name = 'HuggingFaceH4/Multilingual-Thinking'
dataset = load_dataset(dataset_name, split="train")

In [None]:
 pprint.pp(dataset[0])

In [None]:
print(f"total number of fine-tunable samples: {len(dataset)}")

In [None]:
dataset = dataset.remove_columns(
    [col for col in dataset.column_names if col != "messages"]
)

In [None]:
dataset_filename = os.path.join(dataset_parent_path, f"{dataset_name.replace('/', '--').replace('.', '-')}.jsonl")
dataset.to_json(dataset_filename, lines=True)

#### Upload file to S3

In [None]:
from sagemaker.s3 import S3Uploader

In [None]:
data_s3_uri = f"s3://{sess.default_bucket()}/dataset"

uploaded_s3_uri = S3Uploader.upload(
    local_path=dataset_filename,
    desired_s3_uri=data_s3_uri
)
print(f"Uploaded {dataset_filename} to > {uploaded_s3_uri}")

## Train a HuggingFace Model using Trainer

---

In this notebook, we demonstrate how to fine-tune open-weight GPT‑OSS models such as `gpt-oss-20b` using Amazon SageMaker's distributed training capabilities. The training pipeline is built on top of the Hugging Face Transformers and TRL (Training Rewarded Language models) libraries, integrated with Accelerate and DeepSpeed (Zero Stage 3) for memory-efficient, large-scale model optimization. The dataset is prepared in the `messages` format, enabling structured chat-style learning for multilingual reasoning tasks, and is loaded dynamically from Amazon S3 into the SageMaker training environment.

To ensure efficient compute usage and faster training, the model is configured to use `flash-attn-v3`—an optimized attention kernel integrated via the `vllm-flash-attn3` backend. This allows significant speedups in attention computation, especially for long context sequences up to 2048 tokens. The training setup also employs PEFT (Parameter-Efficient Fine-Tuning) using LoRA adapters to minimize GPU memory usage while still achieving high-quality adaptations. With SageMaker’s scalable infrastructure, this setup enables rapid experimentation with state-of-the-art model training, leveraging the latest advancements in model compression, attention acceleration, and parallelism.


In [None]:
import time
from sagemaker.pytorch import PyTorch
from getpass import getpass
import yaml
from jinja2 import Template

In [None]:
hf_token = getpass()

### Training Params and Hyperparams

---
The YAML file (below) defines the core configuration for fine-tuning GPT‑OSS models on SageMaker. Below are the key sections and their purpose:

- **Model Setup**  
  - `model_name_or_path`: Specifies the pre-trained GPT‑OSS model to fine-tune (e.g., `openai/gpt-oss-20b`).
  - `attn_implementation`: Uses `kernels-community/vllm-flash-attn3` for faster attention via Flash Attention v3.
  - `torch_dtype`: Enables bfloat16 precision (`bf16`) to reduce memory usage without sacrificing accuracy.

- **Dataset**  
  - `dataset_id_or_path`: Path to the training dataset in `messages` format.
  - `max_seq_length`: Sets the token limit for each input sequence.
  - `packing`: Enables efficient token packing for better throughput during training.

- **LoRA (PEFT)**  
  - `use_peft`: Enables parameter-efficient fine-tuning.
  - `lora_target_modules` and `lora_modules_to_save`: Define which model layers/adapters are modified and saved.

- **Training**  
  - `num_train_epochs`, `per_device_train_batch_size`, and `gradient_accumulation_steps`: Control training duration and batch sizing.
  - `gradient_checkpointing`: Reduces memory usage during training by trading off compute.
  - `optim`: Uses `adamw_torch_fused` for fused optimizer support.

- **Logging & Saving**  
  - Logs metrics to MLflow and saves checkpoints per epoch.
  - Seeds and reproducibility settings (`seed`, `save_strategy`) are pre-configured for consistent results.

This file is passed as input to the training script via Accelerate + DeepSpeed, allowing users to modify training behavior without changing code.

In [None]:
# User-defined variables
model_name = "openai/gpt-oss-20b"
tokenizer_name = "openai/gpt-oss-20b"

# dataset path inside a sagemaker container
dataset_path = "/opt/ml/input/data/training/HuggingFaceH4--Multilingual-Thinking.jsonl"
output_path = "/opt/ml/model/openai-gpt-oss-20b-HuggingFaceH4-Multilingual-Thinking/"

bf16_flag = "true" # support only for Ampere, Hopper and Grace Blackwell

In [None]:
yaml_template = """# Model arguments
model_name_or_path: {{ model_name }}
tokenizer_name_or_path: {{ tokenizer_name }}
model_revision: main
torch_dtype: bfloat16
attn_implementation: kernels-community/vllm-flash-attn3
bf16: {{ bf16_flag }}
tf32: false
output_dir: {{ output_dir }}

# Dataset arguments
dataset_id_or_path: {{ dataset_path }}
max_seq_length: 2048
packing: true
packing_strategy: wrapped

# LoRA arguments
use_peft: true
lora_target_modules: "all-linear"
lora_modules_to_save: ["7.mlp.experts.gate_up_proj", "7.mlp.experts.down_proj", "15.mlp.experts.gate_up_proj", "15.mlp.experts.down_proj", "23.mlp.experts.gate_up_proj", "23.mlp.experts.down_proj"]
lora_r: 8
lora_alpha: 16

# Training arguments
num_train_epochs: 1
per_device_train_batch_size: 6
per_device_eval_batch_size: 6
gradient_accumulation_steps: 3
gradient_checkpointing: true
optim: adamw_torch_fused
gradient_checkpointing_kwargs:
  use_reentrant: true
learning_rate: 1.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
max_grad_norm: 0.3
bf16: {{ bf16_flag }}
bf16_full_eval: {{ bf16_flag }}
tf32: false

# Logging arguments
logging_strategy: steps
logging_steps: 2
report_to:
  - mlflow
save_strategy: "epoch"
seed: 42
"""

config_filename = "openai-gpt-oss-20b-qlora.yaml"

In [None]:
# Render template and plug dynamic values
rendered_yaml = Template(yaml_template).render(
    dataset_path=dataset_path,
    output_dir=output_path,
    model_name=model_name,
    tokenizer_name=tokenizer_name,
    bf16_flag=bf16_flag
)

# Print to verify
print(rendered_yaml)

In [None]:
# Save to file (optional)
with open(os.path.join("./code/recipes", config_filename), "w") as f:
    f.write(rendered_yaml)

Below is configuration for model to instance type that's tested for different `gpt-oss` models
| GPT‑OSS Model               | SageMaker Instance             | GPU Specs                                       |
|----------------------------|-------------------------------|-------------------------------------------------|
| [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | `ml.p5en.48xlarge` | 8× NVIDIA H200 GPUs, 96 GB HBM3 each             |
| [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)   | `ml.p4de.24xlarge` | 8× NVIDIA A100 GPUs, 80 GB HBM2e each            |


In [None]:
NUM_GPUS = 8
job_name = 'gpt-oss-finetuning'
training_instance_type = "ml.p4de.24xlarge"

---
**Training Options**

This notebook provides two options for training GPT‑OSS models on Amazon SageMaker:

**Stage 1: Use PyTorch SageMaker Container**
Leverages the official PyTorch SageMaker container to run a custom training script using the Accelerate and DeepSpeed libraries. This option is ideal for users who want full control over the training pipeline and dependencies.

**Stage 2: Use HuggingFace SageMaker Container ✅**  
Utilizes the Hugging Face SageMaker container with built-in support for Transformers and TRL (Training Rewarded Language models). This approach simplifies setup and is optimized for fine-tuning open-weight language models like GPT‑OSS.

> **We recommend using _Stage 2_ for faster setup and better integration with Hugging Face's ecosystem.**

---

### Stage 1: Use PyTorch SageMaker Container

In [None]:
%%writefile code/requirements.txt
transformers>=4.55.0
kernels>=0.9.0
datasets==4.0.0
bitsandbytes==0.46.1
trl>=0.20.0
peft>=0.17.0
lighteval==0.10.0
hf-transfer==0.1.8
hf_xet
tensorboard 
liger-kernel==0.6.1
deepspeed==0.17.4
lm-eval[api]==0.4.9
Pillow
mlflow
sagemaker-mlflow==0.1.0
triton
git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

In [None]:
pytorch_estimator = PyTorch(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker",
    entry_point="accelerate_sagemaker_train.sh", # Adapted bash script to train using accelerate on SageMaker - Multi-GPU
    source_dir="code",
    instance_type=training_instance_type,
    instance_count=1,
    base_job_name=f"{job_name}-pytorch",
    role=role,
    volume_size=300,
    py_version="py312",
    keep_alive_period_in_seconds=3600,
    environment={
        "MLFLOW_EXPERIMENT_NAME": "openai-gpt-oss-20b-pytorch",
        "MLFLOW_TAGS": '{"source.job": "sm-training-jobs", "source.type": "sft", "source.framework": "pytorch"}',
        "HF_TOKEN": hf_token,
        "MLFLOW_TRACKING_URI": "arn:aws:sagemaker:us-west-2:122610505991:mlflow-tracking-server/llm-experimentation",
    },
    hyperparameters={
        "num_process": NUM_GPUS,
        "config": f"recipes/{config_filename}",
    }
)


In [None]:
# fit or train
pytorch_estimator.fit({"training": uploaded_s3_uri}, wait=False)

In [None]:
s3_model_data_uri = pytorch_estimator.model_data

### Stage 2: Use HuggingFace SageMaker Container

In [None]:
%%writefile code/requirements.txt
torch==2.7.1
torchvision==0.22.1 
torchaudio==2.7.1
transformers>=4.55.0
kernels>=0.9.0
datasets==4.0.0
bitsandbytes==0.46.1
trl>=0.20.0
peft>=0.17.0
lighteval==0.10.0
hf-transfer==0.1.8
hf_xet
tensorboard 
liger-kernel==0.6.1
deepspeed==0.17.4
lm-eval[api]==0.4.9
Pillow
mlflow
sagemaker-mlflow==0.1.0
triton
git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

In [None]:
huggingface_estimator = HuggingFace(
    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.5.1-transformers4.49.0-gpu-py311-cu124-ubuntu22.04",
    entry_point="accelerate_sagemaker_train.sh", # Adapted bash script to train using accelerate on SageMaker - Multi-GPU
    source_dir="code",
    instance_type=training_instance_type,
    instance_count=1,
    base_job_name=f"{job_name}-huggingface",
    role=role,
    volume_size=300,
    py_version="py311",
    keep_alive_period_in_seconds=3600,
    environment={
        "MLFLOW_EXPERIMENT_NAME": "openai-gpt-oss-20b-huggingfacce",
        "MLFLOW_TAGS": '{"source.job": "sm-training-jobs", "source.type": "sft", "source.framework": "huggingface"}',
        "HF_TOKEN": hf_token,
        "MLFLOW_TRACKING_URI": "arn:aws:sagemaker:us-west-2:122610505991:mlflow-tracking-server/llm-experimentation",
    },
    hyperparameters={
        "num_process": NUM_GPUS,
        "config": f"recipes/{config_filename}",
    },
    sagemaker_session=sess
)


In [None]:
# fit or train
huggingface_estimator.fit({"training": uploaded_s3_uri}, wait=False)

In [None]:
s3_model_data_uri = huggingface_estimator.model_data

#### Download Model from S3

In [None]:
import tarfile
from sagemaker.s3 import S3Downloader

In [None]:
local_model_path = "./fine_tuned_model"
os.makedirs(local_model_path, exist_ok=True)

In [None]:
S3Downloader.download(
    s3_uri=s3_model_data_uri,
    local_path=local_model_path
)
print(f"download model file to {local_model_path}")

In [None]:
def untar_file(tar_path: str, destination: str) -> None:

    if not os.path.isfile(tar_path):
        raise FileNotFoundError(f"The file '{tar_path}' does not exist.")

    os.makedirs(destination, exist_ok=True)

    with tarfile.open(tar_path, "r:gz") as tar:
        tar.extractall(path=destination)
        print(f"Extracted '{tar_path}' to '{destination}'.")


# untar model file
untar_file(
    tar_path=os.path.join(local_model_path, os.path.basename(s3_model_data_uri)), 
    destination=local_model_path
)