# Fine-tuning gpt-oss with NeMo Framework

This notebook demonstrates the process of applying LoRA finetuning to **gpt-oss-20b** using the [multilingual-customer-support-tickets](https://www.kaggle.com/datasets/tobiasbueck/multilingual-customer-support-tickets) dataset. Each entry in the dataset includes a customer email's subject and body, the priority level, the queue it was assigned to, and the agent's response.

In multi-agent customer care systems, routing customer queries is a crucial task. It involves evaluating a query and directing it to the appropriate sub-agent for resolution. In this example, we will fine-tune the model to perform agent ticket routing, which involves determining the correct queue for a ticket based on its email subject and body.

## Pre-requisites

> **NOTE:** Run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) tag `25.07.gpt_oss` which includes all required dependencies. See the tutorial README for instructions on downloading the container.

The following cell installs dependencies to visualize the run configurations.

In [None]:
%%capture

!apt-get update && apt-get install -y graphviz
!pip install ipywidgets

---
# Part I: Prepare the Dataset

In [None]:
import os
import json
import random
import pandas as pd
random.seed(42)

The following cell inspects the dataset and drops rows with missing data.

In [None]:
DATA_DIR = "/nemo-experiments/data/customer-ticket-routing"

# Load the customer support data
df = pd.read_csv(os.path.join(DATA_DIR, "aa_dataset-tickets-multi-lang-5-2-50-version.csv"))

# Remove rows with missing values
df = df.dropna(subset=['subject', 'body', 'queue', 'type'])
df.head()

Configure the splits -

In [None]:
# Set your split ratios
TRAIN_RATIO = 0.9
VAL_RATIO = 0.09
TEST_RATIO = 0.01

PREPARED_DATA_DIR = os.path.join(DATA_DIR, "prepared-data")
os.makedirs(PREPARED_DATA_DIR, exist_ok=True)

Transform the data by defining the task in the prompt -

In [None]:
# This list will hold all of our transformed data points.
transformed_data = []

def create_prompt(subject, body):
    """
    Creates a standardized prompt for the language model.
    """
    return f"A customer has submitted a support ticket. Please route it to the correct department.\n\nSubject: {subject}\n\nBody: {body}\n\nDepartment:"


# Iterate over each row of the DataFrame to create the prompt-completion pairs.
for index, row in df.iterrows():
    prompt = create_prompt(row['subject'], row['body'])
    # completion = row['type'] + ", " + row['queue']
    completion = row['queue']
    
    transformed_data.append({
        "input": prompt,
        "output": f"{completion}"
    })


random.shuffle(transformed_data)
n = len(transformed_data)

# Calculate split indices
train_end = int(n * TRAIN_RATIO)
val_end = train_end + int(n * VAL_RATIO)

train_data = transformed_data[:train_end]
val_data = transformed_data[train_end:val_end]
test_data = transformed_data[val_end:]

# Determine folder


def save_jsonl(data, filename):
    with open(filename, 'w') as f:
        for entry in data:
            json.dump(entry, f)
            f.write('\n')

# Save each split
save_jsonl(train_data, os.path.join(PREPARED_DATA_DIR, "training.jsonl"))
save_jsonl(val_data, os.path.join(PREPARED_DATA_DIR, "validation.jsonl"))
save_jsonl(test_data, os.path.join(PREPARED_DATA_DIR, "test.jsonl"))

print(f"Total records: {n}")
print(f"Train: {len(train_data)}, Val: {len(val_data)}, Test: {len(test_data)}")
print(f"Saved to {PREPARED_DATA_DIR}")

In [None]:
# Inspect the prepared data
!ls {PREPARED_DATA_DIR}

---
## Part II: Finetune with NeMo Framework

In [None]:
from pathlib import Path

import nemo_run as run
from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed

In [None]:
# Define directories for intermediate artifacts
NEMO_MODELS_CACHE = "/nemo-experiments/models-cache"
NEMO_DATASETS_CACHE = "/nemo-experiments/data-cache"

os.environ["NEMO_DATASETS_CACHE"] = NEMO_DATASETS_CACHE
os.environ["NEMO_MODELS_CACHE"] = NEMO_MODELS_CACHE


# Configure the number of GPUs to use
NUM_GPU_DEVICES = 1

(Required) Configure your Hugging Face token

In [None]:
from getpass import getpass
from huggingface_hub import login

login(token=getpass("Input your HF Access Token"))

(Optional) Configure your [WandB](https://wandb.ai/) token for experiment tracking.

Leave empty and press "Enter" / skip this step if you don't wish to track with WandB.

In [None]:
import wandb

WANDB_API_KEY = getpass("Your Wandb API Key:")

wandb.login(key=WANDB_API_KEY)

### Step 1. Import the Hugging Face Checkpoint
The following code uses the `llm.import_ckpt` API to download the specified model using the `hf://<huggingface_model_id>` URL format. It will then convert the model into NeMo 2.0 format.


```python
llm.import_ckpt(model=llm.GPTOSSModel(llm.GPTOSSConfig20B()), source="hf:///nemo-experiments/models/gpt-oss-20b")
```
Below we wrap this with `run.Partial` to configure it, and then we execute it. Note that run.* primitives are part of [Nemo-Run](https://github.com/NVIDIA-NeMo/Run) which can be used to configure, launch and manage experiments at scale locally, SLURM or even cloud environments from the comfort of a Jupyter Notebook.

In [None]:
# You can just as easily swap out the model with the 120B variant, or execute this on a remote cluster.

def configure_checkpoint_conversion():
    return run.Partial(
        llm.import_ckpt,
        model=run.Config(llm.GPTOSSModel, llm.GPTOSSConfig20B),
        source="hf:///nemo-experiments/models/gpt-oss-20b",
        overwrite=False,
    )
    
# Run your experiment locally
run.run(configure_checkpoint_conversion(), executor=run.LocalExecutor())


The above steps downloads the checkpoint from HuggingFace, converts it to NeMo format, and saves it to the directory specified by the `NEMO_MODELS_CACHE` environment variable.

In [None]:
!ls $NEMO_MODELS_CACHE/gpt-oss-20b

---
### Step 2. Configure the Fine-tuning Run

NeMo Framework provides recipes for finetuning and pretraining of supported models. Below, we instantiate the finetuning recipe for `gpt-oss-20b`.

> **NOTE**: Below, we specify LoRA, but full supervised-finetuning can also be done by specifying `peft_scheme`=`none`

In [None]:
recipe = llm.gpt_oss_20b.finetune_recipe(
    name="gpt_oss_20b_finetuning",
    dir="/nemo-experiments/",
    num_nodes=1,
    num_gpus_per_node=NUM_GPU_DEVICES,
    peft_scheme='lora',  # 'lora', 'none' (for SFT)
)



#### 2.1: Configure the Dataloader

Since we already have the data in input/output format, we can use the [`FineTuningDataModule`](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/fine_tuning.py) directly. 

You can also subclass this module (ex: [DollyDataModule](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/dolly.py)) for Dolly dataset and define your own data preparation format / logic.

In [None]:
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule

dataloader = run.Config(
        FineTuningDataModule,
        dataset_root=PREPARED_DATA_DIR,
        seq_length=2048,
        micro_batch_size=4,
        global_batch_size=64
    )

# Configure the recipe
recipe.data = dataloader

# Visualize the dataloader
dataloader




#### 2.2: Configure the Logger


The following example demonstrates how to set up the logger with a specific WandB project and run name. Additional configurations, such as checkpointing details, can also be specified.

In [None]:
from lightning.pytorch.loggers import WandbLogger

LOG_DIR = "/nemo-experiments/results"
LOG_NAME = "nemo2_gpt_oss_sft_customer_ticket_routing"

def logger() -> run.Config[nl.NeMoLogger]:
    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=200,
        monitor="reduced_train_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    # Since WANDB was optional
    if WANDB_API_KEY is not None and WANDB_API_KEY != "":
        wandb_config = run.Config(
            WandbLogger, project="NeMo_LoRA_Customer_Ticket_Routing", name="Customer_Ticket_Routing"
        )
    else:
        wandb_config = None

    return run.Config(
        nl.NeMoLogger,
        name=LOG_NAME,
        log_dir=LOG_DIR,
        use_datetime_version=False,
        ckpt=ckpt,
        wandb=wandb_config,
    )

recipe.log = logger()

logger()



#### 2.3: Configure AutoResume

In [None]:
def resume() -> run.Config[nl.AutoResume]:
    return run.Config(
        nl.AutoResume,
        restore_config=run.Config(
            nl.RestoreConfig, path=f"nemo:///{NEMO_MODELS_CACHE}/gpt-oss-20b"
        ),
        resume_if_exists=True,
    )
    
recipe.resume = resume()



#### 2.4: Trainer Configurations

You may also just set various training configurations as needed. For example:

In [None]:
recipe.trainer.max_steps = 100
recipe.trainer.val_check_interval = 25
recipe.trainer.limit_val_batches = 2
recipe.optim.config.lr = 2e-4

# Let's visualize the recipe
recipe



There are several such parameters (ex: optimizer, LoRA) available to tweak. For example -

```python
# You may also configure the learning rate, optimizer, etc.
recipe.optim.config.lr = 1e-4

# Or tweak the LoRA parameters
recipe.peft.dim = 8
recipe.peft.alpha = 32
recipe.peft.dropout = 0.1
recipe.peft.target_modules = ['linear_qkv', 'linear_proj']

```

---
### Step 3. Execute Finetuning

Following cell executes the configure recipe for finetuning locally. 


> **NOTE**: You can replace `run.LocalExecutor` with `run.SlurmExecutor` for SLURM cluster execution or `run.SkypilotExecutor` for cloud-based execution. For additional options and detailed guidance, please consult the [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemorun/guides/execution.html).

In [None]:
run.run(recipe, executor=run.LocalExecutor(ntasks_per_node=NUM_GPU_DEVICES))

---
### Step 4. Run In-Framework Generation

For a sanity check, we use the `llm.generate` API in NeMo 2.0 to generate sample from the trained checkpoint. Find your last saved checkpoint from your experiment `results` dir:

In [None]:
peft_ckpt_path = str(
    next(
        (
            d
            for d in Path(
                LOG_DIR + "/" + LOG_NAME + "/checkpoints/"
            ).iterdir()
            if d.is_dir() and d.name.endswith("-last")
        ),
        None,
    )
)
print("We will load the PEFT checkpoint from:", peft_ckpt_path)

In [None]:
# You should see weights and context directories
!ls -ltr {peft_ckpt_path}

When using the `llm.generate` API, you can provide the dataloader (as we configured earlier), for example: `input_dataset=dataloader`. This will use the test set from the specified data module to generate predictions. In the example below, the generated predictions are saved to the `peft_predictions.txt` file.

Generating predictions needs only 1 GPU (`tensor_model_parallel_size=1`). However, using multiple GPU devices can speed up inference.

> **Note:** The execution of the following cell may take up to 10 minutes to complete, based on tests conducted using a single H100-80GB GPU. This in-framework inference is intended primarily for validation or sanity checks. For optimized inference, consider using solutions like NVIDIA NIM.

In [None]:
RESULTS_DIR = "/nemo-experiments/results/"
os.makedirs(RESULTS_DIR, exist_ok=True)


OUTPUT_FILE = os.path.join(RESULTS_DIR, "ctr-peft_prediction.jsonl")

In [None]:
from megatron.core.inference.common_inference_params import CommonInferenceParams


def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(
        nl.MegatronStrategy,
        tensor_model_parallel_size=1,
    )
    trainer = run.Config(
        nl.Trainer,
        accelerator="gpu",
        devices=NUM_GPU_DEVICES,
        num_nodes=1,
        strategy=strategy,
        plugins=bf16_mixed(),
    )
    return trainer


def configure_inference():
    return run.Partial(
        llm.generate,
        path=str(peft_ckpt_path),
        trainer=trainer(),
        input_dataset=dataloader,
        inference_params=CommonInferenceParams(num_tokens_to_generate=50, top_k=1, return_log_probs=False, top_n_logprobs=0),
        output_path=OUTPUT_FILE,
        enable_flash_decode=False
    )


if __name__ == "__main__":
    run.run(
        configure_inference(), executor=run.LocalExecutor(ntasks_per_node=NUM_GPU_DEVICES)
    )

After the inference is complete, you will see results similar to the following:

In [None]:
!head -n 2 {OUTPUT_FILE} | jq

You should see output similar to the following:
```json
{
  "input": "A customer has submitted a support ticket. Please assign the type of the ticket and route it to the correct department queue.\n\nSubject: Support for ClickUp\n\nBody: have encountered recurring crashes when using ClickUp with Microsoft SQL Server 2019. The problem could be related to compatibility issues between software versions.\n\nDepartment:",
  "label": "Technical Support",
  "prediction": " Technical Support"
}

```

---
### Step 5. Calculate Evaluation Metric

We can evaluate the model's predictions by calculating the F1 score.

In [None]:

import json
from sklearn.metrics import f1_score

labels = []
predictions = []

# Read the jsonl file and extract labels and predictions
with open(OUTPUT_FILE, "r") as f:
    for line in f:
        item = json.loads(line)
        labels.append(item["label"])
        predictions.append(item["prediction"])


# Clean up whitespace for fair comparison
clean_labels = [label.strip().lower() for label in labels]
clean_preds = [pred.strip().lower() for pred in predictions]

f1 = f1_score(clean_labels, clean_preds, average='micro')

print(f"F1 score (micro): {f1:.4f}")

**NOTE**: If you inspect the dataset, you will notice that some of the ground truth labels are ambiguous even to a human annotator. This is the case with many real-world datasets as well. For example in the dataset we have, at times the distinction between "Technical Support" and "Product Support" may be hard to tell.

---
### Step 6. Export to Hugging Face Format

The next step is to export the model to Hugging Face `.safetensors` format. This format can be ingested by NVIDIA NIM or vLLM for deployment.

Before exporting, let's merge the LoRA weights into the base model weights to have a unified finetuned checkpoint.

In [None]:
### Merge LoRA Weights with Base Model Weights


def merge_lora_with_base_model():
    return run.Partial(
        llm.peft.merge_lora,
        lora_checkpoint_path=peft_ckpt_path,
        output_path=peft_ckpt_path + "_merged",
    )


local_executor = run.LocalExecutor()
run.run(merge_lora_with_base_model(), executor=local_executor)
print(f"Merged LoRA weights with base model weights to: {peft_ckpt_path + '_merged'}")

The following cell uses the `llm.export_ckpt` API, wrapped by NeMo Run's `run.Partial` primitive, followed by executing it.

Its worth noting that `target=hf` indicates exporting a full weights checkpoint to Hugging Face format.

In [None]:
# Configure the export directory
EXPORT_DIR = "/nemo-experiments/models/gpt-oss-ctr-finetuned"

def configure_export_ckpt():
    return run.Partial(
        llm.export_ckpt,
        path=peft_ckpt_path + "_merged", # Use the merged checkpoint path
        target="hf",
        output_path=EXPORT_DIR,
        overwrite=True
    )


local_executor = run.LocalExecutor()
run.run(configure_export_ckpt(), executor=local_executor)
print(f"Exported Hugging Face model to: {EXPORT_DIR}")

In [None]:
!ls {EXPORT_DIR}

At this point, we have a finetuned `gpt-oss-20b` checkpoint ready to deploy!