# Train your Reasoning Model using NeMo 2.0

This tutorial shows how to fine-tune Meta’s LLaMA 3–8B Instruct model using NVIDIA NeMo and supervised fine-tuning (SFT). You'll train the model on complex instruction-following and reasoning tasks using the[Llama-Nemotron-Post-Training-Data](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)

### ✅ What You'll Learn
1. Load and preprocess a reasoning-focused instruction dataset.
2. Apply SFT with NeMo 2.0.
3. Train using NeMo's distributed, mixed-precision trainer.
4. Save a fine-tuned checkpoint ready for evaluation or deployment.

### 🚀 Ideal For
1. Multi-turn reasoning (e.g., chain-of-thought)
2. Domain-specific instruction following
3. Question answering, dialogue systems, and agentic behaviors



## Step 1. Convert HuggingFace Checkpoint to NeMo Format

Before training, we need to convert the HuggingFace LLaMA 3–8B Instruct checkpoint into NeMo format. NeMo provides a built-in utility ```llm.import_ckpt()``` to handle this conversion.

### ⚠️ This step only needs to be run once per model.
After conversion, the model can be loaded and fine-tuned using NeMo APIs directly.

In [11]:
import nemo_run as run
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig

import torch
import pytorch_lightning as pl
from pathlib import Path
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed
from nemo.lightning.pytorch.optim import CosineAnnealingScheduler, MegatronOptimizerModule, PytorchOptimizerModule
from datetime import datetime

# Configure the import from HuggingFace format to NeMo format
def configure_checkpoint_conversion():
    return run.Partial(
        llm.import_ckpt,
        model=llm.llama3_8b.model(),  # Predefined LLaMA 3 8B model structure
        source="hf:///workspace/Meta-Llama-3-8B-Instruct",  # Path to HF checkpoint (local or HF hub)
        overwrite=False,  # Set to True if you want to overwrite an existing NeMo checkpoint
    )

# Create the configured import task
import_ckpt = configure_checkpoint_conversion()

# Define the local executor (single-node)
local_executor = run.LocalExecutor()

# Execute the checkpoint conversion
run.run(import_ckpt, executor=local_executor)

Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.import_ckpt/nemo.collections.llm.api.import_ckpt_1747804885/nemo.collections.llm.api.import_ckpt


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.import_ckpt/nemo.collections.llm.api.import_ckpt_1747804885/nemo.collections.llm.api.import_ckpt
Launched app: local_persistent://nemo_run/nemo.collections.llm.api.import_ckpt-kjwcwsrkxgj37


Waiting for job nemo.collections.llm.api.import_ckpt-kjwcwsrkxgj37 to finish [log=True]...


mport_ckpt/0   @custom_fwd
mport_ckpt/0 
mport_ckpt/0   @custom_bwd
mport_ckpt/0 
mport_ckpt/0   @custom_fwd
mport_ckpt/0 
mport_ckpt/0   @custom_bwd
mport_ckpt/0 
mport_ckpt/0   @custom_fwd
mport_ckpt/0 
mport_ckpt/0   @custom_bwd
mport_ckpt/0 
mport_ckpt/0   @custom_fwd
mport_ckpt/0 
mport_ckpt/0   @custom_bwd
mport_ckpt/0 
mport_ckpt/0 [32m $[0m[32mNEMO_MODELS_CACHE[0m[32m=[0m[32m/root/.cache/nemo/[0m[32mmodels[0m[32m [0m
mport_ckpt/0 [32m✓ Checkpoint imported to [0m[32m/root/.cache/nemo/models/[0m[32mMeta-Llama-3-8B-Instruct[0m


Job nemo.collections.llm.api.import_ckpt-kjwcwsrkxgj37 finished: SUCCEEDED


✓ Checkpoint imported to /root/.cache/nemo/models/Meta-Llama-3-8B

## Step 2. Prepare Data

In this section, we define the configuration for loading and preprocessing an instruction-tuning dataset using NeMo’s FineTuningDataModule. The dataset is expected to be in a structured format (e.g. JSONL), stored locally as ```training.jsonl```.

The training-related parameters like batch size, number of workers, memory mapping, and device count can be modified based on the size of the model, dataset size and compute resources available.

In [12]:
import json
import shutil
from pathlib import Path
from typing import TYPE_CHECKING, Any, Dict, List, Optional

from datasets import Dataset, DatasetDict, load_dataset

from nemo.collections.llm.gpt.data.core import get_dataset_root
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule
from nemo.core.config import hydra_runner
from nemo.collections import llm
from nemo.lightning.io.mixin import IOMixin
from nemo.utils import logging

N_DEVICES = 4
timestamp = datetime.now().strftime("%Y%m%d-%H%M")
experiment_name = "baseline-8GPUs-all-data-cleaned-shuffle-no-distrib-sampler-500k-2-workers"

# Define fine-tuning dataset configuration
finetune_config = run.Config(
    llm.FineTuningDataModule,
    dataset_root="/workspace",       # Path to your preprocessed dataset (JSONL, etc.)
    seq_length=8192,                 # Max sequence length for input tokens
    micro_batch_size=1,              # Per-device batch size
    global_batch_size=256,           # Total batch size across all devices
    seed=1234,                       # Seed for reproducibility
    memmap_workers=1,                # Use memory-mapped dataset format for performance
    num_workers=8,                   # DataLoader worker threads
    pin_memory=True,                 # Optimize data transfer to GPU
)

## Step 3. Configure SFT with the NeMo 2.0 API

In this step, we'll use the modular NeMo 2.0 API to configure:

* The distributed trainer

* Logging and checkpointing

* Optimizer with cosine annealing scheduler

* Model definition and resume behavior

* Final recipe assembly for fine-tuning

### ⚙️ 3.1 Configure the Trainer
We define the training strategy with Megatron's Distributed Training strategy using tensor model parallelism and enabling mixed precision with bf16.

In [13]:
def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(
        nl.MegatronStrategy,
        tensor_model_parallel_size=4,
        optimizer_cpu_offload=True
    )
    trainer = run.Config(
        nl.Trainer,
        devices=4,
        num_nodes=1,
        max_steps=100,
        accelerator="gpu",
        strategy=strategy,
        plugins=bf16_mixed(),
        log_every_n_steps=50,
        limit_val_batches=0,
        val_check_interval=0,
        num_sanity_val_steps=0,
        use_distributed_sampler=False,
    )
    return trainer    

### 📝 3.2 Configure Logging and Checkpointing
Logs metrics and periodically saves model checkpoints during training.

In [14]:
def logger() -> run.Config[nl.NeMoLogger]:
    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=10,
        monitor="reduced_train_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    return run.Config(
        nl.NeMoLogger,
        name=f"trained-model-checkpoints",
        log_dir=f"./results-{timestamp}-{N_DEVICES}-devices-{experiment_name}",
        use_datetime_version=True,
        ckpt=ckpt,
        wandb=None
    )

### 📈 3.3 Configure Optimizer with Cosine Annealing
Uses the Adam optimizer with gradient clipping, distributed optimizer support, and a cosine annealing learning rate schedule.

In [15]:
from megatron.core.optimizer import OptimizerConfig

def lr_scheduler():
    return run.Config(
        CosineAnnealingScheduler,
        warmup_steps=100,        
        constant_steps=1000,
        min_lr=1e-6,
    )
    
def adam_with_cosine_annealing() -> run.Config[nl.OptimizerModule]:
    opt_cfg = run.Config(
        OptimizerConfig,
        optimizer="adam",
        lr=1e-4,
        weight_decay=0.001,
        use_distributed_optimizer=True,
        clip_grad=1.0,
        bf16=True,
    )
    
    return run.Config(
        nl.MegatronOptimizerModule,
        config=opt_cfg,
        lr_scheduler=lr_scheduler(), 
    )

### 🧠 3.4 Define the Base Model and Resume Logic
We use the built-in LLaMA 3 8B config from NeMo and optionally resume from a previously saved checkpoint.

In [16]:
def llama3_8b() -> run.Config[pl.LightningModule]:
    return run.Config(llm.LlamaModel, config=run.Config(llm.Llama3Config8B))

def resume() -> run.Config[nl.AutoResume]:
    return run.Config(
        nl.AutoResume,
        restore_config=run.Config(
            nl.RestoreConfig,
            path="nemo://Meta-Llama-3-8B-Instruct",  # Change to local path if needed
        ),
        resume_if_exists=True,
    )


### 📦 3.5 Assemble the Fine-Tuning Recipe
This ties together the model, trainer, dataset config, optimizer, and logger into a single training recipe using NeMo’s run.Partial system.

In [17]:
def configure_finetuning_recipe():
    return run.Partial(
        llm.finetune,
        model=llama3_8b(),
        trainer=trainer(),
        data=finetune_config,  # From earlier step
        log=logger(),
        optim=adam_with_cosine_annealing(),
        resume=resume(),
    )

## ▶️ Step 4: Run Supervised Fine-Tuning (SFT) with NeMo 2.0 and nemo-run
Now that everything is configured (model, trainer, optimizer, logging, and data), it's time to launch the training job using nemo-run's LocalExecutor.

This will:

* Use torchrun to launch a multi-GPU job

* Set environment variables for optimized NCCL behavior

* Kick off the training loop with your full configuration

In [18]:
def local_executor_torchrun(nodes: int = 1, devices: int = 4) -> run.LocalExecutor:
    # Environment variables to optimize distributed training
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
    }

    return run.LocalExecutor(
        ntasks_per_node=devices,
        launcher="torchrun",
        env_vars=env_vars,
    )

# Execute the training run
if __name__ == '__main__':
    run.run(
        configure_finetuning_recipe(),
        executor=local_executor_torchrun()
    )

Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.finetune/nemo.collections.llm.api.finetune_1747804913/nemo.collections.llm.api.finetune


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.finetune/nemo.collections.llm.api.finetune_1747804913/nemo.collections.llm.api.finetune
Launched app: local_persistent://nemo_run/nemo.collections.llm.api.finetune-ks9cdpj3k4r5kc


Waiting for job nemo.collections.llm.api.finetune-ks9cdpj3k4r5kc to finish [log=True]...


i.finetune/0 I0521 05:21:55.683000 6205 torch/distributed/run.py:675] Using nproc_per_node=4.
i.finetune/0 W0521 05:21:55.684000 6205 torch/distributed/run.py:792] 
i.finetune/0 W0521 05:21:55.684000 6205 torch/distributed/run.py:792] *****************************************
i.finetune/0 W0521 05:21:55.684000 6205 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
i.finetune/0 W0521 05:21:55.684000 6205 torch/distributed/run.py:792] *****************************************
i.finetune/0 I0521 05:21:55.685000 6205 torch/distributed/launcher/api.py:194] Starting elastic_operator with launch configs:
i.finetune/0 I0521 05:21:55.685000 6205 torch/distributed/launcher/api.py:194]   entrypoint       : nemo_run.core.runners.fdl_runner
i.finetune/0 I0521 05:21:55.685000 6205 torch/distributed/launcher/api.p

Job nemo.collections.llm.api.finetune-ks9cdpj3k4r5kc finished: SUCCEEDED


In [19]:
!nvidia-smi

Wed May 21 07:34:20 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:47:00.0 Off |                    0 |
| N/A   31C    P0             63W /  400W |       4MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00

## 🎉 Tada! You Just Trained Your First Reasoning Model!
Congratulations — you've successfully fine-tuned LLaMA 3–8B Instruct into a domain-adapted reasoning model using NVIDIA NeMo 2.0!

Your model is now ready to:

* Answer questions more effectively
* Follow domain-specific instructions
* Support chain-of-thought reasoning in real-world applications

### 🚀 Next Steps
* 🧪 Evaluate your model on reasoning benchmarks (e.g., MMLU, GSM8K)
* 🪄 Add LoRA or QLoRA for even more efficient adaptation
* ☁️ Package the model for deployment or inference with Triton or vLLM
* 📤 Optionally, upload it to HuggingFace or NGC to share with the world