# üöÄ NVIDIA Nemotron-3-Nano LoRA Fine-Tuning Guide with Megatron Bridge

This notebook walks you through **fine-tuning** the NVIDIA Nemotron-3-Nano-30B model from start to finish‚Äîusing **LoRA** (Low-Rank Adaptation) 
so you train only a small set of parameters. 
In this notebook you will train with [Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), part of the **NeMo** framework.

[![ Click here to deploy.](https://brev-assets.s3.us-west-1.amazonaws.com/nv-lb-dark.svg)](https://brev.nvidia.com/launchable/deploy?launchableID=env-39PnUMhHmxbMcHKO61iQ8O5F7ZL)
---

## üìã What You're Working With

| | |
|:--:|:--|
| ü§ñ **Model** | `NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` |
| üõ†Ô∏è **Framework** | NeMo with Megatron-Bridge |
| üìê **Method** | LoRA (Parameter-Efficient Fine-Tuning) |

---

## ‚úÖ Prerequisites

### üíª Hardware
- **8√ó GPUs** ‚Äî NVIDIA H100 or A100
- **250 GB** free storage (minimum)

### üì¶ Software
- **OS:** Ubuntu 22.04
- **GPU driver:** 580 or newer
- **CUDA:** 12.8 or newer
- **NVIDIA Container Toolkit** (for Docker + GPU)

---

## üó∫Ô∏è Workflow at a Glance

| Step | What you'll do |
|:--:|:--|
| **1** | üê≥ Set up the Docker environment |
| **2** | üîÑ Convert HuggingFace model ‚Üí Megatron format |
| **3** | üéØ Fine-tune with LoRA |
| **4** | üîó Merge LoRA weights into the base model |
| **5** | üì§ Export back to HuggingFace format |
| **6** | üåê Deploy your fine-tuned model |

Follow the steps below in order. Let's go! üëá

---

## Step 1: Download the NeMo Docker Container

Let's begin by obtaining the official NVIDIA NeMo container, which comes preloaded with everything needed for training.
Before proceeding, make sure to set your NGC_API_KEY.

#### Setup NGC_API_KEY
NGC offers a wide variety of public images, models, and datasets and you'll need to generate an API key and authenticate with NGC.

To create your API key, visit: https://org.ngc.nvidia.com/setup/api-keys

When generating your NGC key, make sure to enable the "NGC Catalog" under "Services Included".

In [None]:
# Put your NGC API key here
NGC_API_KEY="<ENTER_YOUR_NGC_API_KEY_HERE>"

In [None]:
import subprocess

# Use the NGC_API_KEY set in the previous cell (run that cell first!)
try:
    api_key = NGC_API_KEY
except NameError:
    raise RuntimeError(
        "NGC_API_KEY is not set. Please run the cell above to set your NGC API key."
    ) from None
if not (api_key and api_key.strip()):
    raise ValueError("NGC_API_KEY is empty. Please set it in the cell above.")

# Log in to NGC container registry
result = subprocess.run(
    ["docker", "login", "nvcr.io", "-u", "$oauthtoken", "--password-stdin"],
    input=api_key.encode(),
    capture_output=True,
)

print(result.stdout.decode())
if result.returncode != 0:
    print("Error:", result.stderr.decode())

Now let's download the NVIDIA Nemotron-3 Nano container from NGC

In [None]:
%%bash
docker pull nvcr.io/nvidia/nemo:25.11.nemotron_3_nano

## Step 2: Launch the Docker Container

Launch the Docker container with GPU capabilities and mount your current directory as the workspace.
 
**Key options explained:**
- `--gpus all`: Grants the container access to all available GPUs.
- `--ipc=host`: Shares the host‚Äôs IPC namespace for improved multi-GPU support.
- `--network host`: Uses the host machine's network settings.
- `-v $(pwd):/workspace`: Mounts your present directory to `/workspace` inside the container.
- `-p 8080:8080 -p 8088:8088`: Opens essential ports for monitoring and service access.
 
**Note:** 
Run this command in your terminal (not in Jupyter). The command will start an interactive shell session inside the container.

In [None]:
# Run this in your terminal:
# docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.11.nemotron_3_nano

---

#### Note : The following steps should be executed inside the Docker container.

---

## Step 3: Set Environment Variables

Set the HuggingFace model ID and specify the destination for the Megatron checkpoint.

In [None]:
%%bash
export HF_MODEL_ID=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
export MEGATRON_MODEL_PATH=/workspace/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-Mbridge

echo "HF_MODEL_ID: $HF_MODEL_ID"
echo "MEGATRON_MODEL_PATH: $MEGATRON_MODEL_PATH"

## Step 4: Convert HuggingFace Model to Megatron Format

NeMo uses Megatron-LM format for training. We need to convert the HuggingFace checkpoint to Megatron format.

This step:
- Downloads the model from HuggingFace Hub
- Converts model weights to Megatron-compatible format
- Saves to the specified output path

**‚è±Ô∏è Note:** 
The first run downloads ~60GB+ and can take **15‚Äì60+ minutes** depending on your connection. The progress bar may stay at 0% for a while before moving‚Äîthis is normal. Do not interrupt the cell.

In [None]:
%%bash
cd /opt/Megatron-Bridge

python examples/conversion/convert_checkpoints.py import \
  --hf-model $HF_MODEL_ID \
  --megatron-path $MEGATRON_MODEL_PATH \
  --trust-remote-code

## Step 5: Fine-tune with LoRA

In this step, you will fine-tune the model using LoRA (Low-Rank Adaptation), which is an efficient technique for adapting large models with fewer trainable parameters. 
We'll use the [SQuAD dataset](https://huggingface.co/datasets/rajpurkar/squad) for this example. The SQuAD dataset is a popular benchmark for question answering, containing over 100,000 question-and-answer pairs across more than 500 diverse articles.

**Key training parameters:**
- `--peft lora`: Activates LoRA for efficient fine-tuning.
- `train.global_batch_size=128`: Sets the total batch size combining all GPUs.
- `train.train_iters=50`: Determines the number of training iterations.
- `scheduler.lr_warmup_iters=10`: Number of iterations to gradually increase (warm up) the learning rate at the start of training.
- `checkpoint.pretrained_checkpoint`: Specifies the path to the pre-converted Megatron checkpoint to start from.

**Hardware note:** The sample command uses 8 GPUs (`--nproc-per-node=8`). If you have a different number of GPUs, adjust this parameter as needed.

**What to expect:** 
After training, you will see output similar to:
`validation loss at iteration 50 on validation set | lm loss value: 1.261660E-01 | lm loss PPL: 1.134470E+00 |`
This indicates the loss and perplexity on the validation set after 50 iterations. For this setup, the training should complete in about 15 minutes.

In [None]:
%%bash
cd /opt/Megatron-Bridge

export ALLOW_NVLINK_FOR_NORMAL_MODE=0
export NCCL_P2P_DISABLE=1 

torchrun --nproc-per-node=8 examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
  --peft lora \
  train.global_batch_size=128 \
  train.train_iters=50 \
  scheduler.lr_warmup_iters=10 \
  checkpoint.pretrained_checkpoint=$MEGATRON_MODEL_PATH \
  model.moe_enable_deepep=False \
  model.moe_token_dispatcher_type=alltoall


---

## Optional Step 5a ‚Äì Fine-Tune with a Custom Training Script and Dataset

To use your own dataset rather than the default SQuAD dataset, you can write a custom Python training script. \
This approach allows you to customize the training setup to fit your specific needs. \
Run the steps below directly on the machine.

### Prepare Your Dataset

This step uses the **BIRD SQL** dataset (a text-to-SQL benchmark with schema, question, evidence, and SQL pairs) some helper scripts.
These scripts apply the Nemotron chat template, filters by sequence length, and writes `training.jsonl` into a dataset directory.

Run the cells below to download/prepare the dataset and save it to `dataset/training.jsonl`.

In [None]:
# Install into this kernel's Python (avoids user-local / system pip mismatch)
import sys
import subprocess
# Ensure pip exists in the venv (some venvs are created without it)
subprocess.check_call([sys.executable, "-m", "ensurepip", "--upgrade"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "datasets", "transformers", "jinja2"])

In [None]:
! pip install datasets transformers jinja2

In [None]:
import os
import sys
from datasets import disable_caching

disable_caching()

# Ensure bird_sql is importable (repo root = /workspace in Docker, or cwd when run locally)
workspace_root = os.environ.get("WORKSPACE", os.getcwd())
if workspace_root not in sys.path:
    sys.path.insert(0, workspace_root)
from bird_sql.dataset_bird import DatasetBIRD

DATASET_DIR = os.environ.get("DATASET_DIR", os.path.join(os.getcwd(), "dataset")) # automatically put in dataset in this current directory
training_jsonl = os.path.join(DATASET_DIR, "training.jsonl")
model_id = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
max_seq_len = 4096
num_workers = 8

os.makedirs(DATASET_DIR, exist_ok=True)

print("Preparing BIRD training dataset...")
dataset = DatasetBIRD(
    model_id_to_prep_for=model_id,
    max_seq_len=max_seq_len,
    num_workers=num_workers,
).make_dataset()
dataset = dataset.sort("length")

dataset.to_json(training_jsonl, orient="records", lines=True, force_ascii=True)
print(f"Saved {len(dataset)} samples to {training_jsonl}")

### Create Custom Training Script

Create a Python script that configures the training with your custom dataset. \
This also should be run directly on the host machine, but the paths used will be the mount points in the container. \
No need to change paths in the script.

### Key Configuration Options

You can customize training by setting environment variables:

**Paths:**
- `BASE_MODEL_PATH`: Path to converted Megatron checkpoint
- `DATASET_DIR`: Directory containing `training.jsonl`
- `CHECKPOINT_DIR`: Where to save training checkpoints

**LoRA Parameters:**
- `LORA_RANK`: LoRA rank (default: 16, higher = more parameters)
- `LORA_ALPHA`: LoRA alpha scaling (default: 32)
- `LORA_DROPOUT`: LoRA dropout rate (default: 0.05)

In [None]:
%%writefile custom_finetune.py
#!/usr/bin/env python3
"""Custom fine-tuning script for Nemotron-3-Nano with custom dataset."""

import os
import math
import torch

from megatron.bridge.recipes.nemotronh.nemotron_3_nano import (
    nemotron_3_nano_finetune_config,
)
from megatron.bridge.training.config import FinetuningDatasetConfig
from megatron.bridge.training.finetune import finetune
from megatron.bridge.training.gpt_step import forward_step

# Ensure script is launched with torchrun
if "LOCAL_RANK" not in os.environ and "RANK" not in os.environ:
    raise RuntimeError(
        "This script must be launched with torchrun. "
        "Example: torchrun --nproc-per-node=8 custom_finetune.py"
    )

# ===========================
# CONFIGURATION PARAMETERS
# ===========================

# Paths
BASE_MODEL_PATH = os.environ.get(
    "BASE_MODEL_PATH",
    "/workspace/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-Mbridge"
)
DATASET_DIR = os.environ.get(
    "DATASET_DIR",
    os.path.join("/workspace/dataset")
)
CHECKPOINT_DIR = os.environ.get(
    "CHECKPOINT_DIR",
    "/opt/Megatron-Bridge/nemo_experiments/custom_run"
)

# Training hyperparameters
N_DEVICES = int(os.environ.get("N_DEVICES", "8"))  # Number of GPUs
MAX_SEQ_LEN = int(os.environ.get("MAX_SEQ_LEN", "4096"))
GLOBAL_BATCH_SIZE = int(os.environ.get("GLOBAL_BS", "128"))
PER_DEVICE_BATCH_SIZE = int(os.environ.get("PER_DEVICE_BS", "1"))
LEARNING_RATE = float(os.environ.get("LR", "5e-5"))
MIN_LR = float(os.environ.get("MIN_LR", "1e-6"))
WEIGHT_DECAY = float(os.environ.get("WEIGHT_DECAY", "0.001"))
CLIP_GRAD = float(os.environ.get("CLIP_GRAD", "1.0"))
WARMUP_RATIO = float(os.environ.get("WARMUP_RATIO", "0.03"))
EPOCHS = int(os.environ.get("EPOCHS", "3"))

# LoRA parameters
LORA_RANK = int(os.environ.get("LORA_RANK", "16"))
LORA_ALPHA = int(os.environ.get("LORA_ALPHA", "32"))
LORA_DROPOUT = float(os.environ.get("LORA_DROPOUT", "0.05"))


def count_jsonl_rows(filepath: str) -> int:
    """Count number of lines in JSONL file."""
    with open(filepath, "r", encoding="utf-8") as f:
        return sum(1 for _ in f)


def main():
    print("="*80)
    print("Custom Nemotron-3-Nano Fine-tuning")
    print("="*80)
    
    # Validate paths
    if not os.path.isdir(BASE_MODEL_PATH):
        raise FileNotFoundError(
            f"Base model path not found: {BASE_MODEL_PATH}\n"
            "Please convert the HuggingFace model to Megatron format first."
        )
    
    training_file = os.path.join(DATASET_DIR, "training.jsonl")
    if not os.path.exists(training_file):
        raise FileNotFoundError(
            f"Training data not found: {training_file}\n"
            f"Please create a training.jsonl file in {DATASET_DIR}"
        )
    
    # Calculate training steps
    n_examples = count_jsonl_rows(training_file)
    steps_per_epoch = math.ceil(n_examples / GLOBAL_BATCH_SIZE)
    total_steps = EPOCHS * steps_per_epoch
    warmup_steps = int(WARMUP_RATIO * total_steps)
    save_interval = max(1, total_steps // 5)  # Save 5 checkpoints
    
    print(f"\nüìä Training Configuration:")
    print(f"   Base model: {BASE_MODEL_PATH}")
    print(f"   Dataset: {DATASET_DIR}")
    print(f"   Training examples: {n_examples}")
    print(f"   Epochs: {EPOCHS}")
    print(f"   Total steps: {total_steps}")
    print(f"   Steps per epoch: {steps_per_epoch}")
    print(f"   Global batch size: {GLOBAL_BATCH_SIZE}")
    print(f"   Per-device batch size: {PER_DEVICE_BATCH_SIZE}")
    print(f"   Learning rate: {LEARNING_RATE}")
    print(f"   LoRA rank: {LORA_RANK}")
    print(f"   Checkpoints will be saved to: {CHECKPOINT_DIR}")
    print()
    
    # Create base configuration
    config = nemotron_3_nano_finetune_config(
        seq_length=MAX_SEQ_LEN,
        peft="lora",
        packed_sequence=False,
        expert_model_parallelism=N_DEVICES,
        global_batch_size=GLOBAL_BATCH_SIZE,
        micro_batch_size=PER_DEVICE_BATCH_SIZE,
        finetune_lr=LEARNING_RATE,
        min_lr=MIN_LR,
        lr_warmup_iters=warmup_steps,
        train_iters=total_steps,
    )
    
    # Configure custom dataset
    config.dataset = FinetuningDatasetConfig(
        dataset_root=DATASET_DIR,
        seq_length=MAX_SEQ_LEN,
        seed=1234,
        num_workers=8,
        pin_memory=True,
        do_validation=False,  # Set to True if you have validation.jsonl
        do_test=False,
        dataset_kwargs={
            "label_key": "output",
            "answer_only_loss": True,
            "prompt_template": "{input} {output}",
            "truncation_field": "input",
        },
    )
    print(f"\nüìã Dataset Configuration (verify these match your data):")
    print(f"   label_key: 'output' (must match the target field in training.jsonl)")
    print(f"   answer_only_loss: True (loss computed only on output tokens, not input)")
    print(f"   prompt_template: '{{input}} {{output}}' (how fields are combined)")
    print(f"   truncation_field: 'input' (input gets truncated if sequence too long)")
    print()


    # Configure model and training
    config.model.seq_length = MAX_SEQ_LEN
    config.model.calculate_per_token_loss = True
    
    # Checkpoint configuration
    config.checkpoint.pretrained_checkpoint = BASE_MODEL_PATH
    config.checkpoint.save_interval = save_interval
    config.checkpoint.checkpoints_path = CHECKPOINT_DIR
    
    # Optimizer settings
    config.optimizer.clip_grad = CLIP_GRAD
    config.optimizer.weight_decay = WEIGHT_DECAY
    
    # LoRA configuration
    config.peft.lora_rank = LORA_RANK
    config.peft.lora_alpha = LORA_ALPHA
    config.peft.lora_dropout = LORA_DROPOUT
    
    # Logging
    config.logger.log_interval = 1
    config.logger.tensorboard_dir = os.path.join(CHECKPOINT_DIR, "tensorboard")
    
    # MoE dispatcher settings (for portability)
    config.model.moe_token_dispatcher_type = "alltoall"
    config.model.moe_enable_deepep = False
    
    print("üöÄ Starting fine-tuning...\n")
    
    # Start training
    finetune(config=config, forward_step_func=forward_step)
    
    print("\n‚úÖ Training completed successfully!")
    print(f"üìÅ Checkpoints saved to: {CHECKPOINT_DIR}")
    
    if torch.distributed.is_initialized():
        torch.distributed.destroy_process_group()


if __name__ == "__main__":
    main()

### Launch Custom Training

Now run your custom training script with torchrun

**Training Parameters:**
- `N_DEVICES`: Number of GPUs (must match `--nproc-per-node`)
- `GLOBAL_BS`: Global batch size across all GPUs
- `PER_DEVICE_BS`: Batch size per GPU
- `EPOCHS`: Number of training epochs
- `LR`: Learning rate (default: 5e-5)
- `WARMUP_RATIO`: Fraction of steps for learning rate warmup (default: 0.03)

**Run the steps below from inside the container. Your data should already be at `/workspace/dataset`**

In [None]:
%%bash
cd /opt/Megatron-Bridge

# Set environment variables (optional - script has defaults)
export BASE_MODEL_PATH=$MEGATRON_MODEL_PATH
export DATASET_DIR=/workspace/dataset
export N_DEVICES=8
export GLOBAL_BS=128
export EPOCHS=3
export LR=5e-5
export LORA_RANK=16

# Launch training
torchrun --nproc-per-node=8 /workspace/custom_finetune.py

---

## Step 6: Check Training Outputs

Once training is finished, check that the checkpoints have been saved successfully.

In [None]:
%%bash
ls /opt/Megatron-Bridge/nemo_experiments/default/checkpoints

**Expected output:**
```
iter_0000050
latest_checkpointed_iteration.txt
latest_train_state.pt
```
If you used custom data (5a) you will see more iterations and you can view `latest_checkpointed_iteration.txt` to view which folder you should use. \
Note the name of the folder of this checkpoint for the next step

## Step 7: Merge LoRA Weights

To obtain a standalone fine-tuned model, merge the LoRA adapters into the base model.

In this step:
- The base model weights are combined with the LoRA adapter weights
- The result is a single, merged checkpoint

Again, make sure to change `--nproc_per_node=8` to your GPU count.

You may need to change `--lora-checkpoint` if you used your own data \
to the checkpoint to your latest checkpoint you would like to use.

There is no need to change the `--output` directory, but if you do make sure \
the path you choose is used in the remainder in the steps. 

In [None]:
%%bash
cd /opt/Megatron-Bridge

torchrun --standalone --nproc_per_node=8 examples/peft/merge_lora.py \
  --hf-model-path $HF_MODEL_ID \
  --lora-checkpoint /opt/Megatron-Bridge/nemo_experiments/default/checkpoints/iter_0000050 \
  --output /workspace/models/merged_0050

## Step 8: Export to HuggingFace Format

Finally, convert the merged Megatron checkpoint back to HuggingFace format for easy deployment and inference.

This creates a standard HuggingFace model that can be loaded with `transformers` library.

In [None]:
%%bash
cd /opt/Megatron-Bridge

python examples/conversion/convert_checkpoints.py export \
  --hf-model $HF_MODEL_ID \
  --megatron-path /workspace/models/merged_0050 \
  --hf-path /workspace/models/merged_0050-hf

**Expected output:**
```
Converting to HuggingFace ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 100% (6231/6231)

Success: All tensors from the original checkpoint were written.
‚úÖ Successfully exported model to: /workspace/models/merged_0050-hf
```

## Step 9: Deploy the Fine-tuned Model with Docker Compose

Now that you have your fine-tuned model, you can deploy it for inference via NVIDIA NIM (NVIDIA Inference Microservices) or vLLM. 

**You can exit the NeMo container that you ran commands in.**

### Deployment Options

The docker-compose configuration below provides two deployment options:

1. **NVIDIA NIM** 
   - Uses your merged weights
   - Supports tensor parallelism
   - Optimized for production inference

3. **vLLM - Open-source inference**
   - Fast and memory-efficient
   - Supports LoRA adapters
   - PagedAttention for throughput

### Prerequisites

Before deploying, ensure:
- Your fine-tuned model is accessible on the host machine
- You have NGC API key (for NIM services - requires an)
- Docker Compose is installed
- NVIDIA Docker runtime is configured

The prerequisites listed above pertain to your local environment. 
The following commands should be executed from this notebook, outside the NeMo container, directly on your host machine.

### Create docker-compose.yml

Specify the HOST_MODELS_DIR where your models are stored on your host machine.

In [None]:
# Base directory where models are stored on your host machine. 
# merged_0050, merged_0050-hf, and NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-Mbridge should be in this directory.
# CHANGE THIS TO YOUR PATH
HOST_MODELS_DIR = "YOUR_MODEL_DIRECTORY_ON_YOUR_HOST_MACHINE"

# Sanity check: Your output should contain these 3 folders:
# - merged_0050
# - merged_0050-hf
# - NVIDIA-Nemotron-3-Nano-30B-A3B-BF16-Mbridge
!ls -la {HOST_MODELS_DIR}


Create a `docker-compose.yml` file with the following configuration:

In [None]:
# Write docker-compose.yml with populated variables
docker_compose_content = f"""version: '3.8'
services:
  # NVIDIA NIM - Deploy your fine-tuned Nemotron-3-Nano model
  nim-nano3:
    image: nvcr.io/nim/nvidia/nemotron-3-nano:1
    container_name: customized-nim-nano3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0','1','2','3','4','5','6','7']  # Use all 8 GPUs
              capabilities: [gpu]
    shm_size: 128GB  # Large shared memory for multi-GPU inference
    environment:
      # Point to your fine-tuned model
      - NGC_API_KEY={NGC_API_KEY}  # Required for NIM
      - NIM_MODEL_NAME=/.cache/{HOST_MODELS_DIR}/merged_0050-hf/
      - NIM_SERVED_MODEL_NAME=nemotron-nano-3  # Name for API requests
      - NIM_TENSOR_PARALLEL_SIZE=8  # Split model across 2 GPUs
      - OMPI_ALLOW_RUN_AS_ROOT=1
      - OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
    volumes:
      - "{HOST_MODELS_DIR}:/.cache"  # Mount models directory
    ports:
      - "8007:8000"  # API endpoint at http://localhost:8007
    user: root

  # vLLM - Open-source high-throughput inference
  vllm:
    image: vllm/vllm-openai:v0.13.0
    container_name: vllm-nano8b
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']  # Use all GPUs
              capabilities: [gpu]
    shm_size: 128GB
    environment:
      - OMPI_ALLOW_RUN_AS_ROOT=1
      - OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
    volumes:
      - "{HOST_MODELS_DIR}:/root/.cache"
    ports:
      - "8006:8000"  # API endpoint at http://localhost:8006
    user: root
    command: [
      "--trust-remote-code",
      "--served-model-name", "nemotron-nano",
      "--tensor-parallel-size", "1",
      "--model", "/root/.cache/merged_0050-hf" 
    ]
"""

# Write the file
with open('docker-compose.yml', 'w') as f:
    f.write(docker_compose_content)

print(f"‚úÖ docker-compose.yml written successfully!")
print(f"   Using HOST_MODELS_DIR: {HOST_MODELS_DIR}")

After running this cell the docker-compose.yml file should be created in the current directory.

### Start the Inference Service

Choose your deployment backend by commenting or uncommenting the relevant lines below‚Äîstart either the NIM or vLLM service as needed.\
Reminder: if you would like to use NVIDIA NIM a license is required. See documentation on how to obtain a free license [here](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#nim-container-access). 

In [None]:
%%bash
# Start only the NIM service with your fine-tuned model
docker compose up -d nim-nano3

# To start the vLLM service, uncomment the following line
# docker compose up -d vllm

# Check service status
docker compose ps

### Deployment Configuration Tips

**GPU Allocation:**
- Adjust `device_ids` based on available GPUs
- Use `nvidia-smi` to check GPU availability
- Avoid overlapping GPU assignments between services

**Tensor Parallelism:**
- `NIM_TENSOR_PARALLEL_SIZE=2` splits model across 2 GPUs
- Adjust based on model size and GPU memory
- Higher TP = more GPUs, better for large models

**Memory Settings:**
- `shm_size: 128GB` provides shared memory for IPC
- Increase if you encounter "Bus error" or shared memory issues
- Must be large enough for model weights and KV cache

**LoRA Configuration:**
- `NIM_MAX_LORA_RANK=64` sets maximum adapter rank
- `NIM_PEFT_REFRESH_INTERVAL` controls adapter reload frequency
- Place adapters in `NIM_PEFT_SOURCE` directory

**Performance Tuning:**
- Monitor with `docker stats` or `nvidia-smi`
- Adjust batch sizes via environment variables
- Enable `NIM_KV_CACHE_REUSE` for better throughput

### Test the Deployed Model

Once the service is running, you can test it using the OpenAI-compatible API:

In [None]:
! pip install requests

Send a request to **NVIDIA NIM - Nemotron-3-Nano** (Custom fine-tuned model) which is running on port 8007.

In [None]:
import requests
import json

# API endpoint
url = "http://localhost:8007/v1/chat/completions"

# Request payload
payload = {
    "model": "nemotron-nano-3",
    "messages": [
        {"role": "user", "content": "Write a SQL query to list the top 5 customers by total spend. Tables: customers(id,name), orders(id,customer_id,total_amount)"}
    ],
    "temperature": 0.7,
    "max_tokens": 2000
}

# Make request
response = requests.post(url, json=payload)
result = response.json()

# Print response
print("Model Response:")
print(result["choices"][0]["message"]["content"])

### Alternative: Test with cURL

You can also test using curl from the command line to the **NVIDIA NIM - Nemotron-3-Nano** customized model.

In [None]:
%%bash
curl -X POST http://localhost:8007/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-nano-3",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

If you want to test the vLLM service, uncomment the following cell and run it.

In [None]:
# import requests
# import json

# # API endpoint
# url = "http://localhost:8006/v1/chat/completions"

# # Request payload
# payload = {
#     "model": "nemotron-nano",
#     "messages": [
#         {"role": "user", "content": "Hello! How can you help me today?"}
#     ],
#     "temperature": 0.7,
#     "max_tokens": 256
# }

# # Make request
# response = requests.post(url, json=payload)
# result = response.json()

# # Print response
# print("Model Response:")
# print(result["choices"][0]["message"]["content"])

### Monitor and Manage Services

In [None]:
%%bash
# View logs
docker compose logs -f nim-nano3

# Stop services
# docker compose down

# Restart a service
# docker compose restart nim-nano3

---

## Export Structure

The exported HuggingFace model contains:

```
üìÅ /workspace/models/merged_0050-hf/
   üìÑ config.json                         # Model configuration
   üìÑ generation_config.json             # Generation parameters
   üìÑ tokenizer.json                     # Tokenizer vocabulary
   üìÑ tokenizer_config.json              # Tokenizer configuration
   üìÑ special_tokens_map.json            # Special tokens mapping
   üìÑ chat_template.jinja                # Chat template
   üìÑ model.safetensors.index.json       # Model sharding index
   üìÑ model-00001-of-00013.safetensors   # Model weights (sharded)
   üìÑ model-00002-of-00013.safetensors
   ... (13 shard files total)
   üìÑ modeling_nemotron_h.py             # Custom model code
   üìÑ configuration_nemotron_h.py        # Custom config code
```

---

## Additional Resources

- **Model Collection:** [NVIDIA Nemotron V3 on HuggingFace](https://huggingface.co/collections/nvidia/nvidia-nemotron-v3)
- **Base Model:** [NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16)
- **NeMo Framework:** [NVIDIA NeMo Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/)

---

## Tips and Best Practices

### Training Configuration
- **Batch Size:** Adjust `train.global_batch_size` based on GPU memory
- **Iterations:** Increase `train.train_iters` for better convergence
- **Learning Rate:** Tune via `optimizer.lr` and `scheduler.lr_warmup_iters`

### LoRA Parameters
- Default LoRA rank is typically 8-16 (configured in recipe)
- Lower rank = fewer trainable parameters, faster training
- Higher rank = more expressivity, potentially better results

### GPU Requirements
- This example uses 8 GPUs
- For fewer GPUs: adjust `--nproc-per-node` and reduce batch size
- Monitor GPU memory with `nvidia-smi`

### Checkpoint Management
- Checkpoints are saved to `/opt/Megatron-Bridge/nemo_experiments/`
- Use `checkpoint.save_interval` to control checkpoint frequency
- Keep at least 2-3 checkpoints for rollback

---

## Troubleshooting

### Out of Memory (OOM)
- Reduce `train.global_batch_size`
- Enable gradient checkpointing
- Use fewer GPUs with tensor parallelism

### Slow Training
- Check GPU utilization with `nvidia-smi`
- Verify data loading isn't a bottleneck
- Ensure `--ipc=host` flag is used

### Conversion Errors
- Verify HuggingFace model ID is correct
- Check disk space for checkpoint storage
- Ensure `--trust-remote-code` is set for custom models

---

## Conclusion

You now have a complete end-to-end workflow for fine-tuning and deploying NVIDIA Nemotron-3-Nano models!

### What You've Accomplished:

‚úÖ **Fine-tuning:** Trained a custom model using LoRA for efficient adaptation
‚úÖ **Model Export:** Converted to HuggingFace format at `/workspace/models/merged_0050-hf`
‚úÖ **Deployment:** Set up production-ready inference with NVIDIA NIM or vLLM

### Next Steps:

Your fine-tuned model can now be:

- **Deployed:** Already configured with docker-compose for immediate use
- **Integrated:** OpenAI-compatible API for easy integration
- **Shared:** Upload to HuggingFace Hub for team collaboration
- **Improved:** Further fine-tune with additional domain-specific data
- **Scaled:** Deploy across multiple nodes for high-throughput serving

### API Endpoints:

Once deployed, your services are available at:
- **NIM Nemotron-3-Nano:** `http://localhost:8007/v1/chat/completions`
- **vLLM:** `http://localhost:8006/v1/chat/completions`

Happy training and deploying! üöÄ