<div class="alert alert-block alert-warning">
<b>WARNING:</b> This notebook requires at least 512 GB of SHM to run. While launching the dev pod on DGX Cloud Lepton, enter <code>512</code> in the <b>Advanced Configuration > Shared Memory</b> field.
</div>

# SFT with NeMo Framework
This notebook demonstrates how to use NeMo Framework to run supervised fine-tuning (SFT) to add reasoning capabilities to an LLM. The example first pre-processes the [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) dataset which contains millions of sample math problems with chain-of-thought (COT) sequences. Using the prepared dataset, it downloads and fine-tunes the Qwen2.5-Math-1.5B base model to instruct it to follow the chain-of-thought pattern, resulting in more accurate responses.

### Requirements
The notebook requires the following to run successfully:
* 8x GPUs with at least 80 GB of GPU memory each
* 512 GB of SHM attached to the container

### Fine-Tuning Job Settings
Let's specify the high-level settings that will be used for fine-tuning the model. The values can be changed if needed, but the notebook is designed to run with the default values shown below.

In [None]:
BASE_MODEL = "Qwen/Qwen2.5-Math-1.5B"  # Base model we will fine-tune
LOG_DIR="/workspace/qwen-sft"  # Directory to store the logs and checkpoints
JOB_NAME="qwen-open-math-reasoning"  # Name of the job
DATASET_NAME = "nvidia/OpenMathReasoning"  # Math reasoning dataset to fine-tune against
DATA_DIRECTORY = "/workspace/data/openmath_reasoning"  # Directory to store the dataset
DATASET_SPLIT = "cot"  # Which dataset split to use
GPUS_PER_NODE = 8  # Number of GPUs per node to use for fine-tuning
TENSOR_PARALLELISM = 2  # Tensor parallelism level to split the model across GPUs
CONTEXT_PARALLELISM = 1  # Amount to split the sequence context across GPUs
SEQ_LENGTH = 32768  # Total number of input+output tokens to fine-tune
GLOBAL_BATCH_SIZE = 4  # Total number of tokens to process in each step
MICRO_BATCH_SIZE = 1  # Number of tokens to process in each step
MAX_STEPS = 8000  # Maximum number of steps to fine-tune for
LR = 3e-4  # Maximum learning rate during fine-tuning

### Weights and Biases Integration
NeMo Framework supports Weights and Biases (W&B) for tracking model training metrics. To track your training with W&B, add your API key in the EV below and specify the project and job names. Leave the `WANDB_API_KEY` variable blank to skip W&B tracking.

In [None]:
WANDB_API_KEY = ""  # Optionally add your W&B API key here if desired. If kept blank, W&B will not be used.
WANDB_PROJECT = "nemo-sft"  # Project to save the job under in W&B
WANDB_JOB_NAME = "openmath-reasoning"  # Name of the specific job in W&B

### Download Base Model
First, we need to download the base model from Hugging Face which we will be fine-tuning. This step downloads the Qwen2.5-Math-1.5B model from Hugging Face and converts it to the NeMo format which allows it to be fine-tuned directly with NeMo Framework.

This cell uses [NeMo-Run](https://github.com/nvidia-nemo/run) to launch a job inside the container in a background process. It will take a couple of minutes to download and convert the model depending on your network connection. The cell will output the following when successful:

```
Converted Qwen model to Nemo, model saved to /root/.cache/nemo/models/Qwen/Qwen2.5-Math-1.5B
```

To use a different model as a base, change the `BASE_MODEL` setting above and update the `model=llm.qwen25_1p5b.model()` line with the link to the corresponding model in NeMo Framework. See [this link](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/recipes) for the list of models supported by NeMo Framework and for the corresponding model name.

In [None]:
import nemo_run as run
from nemo.collections import llm

def configure_checkpoint_conversion():
    return run.Partial(
        llm.import_ckpt,
        model=llm.qwen25_1p5b.model(),
        source=f"hf://{BASE_MODEL}",
        overwrite=True,
    )

import_ckpt = configure_checkpoint_conversion()
local_executor = run.LocalExecutor()

run.run(import_ckpt, executor=local_executor)

### Dataset Preparation
We need to download and prepare a dataset to feed the model during fine-tuning. NeMo Framework expects a training and validation file in JSONL format. The default expected keys are `input` for the input prompts, and `output` for the generated responses.

The following cell first loads the nvidia/OpenMathReasoning dataset from Hugging Face, iterates through the `cot` dataset split, rewrites the `problem` column to `input`, and the `generated_solution` column to `output`, and saves it to JSONL files in the container.

To use a different dataset, replace the `DATASET_NAME` field with the repo ID of a Hugging Face dataset, and modify the `input` and `output` columns as necessary to match what's used in your dataset.

By default, the dataset will be saved to the `DATA_DIRECTORY` of `/workspace/data/openmath_reasoning`.

This step takes approximately 15 minutes to complete depending on your network speed.

In [None]:
import os
import json
from datasets import load_dataset

def prepare_dataset():
    dataset = load_dataset(DATASET_NAME)
    train_file_path = os.path.join(DATA_DIRECTORY, "training.jsonl")
    val_file_path  = os.path.join(DATA_DIRECTORY, "validation.jsonl")

    # Check if the dataset has already been prepared
    if os.path.exists(train_file_path) and os.path.exists(val_file_path):
        print("Dataset already prepared. Skipping...")
        return

    with open(train_file_path, "w") as train_file, open(val_file_path, "w") as val_file:
        for train_counter, line in enumerate(dataset[DATASET_SPLIT]):
            # Rename the columns to match the expected keys
            desired_data = {
                "input": line["problem"],
                "output": line["generated_solution"],
            }

            # Include 100 validation examples
            if train_counter < 100:
                json.dump(desired_data, val_file)
                val_file.write("\n")
            # Include one million training examples from the dataset
            elif train_counter < 1000000:
                json.dump(desired_data, train_file)
                train_file.write("\n")
            else:
                break

os.makedirs(DATA_DIRECTORY, exist_ok=True)
prepare_dataset()

### Validate Dataset
After preparing the dataset, check the first line of each file to see the output format and ensure it matches the expected format. The output should look similar to the following (truncated for readability).

training.jsonl:
```
{"input": "A balance scale sits on a teacher's table, currently tipped to the right. <truncated>", "output": "<think>\nOkay, let's try to tackle this problem. Hmm, so we have a balance scale that's tipped to the right initially. <truncated> **Final Answer:**\n   - The sum of \\( f(S) \\) for all possible non-empty subsets \\( S \\) of pupils is:\n     \\[\n     \\boxed{2^{n-1} (R - L)}\n     \\]\n\nThis is the final answer, where \\( R \\) is the total weight on the right side initially, and \\( L \\) is the total weight on the left side initially."}
```

validation.jsonl:
```
{"input": "Given a group of \\( N \\) balls consisting of \\( C \\) colors, where <truncated>", "output": "<think>\nOkay, so I need to find the probability that when I pick A balls out of N, where there are C different colors, the number of each color I pick is exactly a1, a2, ..., aC. Hmm, let's think about how to approach this.\n\nFirst, probability problems often involve combinations. The general formula for probability is the number of favorable outcomes divided by the total number of possible outcomes. <truncated> Final Solution:\nThe probability that when \\( A \\) balls are randomly picked from \\( N \\) balls, the picked balls consist of \\( a_1, a_2, \\ldots, a_C \\) balls of each color is given by:\n\\[\n\\boxed{\\frac{\\prod_{i=1}^{C} \\binom{n_i}{a_i}}{\\binom{N}{A}}}\n\\]\n\nThis solution is derived from the multivariate hypergeometric distribution, where the combinations \\( \\binom{n_i}{a_i} \\) account for the ways to choose \\( a_i \\) balls from \\( n_i \\) balls of color \\( i \\), and the denominator \\( \\binom{N}{A} \\) accounts for the total ways to choose \\( A \\) balls from \\( N \\) balls."}
```

In [None]:
!head -n 1 data/openmath_reasoning/*jsonl

### Configure Data Module
Now that the dataset has been prepared, we need to tell NeMo Framework how to load it and configure some hyperparameters including the tokenizer, maximum sequence length, and global and micro batch sizes. Let's look a bit deeper at some of these settings.

* seq_length: This is the maximum number of input+output tokens that will be processed by the model during fine-tuning. Any lines in the dataset that have input+output tokens that is longer than the seq_length will be truncated. Longer sequence lengths allow for longer input and output responses, but require additional memory during fine-tuning and deployment.
* global_batch_size: This is the amount of sequences that are processed during every step. In general, higher batch sizes allow greater throughput at the cost of potentially running out of GPU memory. If you experience CUDA OOM errors, try lowering the `GLOBAL_BATCH_SIZE`. It needs to be a power-of-2 value (ie. 2, 4, 8, ...).
* micro_batch_size: This is the amount to sub-divide sequences on GPUs during every step. In general, higher batch sizes allow greater throughput at the cost of potentially running out of GPU memory. If you expeience CUDA OOM errors, try lowering the `MICRO_BATCH_SIZE`. It needs to be less than or equal to `GLOBAL_BATCH_SIZE`.

In [None]:
from nemo.collections.common.tokenizers import AutoTokenizer
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule

def data_module() -> run.Config:
    tokenizer = run.Config(AutoTokenizer, pretrained_model_name=BASE_MODEL)

    return run.Config(
        FineTuningDataModule,
        dataset_root=DATA_DIRECTORY,
        seq_length=SEQ_LENGTH,
        global_batch_size=GLOBAL_BATCH_SIZE,
        micro_batch_size=MICRO_BATCH_SIZE,
        tokenizer=tokenizer,
    )

### Configure Checkpoint Resumption
We need to specify the model checkpoint to fine-tune. This indicates the base model that should be updated based on the fine-tuning dataset. This cell tells NeMo Framework to load the model that was downloaded from Hugging Face and converted to the NeMo format earlier in the notebook.

In [None]:
from nemo import lightning as nl

def checkpoint_resumption() -> run.Config:
    return run.Config(
        nl.AutoResume,
        restore_config=run.Config(nl.RestoreConfig,
            path=f"nemo://{BASE_MODEL}"
        ),
        resume_if_exists=True,
    )

### Configure Logger
Next, the logger needs to be configured to indicate where checkpoints and logs should be saved. All checkpoints will be saved in the `{LOG_DIR}/{JOB_NAME}/checkpoints` directory, such as `/workspace/qwen-sft/qwen-open-math-reasoning/checkpoints`.

If a W&B key was added at the beginning of the notebook, the W&B logger will also be added and metrics will automatically be uploaded to W&B servers during fine-tuning.

In [None]:
from nemo.collections.llm.recipes.log.default import default_log, wandb_logger

def logger(log_dir: str, name: str):
    if WANDB_API_KEY != "":
        wandb = wandb_logger(
            project=WANDB_PROJECT,
            name=WANDB_JOB_NAME,
        )
        return default_log(dir=log_dir, name=name, wandb_logger=wandb)
    else:
        return default_log(dir=log_dir, name=name)

### Configure Trainer
We now specify the high-level settings for the trainer, such as the number of nodes and GPUs to fine-tune with. Additionally, specifying the `CONTEXT_PARALLELISM` and `TENSOR_PARALLELISM` sizes to spread the context and model weights across GPUs to reduce memory usage. In general, if training throws CUDA OOM errors, try doubling some of the parallelism sizes to spread the reduce the memory requirements for each GPU.

In [None]:
from nemo.collections.llm.recipes.qwen2 import qwen2_trainer

def trainer(nodes: int = 1, gpus_per_node: int = 8):
    return qwen2_trainer(
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        max_steps=MAX_STEPS,
        context_parallelism=CONTEXT_PARALLELISM,
        tensor_parallelism=TENSOR_PARALLELISM,
    )

### Configure Optimizer
We now setup the optimizer for fine-tuning. We will use the standard Adam optimizer with cosine annealing. Specify the maximum learning rate and warmup steps for the learning rate scheduler to change over time. Note that this workflow hasn't been compared against different learning rates, so it is possible different values could yield higher quality responses.

In [None]:
from nemo.collections.llm.recipes.optim.adam import distributed_fused_adam_with_cosine_annealing

def optimizer():
    return distributed_fused_adam_with_cosine_annealing(max_lr=LR, warmup_steps=10)

### Setup the Fine-Tuning Recipe
Finally, we pull everything together to define the recipe for fine-tuning the model. This uses all of the settings we specified in the previous cells and creates an object that describes the complete setup for the fine-tuning job.

In [None]:
def configure_recipe(nodes: int = 1, gpus_per_node: int = 8, log_dir=None, name="nemo"):
    recipe = run.Partial(
        llm.finetune,
        model=llm.qwen25_1p5b.model(),
        data=data_module(),
        trainer=trainer(nodes, gpus_per_node),
        log=logger(log_dir, name),
        optim=optimizer(),
    )

    recipe.model.config.calculate_per_token_loss = True
    recipe.trainer.strategy.ckpt_load_strictness = False
    recipe.trainer.val_check_interval = 100

    recipe.resume = checkpoint_resumption()

    return recipe

### Launch Training
With the recipe defined, we can now launch the fine-tuning job. Similar to the model conversion cell above, this step uses NeMo-Run to launch the fine-tuning job with NeMo Framework inside the container. By default, this runs with 8 GPUs on a single node.

Depending on the settings used, this step will take several hours to run on 8x H100 GPUs. There will be a lot of output during the process, including some warnings that can be safely ignored. As the fine-tuning process begins, you will see several lines that will look similar to the following:

```
i.finetune/0 [default0]:Training epoch 0, iteration 0/7999 | lr: 2.727e-05 | global_batch_size: 4 | global_step: 0 | reduced_train_loss: 1.561
i.finetune/0 [default0]:Training epoch 0, iteration 1/7999 | lr: 5.455e-05 | global_batch_size: 4 | global_step: 1 | reduced_train_loss: 1.621 | consumed_samples: 8
i.finetune/0 [default0]:Training epoch 0, iteration 2/7999 | lr: 8.182e-05 | global_batch_size: 4 | global_step: 2 | reduced_train_loss: 0.9908 | consumed_samples: 12
i.finetune/0 [default0]:Training epoch 0, iteration 3/7999 | lr: 0.0001091 | global_batch_size: 4 | global_step: 3 | reduced_train_loss: 1.041 | consumed_samples: 16
i.finetune/0 [default0]:Training epoch 0, iteration 4/7999 | lr: 0.0001364 | global_batch_size: 4 | global_step: 4 | reduced_train_loss: 0.7781 | consumed_samples: 20
...
```

This output displays the following information in order:
* epoch: The current epoch in the fine-tuning pass. This will very likely be `0` for the entire process.
* iteration: This is the current step the process is on including the maximum number of steps for training.
* lr: This is the current learning rate for the indicated step. This will start off small, quickly jump up, then decay following the cosine annealing learning rate scheduler.
* global_batch_size: This is a static number indicated the batch size used for training.
* global_step: This is the current step that has been completed.
* reduced_train_loss: This is the current `train_loss` value that was calculated after the last step.
* consumed_samples: This shows how many sequences have been processed so far. To find the number of tokens processed, multiply this value by `SEQ_LENGTH`.

In [None]:
def local_executor_torchrun(devices: int = 8) -> run.LocalExecutor:
    env_vars = {}

    if WANDB_API_KEY != "":
        env_vars["WANDB_API_KEY"] = WANDB_API_KEY

    return run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

if __name__ == '__main__':
    recipe = configure_recipe(
        nodes=1,
        gpus_per_node=GPUS_PER_NODE,
        log_dir=LOG_DIR,
        name=JOB_NAME
    )
    run.run(recipe, executor=local_executor_torchrun(devices=GPUS_PER_NODE))

### Final Checkpoint
Once training finishes, your final checkpoint will be saved in the specified directory. This will default to `{LOG_DIR}/{JOB_NAME}/checkpoints` or `/workspace/qwen-sft/qwen-open-math-reasoning/checkpoints`. The final checkpoint will end with `-last` in the directory name. The cell below lists the final checkpoint which can be used for downstream tasks like evaluation and inference.

In [None]:
!find . -name "*-last"

### Next Steps
Congratulations, you have successfully fine-tuned an LLM using NeMo Framework! The fine-tuned model will have additional math reasoning capabilities which allow it to think through challenging math problems and should provide higher accuracy scores in math benchmarks.

From here, you can deploy the final checkpoint as an Endpoint on DGX Cloud Lepton, allowing inference requests to be sent directly to the model. For more information on deploying models as Endpoints, [read here](https://docs.nvidia.com/dgx-cloud/lepton/features/endpoint/create-from-container-image/).

Additionally, you can evaluate the fine-tuned model using various evaluation harnesses including [EleutherAI's lm-evaluation-harness](https://github.com/eleutherAI/lm-evaluation-harness/).