# Supervised Fine-Tuning for Instruction Following

[Gemma](https://ai.google.dev/gemma/docs/model_card) is a groundbreaking new open model in the Gemini family of models from Google. Gemma is just as powerful as previous models but compact enough to run locally on NVIDIA RTX GPUs. Gemma is available in 2 sizes: 2B and 7B parameters. With NVIDIA NeMo, you can customize Gemma to fit your usecase and deploy an optimized model on your NVIDIA GPU.

In this tutorial, we'll go over a specific kind of customization -- full parameter supervised fine-tuning for instruction following (also known as SFT). To learn how to perform Low-rank adapter (LoRA) tuning to follow a specific output format, see the [companion notebook](./lora.ipynb). For LoRA, we'll show how you can kick off a multi-GPU training job with an example script so that you can train on 8 GPUs. The exact number of GPUs needed will depend on which model you use and what kind of GPUs you use, but we recommend using 8 A100-80GB GPUs.

We'll also learn how to export your custom model to TensorRT-LLM, an open-source library that accelerates and optimizes inference performance of the latest LLMs on the NVIDIA AI platform.

## Introduction

Supervised Fine-Tuning (SFT) is the process of fine-tuning all of a model’s parameters on supervised data of inputs and outputs. It teaches the model how to follow user specified instructions and is typically done after model pre-training. This notebook describes the steps involved in fine-tuning Gemma for instruction following. Gemma was released with a checkpoint already fine-tuned for instruction-following, but here we'll learn how we can tune our own model starting with the pre-trained checkpoint to achieve a similar outcome. 

## Download the base model

For all of our customization and deployment processes, we'll need to start off with a pre-trained version of Gemma in the `.nemo` format. You can download the base model in `.nemo` format from the NVIDIA GPU Cloud, or convert checkpoints from another framework into a `.nemo` file. You can choose to use the 2B parameter or 7B parameter Gemma models for this notebook -- the 2B model will be faster to customize, but the 7B model will be more capable. 

You can download either model from the NVIDIA NGC Catalog, using the NGC CLI. The instructions to install and configure the NGC CLI can be found [here](https://ngc.nvidia.com/setup/installers/cli).

To download the model, execute one of the following commands, based on which model you want to use:

```bash
ngc registry model download-version "nvidia/nemo/gemma_2b_base:1.1"
```

or

```bash
ngc registry model download-version "nvidia/nemo/gemma_7b_base:1.1"
```

## Getting NeMo Framework

NVIDIA NeMo Framework is a generative AI framework built for researchers and PyTorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.

If you haven't already, you can pull a container that includes the version of NeMo Framework and all dependencies needed for this notebook with the following:

```bash
docker pull nvcr.io/nvidia/nemo:24.01.gemma
```

The best way to run this notebook is from within the container. You can do that by launching the container with the following command

```bash
docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.01.gemma
```

Then, from within the container, start the jupyter server with

```bash
jupyter lab --no-browser --port=8080 --allow-root --ip 0.0.0.0
```

## SFT Data Formatting

To begin, we'll need to prepare a dataset to tune our model on.

This notebook uses the [Dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) as an example to demonstrate how to format your SFT data. This dataset consists of 15,000 instruction-context-response triples.

First, to download the data enter the following command:

In [None]:
!wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl


The downloaded data, stored at `databricks-dolly-15k.jsonl`, is a `JSONL` file with each line formatted like this:



In [None]:
{
    "instruction": "When did Virgin Australia start operating?",
    "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.[3] It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.[4]",
    "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.",
    "category": "closed_qa"
}

As this example shows, there are no clear “input” and “output” fields, which are required for SFT with NeMo. To remedy this, we can do some data pre-processing. This cell converts the `instruction`, `context`, and `response` fields into `input` and `output`. It also concatenates the `instruction` and `context` fields with a `\n\n` separator, and randomizes the order in which they appear in the input to generate a new `JSONL` file. This generates an output file called `databricks-dolly-15k-output.jsonl`.

In [None]:
import json
import numpy as np

path_to_data = "databricks-dolly-15k.jsonl"
output_path = f"{path_to_data.split('.')[0]}-output.jsonl"
with open(path_to_data, "r") as f, open(output_path, "w") as g:
    for line in f:

        # Read JSONL line in original format
        line = json.loads(line)
        context = line["context"].strip()

        # Randomize context and instruction order.
        if context != "":
            context_first = np.random.randint(0, 2) == 0
            if context_first:
                instruction = line["instruction"].strip()
                assert instruction != ""
                input = f"{context}\n\n{instruction}"
                output = line["response"]
            else:
                instruction = line["instruction"].strip()
                assert instruction != ""
                input = f"{instruction}\n\n{context}"
                output = line["response"]
        else:
            input = line["instruction"]
            output = line["response"]

        # Write JSONL line in new format
        g.write(
            json.dumps(
                {"input": input, "output": output, "category": line["category"]}
            )
            + "\n"
        )

Now, the dataset is a `JSONL` file with each line formatted like this: 

In [None]:
{
  "input": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.\n\nWhen did Virgin Australia start operating?",
  "output": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.",
  "category": "closed_qa"
}

## SFT Training

To perform the SFT Training, we'll use NVIDIA NeMo-Aligner. NeMo-Aligner is a scalable toolkit for efficient model alignment, built using the [NeMo Toolkit](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. Users can do end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource efficient manner.

To install NeMo Aligner, we can clone the repository and install it using `pip`:

In [None]:
%%bash

git clone https://github.com/NVIDIA/NeMo-Aligner.git -b dev
cd NeMo-Aligner
pip install -e .

If you want to track and visualize your SFT training experiments, you can login to Weights and Biases. If you don't want to use wandb, make sure to set the argument `exp_manager.create_wandb_logger=False` when launching your job.

In [None]:
import wandb
wandb.login()

To run SFT locally on a single node, you can use the following command. Note the `trainer.num_nodes` and `trainer.devices` arguments, which define how many nodes and how many total GPUs you want to use for training. Make sure the source model, output model, and dataset paths all match your local setup.

If you'd like to perform multi-node finetuning -- for example on a slurm cluster -- you can find more information in the [NeMo-Aligner user guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/rlhf.html#instruction-following-taught-by-supervised-fine-tuning-sft).

In [None]:

%%bash

cd NeMo-Aligner

python examples/nlp/gpt/train_gpt_sft.py \
   name=gemma_dolly_finetuned \
   trainer.precision=bf16 \
   trainer.num_nodes=1 \
   trainer.devices=8 \
   trainer.sft.max_steps=-1 \
   trainer.sft.limit_val_batches=40 \
   trainer.sft.val_check_interval=1000 \
   model.tensor_model_parallel_size=4 \
   model.pipeline_model_parallel_size=1 \
   model.megatron_amp_O2=True \
   model.restore_from_path=../gemma_7b_pt.nemo \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   ++model.bias_activation_fusion=true \
   model.data.num_workers=0 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=128 \
   model.data.train_ds.file_path=../databricks-dolly-15k-output.jsonl \
   model.data.train_ds.add_bos=True \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.validation_ds.drop_last=True \
   model.data.validation_ds.file_path=../databricks-dolly-15k-output.jsonl \
   exp_manager.create_wandb_logger=True \
   exp_manager.explicit_log_dir=../results \
   exp_manager.wandb_logger_kwargs.project=sft_run \
   exp_manager.wandb_logger_kwargs.name=dolly_sft_run \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss

When training is finished, you should see a file called `results/checkpoints/gemma_dolly_finetuned.nemo` that contains the weights of your new, instruction-tuned model.

## Exporting to TensorRT-LLM

TensorRT-LLM is an open-source library for optimizing inference performance to achieve state-of-the-art speed on NVDIA GPUs. The NeMo framework offers an easy way to compile `.nemo` models into optimized TensorRT-LLM engines which you can run locally embedded in another application, or serve to other applications using a server like Triton Inference Server.

To start with, lets create a folder where our exported model will land

In [None]:
!mkdir gemma_trt_llm

To export the model, we just need to create an instance of the `TensorRTLLM` class and call the `TensorRTLLM.export()` function -- pointing the `nemo_checkpoint_path` argument to the newly fine-tuned model we trained above.

This creates a couple of files in the folder we created -- an `engine` file that holds the weights and the compiled execution graph of the model, a `tokenizer.model` file which holds the tokenizer information, and `config.json` which holds some metadata about the model (along with `model.cache`, which caches some operations and makes it faster to re-compile the model in the future.)

In [None]:
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="gemma_dolly_finetuned_trt_llm")
trt_llm_exporter.export(nemo_checkpoint_path="results/checkpoints/gemma_dolly_finetuned.nemo", model_type="gemma", n_gpus=1)


With the model exported into TensorRTLLM, we can perform very fast inference

In [None]:
trt_llm_exporter.forward(["NVIDIA and Google are"])

There's also a convenient function to deploy a the model as a service, backed by Triton Inference Server:

In [None]:
from nemo.deploy import DeployPyTriton

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="gemma")
nm.deploy()
nm.serve()