## CodeGemma Parameter Efficient Fine-Tuning with LoRA using NeMo Framework

CodeGemma is a groundbreaking new open model in the Gemini family of models from Google. CodeGemma is just as powerful as previous models but compact enough to run locally on NVIDIA RTX GPUs. CodeGemma is available in 2 sizes: 2B and 7B parameters. With NVIDIA NeMo, you can customize CodeGemma to fit your usecase and deploy an optimized model on your NVIDIA GPU.

In this tutorial, we'll go over a specific kind of customization -- Low-rank adapter tuning to follow a specific output format (also known as LoRA). To learn how to perform full parameter supervised fine-tuning for instruction following (also known as SFT), see the [SFT notebook on Gemma Base Model](https://github.com/NVIDIA/GenerativeAIExamples/blob/main/models/Gemma/sft.ipynb). For LoRA, we'll perform all operations within the notebook on a single GPU. The compute resources needed for training depend on which CodeGemma model you use. For the 7 billion parameter variant, you'll need a GPU with 80GB of memory. For the 2 billion parameter model, 40GB will do.

We'll also learn how to export your custom model to TensorRT-LLM, an open-source library that accelerates and optimizes inference performance of the latest LLMs on the NVIDIA AI platform.

## Introduction

[LoRA tuning](https://arxiv.org/abs/2106.09685) is a parameter efficient method for fine-tuning models, where we freeze the base model parameters and update an auxiliary "adapter" with many fewer weights. At inference time, the adapter weights are combined with the base model weights to produce a new model, customized for a particular use case or dataset. Because this adapter is so much smaller than the base model, it can be trained with far fewer resources than it would take to fine-tune the entire model. In this notebook, we'll show you how to LoRA-tune small models like the CodeGemma models on a single A100 GPU.

For this example, we're going to tune our CodeGemma model on the [Alpaca Python Code Instructions Dataset](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca) and tuning our model to enhance its instruction following ability for generating Python code.

## Download the Pretrained CodeGemma Model

For all of our customization and deployment processes, we'll need to start off with a pre-trained version of CodeGemma in the `.nemo` format. You can download the base model in `.nemo` format from the NVIDIA GPU Cloud, or convert checkpoints from another framework into a `.nemo` file. You can choose to use the 2B parameter or 7B parameter CodeGemma models for this notebook -- the 2B model will be faster to customize, but the 7B model will be more capable.

You can download either model from the NVIDIA NGC Catalog, using the NGC CLI. The instructions to install and configure the NGC CLI can be found [here](https://ngc.nvidia.com/setup/installers/cli).

To download the model, execute one of the following commands, based on which model you want to use:

ngc registry model download-version "nvidia/nemo/codegemma_2b_base:1.0"

or

ngc registry model download-version "nvidia/nemo/codegemma_7b_base:1.0"

## Getting NeMo Framework

NVIDIA NeMo Framework is a generative AI framework built for researchers and PyTorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.

If you haven't already, you can pull a container that includes the version of NeMo Framework and all dependencies needed for this notebook with the following:

docker pull nvcr.io/nvidia/nemo:24.03.codegemma

The best way to run this notebook is from within the container. You can do that by launching the container with the following command

docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.03.codegemma

Then, from within the container, start the jupyter server with

jupyter lab --no-browser --port=5000 --allow-root --ip 0.0.0.0

## Dataset Preparation

Let's download Alpaca Python Code Instructions dataset from Hugging Face:

In [None]:
!git lfs install
!git clone https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca

Finally, the following code snippets convert the dataset into the JSONL format that NeMo defaults for PEFT. Meanwhile, we will reformat the data into list of (prompt, completion) pairs that our model can appropriately handle. Please refer to the printout for the original code instruction data format.

In [None]:
import pandas as pd
import glob
from random import seed, shuffle
from huggingface_hub import login

login(token='your_huggingface_access_token')
parquet_file_path = glob.glob('./python_code_instructions_18k_alpaca/data/*.parquet')
parquet_file_list = ''.join(parquet_file_path)
df = pd.read_parquet(parquet_file_list)
instruct2code_list = df.to_dict('records')

seed(2)
val_percent = 5
test_percent = 5
instruct2code_list = instruct2code_list[:len(instruct2code_list)] 
num_train = int(len(instruct2code_list) * (100 - val_percent - test_percent) / 100)
num_val = int(len(instruct2code_list)*(val_percent)/100)
shuffle(instruct2code_list)

instruct2code_list_train = instruct2code_list[:num_train]
instruct2code_list_val = instruct2code_list[num_train : num_train + num_val]
instruct2code_list_test = instruct2code_list[num_train + num_val:]
print(f"=== Input prompt example from the training split:\n{instruct2code_list_train[5]['prompt']}\n") 
print(f"=== Output completion example from the validation split:\n{instruct2code_list_val[5]['output']}")

In [None]:

import json
def write_jsonl(fname, json_objs):
    with open(fname, 'wt') as f:
        for o in json_objs:
            f.write(json.dumps(o)+"\n")
def form_instruction(pair):
    outpout_loc = pair.find('### Output')
    return(pair[:outpout_loc])
def convert_to_jsonl(instruct2code_list, output_path):
    json_objs = []
    for pair in instruct2code_list:
        prompt = form_instruction(pair['prompt'])
        completion = pair['output']
        json_objs.append({"input": prompt, "output": completion})
    write_jsonl(output_path, json_objs)
    return json_objs

print(len(instruct2code_list_train))
train_json_objs = convert_to_jsonl(instruct2code_list_train, "alpaca_python_train.jsonl")
val_json_objs= convert_to_jsonl(instruct2code_list_val, "alpaca_python_val.jsonl")
test_json_objs = convert_to_jsonl(instruct2code_list_test, "alpaca_python_test.jsonl")

Here's an example of what the data looks like after reformatting:

In [None]:
train_json_objs[0]

## LoRA Configuration and Training

NeMo Framework provides support for configuration and training. To proceed with the training, you'll find a script at `/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py`. The script uses config parameters to control many of its operations. An example config file allows you to quickly see what options you can change and carry out different experiments. We can start by downloading the example config file, `megatron_gpt_peft_tuning_config.yaml` from github. The file is referenced to configure the parameters for the running PEFT training jobs in NeMo with LoRA technique for language model tuning. 



In [None]:
!wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml

To see all of the different configuration options available, you can take a look at the file we downloaded. For this example, we're going to update a couple of settings to point to our newly-prepared datasets and to make sure the LoRA tuning runs on our A100. Feel free to experiment with these different options -- you can swap in your own datasets and change the training settings depending on what GPU you're using.

For data our data configuration, we'll point to the `jsonl` files we wrote out earlier. `concat_sampling_probabilities` determines what percentage of the finetuning data you would like to come from each file -- in our example we only have 1 training file so we choose [1.0]

For our model settings, we don't have much to change since we're reading in a pretrained model and can inherit the values that were already set. We need to point to our existing `.nemo` file, specify that we want to use LoRA as our scheme for finetuning, and choose our parallelism and batch size values. The values below should be appropriate for a single A100 GPU.

Make sure to change the `restore_from_path` setting with the path to the `.nemo` checkpoint!

Finally, we set some options for the `Trainer`. We'll be training on 1 GPU on a single node, at bfloat16 precision. For this example we'll train for 2000 steps, with a validation check every after every 200 iterations.

After setting the `Trainer` object configurations to handle our training loop, we set configurations for an experiment manager to handle checkpointing and logging. We can load our model from disk into memory. 

Now, let's see how to add the LoRA Adapter to our model and train it. We can specify that we want to use LoRA by using the `model.peft.peft_scheme` configuration to `lora`, which stores the types of applicable adapter and the hyperparameters required to initialize the adapter module.

We're now ready to start training! As the training loop runs, you'll see the validation loss drop significantly -- even with this short demonstration.

In [None]:
%%bash

PEFT_SCHEME='lora'
MODEL_SIZE=7b
MBS=1
TP=1
PP=1
NUM_DEVICES=1
GBS=8
SEQ_LEN=4096

EXTRA_ARGS="
        +model.fp8=False \
        +model.fp8_e4m3=False \
        +model.fp8_hybrid=True \
        +model.fp8_margin=0 \
        +model.fp8_interval=1 \
        +model.fp8_amax_history_len=128 \
        +model.fp8_amax_compute_algo=max "

TRAIN_DS=[alpaca_python_train.jsonl]
VALID_DS=[alpaca_python_val.jsonl]
GBS=128
PACKED=False
MODEL=codegemma-7b_fromhf.nemo
EXP_DIR=nemo_experiments
    
torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    trainer.devices=${NUM_DEVICES} \
        trainer.num_nodes=1 \
        trainer.val_check_interval=200 \
        trainer.max_steps=2000 \
        +trainer.num_sanity_val_steps=0 \
        +trainer.limit_val_batches=3 \
        model.megatron_amp_O2=True \
        exp_manager.resume_if_exists=False \
        exp_manager.exp_dir="${EXP_DIR}" \
        exp_manager.checkpoint_callback_params.save_top_k=0 \
        model.tensor_model_parallel_size=${TP} \
        model.pipeline_model_parallel_size=${PP} \
        model.micro_batch_size=${MBS} \
        model.global_batch_size=${GBS} \
        model.restore_from_path=${MODEL} \
        model.data.train_ds.num_workers=0 \
        model.data.validation_ds.num_workers=0 \
        +model.data.train_ds.packed_sequence=${PACKED} \
        ++model.sequence_parallel=False \
        +model.log_token_counts=True \
        model.data.train_ds.file_names=${TRAIN_DS} \
        model.data.train_ds.concat_sampling_probabilities=[1.0] \
        model.data.validation_ds.file_names=${VALID_DS} \
        model.peft.peft_scheme=${PEFT_SCHEME} \
        model.peft.lora_tuning.target_modules=[attention_qkv] \
        model.data.train_ds.max_seq_length=${SEQ_LEN} \
        model.data.validation_ds.max_seq_length=${SEQ_LEN} \
        +model.apply_rope_fusion=True \
        ${EXTRA_ARGS} \
        trainer.precision=bf16 \
        model.answer_only_loss=True

Once training is completed you should see a saved '.nemo' file in the nemo_experiments folder. This checkpoint will only contain the trained adapter weights, and not the frozen base model weights.

Next, we'll need to merge the weights of the base model and the weights of the adapter. If you're using the `NeMo Framework` container, you'll find a script for this at `/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py`. Otherwise, you can download the standalone script from GitHub at https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/nlp_language_modeling/merge_lora_weights/merge.py

To merge weights using the merge script, you'll need the path to the base model and trained adapter, as well as a path to save the merged model to.

In [None]:
%%bash
python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py \
    trainer.accelerator=gpu \
    tensor_model_parallel_size=1 \
    pipeline_model_parallel_size=1 \
    gpt_model_file=codegemma-7b_fromhf.nemo \
    lora_model_path=megatron_gpt_peft_lora_tuning.nemo \
    merged_model_path=gemma_lora_alpaca_python_merged.nemo

With our merged model weights, we can run evaluation on test dataset using `megatron_gpt_peft_eval.py`. We set the Set the appropriate model checkpoint path, test file path, batch sizes, number of tokens etc. and run evaluation on the test file.

In [None]:
%%bash

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_eval.py \
    model.restore_from_path=gemma_lora_alpaca_python_merged.nemo \
    trainer.devices=1 \
    model.global_batch_size=8 \
    model.data.test_ds.file_names=["alpaca_python_test.jsonl"] \
    model.data.test_ds.names=["alpaca_python_test_set"] \
    model.data.test_ds.global_batch_size=8 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=20 \
    model.tensor_model_parallel_size=1 \
    model.pipeline_model_parallel_size=1 \
    inference.greedy=True \
    model.data.test_ds.output_file_path_prefix=/results \
    model.data.test_ds.write_predictions_to_file=True

Check the output from the result file:

In [None]:
!tail -n 4 /results_test_alpaca_python_test_set_inputs_preds_labels.jsonl

Note, This is only a sample output (based of a toy LoRA example) and your output may vary. The performance can be further improved by fine tuning the model for more steps.

Finally, let's continue on to the "Exporting to TensorRT-LLM" section, to learn how to export our new model for optimized inference using TensorRT-LLM! 

## Exporting to TensorRT-LLM

TensorRT-LLM is an open-source library for optimizing inference performance to acheive state-of-the-art speed on NVDIA GPUs. The NeMo framework offers an easy way to compile .nemo models into optimized TensorRT-LLM engines which you can run locally embedded in another application, or serve to other applications using a server like Triton Inference Server.

To start with, lets create a folder where our exported model will land

In [None]:
!mkdir codegemma_trt_llm

With our merged model weights, we just need to create an instance of the TensorRTLLM class and call the TensorRTLLM.export() function -- pointing the nemo_checkpoint_path argument to the newly merged model from above.

This creates a couple of files in the folder we created -- an engine file that holds the weights and the compiled execution graph of the model, a tokenizer.model file which holds the tokenizer information, and config.json which holds some metadata about the model (along with model.cache, which caches some operations and makes it faster to re-compile the model in the future.)

In [None]:
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="gemma_alpaca_python_merged_trt_llm")
trt_llm_exporter.export(nemo_checkpoint_path="gemma_lora_alpaca_python_merged.nemo", model_type="gemma", n_gpus=1)

With the model exported into TensorRTLLM, we can perform very fast inference:

In [None]:
trt_llm_exporter.forward(["Implement Fibonacci sequence in Python"])

There's also a convenient function to deploy a the model as a service, backed by Triton Inference Server:

In [None]:
from nemo.deploy import DeployPyTriton

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="gemma")
nm.deploy()
nm.serve()