# Learning Goals

## Parameter-Efficient Fine-Tuning (PEFT)

This notebook aims to demonstrate how to adapt or customize foundation models to improve performance on specific tasks using NeMo 2.0.

This optimization process is known as fine-tuning, which involves adjusting the weights of a pre-trained foundation model with custom data.

Considering that foundation models can be significantly large, a variant of fine-tuning has gained traction recently known as PEFT. PEFT encompasses several methods, including P-Tuning, LoRA, Adapters, IA3, etc. NeMo 2.0 currently supports Low-Rank Adaptation(LoRA) method.

This playbook involves applying LoRA to the Llama3 using NeMo 2.0. 

## NeMo 2.0

In NeMo 1.0, the main interface for configuring experiments is through YAML files. This approach allows for a declarative way to set up experiments, but it has limitations in terms of flexibility and programmatic control. NeMo 2.0 is an update on the NeMo Framework which introduces several significant improvements over its predecessor, NeMo 1.0, enhancing flexibility, performance, and scalability.

- Python-Based Configuration - NeMo 2.0 transitions from YAML files to a Python-based configuration, providing more flexibility and control. This shift makes it easier to extend and customize configurations programmatically.

- Modular Abstractions - By adopting PyTorch Lightning’s modular abstractions, NeMo 2.0 simplifies adaptation and experimentation. This modular approach allows developers to more easily modify and experiment with different components of their models.

- Scalability - NeMo 2.0 seamlessly scaling large-scale experiments across thousands of GPUs using NeMo-Run, a powerful tool designed to streamline the configuration, execution, and management of machine learning experiments across computing environments.

By adopting PyTorch Lightning’s modular abstractions, NeMo 2.0 makes it easy for users to adapt the framework to their specific use cases and experiment with various configurations. This section offers an overview of the new features in NeMo 2.0 and includes a migration guide with step-by-step instructions for transitioning your models from NeMo 1.0 to NeMo 2.0.


# NeMo Tools and Resources
1. [NeMo Github repo](https://github.com/NVIDIA/NeMo)

2. NeMo Framework Training container: `nvcr.io/nvidia/nemo:dev`  #TODO: FIX CONTAINER

# Educational Resources
1. Blog: [Mastering LLM Techniques: Customization](https://developer.nvidia.com/blog/selecting-large-language-model-customization-techniques/)

2. Whitepaper: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)

3. [NeMo 2.0 Overview](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/index.html)

4. Blog: [Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM](https://developer.nvidia.com/blog/tune-and-deploy-lora-llms-with-nvidia-tensorrt-llm/)


## Software Requirements

1. Use the latest [NeMo Framework Training container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags) . Note that you must be logged in to the container registry to view this page.

2. This notebook uses the container: `nvcr.io/nvidia/nemo:dev`  #TODO: FIX CONTAINER  


## Hardware Requirements
Llama3 8B: minimum 1xA100 80G


## Data
This notebook uses the SQUAD dataset. For more details about the data refer to [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)



# Step 0: Go inside docker container

Here is a demo of starting and go inside the container on DGX Cloud. '

Otherwise, you can start and enter the dev container by:  #TODO: FIX CONTAINER
```
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:dev bash

```


# Step 1: Import HuggingFace checkpoint
First request download permission from Meta and Hugging Face. Login through `huggingface-cli` using your Huggingface token before importing llama3 models. 

```
$ huggingface-cli login
```

Once you are logged in, NeMo 2.0 will automatically import the Hugging Face model and start training. There is no need to manully convert to NeMo checkpoint format.

Let's first import needed python modules:

In [None]:
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig
import torch
import pytorch_lightning as pl

## Step 2: Prepare data

We will be using SQUAD for this notebook. NeMo 2.0 already provides a `SquadDataModule`. Example usage:

In [None]:

def squad() -> pl.LightningDataModule:
    return llm.SquadDataModule(seq_length=2048, micro_batch_size=2, global_batch_size=8, num_workers=0)

For how to use your own data to create your custom `DataModule` in order to perform PEFT, refer to [NeMo 2.0 SFT notebook](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/nemo2-sft.ipynb).

## Step 3: Run PEFT with NeMo 2.0 API 

The following python script utilizes NeMo 2.0 API to perform PEFT. In this script we are configuring the following components for training. These components are similar between SFT and PEFT. SFT and PEFT both uses `llm.finetune` API. To switch from SFT to PEFT you just need to add `peft` with LoRA adater to the API parameter.

### Trainer
NeMo 2.0 Trainer works simiarly to Pytorch Lightning trainer. You can specify to use MegatronStrategy as your model parallel strategy to use NVIDIA's Megatron-LM framework and pass in configurations as below:



In [None]:

def trainer(devices=1) -> nl.Trainer:
    strategy = nl.MegatronStrategy(
        tensor_model_parallel_size=1,
    )

    return nl.Trainer(
        devices=1,
        max_steps=40,
        accelerator="gpu",
        strategy=strategy,
        plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
        log_every_n_steps=1,
        limit_val_batches=2,
        val_check_interval=2,
        num_sanity_val_steps=0,
    )


### Logger
Configure your training steps, output directories and logging through `NeMoLogger`. In the following example, the experiment output will be saved at `./results/nemo2_peft`.



In [None]:
def logger() -> nl.NeMoLogger:
    ckpt = nl.ModelCheckpoint(
        save_last=True,
        every_n_train_steps=10,
        monitor="reduced_train_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    return nl.NeMoLogger(
        name="nemo2_peft",
        log_dir="./results",
        use_datetime_version=False,
        ckpt=ckpt,
        wandb=None
    )



### Optimizer
In the following example, we will be using distributed adam optimizer, and pass in optimizer configuration through `OptimizerConfig`: 




In [None]:
def adam_with_cosine_annealing() -> nl.OptimizerModule:
    return nl.MegatronOptimizerModule(
        config=OptimizerConfig(
            optimizer="adam",
            lr=0.0001,
            adam_beta2=0.98,
            use_distributed_optimizer=True,
            clip_grad=1.0,
            bf16=True,
        ),
    )

### LoRA Adapter
We need to pass in LoRA adapter to our finetuning API to perform LoRA finetuning. We can configure adapter like the following. The target module we support includes: `linear_qkv`, `linear_proj`, `linear_fc1` and `linear_fc2`. In the final script we used default configurations for LoRA (`llm.peft.LoRA()`), which will use the full list with `dim=32`.

In [None]:
def lora() -> nl.pytorch.callbacks.PEFT:
    return llm.peft.LoRA(
        target_modules=['linear_qkv', 'linear_proj'], # full list:['linear_qkv', 'linear_proj', 'linear_fc1', 'linear_fc2']
        dim=32,
    )

### Base Model
We will perform PEFT on top of Llama3-8b so we create a `LlamaModel` to pass to finetune API.

In [None]:
def llama3_8b() -> pl.LightningModule:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
    return llm.LlamaModel(llm.Llama3Config8B(), tokenizer=tokenizer)

### AutoResume
In NeMo 2.0 we can directly pass in Llama3-8b's Hugging Face ID to start PEFT without manually converting it into NeMo checkpoint format like in NeMo 1.0.

In [None]:

def resume() -> nl.AutoResume:
    return nl.AutoResume(
        restore_config=nl.RestoreConfig(
            path="hf://meta-llama/Meta-Llama-3-8B"
        ),
        resume_if_exists=True,
    )


### NeMo 2.0 finetun API
Using all the components we created above, we can call NeMo 2.0 finetun API:
```
llm.finetune(
    model=llama3_8b(),
    data=squad(),
    trainer=trainer(),
    peft=lora(),
    log=logger(),
    optim=adam_with_cosine_annealing(),
    resume=resume(),
)
```
Below is a python script that you can save as a file e.g. `nemo2-peft.py`, and run PEFT training, using all components we created above and NeMo 2.0 finetune API. The script cannot be directly executed in interactive environment like a notebook. We can execute by `python nemo2-peft.py` if single GPU is used, or `torchrun --nproc_per_node=<NUM_GPU> nemo2-peft.py` if multiple GPU is used.

In [None]:
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig
import torch
import pytorch_lightning as pl


def trainer(devices=1) -> nl.Trainer:
    strategy = nl.MegatronStrategy(
        tensor_model_parallel_size=1,
    )

    return nl.Trainer(
        devices=1,
        max_steps=40,
        accelerator="gpu",
        strategy=strategy,
        plugins=nl.MegatronMixedPrecision(precision="bf16-mixed"),
        log_every_n_steps=1,
        limit_val_batches=2,
        val_check_interval=2,
        num_sanity_val_steps=0,
    )


def logger() -> nl.NeMoLogger:
    ckpt = nl.ModelCheckpoint(
        save_last=True,
        every_n_train_steps=10,
        monitor="reduced_train_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    return nl.NeMoLogger(
        name="nemo2_peft",
        log_dir="./results",
        use_datetime_version=False,
        ckpt=ckpt,
        wandb=None
    )


def adam_with_cosine_annealing() -> nl.OptimizerModule:
    return nl.MegatronOptimizerModule(
        config=OptimizerConfig(
            optimizer="adam",
            lr=0.0001,
            adam_beta2=0.98,
            use_distributed_optimizer=True,
            clip_grad=1.0,
            bf16=True,
        ),
    )

def lora() -> nl.pytorch.callbacks.PEFT:
    return llm.peft.LoRA()



def squad() -> pl.LightningDataModule:
    return llm.SquadDataModule(seq_length=2048, micro_batch_size=2, global_batch_size=8, num_workers=0)



def llama3_8b() -> pl.LightningModule:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
    return llm.LlamaModel(llm.Llama3Config8B(), tokenizer=tokenizer)

def resume() -> nl.AutoResume:
    return nl.AutoResume(
        restore_config=nl.RestoreConfig(
            path="hf://meta-llama/Meta-Llama-3-8B"
        ),
        resume_if_exists=True,
    )

if __name__ == '__main__':
    llm.finetune(
        model=llama3_8b(),
        data=squad(),
        trainer=trainer(),
        peft=lora(),
        log=logger(),
        optim=adam_with_cosine_annealing(),
        resume=resume(),
    )

## Step 4 Evaluation ##TODO: depending on NeMo 2.0 llm generation API

## Optional: Launch with [NeMo-Run](https://github.com/NVIDIA/NeMo-Run)
Alternatively, we could use launch PEFT jobs using existing [recipes](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/recipes) from NeMo-Run. A recipe in NeMo is a python file that defines a complete configuration for training or fine-tuning an LLM. Each recipe typically includes:
1. Model configuration: Defines the architecture and hyperparameters of the LLM.
2. Training configuration: Specifies settings for the PyTorch Lightning Trainer, including distributed training strategies.
3. Data configuration: Sets up the data pipeline, including batch sizes and sequence lengths.
4. Optimization configuration: Defines the optimizer and learning rate schedule.
5. Logging and checkpointing configuration: Specifies how to save model checkpoints and log training progress.

Recipes are designed to be modular and extensible, allowing users to easily customize settings for their specific use cases.


NeMo-Run is a powerful tool designed to streamline the configuration, execution, and management of machine learning experiments across various computing environments. NeMo-Run is responsible for experiment configuration, execution and management. Here is an example for launch a recipe using NeMo-Run using local executor.

In [None]:
## TODO: Pretrain with tp1pp1cp2 doesn't work. Pretrain with tp4pp1cp2 works. Finetuning recipe doesn't work
import nemo_run as run
from nemo.collections import llm

recipe = llm.llama3_8b.finetune_recipe(name="llama3-8b-pretrain", dir="exp/nemorun_ft", num_nodes=1, num_gpus_per_node=2)
env_vars = {
    "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
    "NCCL_NVLS_ENABLE": "0",
    "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
    "NVTE_ASYNC_AMAX_REDUCTION": "1",
    "NVTE_FUSED_ATTN": "0",
}
local_executor = run.LocalExecutor(ntasks_per_node=8, launcher="torchrun", env_vars=env_vars)
run.run(recipe, executor=local_executor)