## Optimizing Supervised Fine-Tuning (SFT) with FP8 Precision
This tutorial focuses on integrating FP8 Training to enhance the efficiency of Supervised Fine-Tuning (SFT) for Large Language Models (LLMs). 

FP8 is a lower-precision numerical format that offers significant advantages over traditional mixed-precision training (BF16 and FP32), including faster computation and reduced memory consumption without a notable loss in model accuracy.

The core of FP8 training lies in managing the wide dynamic range of values present in transformer architectures. To achieve this, specialized scaling strategies are employed, such as per-tensor and per-block scaling. Per-tensor scaling applies a unique scaling factor to each tensor, while per-block scaling, a more granular method, further optimizes accuracy on newer hardware like NVIDIA Blackwell GPUs. These strategies are crucial for maintaining numerical stability and ensuring the reliability of the training process. The NVIDIA NeMo Framework simplifies this by providing high-level configurations for these FP8 recipes, making it easier to integrate them into your SFT workflow.


### Import Modules

In [1]:
import os                                                                                                                                                                                                        
import torch                                                                                                                                                                                                     
import fiddle as fdl
from typing import List, Optional

from nemo import lightning as nl                                                                                                                                                                                 
from nemo.collections import llm       
from nemo.collections.llm import import_ckpt
                                                                                                                                                                                                                                                                                                                                                                  
from nemo.lightning.io.mixin import IOMixin
from lightning.pytorch.loggers import TensorBoardLogger,WandbLogger     
from nemo.lightning.pytorch.callbacks import ModelCheckpoint                                                                                                                                                     
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule    
from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer                                                                                                                                
from nemo.collections.llm.gpt.model.llama import Llama31Config8B, LlamaModel
from nemo.collections.llm.recipes.optim.adam import distributed_fused_adam_with_cosine_annealing                                                                        

### Converting Hugging Face Checkpoint to NeMo Format
Before we can perform Supervised Fine-Tuning with FP8, we need to convert our desired model from a Hugging Face checkpoint into the NeMo format. 

This conversion is a crucial first step that allows the NeMo framework to manage the model's architecture and weights, making it compatible with NeMo's advanced training features, including FP8 quantization and scaling recipes.

To perform this conversion, you will use a Python script, for example, named `01_convert_to_nemo.py`. You will need to specify the path to your Hugging Face model (`hf_model_path`) and the desired output path for the NeMo model (`nemo_model_path`) within the script. The script should then be executed from your terminal.

Below is an example of such a script for converting the `Llama-3.1-8B-Instruct model`:

```
import sys
from nemo.collections import llm
from nemo.collections.llm import import_ckpt

nemo_model_path  = "/workspace/nemo/models/Llama-3.1-8B-Instruct-Nemo"

if __name__ == '__main__':
    hf_model_path = "/workspace/nemo/models/Llama-3.1-8B-Instruct/"
    import_ckpt(model=llm.LlamaModel(llm.Llama31Config8B()), source=f"hf://{hf_model_path}", output_path=nemo_model_path) 
```

The output directory will look like below, confirming the successful conversion:

```
Converted Llama model to Nemo, model saved to /workspace/nemo/models/Llama-3.1-8B-Instruct-Nemo in torch.bfloat16.
✓ Checkpoint imported to /workspace/nemo/models/Llama-3.1-8B-Instruct-Nemo
Imported Checkpoint
├── context/
│   ├── artifacts/
│   │   └── generation_config.json
│   ├── nemo_tokenizer/
│   │   ├── chat_template.jinja
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   └── tokenizer_config.json
│   ├── io.json
│   └── model.yaml
└── weights/
    ├── .metadata
    ├── __0_0.distcp
    ├── __0_1.distcp
    └── common.pt
```

### Configuring the Training Parameters

Now that the model has been converted to the `nemo` format, we can proceed to configure the training parameters for our Supervised Fine-Tuning (SFT) job. 

This involves setting key values related to model parallelism, sequence length, and other performance-related options.

Configuration Parameters:
- `sequence_length=8192` : Sets the maximum length of the input sequence that the model can handle during training. A larger value allows the model to process more context, which is beneficial for tasks that require a deep understanding of long documents or conversations.
- `tensor_parallel_size=1` : Controls how the model's layers are sharded across GPUs. A value of 2, for instance, would shard the layers across two GPUs. A value of 1 indicates that tensor parallelism is not enabled, and all layers of the model are kept on a single GPU.
- `pipeline_parallel_size=1` : Controls the distribution of the model's layers across multiple GPUs in a pipeline fashion. A value of 2 would split the model into two stages, each running on a separate GPU. A value of 1 means pipeline parallelism is not enabled, and the entire model, from input to output, will be processed on a single GPU.
- `context_parallel_size=1` : Controls the sharding of input sequences across GPUs. A value of 2 would split the input context into two parts and process them on different GPUs. A value of 1 indicates that context parallelism is not used, which is consistent with a single-GPU setup.
- `sequence_parallel=False` : Determines whether to enable sequence parallelism. Setting it to False means sequence parallelism is not used.
- `hf_tokenizer_path='/workspace/nemo/models/Llama-3.1-8B-Instruct/'` : Specifies the file path to the tokenizer associated with the original Hugging Face model. The tokenizer is essential for converting text data into numerical tokens that the model can understand.
- `micro_batch_size=8` : Defines the batch size for a single GPU. It represents the number of samples processed by a single device before gradients are accumulated.
- `global_batch_size=256` : Total effective batch size across all GPUs. It's typically a multiple of the micro_batch_size and the number of GPUs. A larger global_batch_size often allows for a more stable training process and is achieved by accumulating gradients from multiple micro_batch_size steps.

This combination of parameters is the standard configuration for running the SFT tutorial on a single GPU, ensuring that the entire model fits on the device and is processed without parallelization overhead.

In [None]:
sequence_length=8192                                                                                                                                                                                             
tensor_parallel_size=1                                                                                                                                                                                           
pipeline_parallel_size=1                                                                                                                                                                                         
virtual_pipeline_parallel_size=0                                                                                                                                                                             
context_parallel_size=1                                                                                                                                                                                          
sequence_parallel=False    

In [None]:
hf_tokenizer_path='/workspace/nemo/models/Llama-3.1-8B-Instruct/'                                                                                                                                                      
                                                                                                                                                                                                                 
micro_batch_size=8                                                                                                                                                                                              
global_batch_size=256
load_optimizer=False     

### Setting up the Data Module
Next, we will configure the data loader for our training job. The `llm.SquadDataModule` class is a data module provided by NeMo that is designed to handle the SQuAD dataset format. It automatically tokenizes the data and prepares it for training, adhering to the configurations we have already set.

We will instantiate this class, passing in our previously defined parameters:

- `seq_length` : The maximum sequence length for the model.
- `tokenizer` : The path to the Hugging Face tokenizer.
- `micro_batch_size` : The batch size per GPU.
- `global_batch_size` : The total effective batch size.

This ensures that our data preparation is consistent with the model's architecture and training settings.
```
train_dl = llm.SquadDataModule(
    seq_length=sequence_length,
    tokenizer=hf_tokenizer_path,
    micro_batch_size=micro_batch_size,
    global_batch_size=global_batch_size
)
```

In [None]:
train_dl = llm.SquadDataModule(seq_length=sequence_length, tokenizer=hf_tokenizer_path, micro_batch_size=micro_batch_size, global_batch_size=global_batch_size)

### Weights & Biases (WandB) Logging
To track our training progress, we will integrate Weights & Biases (WandB) logging. By using the `WandbLogger`, we can monitor key metrics like loss, learning rate, and other relevant information throughout the training process.

We will define an `experiment_name` and `wandb_project_name` to organize our runs within the WandB dashboard. Then, we will instantiate the `WandbLogger` with these names.

```
from pytorch_lightning.loggers import WandbLogger

experiment_name="sft-llama-3.1-nemo2-mcore-fp8"
wandb_project_name='nemo2-sft-tutorial'

wandb = WandbLogger(
    project=wandb_project_name,
    name=experiment_name
)
```

In [None]:
experiment_name="sft-llama-3.1-nemo2-mcore-fp8"
wandb_project_name='nemo2-sft-tutorial'

wandb = WandbLogger(
    project=wandb_project_name,
    name=experiment_name)

### Optimizer Configuration
In this step, we will configure the optimizer and learning rate scheduler for our SFT training. 

We will use the `distributed_fused_adam_with_cosine_annealing` function, which provides a high-performance Adam-based optimizer and a cosine annealing schedule. 

Configuration Parameters:

- `learning_rate=5e-6` : This is the maximum learning rate (`max_lr`) that the scheduler will use. A smaller learning rate is often used for fine-tuning to prevent the model from deviating too far from its pre-trained state.
- `warmup_steps=50` : This defines the number of steps during which the learning rate will gradually increase from a small value to the `max_lr`. Warm-up helps to stabilize training at the beginning.
- `min_lr=5e-7` : This is the minimum learning rate that the scheduler will anneal down to.

In [None]:
learning_rate= 5e-6                                                                                                                                                                                              
warmup_steps=50                                                                                                                                                                                                  
min_lr=5e-7
optim_config = distributed_fused_adam_with_cosine_annealing(
        max_lr=learning_rate,
        min_lr=min_lr,
        warmup_steps=warmup_steps,
        adam_beta2=0.98
    )

### Model and Tokenizer Initialization
Now that we have all the configurations in place, we can instantiate the tokenizer and the Llama model itself. We will use the `get_nmt_tokenizer` function to load the tokenizer from our specified path. Then, we will create the model instance using the LlamaModel class, passing in the model's configuration and the tokenizer we just created. This prepares the model for the training process.
```
tokenizer = get_nmt_tokenizer(library='huggingface', model_name=hf_tokenizer_path)
config = Llama31Config8B()
model = LlamaModel(config=config, tokenizer=tokenizer)
```

In [None]:
tokenizer = get_nmt_tokenizer(library='huggingface', model_name=hf_tokenizer_path)
config = Llama31Config8B()
model = LlamaModel(config=config, tokenizer=tokenizer)

### FP8 Recipes: Scaling Strategies for Mixed Precision Training
This section introduces the different FP8 recipes, which are crucial for managing numerical stability and performance during mixed-precision training. These recipes are configured using the `MegatronMixedPrecision` plugin, which enables the use of FP8 for accelerated training on compatible hardware.

The choice of recipe depends on the desired balance between performance and training stability.

`bf16` : This recipe represents the baseline. It performs training using `bfloat16` mixed precision without any FP8 integration. This serves as a control for comparing the performance and accuracy of the FP8 recipes.

```
if recipe == "bf16":
    plugins = nl.MegatronMixedPrecision(
        precision="bf16-mixed",
    )
```

`delayed` : This is a FP8 recipe that uses a history of maximum absolute values (amax) to determine the scaling factors. This approach is decently stable and is often a good starting point for FP8 training.

- `fp8_recipe="delayed"` : Specifies the use of the delayed scaling strategy.
- `fp8_amax_history_len=1024` : Sets the length of the amax history used to compute the scaling factor.
- `fp8_amax_compute_algo="max"` : Determines the algorithm for computing the amax value from the history.

```
if recipe == "delayed":
    plugins = nl.MegatronMixedPrecision(
        precision="bf16-mixed",
        fp8="hybrid",
        fp8_recipe="delayed",
        fp8_margin=0,
        fp8_amax_history_len=1024,
        fp8_amax_compute_algo="max",
        fp8_param_gather=True,
    )
```

`tensorwise_fp8` : This recipe applies per-tensor scaling but offers additional control over which layers remain in bfloat16 precision for improved stability.

- `first_last_layers_bf16=True` : Keeps the first and last layers of the model in bf16 precision.
- `num_layers_at_start_in_bf16=1` : Specifies the number of layers at the beginning to keep in bf16.
- `num_layers_at_end_in_bf16=1:` Specifies the number of layers at the end to keep in bf16.

```
if recipe == "tensorwise_fp8":
    plugins = nl.MegatronMixedPrecision(
        precision="bf16-mixed",
        fp8="hybrid",
        fp8_recipe="tensorwise",
        first_last_layers_bf16=True,
        num_layers_at_start_in_bf16=1,
        num_layers_at_end_in_bf16=1,
        fp8_param_gather=True,
    )
```

`mxfp8` : This recipe, a block scaling strategy, is designed for NVIDIA's Blackwell GPUs. This is supported on newer hardware like the NVIDIA Blackwell architecture and can provide better accuracy and stability by accommodating local variations in magnitude within a single tensor.


```
if recipe == "mxfp8":
    plugins = nl.MegatronMixedPrecision(
        precision="bf16-mixed",
        fp8="hybrid",
        fp8_recipe="mxfp8",
        fp8_param_gather=True,
    )
```

`blockwise_fp8` : This is an advanced per-block scaling strategy, offering a more granular approach to quantization by assigning a dedicated scaling factor to small, contiguous blocks within a tensor. 

```
if recipe == "blockwise_fp8":
    plugins = nl.MegatronMixedPrecision(
        precision="bf16-mixed",
        fp8="hybrid",
        fp8_recipe="blockwise",
        fp8_param_gather=True,
    )
```

In [None]:
recipe = "bf16"

In [None]:
if recipe == "bf16":
    plugins = nl.MegatronMixedPrecision(
    precision="bf16-mixed",)

if recipe == "delayed":
    plugins = nl.MegatronMixedPrecision(
    precision="bf16-mixed",
    fp8="hybrid",
    fp8_recipe="delayed",
    fp8_margin=0,
    fp8_amax_history_len=1024,
    fp8_amax_compute_algo="max",
    fp8_param_gather=True,)
    
if recipe == "mxfp8":
    plugins = nl.MegatronMixedPrecision(
    precision="bf16-mixed",
    fp8="hybrid",
    fp8_recipe="mxfp8",
    fp8_param_gather=True,)

if recipe == "tensorwise_fp8":
    plugins = nl.MegatronMixedPrecision(
    precision="bf16-mixed",
    fp8="hybrid",
    fp8_recipe="tensorwise",
    first_last_layers_bf16=True,
    num_layers_at_start_in_bf16=1,
    num_layers_at_end_in_bf16=1,
    fp8_param_gather=True,)

if recipe == "blockwise_fp8":
    plugins = nl.MegatronMixedPrecision(
    precision="bf16-mixed",
    fp8="hybrid",
    fp8_recipe="blockwise",
    fp8_param_gather=True,)

### Setting up the Trainer and Strategy

Finally, we will configure the `Trainer` and the `MegatronStrategy`. The Trainer handles the core training loop, while the `MegatronStrategy` is a specialized plugin for distributed training of large models using the Megatron-LM framework.

Trainer Parameters:
- `num_nodes=1` : The number of compute nodes to use.
- `devices=8` : The number of GPUs to use per node. In this example, we're using 8 GPUs on a single node.
- `max_steps=10000` : The total number of steps to run.
- `log_every_n_steps=10` : The frequency (in steps) to log training metrics.
- `val_check_interval=200` : The frequency (in steps) to run a validation epoch.
- `accelerator="gpu"` : Specifies that training should run on GPUs.
- `strategy=strategy` : The distributed training strategy to use.
- `plugins=plugins` : The mixed precision plugin, which includes our FP8 recipe.
- `logger=wandb` : The logger to use for experiment tracking.

In [None]:
nodes=1                                                                                                                                                                                                          
gpu_devices=8
max_steps=10000     
log_every_n_steps=10
val_check_interval=200                                                                                                                                                                                           
limit_val_batches=8  


strategy = nl.MegatronStrategy(
        tensor_model_parallel_size=tensor_parallel_size,
        pipeline_model_parallel_size=pipeline_parallel_size,
        pipeline_dtype=torch.bfloat16,
        virtual_pipeline_parallel_size=virtual_pipeline_parallel_size,
        sequence_parallel=sequence_parallel,
        context_parallel_size=context_parallel_size,
        ckpt_load_optimizer=load_optimizer,
        ckpt_load_strictness="log_all")

trainer = nl.Trainer(
        num_nodes=nodes,
        devices=gpu_devices,
        max_steps=max_steps,
        log_every_n_steps=log_every_n_steps,
        val_check_interval=val_check_interval,
        limit_val_batches=limit_val_batches,
        accelerator="gpu",
        strategy=strategy,
        plugins=plugins,
        logger=wandb
    )

### Running the Fine-Tuning Job

With all the components configured, the final step is to start the fine-tuning process by calling the `llm.finetune` function. This function ties together the model, data, trainer, optimizer, and logging configurations to launch the training job.

Note: Since this tutorial is configured for multi-GPU training (`devices=8`), it cannot be run directly in a Jupyter Notebook environment. Jupyter Notebooks have limitations with multi-process execution, which is required for distributed training. 

```
llm.finetune(
    model=model,
    data=train_dl,
    trainer=trainer,
    optim=optim,
    log=logger,
    resume=resume
)
```

Instead, you should save all of the code in a single Python file, for example, `02_main.py`, and execute it from your terminal using 

`torchrun --nnodes 1 --nproc-per-node 8 02_main.py`