# Supervised Fine-Tuning (SFT)

**GOAL**

Given a pretrained foundational model like Llama2-7b or a domain adaptive pretrained model, we can further customize the model with curated, high quality superivsed training data to align the model's performance on specific task or human preferences.

In this tutorial, we use open source verilog code dataset containing description of the verilog code in natural language as input and the actual verilog code as output. We demonstrate that SFT model trained on this specific dataset could be used for domain specific code generation given an input prompt, which would be very useful in developing coding copilot applications in domain specific application. 

**NeMo Tool and Resources**
* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)


**Software Requirements**

1. access to latest NeMo framework NGC Containers
2. this playbook has been tested on: nvcr.io/nvidia/nemo:dev'. it is expected to work similarly on other environments

In your terminal, launch the NeMo framework container

In [None]:
docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:dev

In your terminal, launch Jupyter Notebook as follows

In [None]:
jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''

**Hardware Requirements**

This playbook has been tested on 2xA100 80G but can be scaled to multiple GPUs as well as multiple nodes by modifying the corresponding configuration files

**Data**

this tutorial will be using the open source verilog dataset hosted on huggingface [link](https://huggingface.co/datasets/GaTech-EIC/MG-Verilog)

**Notebook Outline**
* Step1: download Llama2-7b model from huggingface and convert it to .nemo format (could also use the model checkpoint from previous DAPT step)
* Step2: implement your customized DataModule class
* Step3: define configs needed to put together a SFT recipe
* Step4: define a run task and actually execute the SFT step
* Step5: define a inference run task and execute the inference step
* Step6: evaluate the SFT model with ROUGE score and compare performance with base model
  

**Step1**

download the llama-2-7b model from hugging face and convert it to .nemo format, remove the original download once conversion is complete

In [None]:
!git lfs install
!git clone https://huggingface.co/meta-llama/Llama-2-7b-hf

In [None]:
!cd Llama-2-7b-hf
!python3 ../convert.py
!cd ..
!rm -rf Llama-2-7b-hf/

**Step2**: implement customized verilog DataModule class



in order to use NeMo2.0 to run distributed training job on customized dataset, we need to create a customized data class inheriting the base class *FineTuningDataModule*, in the code below, we create a constructor and specify variables unique to customize Verilog dataset

In [None]:
import json
import re
import shutil
from typing import TYPE_CHECKING, Any, Dict, List, Optional

import numpy as np
from datasets import load_dataset

from nemo.collections.llm.gpt.data.core import get_dataset_root
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule
from nemo.lightning.io.mixin import IOMixin
from nemo.utils import logging

if TYPE_CHECKING:
    from nemo.collections.common.tokenizers import TokenizerSpec
    from nemo.collections.llm.gpt.data.packed_sequence import PackedSequenceSpecs

BLOCK_COMMON = "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions.\n    <</SYS>>\n\n    Implement the Verilog module based on the following block level summaries. Assume that signals are positive clock/clk edge triggered unless otherwise stated.\nHere are block level summaries:\n\nblock_0:"
DETAILED_COMMON = "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions.\n    <</SYS>>\n\n    Implement the Verilog module based on the following description. Assume that signals are positive clock/clk edge triggered unless otherwise stated."
HIGH_LEVEL_COMMON = "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions.\n    <</SYS>>\n\n    Implement the Verilog module based on the following description. Assume that signals are positive clock/clk edge triggered unless otherwise stated."


## subclass the finetuning data module to create our own verilog data
class VerilogDataModule(FineTuningDataModule, IOMixin):
    def __init__(
        self,
        seq_length: int = 1024,
        tokenizer: Optional["TokenizerSpec"] = None,
        micro_batch_size: int = 2,
        global_batch_size: int = 8,
        rampup_batch_size: Optional[List[int]] = None,
        force_redownload: bool = False,
        delete_raw: bool = True,
        seed: int = 12,
        memmap_workers: int = 1,
        num_workers: int = 8,
        pin_memory: bool = True,
        persistent_workers: bool = False,
        packed_sequence_specs: Optional["PackedSequenceSpecs"] = None,
        dataset_kwargs: Optional[Dict[str, Any]] = None,
    ):
        self.force_redownload = force_redownload
        self.delete_raw = delete_raw

        super().__init__(
            # where you save the train, validation, test.jsonl in nemo format
            dataset_root=get_dataset_root("//workspace/data/verilog"),
            seq_length=seq_length,
            tokenizer=tokenizer,
            micro_batch_size=micro_batch_size,
            global_batch_size=global_batch_size,
            rampup_batch_size=rampup_batch_size,
            seed=seed,
            memmap_workers=memmap_workers,
            num_workers=num_workers,
            pin_memory=pin_memory,
            persistent_workers=persistent_workers,
            packed_sequence_specs=packed_sequence_specs,
        )

*prepare_data()* is the default function NeMo2.0 would call, hence it has to be implemented when creating customized dataset class. The function should implement logic to preprocessing the data and split the data into train, validation, test set. In the code below, *find_common_substrings()* function preprocess the downloaded raw data and get rid of the useless common substring in all data pairs. We also implement *_download_data()* function to actually download the dataset from the source.

In [None]:
    def find_common_substrings(strings):
        common_substrings = set(strings[0])
        for string in strings[1:]:
            common_substrings &= set(re.findall(r'\w+', string))
        return common_substrings

    # override the base class function for data handling logic
    def prepare_data(self) -> None:
        # if train file is specified, no need to do anything
        if not self.train_path.exists() or self.force_redownload:
            dset = self._download_data()
            self._preprocess_and_split_data(dset)
        super().prepare_data()

    def _download_data(self):
        logging.info(f"Downloading {self.__class__.__name__}...")
        return load_dataset(
            "GaTech-EIC/MG-Verilog",
            cache_dir=str(self.dataset_root),
            download_mode="force_redownload" if self.force_redownload else None,
        )

    def _preprocess_and_split_data(self, dset, train_ratio: float = 0.80, val_ratio: float = 0.15):
        logging.info(f"Preprocessing {self.__class__.__name__} to jsonl format and splitting...")

        test_ratio = 1 - train_ratio - val_ratio
        save_splits = {}
        dataset = dset.get('train')
        split_dataset = dataset.train_test_split(test_size=val_ratio + test_ratio, seed=self.seed)
        split_dataset2 = split_dataset['test'].train_test_split(
            test_size=test_ratio / (val_ratio + test_ratio), seed=self.seed
        )
        save_splits['training'] = split_dataset['train']
        save_splits['validation'] = split_dataset2['train']
        save_splits['test'] = split_dataset2['test']

        for split_name, dataset in save_splits.items():
            output_file_high_level = self.dataset_root / f"{split_name}.jsonl"
            with output_file_high_level.open("w", encoding="utf-8") as f:
                for example in dataset:
                    code = example["code"].strip()
                    description = example["description"]
                    high_level_global_summary = description['high_level_global_summary']
                    high_level_global_summary = high_level_global_summary.replace(HIGH_LEVEL_COMMON, "")
                    f.write(json.dumps({"input": high_level_global_summary, "output": code}) + "\n")
            logging.info(f"{split_name} split saved to {output_file_high_level}")

        if self.delete_raw:
            for p in self.dataset_root.iterdir():
                if p.is_dir():
                    shutil.rmtree(p)
                elif '.jsonl' not in str(p.name):
                    p.unlink()

**Step3**: define configs needed to put together a SFT recipe

NeMo2.0 wraps training elements into configs in order to simplify and modularize the training process. In the code snippet below, we create configs including
* verilog(): config wrapping customized DataModule class and implemented logic 
* trainer(): config specifying training strategy, tensor parallel size, number of devices, steps, evaluation steps and so on
* logger(): config how we want to log the intermediate results, in this case we are using validation loss and log every 40 steps
* adam_with_cosine_annealing(): config of the adam optimizer with pre specified parameters
* llama2-7b(): the base model you want to conduct SFT on
* resume(): in case of training interruption, you can continue training from intermediate checkpoint rather than training from the beginning

In [None]:
from pathlib import Path
from typing import List, Optional

import nemo_run as run
import pytorch_lightning as pl
import torch
import wandb
from lightning.pytorch.loggers import WandbLogger
from megatron.core.inference.common_inference_params import CommonInferenceParams
from megatron.core.optimizer import OptimizerConfig
from verilog_data_module import VerilogDataModule

from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm import Llama2Config7B
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed
from nemo.lightning.io.mixin import IOMixin


# configure custom dataset
def verilog() -> run.Config[pl.LightningDataModule]:
    return run.Config(VerilogDataModule, seq_length=1024, micro_batch_size=2, global_batch_size=8, num_workers=8)


# configure trainer class similar to pytorch lightning trainer
def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(nl.MegatronStrategy, tensor_model_parallel_size=2)
    trainer = run.Config(
        nl.Trainer,
        devices=2,
        max_steps=200,
        accelerator="gpu",
        strategy=strategy,
        plugins=bf16_mixed(),
        log_every_n_steps=40,
        limit_val_batches=2,
        val_check_interval=20,
        num_sanity_val_steps=0,
    )
    return trainer


# configure the logger
def logger() -> run.Config[nl.NeMoLogger]:
    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=40,
        monitor="val_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    ## this is where hthe
    return run.Config(
        nl.NeMoLogger,
        name="sft_log",
        log_dir="//workspace",
        use_datetime_version=False,
        ckpt=ckpt,
        wandb=None,
    )


# configre the optimizer, adam with cosine annealing
def adam_with_cosine_annealing() -> run.Config[nl.OptimizerModule]:
    opt_cfg = run.Config(
        OptimizerConfig,
        optimizer="adam",
        lr=5e-5,
        adam_beta2=0.98,
        use_distributed_optimizer=True,
        clip_grad=1.0,
        bf16=True,
    )
    return run.Config(nl.MegatronOptimizerModule, config=opt_cfg)


# configure the base model
def llama2_7b() -> run.Config[pl.LightningModule]:
    return run.Config(llm.LlamaModel, config=run.Config(llm.Llama2Config7B))


# configure auto resume
def resume() -> run.Config[nl.AutoResume]:
    return run.Config(
        nl.AutoResume,
        restore_config=run.Config(
            nl.RestoreConfig,
            ## default path to save converted hf model
            path="/root/.cache/nemo/models/Llama-2-7b-hf",
        ),
        # requires completely saved checkpoint to resume from
        resume_if_exists=False,
    )

**Step4: Define run task and actually execute SFT**

with the above configs defined we are ready to put them together and form a finetuning recipe as show in the code below.
in order to actually run the job, we use [NeMo_Run](https://github.com/NVIDIA/NeMo-Run/tree/main), which is a pythonic and modular way to execute a predefined run. In this case, since we are using *LocalExecutor* to conduct the run, you can also choose other executors like SlurmExecutor or SkypilotExecutor.
Finally, we put together all the components in the main function to download, preprocess, split the data, then run the SFT.

**note this process will take a while depending on your hardware, so please be patient**

In [None]:
# with all above components created, call NeMo2.0 finetune API
def configure_finetuning_recipe():
    return run.Partial(
        llm.finetune,
        model=llama2_7b(),
        trainer=trainer(),
        data=verilog(),
        log=logger(),
        optim=adam_with_cosine_annealing(),
        resume=resume(),
    )


def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)
    return executor


def main():
    print("preprocess data!")
    verilog = VerilogDataModule()
    verilog_data = verilog._download_data()
    verilog._preprocess_and_split_data(verilog_data)
    print("running supervised fine tuning!")
    run.run(configure_finetuning_recipe(), executor=local_executor_torchrun())


if __name__ == "__main__":
    main()

**Step5**: define inference run task and do inference on the test data

we can use the configs defined in the training steps to make sure the configs used are consistent between run and inference, otherwise you might run into error. We only need to configure the inference step which uses *llm.generate* instead of *llm.finetune* 
test data path, model checkpoints for both base model and SFT model, output prediction paths need to be specified for both base and sft model. 

Similar to sft, we use NeMo run to execute the inference pipeline

In [None]:
import os
from pathlib import Path
from typing import List, Optional

import nemo_run as run
import pytorch_lightning as pl
import torch
from megatron.core.inference.common_inference_params import CommonInferenceParams
from megatron.core.optimizer import OptimizerConfig
from run_sft import local_executor_torchrun, trainer

from nemo import lightning as nl
from nemo.collections import llm
from nemo.collections.llm import Llama2Config7B
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed
from nemo.lightning.io.mixin import IOMixin

input_data = "/workspace/data/verilog/test.jsonl"
base_llama_path = "/root/.cache/nemo/models/Llama-2-7b-hf"
sft_ckpt_path = str(
    next(
        (d for d in Path("/workspace/sft_log/checkpoints").iterdir() if d.is_dir() and d.name.endswith("-last")), None
    )
)

os.makedirs("/workspace/inference", exist_ok=True)
output_path_base = "/workspace/inference/base_llama_prediction.jsonl"
output_path_sft = "/workspace/inference/sft_prediction.jsonl"


# Configure inference to predict on base model checkpoint
def configure_inference_base():
    return run.Partial(
        llm.generate,
        path=str(base_llama_path),
        trainer=trainer(),
        input_dataset=input_data,
        inference_params=CommonInferenceParams(num_tokens_to_generate=50, top_k=1),
        output_path=output_path_base,
    )


# Configure inference to predict on trained DAPT checkpoint
def configure_inference_sft():
    return run.Partial(
        llm.generate,
        path=str(sft_ckpt_path),
        trainer=trainer(),
        input_dataset=input_data,
        inference_params=CommonInferenceParams(num_tokens_to_generate=50, top_k=1),
        output_path=output_path_sft,
    )


if __name__ == '__main__':
    print("running inference on base model")
    run.run(configure_inference_base(), executor=local_executor_torchrun())
    print("running inference on supervise fine tuned model")
    run.run(configure_inference_sft(), executor=local_executor_torchrun())

**Step6**: evaluate the SFT model with ROUGE score and compare performance with base model

Now we should have both base_llama_predictions.jsonl and sft_prediction.jsonl files generated containing the prediction of test input for the two models. We also provide the ground truth of the test data and use ROGUE score as metrics which calculate the average n-gram overlap of the prediction and ground truth. Specify what text you want to evaluate on, in this case it is the "output"
You should expect a low score for the base model and a much higher score for the sft model.

In [None]:
!python3 /opt/NeMo/scripts/metric_calculation/compute_rouge.py --ground-truth /workspace/data/verilog/test.jsonl --preds /workspace/inference/base_llama_prediction.jsonl --answer-field "output" 
!python3 /opt/NeMo/scripts/metric_calculation/compute_rouge.py --ground-truth /workspace/data/verilog/test.jsonl --preds /workspace/inference/sft_prediction.jsonl --answer-field "output"