# P-Tuning Nemotron-3 With A Custom Dataset
Nemotron-3 is a robust, powerful family of Large Language Models that can provide compelling responses on a wide range of tasks. While the 8B parameter base model serves as a strong baseline for multiple downstream tasks, they can lack in domain-specific knowledge or proprietary or otherwise sensitive information. Fine-tuning is often used as a means to update a model for a specific task or tasks to better respond to domain-specific prompts. This notebook walks through downloading the Nemotron-3 8B model from Hugging Face, preparing a custom dataset, and p-tuning the base model against the dataset.

Before we begin, feel free to play with our own Nemotron-3 8B Model fine-tuned to this QA task [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/nemo-8b-qa). By the end of this workflow, you will see how we were able to achieve such results fine-tuning the base model to Extractive QA data and create your own version of it!

## Model Preparation
The model needs to be downloaded prior to being fine-tuned. The following blocks walk through this process.

### Downloading the model
First, the 8B-base variant of Nemotron-3 needs to be downloaded to your machine. To download the model, follow the instructions [here](https://huggingface.co/nvidia/nemotron-3-8b-base-4k) to accept the NVIDIA AI Foundation Models Community License Agreement for access to the models in the Nemotron family. Please note that your Hugging Face account email address MUST match the email you provide on NVIDIA's developer site, or your request will not be approved.

Once approved, use your Hugging Face username and API key to download Nemotron-3 7B (non-chat version) to your workstation where you will be fine-tuning the model. To pull the model files to your local machine, you may navigate on your local machine to the folder you specified as the mount for ```/project/models``` and use a ```git lfs clone https://huggingface.co/<namespace>/<repo-name>``` call to [NVIDIA's HF repository](https://huggingface.co/nvidia/nemotron-3-8b-base-4k/tree/main). 

Once you have the repository cloned locally, you can double check that the model is in the correct location if you can see ```/models/nemotron-3-8b-base-4k``` show up on the left hand side panel of this jupyterlab. Ensure the ```.nemo``` file for the model is present inside the directory.

## Preparing The Dataset
The dataset being used for fine-tuning needs to be converted to a .jsonl file and follow a specific format. In general, question and answer datasets are easiest to work with by providing context (if applicable), a question, and the expected answer, though different downstream tasks work as well.

### Downloading the dataset
This notebook will use the [Dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) - available on Hugging Face - as an example. However, this can replaced with other datasets available on Hugging Face as well. If using a dataset not available on Hugging Face, manually upload or download the dataset into `data` directory of the project and follow the steps below.

As seen in the output, the dataset is currently in the form of a Hugging Face `DatasetDict` with 15,015 entries each including an `instruction`, `context`, `response`, and `category` field. For custom datasets, ensure the data has a similar structure of one unique item per row and all rows having the same fields. The fields do not need to match that of the Dolly dataset but should all match each other.

In [None]:
from datasets import load_dataset
from omegaconf import OmegaConf
import os
os.environ['OPENBLAS_NUM_THREADS'] = '8'

dataset = load_dataset("aisquared/databricks-dolly-15k")
dataset

### Preprocessing the dataset
Some datasets may contain unnecessary fields. For the example with the Dolly dataset, we do not need the `closed_qa` field for the fine-tuned model and will remove those lines from the dataset. Additionally, the `instruction`, `response`, and `category` fields will be renamed to `question`, `answer`, and `taskname`, respectively. NeMo Toolkit requires p-tuning datasets to include `taskname` to specify which task an element belongs to in the case of multiple tasks. For this case, we will use `genqa` for the taskname for all items. In general, these fields are the most common to work with. Other field names can be used but the configuration step later in this notebook may need to be updated to reflect these custom fields.

In [None]:
def set_column(example):
    example["taskname"] = 'genqa'
    return example

dataset = dataset.filter(lambda example: example["category"].startswith("closed_qa"))
dataset = dataset.rename_column("instruction", "question")
dataset = dataset.rename_column("response", "answer")
dataset = dataset.rename_column("category", "taskname")
dataset = dataset.map(set_column)

After filtering the dataset above, we can see the first item in the dataset has a `question`, `context`, `answer`, and `taskname` field. Note that the following cell block might not work for custom datasets with different subsets other than `train`, `test`, and `val`.

In [None]:
dataset['train'][0]

### Split the dataset into train and test files

The prompt learning dataset loader accepts a list of json/dictionary objects or a list of json file names where each json file contains a collection of json objects. Each json object must include the field taskname which is a string identifier for the task the data example corresponds to. They should also include one or more fields corresponding to different sections of the discrete text prompt. The input data looks like:
```
[
    {"taskname": "genqa", "context": [CONTEXT_PARAGRAPH_TEXT1], "question": [QUESTION_TEXT1], "answer": [ANSWER_TEXT1]},
    {"taskname": "genqa", "context": [CONTEXT_PARAGRAPH_TEXT2], "question": [QUESTION_TEXT2], "answer": [ANSWER_TEXT2]},
]
```

To create the files, we need to split the dataset object between train and validation files by taking the first 90% of the object and putting it in a `*train.jsonl` file and the remainder in a `*val.jsonl` file. Note that the split percentage can be changed as well as items being randomly sampled from the dataset to fill the quota if desired.

In [None]:
DATA_DIR = "/project/data"
os.makedirs(DATA_DIR, exist_ok=True)

In [None]:
import json

train_test_split = 0.9

with open("/project/data/dolly_train.jsonl", 'w') as f:
    for index, item in enumerate(dataset['train']):
        if index < int(train_test_split*len(dataset['train'])):
            f.write(json.dumps(item) + "\n")  

with open("/project/data/dolly_val.jsonl", 'w') as f:
    for index, item in enumerate(dataset['train']):
        if index >= int(train_test_split*len(dataset['train'])):
            f.write(json.dumps(item) + "\n")  

Let's take a look at a row in the training dataset file. Similar to the previous dataset output, this shows the fields of `question`, `context`, `answer`, and `taskname`. This format is used for both the `train` and `val` files. These files will be used as datasets for the p-tuning process. The dataset is now ready to be used for p-tuning the model.

In [None]:
!head -1 $DATA_DIR/dolly_train.jsonl

## Configuring the job
With the dataset preparation finished, we need to update the default configuration for our fine-tuning job. The sample config file provided by NeMo is a good template to base our changes on. Let's load the file as an object that we can edit.

In [None]:
config = OmegaConf.load("/opt/NeMo/examples/nlp/language_modeling/tuning/conf/megatron_gpt_peft_tuning_config.yaml")

With the config loaded, we can override certain settings for our environment. Many of the default values shown here would work but some key points are called out below:

* `config.trainer.precision="32"` - This is the precision that will be used during p-tuning. The model might be more accurate with higher values but it also uses more memory than lower precisions. If the p-tuning process runs out of memory, try reducing the precision here.
* `config.trainer.devices=1` - This is the number of devices that will be used. If running on a multi-GPU system, increase this number as appropriate.
* `config.model.restore_from_path="/project/models/nemotron-3-8b-base-4k/Nemotron-3-8B-Base-4k.nemo"` - This is the path to the converted `.nemo` checkpoint from the beginning of the notebook. If the path changed, update it here.
* `config.model.data.train_ds.file_names` and `config.model.data.validation_ds.file_names` - If a different filename or path was used for the dataset files created earlier, specify the new values here.
* `config.model.global_batch_size` - If using a higher GPU count or if additional GPU memory allows, this value can be increased for higher performance. Note that higher batch sizes use more GPU memory.
* `config.model.data.train_ds.prompt_template` - If different field names were used during the dataset creation earlier, update them here with the intended field names. This is what NeMo Toolkit will look for in each dataset element.

In [None]:
config.trainer.precision="32"
config.trainer.devices=1
config.trainer.num_nodes=1
config.trainer.max_epochs=3 
config.model.restore_from_path="/project/models/nemotron-3-8b-base-4k/Nemotron-3-8B-Base-4k.nemo"
config.model.peft.peft_scheme="ptuning"
config.model.data.train_ds.file_names=["/project/data/dolly_train.jsonl"] 
config.model.data.validation_ds.file_names=["/project/data/dolly_val.jsonl"]
config.model.global_batch_size=4
config.model.micro_batch_size=1 
config.model.optim.lr=0.0001
config.model.data.train_ds.concat_sampling_probabilities=[1.0] 
config.model.data.train_ds.prompt_template="Context: {context}\n\nQuestion: {question}\n\nAnswer:{answer}" 
config.model.peft.p_tuning.virtual_tokens=15 
config.model.data.train_ds.label_key="answer"
config.model.data.train_ds.truncation_field="context"

With the config settings updated, save it as a `.yaml` file that can be read by NeMo Toolkit during p-tuning and save it to the p-tuning configuration directory.

In [None]:
# OmegaConf.save can also accept a `str` or `pathlib.Path` instance:
OmegaConf.save(config, "nemotron-config.yaml")

In [None]:
!mv nemotron-config.yaml /opt/NeMo/examples/nlp/language_modeling/tuning/conf/

## Launching the job
With the model downloaded, the dataset prepped, and the config set, it is now time to launch the p-tuning job! The following block launches the job on the specified number of GPUs. Depending on the size of the dataset and the GPU used, this could take anywhere from a few minutes to several hours to finish. As the model is tuned, checkpoints will be saved in the `nemo_experiments` directory inside the container. These checkpoints contain prompt embeddings which are used to send inference requests with the p-tuned weights to deployed models so they respond as expected.

In [None]:
!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py \
    --config-name=nemotron-config.yaml

### Configuring the evaluation script
The configuration file for the evaluation job needs to be updated based on the provided template to reflect the changes for the experiment here. Load the template config and update some of the settings to match the local environment. Note the following settings may differ for custom datasets:
* `config.model.data.test_ds.file_names` - List any prediction files that should be used to evaluate the model. In general, it is recommended to have this be different from the training and validation files used during p-tuning. For simplicities sake, we generate a single example here. You may add more if you wish. 
* `config.model.data.test_ds.names` - Change the name of the dataset used here.

In [None]:
# Let's create our test file. 

import json
test = [{"question": "Is the lawn mower product solar powered?", 
         "context": "The Auto Chef Master is a personal kitchen robot that effortlessly turns raw ingredients into gourmet meals with the precision of a Michelin-star chef. The Eco Lawn Mower is a high-tech lawn mower with deployable solar panel flaps that provides an eco-friendly and efficient way to maintain your lawn.", 
         "answer": "Yes, the Eco Lawn Mower is solar powered.", 
         "taskname": "genqa"}]
        
with open('/project/data/dolly_test.jsonl', 'w+') as outfile:
    for entry in test:
        json.dump(entry, outfile)
        outfile.write('\n')


In [None]:
# Load the template config file
config = OmegaConf.load("/opt/NeMo/examples/nlp/language_modeling/tuning/conf/megatron_gpt_peft_eval_config.yaml")

# Override required settings
config.peft_scheme="ptuning"
config.model.restore_from_path="/project/models/nemotron-3-8b-base-4k/Nemotron-3-8B-Base-4k.nemo"
config.model.peft.restore_from_path="/project/code/nemo_experiments/megatron_gpt_peft_tuning/checkpoints/megatron_gpt_peft_tuning.nemo"
config.model.data.test_ds.file_names=["/project/data/dolly_test.jsonl"]
config.model.data.test_ds.names="dolly"
config.model.data.test_ds.global_batch_size=2
config.model.data.test_ds.micro_batch_size=1
config.model.data.test_ds.write_predictions_to_file=True
config.model.data.test_ds.output_file_path_prefix="/project/code/predictions"
config.model.data.test_ds.prompt_template="Context: {context}\n\nQuestion: {question}\n\nAnswer: {answer}"

# Save the new config file
OmegaConf.save(config, "nemotron-eval-config.yaml")

Once the config is saved, evaluation can be launched below. Depending on the size of the hardware and the number of inference examples, this may take a few minutes to complete. Results will be saved to `code/predictions_test_dolly_inputs_preds_labels.jsonl`.

In [None]:
!mv nemotron-eval-config.yaml /opt/NeMo/examples/nlp/language_modeling/tuning/conf/
!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_eval.py --config-name=nemotron-eval-config.yaml

Note depending on hardware and the number of examples used, the evaluation script may take a while to run since we are using a training container setting and not currently optimizing for inference. Once we are ready to serve the finetuned model for true deployment, we may then move the model to an optimized inference framework like Triton and/or TensorRT-LLM. 

After the evaluation script completes, view the results. Keep in mind the results you see may vary in quality for a variety of reasons. Further tuning of hyperparameters and output post-processing may lead to higher quality responses. The point is fine tuning the out-of-the-box model to the general QA task seems to be easy and straightforward with this workflow!

In [None]:
!head -n 1 /project/code/predictions_test_dolly_inputs_preds_labels.jsonl