# Learning Goals

## Supervised Finetuning (SFT)

Often we want to adapt or customize foundation models to be more performant on our specific task. Fine-tuning refers to how we can modify the weights of a pre-trained foundation model with additional custom data. Supervised fine-tuning (SFT) refers to unfreezing all the weights and layers in our model and training on a newly labeled set of examples. We can fine-tune to incorporate new, domain-specific knowledge, or teach the foundation model what type of response to provide. One specific type of SFT is also referred to as “instruction tuning” where we use SFT to teach a model to follow instructions better. In this playbook will demonstrate how to perform SFT with Llama3-8b using NeMo 2.0.

## NeMo 2.0

In NeMo 1.0, the main interface for configuring experiments is through YAML files. This approach allows for a declarative way to set up experiments, but it has limitations in terms of flexibility and programmatic control. NeMo 2.0 is an update on the NeMo Framework which introduces several significant improvements over its predecessor, NeMo 1.0, enhancing flexibility, performance, and scalability.

- Python-Based Configuration - NeMo 2.0 transitions from YAML files to a Python-based configuration, providing more flexibility and control. This shift makes it easier to extend and customize configurations programmatically.

- Modular Abstractions - By adopting PyTorch Lightning’s modular abstractions, NeMo 2.0 simplifies adaptation and experimentation. This modular approach allows developers to more easily modify and experiment with different components of their models.

- Scalability - NeMo 2.0 seamlessly scales large-scale experiments across thousands of GPUs using NeMo-Run, a powerful tool designed to streamline the configuration, execution, and management of machine learning experiments across computing environments.

By adopting PyTorch Lightning’s modular abstractions, NeMo 2.0 makes it easy for users to adapt the framework to their specific use cases and experiment with various configurations. This section offers an overview of the new features in NeMo 2.0 and includes a migration guide with step-by-step instructions for transitioning your models from NeMo 1.0 to NeMo 2.0.


## Software Requirements

1. Use the latest [NeMo Framework Training container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). Note that you must be logged in to the container registry to view this page.

2. This notebook uses the container: `nvcr.io/nvidia/nemo:dev`  


## Hardware Requirements

- Minimum 8xA100 80G (1 node) for SFT on 7B and 13B

- SFT can be run on all (7B/13B/70B) model sizes on multiple nodes


## Data
Databricks-dolly-15k is an open-source dataset created by the collaborative efforts of Databricks employees. It consists of high-quality human-generated prompt/response pairs specifically designed for instruction tuning LLMs. These pairs cover a diverse range of behaviors, from brainstorming and content generation to information extraction and summarization. 

For more information, refer to [databricks-dolly-15k | Hugging Face](https://huggingface.co/datasets/databricks/databricks-dolly-15k)

# Step 0: Go inside docker container

You can start and enter the dev container by:
```
docker run --gpus device=1 --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:dev bash

```


# Step 1: Import HuggingFace checkpoint
First request download permission from Meta and Hugging Face. Log in through `huggingface-cli` using your Huggingface token before importing llama3 models. 

```
$ huggingface-cli login
```

Once logged in, you can use the following script to import a Hugging Face model. Based on the provided model configuration (`Llama3-8b` in the example below), the `llm.import_ckpt` API will download the specified model using the "hf://<huggingface_model_id>" URL format. It will then convert the model into NeMo 2.0 format. 


In [1]:
import nemo_run as run
from nemo import lightning as nl
from nemo.collections import llm
from megatron.core.optimizer import OptimizerConfig
import torch
import pytorch_lightning as pl
from pathlib import Path
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed


# llm.import_ckpt is the nemo2 API for converting Hugging Face checkpoint to NeMo format
# example usage:
# llm.import_ckpt(model=llm.llama3_8b.model(), source="hf://meta-llama/Meta-Llama-3-8B")
#
# We use run.Partial to configure this function
def configure_checkpoint_conversion():
    return run.Partial(
        llm.import_ckpt,
        model=llm.llama3_8b.model(),
        source="hf://meta-llama/Meta-Llama-3-8B",
        overwrite=False,
    )

# configure your function
import_ckpt = configure_checkpoint_conversion()
# define your executor
local_executor = run.LocalExecutor()

# run your experiment
run.run(import_ckpt, executor=local_executor)


  from .autonotebook import tqdm as notebook_tqdm
      cm = get_cmap("Set1")
    


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.import_ckpt/nemo.collections.llm.api.import_ckpt_1731693470/nemo.collections.llm.api.import_ckpt


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.import_ckpt/nemo.collections.llm.api.import_ckpt_1731693470/nemo.collections.llm.api.import_ckpt
Launched app: local_persistent://nemo_run/nemo.collections.llm.api.import_ckpt-f0rwwn6vt74ckc
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/nemo.collections.llm.api.import_ckpt/nemo.collections.llm.api.import_ckpt_1731693470/nemo.collections.llm.api.import_ckpt/nemo_run/nemo.collections.llm.api.import_ckpt-f0rwwn6vt74ckc
    


Waiting for job nemo.collections.llm.api.import_ckpt-f0rwwn6vt74ckc to finish [log=True]...


mport_ckpt/0       cm = get_cmap("Set1")
mport_ckpt/0     
mport_ckpt/0 Downloading shards: 100%|██████████| 4/4 [00:00<00:00, 4853.11it/s]
mport_ckpt/0 Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  3.24it/s]
mport_ckpt/0 [NeMo I 2024-11-15 09:57:59 megatron_strategy:310] Fixing mis-match between ddp-config & mcore-optimizer config
mport_ckpt/0 [NeMo I 2024-11-15 09:57:59 megatron_init:396] Rank 0 has data parallel group : [0]
mport_ckpt/0 [NeMo I 2024-11-15 09:57:59 megatron_init:402] Rank 0 has combined group of data parallel and context parallel : [0]
mport_ckpt/0 [NeMo I 2024-11-15 09:57:59 megatron_init:407] All data parallel group ranks with context parallel combined: [[0]]
mport_ckpt/0 [NeMo I 2024-11-15 09:57:59 megatron_init:410] Ranks 0 has data parallel rank: 0
mport_ckpt/0 [NeMo I 2024-11-15 09:57:59 megatron_init:418] Rank 0 has context parallel group: [0]
mport_ckpt/0 [NeMo I 2024-11-15 09:57:59 megatron_init:421] All context parallel group ranks: [[0]]
m

Job nemo.collections.llm.api.import_ckpt-f0rwwn6vt74ckc finished: SUCCEEDED


## Step 2: Prepare data and customize DataModule

We will be using Databricks-dolly-15k for this notebook. NeMo 2.0 already provides a `DollyDataModule`. Example usage:

In [2]:
def dolly() -> run.Config[pl.LightningDataModule]:
    return run.Config(llm.DollyDataModule, seq_length=2048, micro_batch_size=1, global_batch_size=8, num_workers=0)

To use your own data, you will need to create a custom `DataModule`. This involves extending the base class `FineTuningDataModule`, so that you have access to existing data handling logic such as packed sequence. Here we walk you through the process step by step using the already existing [`DollyDataModule`](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/dolly.py) as an example. 

### 1. Subclass the FineTuningDataModule
You need to extend `FineTuningDataModule` if you're fine-tuning NeMo models. This provides access to existing data handling logic, such as packed sequences. The `data_root` parameter is where you store your generated `train/validation/test.jsonl` in NeMo format. Below is how `DollyDataModule` does it:

In [3]:
from datasets import load_dataset
from typing import TYPE_CHECKING, List, Optional
from nemo.collections.common.tokenizers import TokenizerSpec
from nemo.lightning.io.mixin import IOMixin
from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule
import json
from nemo.utils import logging
import numpy as np
import shutil

class DollyDataModule(FineTuningDataModule, IOMixin):
    def __init__(
        self,
        seq_length: int = 2048,
        tokenizer: Optional["TokenizerSpec"] = None,
        micro_batch_size: int = 4,
        global_batch_size: int = 8,
        rampup_batch_size: Optional[List[int]] = None,
        force_redownload: bool = False,
        delete_raw: bool = True,
        seed: int = 1234,
        memmap_workers: int = 1,
        num_workers: int = 8,
        pin_memory: bool = True,
        persistent_workers: bool = False,
        pad_to_max_length: bool = False,
        packed_sequence_size: int = -1,
    ):
        self.force_redownload = force_redownload
        self.delete_raw = delete_raw

        super().__init__(
            dataset_root=get_dataset_root("dolly"),
            seq_length=seq_length,
            tokenizer=tokenizer,
            micro_batch_size=micro_batch_size,
            global_batch_size=global_batch_size,
            rampup_batch_size=rampup_batch_size,
            seed=seed,
            memmap_workers=memmap_workers,
            num_workers=num_workers,
            pin_memory=pin_memory,
            persistent_workers=persistent_workers,
            pad_to_max_length=pad_to_max_length,
            packed_sequence_size=packed_sequence_size,
        )

### 2. Override the `prepare_data` Method

The `prepare_data` method is responsible for downloading and preprocessing data if needed. If the dataset is already downloaded, you can skip this step.



In [4]:
def prepare_data(self) -> None:
    # if train file is specified, no need to do anything
    if not self.train_path.exists() or self.force_redownload:
        dset = self._download_data()
        self._preprocess_and_split_data(dset)
    super().prepare_data()

### 3. Implement Data Download and Preprocessing Logic

If your dataset requires downloading or preprocessing, implement this logic within helper methods. Skip the download part if it's not needed.

In [5]:
def _download_data(self):
    logging.info(f"Downloading {self.__class__.__name__}...")
    return load_dataset(
        "databricks/databricks-dolly-15k",
        cache_dir=str(self.dataset_root),
        download_mode="force_redownload" if self.force_redownload else None,
    )

def _preprocess_and_split_data(self, dset, train_ratio: float = 0.80, val_ratio: float = 0.15):
    logging.info(f"Preprocessing {self.__class__.__name__} to jsonl format and splitting...")

    test_ratio = 1 - train_ratio - val_ratio
    save_splits = {}
    dataset = dset.get('train')
    split_dataset = dataset.train_test_split(test_size=val_ratio + test_ratio, seed=self.seed)
    split_dataset2 = split_dataset['test'].train_test_split(
        test_size=test_ratio / (val_ratio + test_ratio), seed=self.seed
    )
    save_splits['training'] = split_dataset['train']
    save_splits['validation'] = split_dataset2['train']
    save_splits['test'] = split_dataset2['test']

    for split_name, dataset in save_splits.items():
        output_file = self.dataset_root / f"{split_name}.jsonl"
        with output_file.open("w", encoding="utf-8") as f:
            for example in dataset:
                context = example["context"].strip()
                if context != "":
                    # Randomize context and instruction order.
                    context_first = np.random.randint(0, 2) == 0
                    if context_first:
                        instruction = example["instruction"].strip()
                        assert instruction != ""
                        _input = f"{context}\n\n{instruction}"
                        _output = example["response"]
                    else:
                        instruction = example["instruction"].strip()
                        assert instruction != ""
                        _input = f"{instruction}\n\n{context}"
                        _output = example["response"]
                else:
                    _input = example["instruction"]
                    _output = example["response"]

                f.write(json.dumps({"input": _input, "output": _output, "category": example["category"]}) + "\n")

        logging.info(f"{split_name} split saved to {output_file}")

    if self.delete_raw:
        for p in self.dataset_root.iterdir():
            if p.is_dir():
                shutil.rmtree(p)
            elif '.jsonl' not in str(p.name):
                p.unlink()

The original example in Dolly dataset looks like:
```
{'instruction': 'Extract all the movies from this passage and the year they were released out. Write each movie as a separate sentence', 'context': "The genre has existed since the early years of silent cinema, when Georges Melies' A Trip to the Moon (1902) employed trick photography effects. The next major example (first in feature length in the genre) was the film Metropolis (1927). From the 1930s to the 1950s, the genre consisted mainly of low-budget B movies. After Stanley Kubrick's landmark 2001: A Space Odyssey (1968), the science fiction film genre was taken more seriously. In the late 1970s, big-budget science fiction films filled with special effects became popular with audiences after the success of Star Wars (1977) and paved the way for the blockbuster hits of subsequent decades.", 'response': 'A Trip to the Moon was released in 1902. Metropolis came out in 1927. 2001: A Space Odyssey was released in 1968. Star Wars came out in 1977.', 'category': 'information_extraction'}
```
After the preprocessing logic, the data examples are transformed into NeMo format, as below:
```
{'input': "Extract all the movies from this passage and the year they were released out. Write each movie as a separate sentence\n\nThe genre has existed since the early years of silent cinema, when Georges Melies' A Trip to the Moon (1902) employed trick photography effects. The next major example (first in feature length in the genre) was the film Metropolis (1927). From the 1930s to the 1950s, the genre consisted mainly of low-budget B movies. After Stanley Kubrick's landmark 2001: A Space Odyssey (1968), the science fiction film genre was taken more seriously. In the late 1970s, big-budget science fiction films filled with special effects became popular with audiences after the success of Star Wars (1977) and paved the way for the blockbuster hits of subsequent decades.", 'output': 'A Trip to the Moon was released in 1902. Metropolis came out in 1927. 2001: A Space Odyssey was released in 1968. Star Wars came out in 1977.', 'category': 'information_extraction'}
```
Each data example is saved as a json string as one line in the `train/validation/test.jsonl` file, under `data_root` directory you specified earlier.

## Step 3: Run SFT with NeMo 2.0 API 

The following python script utilizes NeMo 2.0 API to perform SFT. In this script we are configuring the following components for training. These components are similar between SFT and PEFT. SFT and PEFT both uses `llm.finetune` API. To switch from PEFT to SFT you just need to remove `peft` parameter.

### Trainer
NeMo 2.0 Trainer works simiarly to Pytorch Lightning trainer. You can specify to use MegatronStrategy as your model parallel strategy to use NVIDIA's Megatron-LM framework and pass in configurations as below:



In [6]:
def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(
        nl.MegatronStrategy,
        tensor_model_parallel_size=2
    )
    trainer = run.Config(
        nl.Trainer,
        devices=2,
        max_steps=20,
        accelerator="gpu",
        strategy=strategy,
        plugins=bf16_mixed(),
        log_every_n_steps=1,
        limit_val_batches=2,
        val_check_interval=2,
        num_sanity_val_steps=0,
    )
    return trainer


### Logger
Configure your training steps, output directories and logging through `NeMoLogger`. In the following example, the experiment output will be saved at `./results/nemo2_sft`.



In [7]:
def logger() -> run.Config[nl.NeMoLogger]:
    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=10,
        monitor="reduced_train_loss",
        save_top_k=1,
        save_on_train_epoch_end=True,
        save_optim_on_train_end=True,
    )

    return run.Config(
        nl.NeMoLogger,
        name="nemo2_sft",
        log_dir="./results",
        use_datetime_version=False,
        ckpt=ckpt,
        wandb=None
    )



### Optimizer
In the following example, we will be using distributed adam optimizer, and pass in optimizer configuration through `OptimizerConfig`: 




In [8]:
def adam_with_cosine_annealing() -> run.Config[nl.OptimizerModule]:
    opt_cfg = run.Config(
        OptimizerConfig,
        optimizer="adam",
        lr=5e-6,
        adam_beta2=0.98,
        use_distributed_optimizer=True,
        clip_grad=1.0,
        bf16=True,
    )
    return run.Config(
        nl.MegatronOptimizerModule,
        config=opt_cfg
    )


### Base Model
We will perform SFT on top of Llama3-8b so we create a `LlamaModel` to pass to finetune API.

In [9]:
def llama3_8b() -> run.Config[pl.LightningModule]:
    return run.Config(llm.LlamaModel, config=run.Config(llm.Llama3Config8B))

### AutoResume
In NeMo 2.0 we can directly pass in Llama3-8b's Hugging Face ID to start SFT without manually converting it into NeMo checkpoint format like in NeMo 1.0.

In [10]:
def resume() -> run.Config[nl.AutoResume]:
    return run.Config(
        nl.AutoResume,
        restore_config=run.Config(nl.RestoreConfig,
            path="nemo://meta-llama/Meta-Llama-3-8B"
        ),
        resume_if_exists=True,
    )


### NeMo 2.0 finetun API
Using all the components we created above, we can call NeMo 2.0 finetun API:
```
llm.finetune(
    model=llama3_8b(),
    data=dolly(),
    trainer=trainer(),
    log=logger(),
    optim=adam_with_cosine_annealing(),
    resume=resume(),
)
```
Below is a python script that you can save as a file e.g. `nemo2-sft.py`, and run SFT training, using all components we created above and NeMo 2.0 finetune API. The script cannot be directly executed in interactive environment like a notebook. We can execute by `torchrun --nproc_per_node=<NUM_GPU> nemo2-sft.py` when multiple GPU is used.

In [11]:
def configure_finetuning_recipe():
    return run.Partial(
        llm.finetune,
        model=llama3_8b(),
        trainer=trainer(),
        data=dolly(),
        log=logger(),
        optim=adam_with_cosine_annealing(),
        resume=resume(),
    )


def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
        "NVTE_FUSED_ATTN": "0",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

if __name__ == '__main__':
    run.run(configure_finetuning_recipe(), executor=local_executor_torchrun())

Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.finetune/nemo.collections.llm.api.finetune_1731693538/nemo.collections.llm.api.finetune


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.finetune/nemo.collections.llm.api.finetune_1731693538/nemo.collections.llm.api.finetune
Launched app: local_persistent://nemo_run/nemo.collections.llm.api.finetune-bsqgzflc7xzftd
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/nemo.collections.llm.api.finetune/nemo.collections.llm.api.finetune_1731693538/nemo.collections.llm.api.finetune/nemo_run/nemo.collections.llm.api.finetune-bsqgzflc7xzftd
    


Waiting for job nemo.collections.llm.api.finetune-bsqgzflc7xzftd to finish [log=True]...


i.finetune/0 W1115 09:58:59.485000 140737350272832 torch/distributed/run.py:778] 
i.finetune/0 W1115 09:58:59.485000 140737350272832 torch/distributed/run.py:778] *****************************************
i.finetune/0 W1115 09:58:59.485000 140737350272832 torch/distributed/run.py:778] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
i.finetune/0 W1115 09:58:59.485000 140737350272832 torch/distributed/run.py:778] *****************************************
i.finetune/0 I1115 09:58:59.485000 140737350272832 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
i.finetune/0 I1115 09:58:59.485000 140737350272832 torch/distributed/launcher/api.py:188]   entrypoint       : nemo_run.core.runners.fdl_runner
i.finetune/0 I1115 09:58:59.485000 140737350272832 torch/distributed/launcher/api.py:188]   min_node

Job nemo.collections.llm.api.finetune-bsqgzflc7xzftd finished: SUCCEEDED


## Step 4 Evaluation

We use the `llm.generate` API in NeMo 2.0 to generate results from the trained SFT checkpoint. Find your last saved checkpoint from your experiment dir: `results/nemo2_sft/checkpoints`. 

In [12]:
sft_ckpt_path=str(next((d for d in Path("./results/nemo2_sft/checkpoints/").iterdir() if d.is_dir() and d.name.endswith("-last")), None))
print("We will load SFT checkpoint from:", sft_ckpt_path)

We will load SFT checkpoint from: results/nemo2_sft/checkpoints/nemo2_sft--reduced_train_loss=1.5063-epoch=0-last


When using `llm.generate` API, you can pass a data module such as dolly: `input_dataset=dolly()`. This will use the test set from the specified data module to generate predictions. In the following example, the generated predictions are saved to the `sft_predictions.txt` file. Note that while fine-tuning required `tensor_model_parallel_size=2` minimum 2 GPUs, generating predictions only requires `tensor_model_parallel_size=1`. However, using multiple GPUs can speed up the inference process.

In [13]:
from megatron.core.inference.common_inference_params import CommonInferenceParams


def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(
        nl.MegatronStrategy,
        tensor_model_parallel_size=1,
        pipeline_model_parallel_size=1,
        context_parallel_size=1,
        sequence_parallel=False,
        setup_optimizers=False,
        store_optimizer_states=False,
    )
    trainer = run.Config(
        nl.Trainer,
        accelerator="gpu",
        devices=1,
        num_nodes=1,
        strategy=strategy,
        plugins=bf16_mixed(),
    )
    return trainer

def configure_inference():
    return run.Partial(
        llm.generate,
        path=str(sft_ckpt_path),
        trainer=trainer(),
        input_dataset=dolly(),
        inference_params=CommonInferenceParams(num_tokens_to_generate=20, top_k=1),
        output_path="sft_prediction.jsonl",
    )


def local_executor_torchrun(nodes: int = 1, devices: int = 1) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
        "NVTE_FUSED_ATTN": "0",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

if __name__ == '__main__':
    run.run(configure_inference(), executor=local_executor_torchrun())


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.generate/nemo.collections.llm.api.generate_1731693822/nemo.collections.llm.api.generate


Log directory is: /root/.nemo_run/experiments/nemo.collections.llm.api.generate/nemo.collections.llm.api.generate_1731693822/nemo.collections.llm.api.generate
Launched app: local_persistent://nemo_run/nemo.collections.llm.api.generate-lzdnjbxr7thbv
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/nemo.collections.llm.api.generate/nemo.collections.llm.api.generate_1731693822/nemo.collections.llm.api.generate/nemo_run/nemo.collections.llm.api.generate-lzdnjbxr7thbv
    


Waiting for job nemo.collections.llm.api.generate-lzdnjbxr7thbv to finish [log=True]...


i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188]   entrypoint       : nemo_run.core.runners.fdl_runner
i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188]   min_nodes        : 1
i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188]   max_nodes        : 1
i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188]   nproc_per_node   : 1
i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188]   run_id           : 159
i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188]   rdzv_backend     : c10d
i.generate/0 I1115 10:03:43.883000 140737350272832 torch/distributed/launcher/api.py:188]   rdzv_endpoint    : localhost:0
i.generate/0 I1115 10:03:4

Job nemo.collections.llm.api.generate-lzdnjbxr7thbv finished: SUCCEEDED


After the inference is complete, you will see results similar to the following:

In [14]:
%%bash
head -n 3 sft_prediction.jsonl

{"input": "What is best creator's platform", "category": "brainstorming", "label": "Youtube. Youtube should be best creator platform", "prediction": " for video content creators. YouTube is best creator's platform for video content creators."}
{"input": "When was the last time the Raiders won the Super Bowl?", "category": "open_qa", "label": "The Raiders have won three Super Bowl championships (1977, 1981, and 1984), one American Football League (AFL) championship (1967), and four American Football Conference (AFC) titles. The most recent Super Bowl ring was won in 1984 against the Washington Redskins of the NFC.", "prediction": " 2003"}
{"input": "Muckle Water is a long, narrow fresh water loch on Ward Hill on Rousay, Orkney, Scotland. It is the biggest loch on the island and is popular for fishing. It can be reached by a track from the roadside. The Suso Burn on the north eastern shore drains the loch into the Sound of Rousay.\n\nWhere is Muckle Water?", "category": "closed_qa", "lab