In [None]:
# Copyright 2024 NVIDIA Corporation. All Rights Reserved.

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width:60px; float:right"><br>
# <font color="#76b900">**Finetuning LLM for Triplet Prediction**<br/>with NVIDIA NIM microservice</font>

**Welcome To Your Cloud Environment!** This interactive web application, which you're currently using to run Python code, is more than just a simple interface. When you access this Jupyter Notebook, an instance on a cloud platform is allocated to you by the [**NVIDIA Deep Learning Institute (DLI)**](https://www.nvidia.com/en-us/training/). This forms your base cloud environment, essentially a blank canvas for further setup, and includes:

- A dedicated CPU, and possibly a GPU, for processing.
- A pre-installed base operating system.
- A pre-installation of packages necessary to run the lab.

### Learning Objectives 

### **Fine-Tuning a Smaller LLM for Accurate Triplet Predictions**  

In this tutorial, we will **fine-tune a smaller Large Language Model (LLM) for more accurate triplet predictions** using [**NVIDIA NeMo**](https://www.nvidia.com/en-in/ai-data-science/products/nemo/) and [**NVIDIA Inference Microservices (NIM)**](https://www.nvidia.com/en-in/ai/).  

#### **Introduction to NVIDIA NeMo and NIM**  
[NVIDIA NeMo](https://www.nvidia.com/en-in/ai-data-science/products/nemo/) is a **scalable, cloud-native generative AI framework** designed for researchers and developers working with Large Language Models, Multimodal AI, and Speech AI (e.g., Automatic Speech Recognition and Text-to-Speech). It allows users to efficiently create, customize, and deploy generative AI models by leveraging existing code and pre-trained model checkpoints.  

[NVIDIA Inference Microservices (NIM)](https://www.nvidia.com/en-in/ai/) is a **suite of microservices** that enables fast and seamless deployment of AI models. NIM can be used on-premises or in **DGX Cloud**, allowing users to transition models to self-managed hosting with minimal code changes. These microservices are designed to scale dynamically based on load and run efficiently on GPUs.  

#### **Why Fine-Tune a Smaller LLM?**  
Large Language Models (LLMs) are trained for a wide range of tasks. However, for this specific use case, we only need the model to predict **triplets** from given text. Instead of deploying a large LLM, we use **LLM distillation** to train a smaller, more efficient model that retains the accuracy of a larger model while consuming fewer computational resources.  

**LLM Distillation** is a process where a **large LLM (teacher model)** is used to train a **smaller LLM (student model)**. The smaller model learns by replicating the teacher’s output, achieving similar accuracy with reduced computational overhead.  

While teacher models provide high accuracy, they are resource-intensive. Deploying them for a single task is often inefficient. Instead, a **fine-tuned student model** offers significantly better throughput while meeting business-related performance KPIs.  

#### **Tutorial Overview**  
In this tutorial, we will fine-tune the **LLaMa-3 8B** model using NVIDIA NeMo and deploy it with NVIDIA NIM. We will cover the following:  

- **Dataset Preparation**: How to collect and preprocess data for LLM distillation  
- **Fine-Tuning LLaMa-3 8B**: Setting up and fine-tuning the model using NVIDIA NeMo (the model is pre-downloaded, and the necessary Python scripts are provided)  
- **Deploying the Fine-Tuned Model**: Using NVIDIA NIM for efficient model deployment  
- **Querying the Deployed Model**: Interacting with the model to make predictions  
- **Enhancing Accuracy**: Additional techniques to improve model performance  

The complete process of fine-tuning and deployment is summarized in the image below:  

![](assets/e2e-lora-train-and-deploy.png)  


### Importing necessary modules

In [None]:
import os
import json
import re
import random
from pprint import pprint
import requests
import urllib.request


**The step defines directories and output file paths used in a data processing pipeline:**

1. TRIPLES_DIR: Path to JSON files containing triples for the corresponding file in RAW_JSON_DIR .
2. RAW_JSON_DIR: Path to raw JSON files that contain unprocessed sec data.
3. OUTPUT_JSONL: Path to save the processed data in JSON Lines (JSONL) format, where each line represents a separate JSON object.

In [None]:
# Define directories
TRIPLES_DIR = "/workspace/data/triples_10k"  # Directory containing triples JSON files
RAW_JSON_DIR = "/workspace/data"             # Directory containing raw JSON files
OUTPUT_JSONL = "/workspace/data/training_data/output.jsonl"      # Output file

<br/>

## <font color="#76b900">**1. Dataset for Distillation**</font>


Here's a concise explanation of your code:

1. **`clean_text` function**: Cleans the input text by removing extra spaces, tabs, and newlines.

2. **`read_raw_json_item_1` function**: Reads the `item_1` key from a specified raw Sec JSON file. It returns an empty string if the file is not found or if there's an error reading the file.

3. **`process_triples_and_raw_json` function**:
   - It processes files in the `triples_dir` directory (which should contain JSON files with triples data).
   - For each triple file, it:
     - Reads the corresponding raw JSON file (based on the `filename` field in the triples JSON).
     - Cleans the text in the `item_1` field of the raw JSON.
     - Cleans and processes the `item_1a` field from the triples file (treated as triples).
     - Creates a JSONL entry with `input` as the cleaned `item_1` text and `output` as the cleaned triples.
     - Writes each JSONL entry to the `output_file`.

In [None]:
def clean_text(text):
    """
    Clean text by removing \n, \t, extra spaces, non-printable characters, etc.
    """
    if text is None:
        return ""
    text = re.sub(r'[^\x20-\x7E]', '', text)  # Remove non-printable characters
    return re.sub(r'\s+', ' ', text).strip()

def read_raw_json_item_1(raw_json_dir, filename):
    """
    Read the 'item_1' key from the specified raw JSON file.
    """
    raw_file_path = os.path.join(raw_json_dir, filename)
    if not os.path.exists(raw_file_path):
        print(f"Raw JSON file not found: {raw_file_path}")
        return ""

    with open(raw_file_path, 'r', encoding='utf-8') as f:
        try:
            data = json.load(f)
            return clean_text(data.get("item_1", ""))
        except json.JSONDecodeError:
            print(f"Error decoding JSON: {raw_file_path}")
            return ""

def process_triples_and_raw_json(triples_dir, raw_json_dir, output_file):
    """
    Process all triple files and corresponding raw JSON files.
    Generate JSONL entries with 'input' (item_1 text) and 'output' (cleaned triples).
    """
    with open(output_file, 'w', encoding='utf-8') as out_f:
        # Iterate through all files in the triples directory
        for file_name in os.listdir(triples_dir):
            file_path = os.path.join(triples_dir, file_name)

            # Process only JSON files
            if not file_name.endswith(".txt"):
                continue

            with open(file_path, 'r', encoding='utf-8') as triple_f:
                try:
                    triple_data = json.load(triple_f)
                    
                    # Extract raw JSON filename
                    raw_json_filename = triple_data.get("filename")
                    if not raw_json_filename:
                        print(f"No 'filename' field in {file_name}")
                        continue

                    # Read 'item_1' from raw JSON file
                    item_1_text = read_raw_json_item_1(raw_json_dir, raw_json_filename)
                    # Process triples
                    triples = triple_data.get("item_1a", [])
                    
                    output_triples = clean_text(str(triples))
                    # Create JSONL entry
                    jsonl_entry = {
                        "input": item_1_text,
                        "output": output_triples
                    }
                    out_f.write(json.dumps(jsonl_entry, ensure_ascii=False) + "\n")
                    out_f.flush()
                
                except json.JSONDecodeError:
                    print(f"Error decoding JSON file: {file_path}")
                    continue

# Run the process
process_triples_and_raw_json(TRIPLES_DIR, RAW_JSON_DIR, OUTPUT_JSONL)
print(f"Processing complete. Output saved to {OUTPUT_JSONL}.")



Below code defines a function to split a JSONL file into training, validation, and test datasets, and then writes the resulting data into separate files:

In [None]:
# Input/Output File Paths
TRAIN_FILE = "../data/training_data/sec_train.jsonl"         # Train dataset
VALID_FILE = "../data/training_data/sec_val.jsonl"           # Validation dataset
TEST_FILE = "../data/training_data/sec_test.jsonl"           # Test dataset

In [None]:
def split_jsonl(input_file, train_file, valid_file, test_file, train_ratio=0.8, valid_ratio=0.1, test_ratio=0.1):
    """
    Splits the input JSONL file into train, validation, and test datasets.
    """
    # Read all lines from the input JSONL file
    with open(input_file, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    # Shuffle the lines randomly to ensure data distribution
    random.shuffle(lines)

    # Calculate the split indices
    total_lines = len(lines)
    train_split = int(total_lines * train_ratio)
    valid_split = int(total_lines * valid_ratio)

    # Split the data
    train_data = lines[:train_split]
    valid_data = lines[train_split:train_split + valid_split]
    test_data = lines[train_split + valid_split:]

    # Write the train dataset
    with open(train_file, 'w', encoding='utf-8') as train_f:
        train_f.writelines(train_data)
    print(f"Train dataset created with {len(train_data)} records: {train_file}")

    # Write the validation dataset
    with open(valid_file, 'w', encoding='utf-8') as valid_f:
        valid_f.writelines(valid_data)
    print(f"Validation dataset created with {len(valid_data)} records: {valid_file}")

    # Write the test dataset
    with open(test_file, 'w', encoding='utf-8') as test_f:
        test_f.writelines(test_data)
    print(f"Test dataset created with {len(test_data)} records: {test_file}")


# Ensure reproducibility
random.seed(42)

# Split the JSONL file
split_jsonl(OUTPUT_JSONL, TRAIN_FILE, VALID_FILE, TEST_FILE)


Applying additional cleaning function as malformed/bad json in jsonl often found to halt training midway.

In [None]:
def sanitize_jsonl(input_file, output_file):
    """Sanitizes JSONL file by fixing bad lines."""
    with open(input_file, 'r', encoding='utf-8') as infile, open(output_file, 'w', encoding='utf-8') as outfile:
        for line_number, line in enumerate(infile, 1):
            try:
                # Try parsing JSON line
                json_data = json.loads(line)
                # Write cleaned JSON
                outfile.write(json.dumps(json_data, ensure_ascii=False) + '\n')
            except json.JSONDecodeError:
                print(f"Skipping malformed line {line_number}: {line.strip()}")

# Example usage
sanitize_jsonl("/workspace/data/training_data/sec_val.jsonl", "/workspace/data/training_data/sec_val_clean.jsonl")
sanitize_jsonl("/workspace/data/training_data/sec_test.jsonl", "/workspace/data/training_data/sec_test_clean.jsonl")
sanitize_jsonl("/workspace/data/training_data/sec_train.jsonl", "/workspace/data/training_data/sec_train_clean.jsonl")

In knowledge distillation, models are fine-tuned using labeled datasets to improve their performance on specific tasks. Task-specific fine-tuning enhances response quality and helps overcome the limitations of the student model. During this process, the model is trained over multiple iterations on labeled data to refine its predictions.

For fine-tuning with NVIDIA NeMo, labeled data must be provided in JSON Lines (JsonL) format. JsonL is a convenient format for storing structured data, allowing for efficient processing of records one at a time.

Typically following format is used when doing finetuning with NeMo:

```json
{"input": "Sample input text", "output": "Expected model response"}
{"input": "Another example input", "output": "Corresponding expected output"}
```
In the case of finetunint model for the triplet extraction the labelled data looks as given below:

```json
{"input": "ITEM 1. BUSINESS ImageWare Systems, Inc., a Delaware corporation, has its principal place of business at 11440 West Bernardo Court, Suite 300, San Diego, California 92127. We maintain a corporate website at www.iwsinc.com. Our common stock, par value $0.01 per share (\u201cCommon Stock\u201d), is currently listed for quotation on the OTCQB marketplace under the symbol \u201cIWSY\u201d. As used in this Annual Report, \u201cwe\u201d, \u201cus\u201d, \u201cour\u201d, \u201cImageWare\u201d, \u201cImageWare Systems\u201d or the \u201cCompany\u201d refers to ImageWare Systems, Inc. and all of its subsidiaries. Overview ImageWare Systems, Inc. (\u201cImageWare,\u201d the \u201cCompany,\u201d \u201cwe,\u201d \u201cour\u201d) provides defense-grade biometric identification and authentication solutions to safeguard your data, products, services or facilities. We are experts in biometric authentication and considered a preeminent patent holder of multimodal biometrics IP, having many of the most-cited patents in the industry. Our patented IWS Biometric Engine\u00ae is one of the most accurate and fastest biometrics matching engines in the industry, capable of our patented biometrics fusion. Part of our heritage is in law enforcement, having built the first statewide digital booking platform for United States local law enforcement in the late 1990\u2019s - and having more than three decades of experience in the challenging government sector creating biometric smart cards and logical access for millions of individuals. We are a \u201cbiometrics first\u201d company, leveraging unique human characteristics to provide unparalleled accuracy for identification while protecting your identity. The Company\u2019s products also provide law enforcement and public safety sector with integrated biographic, mugshot, SMT, and fingerprint capture for booking, in addition to investigative capabilities. The Company also provides comprehensive authentication security software using biometrics to secure physical and logical access to facilities, computer networks or Internet sites. Biometric technology is now an integral part of all markets that the Company addresses, and every product leverages our patented IWS Biometric Engine\u00ae. The IWS Biometric Engine\u00ae is a patented biometric identity and authentication database built for multi-biometric enrollment, management and authentication. It is hardware agnostic and can utilize different types of biometric algorithms. It allows different types of biometrics to be operated at the same time on a seamlessly integrated platform. It is also offered as a Software Development Kit (\u201cSDK\u201d), enabling developers and system integrators to implement biometric solutions or integrate biometric capabilities, into existing applications. Our secure credential solutions empower customers to design and create smart digital identification wristbands and badges for access control systems. We develop, sell and support software and design systems that utilize digital imaging and biometrics for photo identification cards, credentials and identification systems. Our products in this market consist of IWS EPI Suite and IWS EPI Builder. These products allow for production of digital identification badges and related databases and records and can be used by, among others, schools, airports, hospitals, corporations and governments. .....................", "output": "[['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Insufficient Cash Resources', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Need', 'Additional Capital', 'FIN_INSTRUMENT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Operate_In', 'Identity Management Solutions Industry', 'SECTOR'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Competition', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Fluctuating Operating Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Depends_Upon', 'Large System Sales', 'PRODUCT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Lengthy Sales Cycle', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Negative Working Capital', 'ECON_INDICATOR'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Sell', 'Products to Government Agencies', 'GPE'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Rely_On', 'Systems Integrators', 'ORG'], ['Systems Integrators', 'ORG', 'Perform', 'Adequately', 'VERB'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Accumulated Deficit', 'ECON_INDICATOR'], ['IMAGEWARE SYSTEMS INC', 'COMP', ' experience', 'Fluctuations in Operating Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Subject_To', 'Penny Stock Regulations', 'FIN_INSTRUMENT'], ['Penny Stock Regulations', 'FIN_INSTRUMENT', 'Impose', 'Additional Sales Practice Requirements', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Foreign Operations', 'COMP'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Political Risks', 'CONCEPT'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Economic Risks', 'CONCEPT'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Legal Risks', 'CONCEPT'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Currency Exchange Rates', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Foreign Operations', 'COMP'], ['Foreign Operations', 'COMP', 'Affect', 'Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Subject_To', 'Income Taxes', 'CONCEPT'], ['Income Taxes', 'CONCEPT', 'Requires', 'Significant Judgments', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Subject_To', 'Income Taxes', 'CONCEPT'], ['Income Taxes', 'CONCEPT', 'Subject_To', 'Examinations', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Political Risks', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Economic Risks', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Legal Risks', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Currency Exchange Rates', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Foreign Operations', 'COMP'], ['Foreign Operations', 'COMP', 'Affect', 'Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Penny Stock Rules', 'FIN_INSTRUMENT'], ['Penny Stock Rules', 'FIN_INSTRUMENT', 'Affect', 'Market Liquidity', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Volatility', 'CONCEPT'], ['Volatility', 'CONCEPT', 'Affect', 'Investment Value', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Fluctuations', 'CONCEPT'], ['Fluctuations', 'CONCEPT', 'Cause', 'Decline in Value', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Common Stock', 'FIN_INSTRUMENT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Specific Factors', 'CONCEPT'], ['Specific Factors', 'CONCEPT', 'Affect', 'Market Price', 'CONCEPT'], ,,,,,,,,,,]"}

```

In our case the "input" key will be text which is given as input and "output" key will be the triplets predicted by Teacher model. 
To construct the current data set we have used Mixtral8x7B as a teacher. We have used SEC-10 dataset.

<br/>

## <font color="#76b900">**2. Finetuning**</font>


[Llama 3](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/) is an open-source large language model by Meta that delivers state-of-the-art performance on popular industry benchmarks. It has been pretrained on over 15 trillion tokens, and supports an 8K token context length. It is available in two sizes, 8B and 70B, and each size has two variants---base pretrained and instruction tuned.

[Low-Rank Adaptation (LoRA)](https://arxiv.org/pdf/2106.09685) has emerged as a popular Parameter-Efficient Fine-Tuning (PEFT) technique that tunes a very small number of additional parameters as compared to full fine-tuning, thereby reducing the compute required.

[NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html) provides tools to perform LoRA on Llama 3 to fit your use case, which can then be deployed using [NVIDIA NIM](https://www.nvidia.com/en-us/ai/) for optimized inference on NVIDIA GPUs.

This notebook shows how to perform LoRA PEFT on Llama 3 8B Instruct using SEC-10 with NeMo Framework.

## Download the base model 
The first set of commands creates a directory to store the Llama-3-8B-Instruct model file if it doesn’t already exist. The second set of commands downloads the model file (8b_instruct_nemo_bf16.nemo) from NVIDIA's NGC server using requests and saves it in the newly created directory. The third set of commands verifies the successful download by listing the contents of the directory.

In [None]:
directory = "../model/llama-3-8b-instruct-nemo_v1.0"
file_path = os.path.join(directory, "8b_instruct_nemo_bf16.nemo")
url = "https://api.ngc.nvidia.com/v2/models/org/nvidia/team/nemo/llama-3-8b-instruct-nemo/1.0/files?redirect=true&path=8b_instruct_nemo_bf16.nemo"
os.makedirs(directory, exist_ok=True)
# Create directory if not exists
os.makedirs(directory, exist_ok=True)

def download_progress(block_num, block_size, total_size):
    downloaded = block_num * block_size
    percent = min(100, downloaded * 100 / total_size)
    print(f"\rDownloading: {percent:.2f}% ({downloaded}/{total_size} bytes)", end="")

# Check if file exists
if not os.path.exists(file_path):
    print("File not found. Downloading...")
    urllib.request.urlretrieve(url, file_path, reporthook=download_progress)
    print("\nDownload complete.")
else:
    print("File already exists. Skipping download.")


# List directory contents
print("Directory contents:", os.listdir(directory))

## Check GPU availability for training 
The command docker exec containerB nvidia-smi runs the nvidia-smi tool inside the containerB container to display GPU status. Ensure the container has GPU access (--gpus all) and the NVIDIA drivers installed.








In [None]:
%%bash 
docker exec containerB nvidia-smi

NeMo framework (Current environment) includes a high level python script for fine-tuning megatron_gpt_finetuning.py that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!

For this demonstration, this training run is capped at 20 max steps, and validation is carried out every 10 steps. You may increase the steps to 10,000+ in practical scenarios, but currently in interest of time we have limited the steps.

This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `/workspace/model/Meta-Llama-3-8B-Instruct-Sec-LoRA` We'll use this later.

`trainer.max_steps` are capped at 20 iteration to save time and treat it as learning example. Typically finetuning is done on 8xH100 kind of setup and often require 10,000+ steps. 

The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [PEFT support matrix](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html). For example, for P-tuning, simply set `model.peft.peft_scheme="ptuning"  # instead of "lora"`

In [None]:
%%bash
docker exec containerB bash -c "
    MODEL='/workspace/model/llama-3-8b-instruct-nemo_v1.0/8b_instruct_nemo_bf16.nemo'
    TRAIN_DS='/workspace/data/training_data/sec_train_clean.jsonl'
    VALID_DS='/workspace/data/training_data/sec_val_clean.jsonl'
    TEST_DS='/workspace/data/training_data/sec_test_clean.jsonl'
    TEST_NAMES='[sec]'
    SCHEME='lora'
    TP_SIZE=1
    PP_SIZE=1
    OUTPUT_DIR='/workspace/model/Meta-Llama-3-8B-Instruct-Sec-LoRA'
    rm -rf \${OUTPUT_DIR}
    
    torchrun --nproc_per_node=1 /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
        exp_manager.exp_dir=\${OUTPUT_DIR} \
        exp_manager.explicit_log_dir=\${OUTPUT_DIR} \
        trainer.devices=1 \
        trainer.num_nodes=1 \
        trainer.precision=bf16-mixed \
        trainer.val_check_interval=5 \
        trainer.max_steps=20 \
        model.megatron_amp_O2=True \
        ++model.mcore_gpt=True \
        model.tensor_model_parallel_size=\${TP_SIZE} \
        model.pipeline_model_parallel_size=\${PP_SIZE} \
        model.micro_batch_size=1 \
        model.global_batch_size=8 \
        model.restore_from_path=\${MODEL} \
        model.data.train_ds.num_workers=0 \
        model.data.validation_ds.num_workers=0 \
        model.data.train_ds.file_names=[\${TRAIN_DS}] \
        model.data.train_ds.concat_sampling_probabilities=[1.0] \
        model.data.validation_ds.file_names=[\${VALID_DS}] \
        model.peft.peft_scheme=\${SCHEME}
    "

Transfer the finetuned LORA adapter to directory where NIM can load and make model avaialble for inference

In [None]:
!mkdir -p ../model/loras/Meta-Llama-3-8B-Instruct-Sec-LoRA 
!cp ../model/Meta-Llama-3-8B-Instruct-Sec-LoRA/checkpoints/megatron_gpt_peft_lora_tuning.nemo ../model/loras/Meta-Llama-3-8B-Instruct-Sec-LoRA 

<br/>

## <font color="#76b900">**3. Deploy LoRA Inference Adapters with NVIDIA NIM**</font>

### Run the container with following env variables 
___
Below given steps are just for your information and not required to be executed right now as we have already set an environment for you
___

### Details on how this container was run 

1.  Download the example LoRA adapters.

The following steps assume that you have authenticated with NGC and downloaded the CLI tool, as listed in the Requirements section.

```source-shell
# Set path to your LoRA model store
export LOCAL_PEFT_DIRECTORY="$(pwd)/loras"
```

```source-shell
mkdir -p $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY

# downloading NeMo-format loras
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"

popd
chmod -R 777 $LOCAL_PEFT_DIRECTORY
```

1.  Prepare the LoRA model store.

After training is complete, that LoRA model checkpoint will be created at `./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo`, assuming default paths in the first notebook weren't modified.

To ensure the model store is organized as expected, create a folder named `llama3-8b-pubmed-qa`, and move your `.nemo` checkpoint there.

```source-shell
mkdir -p $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa

# Ensure the source path is correct
cp ./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa
```

Ensure that the LoRA model store directory follows this structure: the model name(s) should be sub-folder(s) containing the `.nemo` file(s).

```
<$LOCAL_PEFT_DIRECTORY>
├── llama3-8b-instruct-lora_vnemo-math-v1
│   └── llama3_8b_math.nemo
├── llama3-8b-instruct-lora_vnemo-squad-v1
│   └── llama3_8b_squad.nemo
└── llama3-8b-pubmed-qa
    └── megatron_gpt_peft_lora_tuning.nemo
```

The last one was just trained on the PubmedQA dataset in the previous notebook.

1.  Set-up NIM.

From your host OS environment, start the NIM docker container while mounting the LoRA model store, as follows:

```source-shell
# Set these configurations
export NGC_API_KEY=<YOUR_NGC_API_KEY>
export NIM_PEFT_REFRESH_INTERVAL=3600  # (in seconds) will check NIM_PEFT_SOURCE for newly added models in this interval
export NIM_CACHE_PATH=</path/to/NIM-model-store-cache>  # Model artifacts (in container) are cached in this directory
```

```source-shell
mkdir -p $NIM_CACHE_PATH
chmod -R 777 $NIM_CACHE_PATH

export NIM_PEFT_SOURCE=/home/nvs/loras # Path to LoRA models internal to the container
export CONTAINER_NAME=meta-llama3-8b-instruct

docker run -it --rm --name=$CONTAINER_NAME\
    --runtime=nvidia\
    --gpus all\
    --shm-size=16GB\
    -e NGC_API_KEY\
    -e NIM_PEFT_SOURCE\
    -e NIM_PEFT_REFRESH_INTERVAL\
    -v $NIM_CACHE_PATH:/opt/nim/.cache\
    -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE\
    -p 8000:8000\
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0
```

The first time you run the command, it will download the model and cache it in `$NIM_CACHE_PATH` so subsequent deployments are even faster. There are several options to configure NIM other than the ones listed above. You can find a full list in the [NIM configuration](https://docs.nvidia.com/nim/large-language-models/latest/configuration.html) documentation.

To help interface with this framework, the [**langchain-nvidia-ai-endpoints package**](https://github.com/langchain-ai/langchain-nvidia) provides connectors like [**`ChatNVIDIA`** ](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/langchain_nvidia_ai_endpoints/chat_models.py) and [**`NVIDIAEmbeddings`** ](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/langchain_nvidia_ai_endpoints/embeddings.py) to help interface with the raw endpoints. These will be used throughout the course to power our RAG pipeline!

<br/>

## <font color="#76b900">**4. Querying LoRA for Inference**</font>

### Check available LoRA models
Once the NIM server is up and running, check the available models as follows:

In [None]:
!docker restart containerC
!sleep 60

In [None]:
url = 'http://containerC:8000/v1/models'  # Use containerC's name as the hostname

response = requests.get(url)
data = response.json()

print(json.dumps(data, indent=4))

### Query the LoRA

Create a prompt template ; Idelly this should be the same as training template.

In [None]:
# Example from the PubMedQA test set
def get_prompt(news_prompt):    
        master_prompt = f"""
                
                    Note that the entities should not be generic, numerical or temporal (like dates or percentages).  Entities must be classified into the following categories:
                    ORG: Organizations other than government or regulatory bodies
                    ORG/GOV: Government bodies (e.g., "United States Government")
                    ORG/REG: Regulatory bodies (e.g., "Federal Reserve")
                    PERSON: Individuals (e.g., "Elon Musk")
                    GPE: Geopolitical entities such as countries, cities, etc. (e.g., "Germany")
                    COMP: Companies (e.g., "Google")
                    PRODUCT: Products or services (e.g., "iPhone")
                    EVENT: Specific and Material Events (e.g., "Olympic Games", "Covid-19")
                    SECTOR: Company sectors or industries (e.g., "Technology sector")
                    ECON_INDICATOR: Economic indicators (e.g., "Inflation rate"), numerical value like "10%" is not a ECON_INDICATOR;
                    FIN_INSTRUMENT: Financial and market instruments (e.g., "Stocks", "Global Markets")
                    CONCEPT: Abstract ideas or notions or themes (e.g., "Inflation", "AI", "Climate Change")
                    The relationships 'r' between these entities must be represented by one of the following relation verbs set: Has, Announce, Operate_In, Introduce, Produce, Control, Participates_In, Impact, Positive_Impact_On, Negative_Impact_On, Relate_To, Is_Member_Of, Invests_In, Raise, Decrease.
                    Remember to conduct entity disambiguation, consolidating different phrases or acronyms that refer to the same entity (for instance,  "UK Central Bank", "BOE" and "Bank of England" should be unified as "Bank of England"). Simplify each entity of the triplet to be less than four words.  
                    
                    From this text, your output Must be in python lis tof tuple with each tuple made up of ['h', 'type', 'r', 'o', 'type'], each element of the tuple is the string, where the relationship 'r' must be in the given relation verbs set above. Only output the list. 
                    As an Example, consider the following news excerpt: 
                        Input :'Apple Inc. is set to introduce the new iPhone 14 in the technology sector this month. The product's release is likely to positively impact Apple's stock value.'
                        OUTPUT : ```
                            [('Apple Inc.', 'COMP', 'Introduce', 'iPhone 14', 'PRODUCT'),
                            ('Apple Inc.', 'COMP', 'Operate_In', 'Technology Sector', 'SECTOR'),
                            ('iPhone 14', 'PRODUCT', 'Positive_Impact_On', 'Apple's Stock Value', 'FIN_INSTRUMENT')]
                        ```
                        The output structure must not be anything apart from above OUTPUT structure.
                
                    INPUT_TEXT:
                    """ + news_prompt.replace("\n",".")[:4192] +"[/INST]"
        return master_prompt 

In [None]:

url = 'http://containerC:8000/v1/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

test_line=open("../data/training_data/sec_test_clean.jsonl","r").readline()
prompt=json.loads(test_line)
input_ = prompt['input']
output_ = prompt['output']



data = {
    "model": "Meta-Llama-3-8B-Instruct-Sec-LoRA",
    "prompt": get_prompt(input_),
    "max_tokens": 256
}

response = requests.post(url, headers=headers, json=data)
response_data = response.json()
print(response_data)
pprint("Predicted output \n ++++++++++++++++++++++++++++++++++++++++ \n" +response_data["choices"][0]["text"])

In [None]:
pprint("Actual output \n ++++++++++++++++++++++++++++++++++++++++ \n" + prompt['output'])