# Retriever Customization - Fine-Tuning & Evaluation (2/2)

Authors - Aditya Malte, Vinay Raman, Ali Taghibakhshi, Dora Li

## Overview
This is part two of a two-part series. 
1. `synthetic_data_generation_nemo.ipynb`:
    - Use an LLM from build.nvidia.com (or deploy your own using NIM!) to create training examples containing generated queries and positive chunks. By default the notebook will use nfcorpus, but you can easily swap in your own data.
    - Save results to a `.jsonl` file 


2. `retriever_customization.ipynb` **(this notebook)**:
    - Implement hard negative mining to find challenging negative examples
    - Use the generated training data in the `.jsonl` file to fine-tune a retriever model using Nemo Framework
    - Evaluate the results of your fine-tuned embedding model against the original using BeIR Benchmark
    
A GPU is required to run this notebook. 

## Setup Instructions

#### NeMo Framework Docker container
This notebook requires the NeMo Framework Docker container. Download the appropriate Docker image and build the container when inside the `synthetic-data-retriever-customization` directory using this command: 

`docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.07`

This notebook was tested on a setup comprising of 1xL40S GPUs with CUDA setup.


#### NVIDIA AI Endpoints
As in Notebook 1, you'll use another API endpoint from [www.build.nvidia.com](https://www.build.nvidia.com) in Notebook 2, this time for generating embeddings with the text embedding model [NV-EmbedQA-E5-V5](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5). You can reuse the same API Key as before, or generate a new one by clicking the link to the model. 


#### Download NV-Embed-QA-4 model weights from NGC
Use the command `ngc registry model download-version "ohlfw0olaadg/ea-participants/nv-embed-qa:4"` to download the NeMo Retriever model. It must be downloaded to the directory `files/models`. The same model - NeMo Retriever - has been used as an example in this notebook. If you do not have NVAIE access, then you may download and convert a HF embedding like `intfloat/e5-large-unsupervised` for your purpose as follows:
```
/NeMo/scripts/nlp_language_modeling/convert_bert_hf_to_nemo.py \
       --input_name_or_path "intfloat/e5-large-unsupervised" \
       --output_path /workspace/files/models/my_model.nemo
```

For the purpose of this notebook, we have used the NeMo Retriever model. If you use another model, or convert an HF model, ensure that the model path is updated accordingly

In [None]:
!pip install ipywidgets
!pip install beir

## Import libraries and set configuration

In [None]:
import numpy as np
import json
import pandas as pd
from collections import OrderedDict
import os
import torch
from openai import AsyncOpenAI
import asyncio
import nest_asyncio
import math

nest_asyncio.apply()

In [None]:
# This should be the synthetic dataset generated in Part 1, consisting of the queries, pos_doc, and neg_docs
QA_PAIRS_PATH = "/workspace/files/data/qa_pairs_nvidia-nemotron-4-340b-instruct_num_queries_300_BeIR_nfcorpus.csv"

# Specify the path where the fine-tuning dataset will be saved
OUTPUT_DATA_PATH = "/tmp/data/output_data.jsonl"
output_dir_path = os.path.dirname(OUTPUT_DATA_PATH)
if not os.path.exists(output_dir_path):
    os.mkdir(output_dir_path)

#### Parameters for Fine-Tuning

In [None]:
NUM_DEVICES=1 # number of gpus available for fine-tuning

# Use the default config for BERT Embedding Model
CONFIG_PATH="/opt/NeMo/examples/nlp/information_retrieval/conf/"
CONFIG_NAME="megatron_bert_embedding_config"

PATH_TO_NEMO_MODEL= "/workspace/files/models/NV-Embed-QA-4.nemo" # Path to converted nemo model from hf, if you have a different model
DATASET_PATH= OUTPUT_DATA_PATH # Path to jsonl dataset
SAVE_DIR= "/tmp/trained_model/" # where the checkpoint and logs are saved

In [None]:
#### Read QA Pairs CSV file

In [None]:
qa_pairs = pd.read_csv(QA_PAIRS_PATH).sample(frac=1).reset_index(drop=True)
qa_pairs

## Mining Hard Negatives

Hard negative mining refers to the creation of negative examples that are 'hard'. Essentially, what this means is that rather than performing random sampling - which would lead to easy negatives - we mine for harder negative examples.

This has an advantage that the negatives would not be obvious to the model during training, and hence would actually be more helpful.

However, hard negative mining has a higher probability of generating false negatives. To avoid this, we set a safety `margin`. This margin is a hyperparameter and you may change it depending on if more false negatives are being generated. For instance, a larger corpus has a higher probability of generating false negatives than a smaller one, as the probability of finding another positive increases. In such cases a lower `margin` value may be more helpful.

#### NV-EmbedQA-E5-V4
To do hard negative mining, we'll need to create embeddings for all of our text chunks using the [NV-EmbedQA-E5-V5](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5) model from www.build.nvidia.com. You can reuse the same NVIDIA_API_KEY as before. 

Since the NV-EmbedQA-E5-V5 model is quite small, you can also easily host it as self-deployed NIM Docker container following the instructions [here](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5?snippet_tab=Docker). If you already have the model weights for .nemo format embedding model downloaded in preparation for fine-tuning, you can also restore the model using NeMo Framework. To do that, simply copy the encode_text() function from the evaluation section of this notebook and use it here. 

#### BeIR
BeIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark [source](https://github.com/beir-cellar/beir). First we'll do some basic processing so that our synthetic dataset matches the BeIR format. 

In [None]:
passages = OrderedDict()
queries = []
positive_passage_ids = []
for _, row in qa_pairs.iterrows():
    queries.append(row["query"])
    positive_passage_str = row["positive_chunk"]
    if(positive_passage_str in passages):
        positive_passage_id = passages[positive_passage_str]
        positive_passage_ids.append(positive_passage_id)
    else:
        positive_passage_id = len(passages)
        passages[positive_passage_str] = positive_passage_id
        positive_passage_ids.append(positive_passage_id)

In [None]:
queries[12], positive_passage_ids[12]

### Generate Embeddings for all Queries and Positive Passages

In [None]:
embedding_client = AsyncOpenAI(
    base_url = "https://integrate.api.nvidia.com/v1",
    api_key = os.environ["NVIDIA_API_KEY"]
)

In [None]:
async def encode_text(client, text, input_type):
    try:
        response = await client.embeddings.create(
            input=[text],
            model="nvidia/nv-embedqa-e5-v5",
            encoding_format="float",
            extra_body={"input_type": input_type, "truncate": "END"}
        )

        if hasattr(response, 'data') and len(response.data) > 0:
            return response.data[0].embedding
            
    except Exception as e:
        return f"Error occurred: {str(e)}"
    

async def batch_encode_text(client, all_texts, input_type):
    tasks = [encode_text(client, text, input_type) for text in all_texts]
    results_list = await asyncio.gather(*tasks)
    return results_list

In [None]:
query_embeddings = await batch_encode_text(embedding_client, [("query: "+query) for query in queries], "query")
passage_embeddings = await batch_encode_text(embedding_client, [("passage: "+passage) for passage in list(passages)], "passage")

NV-EmbedQA-V4 uses the keys "query" and "passage" but this may differ between models. Ensure you are using the correct keys for your model, otherwise you'll hit an error during fine-tuning. 

### Find Hard Negatives Using Similarity Score

In [None]:
def hard_negative_mining(
        query_embeddings,
        passage_embeddings,
        batch_size,
        margin, 
        num_negs,
        query_positive_paragraph_idxs
):
    hard_negative_idxs = []
    num_batches = int(math.ceil(query_embeddings.shape[0] / batch_size))
    # Split the query embeddings into batches of given batch size
    for current_batch_idx in range(num_batches):
        start = (current_batch_idx)*batch_size
        end = (current_batch_idx+1)*(batch_size)
        batch_query_embeddings = query_embeddings[start:end]
        batch_query_positive_paragraph_idxs = query_positive_paragraph_idxs[start:end]
        
        # Find minimum query-positive_chunk similarity score for each query in a batch
        query_passage_pos_scores = np.matmul(batch_query_embeddings, passage_embeddings.T)

        min_pos_scores = []
        for query_id, row in enumerate(query_passage_pos_scores):
            min_value = float("inf")
            for query_positive_paragraph_idx in query_positive_paragraph_idxs[query_id+start]:
                min_value = min(min_value, row[query_positive_paragraph_idx])
            min_pos_scores.append(min_value)
        min_pos_scores = np.array(min_pos_scores)
            
        # For each query set minimum threshold as margin*minimum_batch_positive_score 
        mining_thresholds = min_pos_scores*margin
        
        # Filter out all chunks belonging to the same paragraph as positive passage OR those manually labelled as positives
        for query_idx, positive_paragraph_idxs in enumerate(batch_query_positive_paragraph_idxs):
            batch_query_idx = query_idx%batch_size
            query_passage_pos_scores[batch_query_idx][positive_paragraph_idxs] = -float("inf")
        
        # Filter out all chunks with score>mining_threshold
        for row_idx in range(query_passage_pos_scores.shape[0]):
            row = query_passage_pos_scores[row_idx]
            row[row>mining_thresholds[row_idx]] = -float("inf")
            
        # For each query get top_k hard negatives from all that remains
        for row in query_passage_pos_scores:
            top_k_hard_negative_idxs = np.argpartition(row, -num_negs)[-num_negs:]
            hard_negative_idxs.append(list(top_k_hard_negative_idxs))
            
    return hard_negative_idxs

In [None]:
# Here we set a margin of 0.95 to prevent false negatives and we mine 5 negative docs (num_negs)
query_embeddings = torch.tensor(query_embeddings).numpy()
passage_embeddings = torch.tensor(passage_embeddings).numpy()

positive_passage_ids_list = [[element] for element in positive_passage_ids]
hard_negative_idxs = hard_negative_mining(query_embeddings=query_embeddings, passage_embeddings=passage_embeddings, query_positive_paragraph_idxs=positive_passage_ids_list,
                    batch_size=32, num_negs=5, margin=0.95)

Use similarity score with the `margin` variable to generate hard negatives. For this example we generate 5 hard negatives, but you can change this number. Ultimately the data will be stored in the following format: 

```
[
    {
        "query": "Query",
        "pos_doc": ["Positive"],
        "neg_doc": ["Negative_1", "Negative_2", ..., "Negative_n"]
    },
    {
        // Next data instance
    },
    ...,
    {
        // Subsequent data instance
    }
]
```

In [None]:
data = []
for query_id, query in enumerate(queries):
    hard_negative_passages = []
    for hard_negative_idx in hard_negative_idxs[query_id]:
        for key, val in passages.items():
            if val == hard_negative_idx:
                hard_negative_passage = key
                hard_negative_passages.append(hard_negative_passage)
    
    for key, val in passages.items():
        if val == positive_passage_ids[query_id]:
            positive_passage = key
            break

    datapoint = {
        "query" : query,
        "pos_doc" : positive_passage,
        "neg_doc" : hard_negative_passages
    }
    data.append(datapoint)

In [None]:
print(len(data))
print('query: ', data[0]['query'])
print('pos_doc: ', data[0]['pos_doc'])
print('neg_doc: ', data[0]['neg_doc'])

In [None]:
# Save data to JSONL file
print(f"Saving data to: {OUTPUT_DATA_PATH}")

with open(OUTPUT_DATA_PATH, "w") as f:
    for entry in data:
        f.write(json.dumps(entry) + '\n')

## Training

Run the `megatron_bert_embedding_finetuning.py` script. This script sets up and trains a Megatron-BERT model using  NVIDIA NeMo Framework, with configurations managed by Hydra. It loads the pre-trained `.nemo` model from a checkpoint, adjusts settings like batch size, and sets up parallel processing for multi-GPU training. Finally, it initializes the trainer and starts the training process with the NeMo Framework Megatron Trainer. 

Note `model.global_batch_size = model.micro_batch_size * trainer.devices (aka # of GPUs)`. Please keep micro_batch_size=4 and set the other parameters accordingly. 

`model.data.hard_negatives_to_train` should be set to the number of neg_docs corresponding to each query in your synthetic dataset. 

In [None]:
COMMAND = f"python /opt/NeMo/examples/nlp/information_retrieval/megatron_bert_embedding_finetuning.py \
--config-path={CONFIG_PATH} \
--config-name={CONFIG_NAME} \
restore_from_path={PATH_TO_NEMO_MODEL} \
trainer.devices={NUM_DEVICES} \
trainer.val_check_interval=10 \
trainer.max_epochs=1 \
+trainer.num_sanity_val_steps=0 \
trainer.max_steps=100000 \
model.global_batch_size=4 \
model.micro_batch_size=4 \
model.mcore_bert=False \
model.tokenizer.library=huggingface \
model.tokenizer.type=intfloat/e5-large-unsupervised \
model.megatron_legacy=True \
++model.data.data_prefix={DATASET_PATH} \
++model.tokenizer.do_lower_case=False \
++model.data.evaluation_sample_size=50 \
++model.data.hard_negatives_to_train=5 \
++model.data.evaluation_steps=100 \
++model.data.data_train={DATASET_PATH} \
++model.data.num_workers=7 \
exp_manager.explicit_log_dir={SAVE_DIR} \
exp_manager.create_wandb_logger=False \
++exp_manager.checkpoint_callback_params.save_best_model=True \
exp_manager.resume_if_exists=False"

print(COMMAND)

In [None]:
!{COMMAND}

If your training completed, you should see a megatron_bert.nemo in your `SAVE_DIR` directory. 

If training failed due to memmap-related errors, delete any output_data.jsonl.idx* (index) files that have been generated in the `OUTPUT_DATA_PATH` directory where output_data.jsonl is located. To save memory, NeMo Framework doesn't rebuild index files if they already exist. So if you've changed any parameters related to the data or changed the data itself, this will cause errors. 

## Model Evaluation

For this tutorial, we'll use the scifact dataset from BeIR to compare the retrieval accuracy between the original model and the fine-tuned model. For a true apples to apples comparison, you should create your own domain-specific evaluation dataset that matches the domain of the synthetic fine-tuning dataset. This evaluation dataset should comprise of corpus, queries, and qrel (query relevance) scores.  

We will use NeMo Framework to restore both the original and fine-tuned models from their respective checkpoints and BeIR libraries to easily evaluate the retrieval accuracy. 

Finally we'll evaluate the model with NDCG@k, MAP@K, Recall@K and Precision@K scores. These metrics assess different aspects of retrieval performance, where NDCG and MAP focus on the quality of rankings, with higher values indicating better-ranked relevant documents.Recall measures how many relevant documents are retrieved at different ranks, improving as k increases. Precision evaluates the accuracy of the top k documents, with higher precision indicating more relevant results at the top.

In [None]:
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

import torch
import math
from tqdm import tqdm
import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#### Download nfcorpus.zip dataset and unzip the dataset
dataset = "nfcorpus"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join("/tmp", "datasets")
data_path = util.download_and_unzip(url, out_dir)

#### Provide the data_path where scifact has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")


#### Create a wrapper NeMo model for retrieval evaluation on this dataset

In [None]:
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from nemo.collections.nlp.models.information_retrieval.megatron_bert_embedding_model import MegatronBertEmbeddingModel
from pytorch_lightning.trainer.trainer import Trainer
from typing import List, Dict
import numpy as np

class NeMoModel:
    def __init__(self, model_path=None, override_configs=None, **kwargs):
        cfg = MegatronBertEmbeddingModel.restore_from(model_path, return_config=True)
        if override_configs is not None:
            for k in override_configs:
                cfg[k] = override_configs[k]
        self.model = MegatronBertEmbeddingModel.restore_from(
            model_path,
            trainer=Trainer(),
            override_config_path=cfg)
        self.model = self.model.to("cuda:0").half()
    
    def encode_text(self, texts, batch_size=1, device="cuda:0"):
        with torch.no_grad():
            tokenized_texts = self.model.tokenizer.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
            
            input_ids = tokenized_texts["input_ids"].to(device)
            attention_mask = tokenized_texts["attention_mask"].to(device)
            token_type_ids = tokenized_texts["token_type_ids"].to(device)

            num_batches = int(math.ceil(len(texts)/batch_size))

            embeddings = []
            for batch_id in tqdm(range(num_batches)):
                start = batch_size * batch_id
                end = batch_size * (batch_id+1)

                batch_embeddings = self.model(input_ids[start:end, :], attention_mask[start:end, :], token_type_ids[start:end, :])
                embeddings.append(batch_embeddings)
            return torch.cat(embeddings, dim=1).swapaxes(0,1)

    # Write your own encoding query function (Returns: Query embeddings as numpy array)
    def encode_queries(self, queries: List[str], batch_size: int, **kwargs) -> np.ndarray:
        queries = [f"query: {query}" for query in queries]
        embeddings = self.encode_text(texts=queries, batch_size=batch_size)
        return embeddings
    
    # Write your own encoding corpus function (Returns: Document embeddings as numpy array)  
    def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs) -> np.ndarray:
        corpus = [f"passage: {passage}" for passage in corpus]
        embeddings = self.encode_text(texts=corpus, batch_size=batch_size)
        return embeddings

#### Evaluate the Fine-tuned model:

NOTE: there may be a bug in Nemo 24.07 where certain global variables are set by default and must match the passed in config variables. One example is global_batch_size=8. So even though we set global_batch_size=4 during fine-tuning, we need to manually override it here to successfully restore the model. This does not impact the model performance. 

In [None]:
new_model = DRES(NeMoModel(model_path="/tmp/trained_model/checkpoints/megatron_bert.nemo", override_configs={'global_batch_size': 8}), batch_size=1)
retriever = EvaluateRetrieval(new_model, score_function="dot") # or "cos_sim" for cosine similarity
results = retriever.retrieve(corpus, queries)

#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K  where k = [1,3,5,10,100,1000] 
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print(ndcg, _map, recall, precision)

The output should look like this:
```
{'NDCG@1': 0.43808, 'NDCG@3': 0.4094, 'NDCG@5': 0.39159, 'NDCG@10': 0.35777, 'NDCG@100': 0.33154, 'NDCG@1000': 0.41858} {'MAP@1': 0.05692, 'MAP@3': 0.09939, 'MAP@5': 0.11412, 'MAP@10': 0.13414, 'MAP@100': 0.17271, 'MAP@1000': 0.18817} {'Recall@1': 0.05692, 'Recall@3': 0.11421, 'Recall@5': 0.13637, 'Recall@10': 0.17648, 'Recall@100': 0.33741, 'Recall@1000': 0.64782} {'P@1': 0.45511, 'P@3': 0.38803, 'P@5': 0.34365, 'P@10': 0.26656, 'P@100': 0.08508, 'P@1000': 0.02163}
```

#### Evaluate the original model: 

In [None]:
# The original model
old_model = DRES(NeMoModel(model_path=PATH_TO_NEMO_MODEL), batch_size=1)
retriever = EvaluateRetrieval(old_model, score_function="dot") # or "cos_sim" for cosine similarity
results = retriever.retrieve(corpus, queries)

#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K  where k = [1,3,5,10,100,1000] 
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print(ndcg, _map, recall, precision)

As you can see, there is some improvement in the results on evaluation. Using a larger amount of data for fine-tuning and proprietary, domain-specific data is likely to make the improvement much more significant. From some initial testing with proprietary corporate data, we've seen around 5-10% accuracy improvement. Your results may vary depending on the other configurations set. 

**Congratulations!** You've officially created synthetic data and fine-tuned a text embedding model using NeMo Framework!