# NeMo Evaluator microservice: Retriever and RAG Evaluation

In the following notebook, we'll be exploring how to use [NeMo Evaluator microservice](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/overview.html) to evaluate [Retriever Models](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/models/models_retriever.html) as well as [Retrieval Augmented Generation (RAG) Models](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/models/models_rag.html)!

We'll look at the following examples: 

- Retriever Model Evaluation on FiQA
- Retriever + Reranking Evaluation on FiQA
- Retrieval Augmented Generation (RAG) Evaluation on FiQA with Ragas Metrics
- Retrieval Augmented Generation (RAG) Evaluation on Synthetically Generated Data with Ragas Metrics

In order to get started, we'll need to make sure our Evaluation Microservice is running, alongside our Retriever, Re-Rank, and LLM NIMs.

## Initial Set-up and Notebook Dependencies

In order to run this notebook, the following will need to be up and running: 

- Evaluator Microservice, which can be conveniently deployed through the [Deploying with Helm](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/deploy-helm.html) guide
- NVIDIA NIM Text Embedding, `nvidia/nv-embedqa-e5-v5`, which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/getting-started.html) guide
- NVIDIA NIM Text Reranking, `nvidia/nv-rerankqa-mistral-4b-v3`, which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/getting-started.html) guide
- NVIDIA NIM for LLM, `meta/llama-3.1-8b-instruct`, which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) guide

Once all of our services are up and running, we can install the Python `requests` library, which we will use to communicate with the Evaluator API.

In [1]:
!pip install -qU requests huggingface_hub==0.26.2

We'll need to provide the Evaluation API URL in the cell below.

> NOTE: Your evaluation URL will be provided as part of your deployment. 

In [4]:
EVAL_URL = "<< YOUR EVALUATOR URL HERE >>"

We'll also need to provide the endpoints for your model addresses and model names, which will be set-up as part of the deployment process for each NIM.

Below is an example of the default value for the embedding NIM:

- Embedding: 
  - EMBEDDING_URL: `http://localhost:8000/v1/embeddings`
  - EMBEDDING_MODEL_NAME: `nvidia/nv-embedqa-e5-v5`

In [2]:
# embedding
EMBEDDING_URL = "<< YOUR EMBEDDING MODEL NIM URL >>"
EMBEDDING_MODEL_NAME = " << YOUR EMBEDDING MODEL NAME >>"

# reranker
RERANKER_URL = "<< YOUR RERANKER MODEL NIM URL >>"
RERANKER_MODEL_NAME = "<< YOUR RERANKER MODEL NAME >>"

# llm
LLM_URL = "<< YOUR LLM MODEL NIM URL >>"
LLM_MODEL_NAME = "<< YOUR LLM MODEL NAME >>"

Now we can verify our Evaluation API is up and running with the built-in health check!

In [5]:
import requests

endpoint = f"{EVAL_URL}/health"
response = requests.get(endpoint).json()
print(response)

{'status': 'healthy'}


## Retriever Model Evaluation on FiQA

For our first evaluation, we're going to evaluate our Retrieval Model (`nvidia/nv-embedqa-e5-v5`) on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

The core pieces we need to provide are: 

- `top_k`, how many documents to retriever through our retriever model
- `query_embedding_url`, the address of your currently running `nvidia/nv-embedqa-e5-v5` NIM if you're following the notebook exactly.
- `query_embedding_model`, this will be `nvidia/nv-embedqa-e5-v5` if you're following the notebook exactly.
- `index_embedding_url`, which will mirror the `query_embedding_url` assuming that you're using the same NIM deployment for both Query Embedding and Index embedding.
- `index_embedding_model`, this will mirror the `query_embedding_model` assuming that you're using the same NIM deployment for both Query Embedding and Index embedding.

> NOTE: While it's possible to use different NIM *deployments* for Query/Index Embedding - you will need to ensure the underlying model is the same between both.

We'll also want to ensure we've set-up our evaluations correctly by following the available [documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations/evaluations_retriever.html) for Retriever evaluations.


In [6]:
retriever_target_config = {
 "type": "retriever",
 "retriever": {
   "pipeline": {
     "query_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME
       }
     },
     "index_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME
       }
     },
     "top_k": 10
   }
 }
}


We'll want to point our request at the `v1/evaluation/targets` endpoint to create the target.

In [7]:
target_endpoint = f"{EVAL_URL}/v1/evaluation/targets"

Then we are clear to fire off the request!

In [42]:
retriever_response = requests.post(
    target_endpoint,
    json=retriever_target_config,
    headers={'accept': 'application/json'}
).json()

We'll capture our target ID for the coming steps - but with this step we have created our target and are ready to create an evaluation configuration!

In [43]:
retriever_target_name = retriever_response["name"]
retriever_target_namespace = retriever_response["namespace"]
print(f"Target Name: {retriever_target_name}, Target Namespace: {retriever_target_namespace}")

Target Name: eval-target-J7SB71Jji73LrPEg4LxnV7, Target Namespace: default


Now we can grab our evaluation configuration.

In [44]:
retriever_eval_config = {
 "type": "retriever",
 "tasks": [
   {
     "type": "beir",
     "dataset": {
       "format": "beir",
       "files_url": "fiqa"
     },
     "metrics": [
       {
         "name": "recall_5",
       },
       {
         "name": "ndcg_cut_5",
       },
       {
         "name": "recall_10",
       },
       {
         "name": "ndcg_cut_10",
       }
     ]
   }
 ]
}

Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

We'll set up our Evaluator endpoint URL...

In [45]:
eval_config_endpoint = f"{EVAL_URL}/v1/evaluation/configs"
retriever_eval_response = requests.post(
    eval_config_endpoint,
    json=retriever_eval_config,
    headers={'accept': 'application/json'}
).json()

Let's again capture our evaluation config for use later.

In [46]:
retriever_config_name = retriever_eval_response["name"]
retriever_config_namespace = retriever_eval_response["namespace"]
print(f"Config Name: {retriever_config_name}, Config Namespace: {retriever_config_namespace}")

Config Name: eval-config-EksLVqPuNX8xpLYVTefhvW, Config Namespace: default


### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [56]:
job_config = {
    "target": retriever_target_namespace + "/" + retriever_target_name,
    "config": retriever_config_namespace + "/" + retriever_config_name,
    "tags": [
        "embedding-fiqa"
    ]
}

Next, let's set the evaluation jobs endpoint.

In [57]:
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

All that's left to do is fire off our job!

In [63]:
retriever_job_response = requests.post(
    job_endpoint,
    json=job_config,
    headers={'accept': 'application/json'}
).json()

In [64]:
retriever_job_id = retriever_job_response["id"]
print(f"Job ID: {retriever_job_id}")

Job ID: eval-AGd3e5Dz5Rr2xN8kpcPQbj


#### Monitoring

We can monitor the status of our job through the following endpoint.

In [65]:
retriever_monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{retriever_job_id}"

In [66]:
retriever_monitoring_response = requests.get(
    retriever_monitoring_endpoint,
).json()

We can check on the status of our evaluation in the cell below. 

> NOTE: When the evaluation `status` becomes `succeeded`, the `evaluation_results` field will become populated.

In [None]:
print(retriever_monitoring_response["status"]["status"])

Once it's done - let's look at the full results!

In [None]:
print(retriever_monitoring_response)

The `evaluation_results` field will contain our `metrics` along with their name, and their score.

In [38]:
retriever_monitoring_response["evaluation_results"][0]["metrics"]

[{'name': 'ndcg_cut_5',
  'value': 0.43179850619730425,
  'metadata': {'name': 'beir'}},
 {'name': 'recall_10',
  'value': 0.5212761004427672,
  'metadata': {'name': 'beir'}},
 {'name': 'ndcg_cut_10',
  'value': 0.455153721565557,
  'metadata': {'name': 'beir'}},
 {'name': 'recall_5',
  'value': 0.4460219435913878,
  'metadata': {'name': 'beir'}}]

## Retriever + Reranking Evaluation on FiQA

For our second evaluation, we're going to evaluate our Retrieval Model (`nvidia/nv-embedqa-e5-v5`) on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

Instead of simply using a Retriever model, however, this example will also leverage a Reranking model (`nvidia/nv-rerankqa-mistral-4b-v3`) to rerank the retrieved results.

We'll rerun the same evaluation configuration as we did above - with a few extra parameters in our `retriever` configuration:

- `ranker_url`, which will point to our reranking model
- `ranker_model`, which will contain the name of our reranking model

In [70]:
reranker_target_config = {
 "type": "retriever",
 "retriever": {
   "pipeline": {
     "query_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME
       }
     },
     "index_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME
       }
     },
     "reranker_model": {
       "api_endpoint": {
         "url": RERANKER_URL,
         "model_id":RERANKER_MODEL_NAME
       }
     },
     "top_k": 10
   }
 }
}

We'll want to point our request at the `v1/evaluation/targets` endpoint to create the target.

In [71]:
target_endpoint = f"{EVAL_URL}/v1/evaluation/targets"

Then we are clear to fire off the request!

In [72]:
reranker_response = requests.post(
    target_endpoint,
    json=reranker_target_config,
    headers={'accept': 'application/json'}
).json()

We'll capture our target ID for the coming steps - but with this step we have created our target and are ready to create an evaluation configuration!

In [73]:
reranker_target_name = reranker_response["name"]
reranker_target_namespace = reranker_response["namespace"]
print(f"Target Name: {reranker_target_name}, Target Namespace: {reranker_target_namespace}")

Target Name: eval-target-26E1Gq39aVLL1mSvSxANhN, Target Namespace: default


Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

> NOTE: Notice how we don't have to re-create our evaluation configuration since we already created it for the Embedding model evaluation!

### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [74]:
reranker_job_config = {
    "target": reranker_target_namespace + "/" + reranker_target_name,
    "config": retriever_config_namespace + "/" + retriever_config_name,
    "tags": [
        "embedding-rerank-fiqa"
    ]
}

Next, let's set the evaluation jobs endpoint.

In [75]:
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

All that's left to do is fire off our job!

In [76]:
reranker_job_response = requests.post(
    job_endpoint,
    json=reranker_job_config,
    headers={'accept': 'application/json'}
).json()

In [78]:
reranker_job_id = reranker_job_response["id"]
print(f"Job ID: {reranker_job_id}")

Job ID: eval-XsxKZPjeRmzpga8GPPkbxJ


#### Monitoring

We can monitor the status of our job through the following endpoint.

In [79]:
reranker_monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{reranker_job_id}"

In [80]:
reranker_monitoring_response = requests.get(
    reranker_monitoring_endpoint,
).json()

We can check on the status of our evaluation in the cell below. 

> NOTE: When the evaluation `status` becomes `succeeded`, the `evaluation_results` field will become populated.

In [None]:
print(reranker_monitoring_response["status"]["status"])

Once it's done - let's look at the full results!

In [None]:
print(reranker_monitoring_response)

The `evaluation_results` field will contain our `metrics` along with their name, and their score.

In [None]:
reranker_monitoring_response["evaluation_results"][0]["metrics"]

## Retrieval Augmented Generation (RAG) Evaluation on FIQA with Ragas Metrics

With the most recent release of NeMo Evaluator microservice, not only can we evaluate Retrievers and Rerankers - we can also Evaluate RAG!

Once again, we're going to evaluate on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

We're also going to evaluate our RAG pipeline on the [Ragas](https://docs.ragas.io/en/stable/howtos/index.html) metrics ["Faithfulness"](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html). This can be done by extending our evaluation configuration in the following ways:

1. We can create the model type `rag`, and provide our `retriever` configuration we used in the first evaluation.
2. We need to provide a `context_ordering` parameter, in this case we'll use `desc` which will order our context in descending score.
3. We need to provide a "generator" (LLM) that can be used to generate responses based on the retrieved context!

We'll also need to add in a number of `judge_` parameters to help calculate the Faithfulness metric.

Let's look at an example evaluation configuration below:

In [126]:
rag_target_config = {
 "type": "rag",
 "rag": {
   "pipeline": {
     "retriever": {
       "pipeline": {
         "query_embedding_model": {
           "api_endpoint": {
             "url": EMBEDDING_URL,
             "model_id": EMBEDDING_MODEL_NAME
           }
         },
         "index_embedding_model": {
           "api_endpoint": {
             "url": EMBEDDING_URL,
             "model_id": EMBEDDING_MODEL_NAME
           }
         }
       }
     },
     "model": {
       "api_endpoint": {
         "url": LLM_URL,
         "model_id": LLM_MODEL_NAME
       }
     }
   }
 }
}

We'll want to point our request at the `v1/evaluation/targets` endpoint to create the target.

In [127]:
target_endpoint = f"{EVAL_URL}/v1/evaluation/targets"

Then we are clear to fire off the request!

In [128]:
rag_response = requests.post(
    target_endpoint,
    json=rag_target_config,
    headers={'accept': 'application/json'}
).json()

We'll capture our target ID for the coming steps - but with this step we have created our target and are ready to create an evaluation configuration!

In [129]:
rag_target_name = rag_response["name"]
rag_target_namespace = rag_response["namespace"]
print(f"Target Name: {rag_target_name}, Target Namespace: {rag_target_namespace}")

Target Name: eval-target-N1yeAWk2HU9aypHubpvdYW, Target Namespace: default


Now we can grab our evaluation configuration.

In [130]:
rag_eval_config = {
 "type": "rag",
 "tasks": [
   {
     "type": "beir",
     "params": {
       "judge_llm": {
         "api_endpoint": {
           "url": LLM_URL,
           "model_id": LLM_MODEL_NAME
         }
       },
       "judge_embeddings": {
         "api_endpoint": {
           "url": EMBEDDING_URL,
           "model_id": EMBEDDING_MODEL_NAME
         }
       },
       "judge_timeout": 300,
       "judge_max_retries": 5,
       "judge_max_workers": 16
     },
     "dataset": {
       "files_url": "fiqa",
       "format": "beir"
     },
     "metrics": [
       {
         "name": "recall_5"
       },
       {
         "name": "ndcg_cut_5"
       },
       {
         "name": "recall_10"
       },
       {
         "name": "ndcg_cut_10"
       },
       {
         "name": "faithfulness"
       }
     ]
   }
 ]
}


Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

We'll set up our Evaluator endpoint URL...

In [132]:
eval_config_endpoint = f"{EVAL_URL}/v1/evaluation/configs"
rag_eval_response = requests.post(
    eval_config_endpoint,
    json=rag_eval_config,
    headers={'accept': 'application/json'}
).json()

Let's again capture our evaluation config for use later.

In [133]:
rag_config_name = rag_eval_response["name"]
rag_config_namespace = rag_eval_response["namespace"]
print(f"Config Name: {rag_config_name}, Config Namespace: {rag_config_namespace}")

Config Name: eval-config-F9xnHtPdwPYFMLjuTdcRpR, Config Namespace: default


### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [134]:
rag_job_config = {
    "target": rag_target_namespace + "/" + rag_target_name,
    "config": rag_config_namespace + "/" + rag_config_name,
    "tags": [
        "rag-eval"
    ]
}

Next, let's set the evaluation jobs endpoint.

In [135]:
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

All that's left to do is fire off our job!

In [136]:
rag_job_response = requests.post(
    job_endpoint,
    json=rag_job_config,
    headers={'accept': 'application/json'}
).json()

In [138]:
rag_job_id = rag_job_response["id"]
print(f"Job ID: {rag_job_id}")

Job ID: eval-SfNDFM4Ei28bp5GrYzzNy5


#### Monitoring

We can monitor the status of our job through the following endpoint.

In [139]:
rag_monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{rag_job_id}"

In [140]:
rag_monitoring_response = requests.get(
    rag_monitoring_endpoint,
).json()

In [None]:
rag_monitoring_response

We can check on the status of our evaluation in the cell below. 

> NOTE: When the evaluation `status` becomes `succeeded`, the `evaluation_results` field will become populated.

In [142]:
print(rag_monitoring_response["status"])

initializing


Once it's done - let's look at the full results!

In [None]:
print(rag_monitoring_response)

The `evaluation_results` field will contain our `metrics` along with their name, and their score.

In [None]:
rag_monitoring_response["evaluation_results"][0]["metrics"]

## Retrieval Augmented Generation (RAG) Evaluation on Synthetically Generated Data with Ragas Metrics

For our final evaluation, we're going to be leveraging work done in [this](https://github.com/NVIDIA/GenerativeAIExamples/blob/main/nemo/retriever-synthetic-data-generation/notebooks/quickstart.ipynb) notebook to create a BeIR format dataset created with Synthetic Data Generation. 

The output from the above notebook should be a dataset with the following items which can be found in the `outputs/sample_synthetic_data/beir/filtered/synthetic` directory after running the notebook:

- `corpus.jsonl`
- `qrels/test.tsv`
- `queries.jsonl`

This notebook assumes you've run the above notebook and have moved the `outputs/sample_synthetic_data/beir/filtered/synthetic` directory into the root folder of this notebook.

We'll use the following utility function to upload the folder contents to the NeMo Datastore microservice under the name "SDG_BEIR_DATASET".

In [None]:
import huggingface_hub as hh
import requests

DATASTORE_URL = "<< YOUR DATASTORE URL >>"

## This token is not used in NDS, and so it could be any value.
token = "mock"

repo_name = "nvidias/sdg_beir"
repo_type = "dataset"
dir_path = "./synthetic"

hf_api = hh.HfApi(endpoint=DATASTORE_URL, token=token)

# create repo
hf_api.create_repo(
    repo_id=repo_name,
    repo_type=repo_type,
)

# upload dir
path_in_repo = "."
result = hf_api.upload_folder(repo_id=repo_name, folder_path=dir_path, path_in_repo=path_in_repo, repo_type=repo_type)

print(f"Dataset folder uploaded to: {result}")

We can once again create our RAG evaluation configuration while make a small change, which is to simply point at the newly updated dataset.

Also, since we already have our target created - we do not need to reinitialize it - we can simple create a new evaluation configuration for this target!

Also - we can use OpenAI API compatible models as our judge, like OpenAI's `gpt-4` model. However, we will need to provide our OpenAI API key. Let's do that below.

In [105]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Please provide your OpenAI API key!")

Now we can create our new evaluation configuration!

In [143]:
rag_gpt4_eval_config = {
 "type": "rag",
 "tasks": [
   {
     "type": "beir",
     "params": {
       "judge_llm": {
         "api_endpoint": {
           "url": "https://api.openai.com/v1/chat/completions",
           "model_id": "gpt-4",
           "api_key": os.environ["OPENAI_API_KEY"]
         }
       },
       "judge_embeddings": {
         "api_endpoint": {
           "url": "https://api.openai.com/v1/embeddings",
           "model_id": "text-embedding-3-small",
           "api_key": os.environ["OPENAI_API_KEY"]
         }
       },
       "judge_timeout": 300,
       "judge_max_retries": 5,
       "judge_max_workers": 16
     },
     "dataset": {
       "files_url": "nds:SDG_BEIR_DATASET",
       "format": "beir"
     },
     "metrics": [
       {
         "name": "recall_5"
       },
       {
         "name": "ndcg_cut_5"
       },
       {
         "name": "recall_10"
       },
       {
         "name": "ndcg_cut_10"
       },
       {
         "name": "faithfulness"
       }
     ]
   }
 ]
}

In [144]:
eval_config_endpoint = f"{EVAL_URL}/v1/evaluation/configs"
rag_gpt4_eval_response = requests.post(
    eval_config_endpoint,
    json=rag_gpt4_eval_config,
    headers={'accept': 'application/json'}
).json()
rag_gpt4_config_name = rag_gpt4_eval_response["name"]
rag_gpt4_config_namespace = rag_gpt4_eval_response["namespace"]
print(f"Config Name: {rag_gpt4_config_name}, Config Namespace: {rag_gpt4_config_namespace}")

Config Name: eval-config-AUCsyQzmD5ZMqamx1hQ3ZE, Config Namespace: default


### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [145]:
rag_gpt4_job_config = {
    "name": "rag-gpt4-eval",
    "target": rag_target_namespace + "/" + rag_target_name,
    "config": rag_gpt4_config_namespace + "/" + rag_gpt4_config_name,
    "tags": [
        "rag-gpt4-eval"
    ]
}

Next, let's set the evaluation jobs endpoint.

In [146]:
rag_gpt4_job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

All that's left to do is fire off our job!

In [147]:
rag_gpt4_job_response = requests.post(
    rag_gpt4_job_endpoint,
    json=rag_gpt4_job_config,
    headers={'accept': 'application/json'}
).json()

In [148]:
rag_gpt4_job_id = rag_gpt4_job_response["id"]
print(f"Job ID: {rag_gpt4_job_id}")

Job ID: eval-EHm3XUkrN2egwsos345rrE


#### Monitoring

We can monitor the status of our job through the following endpoint.

In [150]:
rag_gpt4_monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{rag_gpt4_job_id}"

In [151]:
rag_gpt4_monitoring_response = requests.get(
    rag_gpt4_monitoring_endpoint,
).json()

We can check on the status of our evaluation in the cell below. 

> NOTE: When the evaluation `status` becomes `succeeded`, the `evaluation_results` field will become populated.

In [152]:
print(rag_gpt4_monitoring_response["status"])

running


Once it's done - let's look at the full results!

In [None]:
print(rag_gpt4_monitoring_response)