# Nemo Evaluator Microservice Tutorial

## About this tutorial

In this tutorial, we will cover the following evaluation types using NeMo Evaluator:

- **Agentic Evaluation**
- **LLM Evaluation on Academic Benchmarks**
- **Custom Evaluations**
  - **Similarity Metrics Evaluation**
  - **LLM-as-Judge Evaluation**
  - **Tool Calling Evaluation**
- **Retriever Pipeline Evaluation**
- **RAG Pipeline Evaluation**

## 1. Prerequisites

### 1.1 Install Nemo Microservices

While this tutorial focuses on NeMo Evaluator, we recommend installing the NeMo Microservices platform using the [NeMo Microservices Platform Helm Chart](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo-microservices/helm-charts/nemo-microservices-helm-chart) to avoid manually managing dependencies. For step by step installation guide, refer to the [Demo Cluster Setup Guide](https://docs.nvidia.com/nemo/microservices/latest/get-started/setup/index.html). This tutorial was done on a single-node cluster with two A100 GPUs.

Check the pods to make sure all required microservices is running before proceeding.

In [1]:
! kubectl get pods

NAME                                                          READY   STATUS      RESTARTS      AGE
model-downloader-meta-llama-3-1-8b-instruct-2-0-28trx         0/1     Completed   0             10m
model-downloader-meta-llama-3-2-1b-instruct-2-0-b6scn         0/1     Completed   0             10m
modeldeployment-meta-llama-3-1-8b-instruct-6b64d56fdc-slctn   1/1     Running     0             9m26s
nemo-argo-workflows-server-655f8d755-svgn2                    1/1     Running     0             12m
nemo-argo-workflows-workflow-controller-8f8877cd4-8t2tf       1/1     Running     0             12m
nemo-customizer-5d8554fcf6-rhwfp                              1/1     Running     2 (11m ago)   12m
nemo-customizerdb-0                                           1/1     Running     0             12m
nemo-data-store-795ccbb97b-nwcf2                              1/1     Running     0             12m
nemo-deployment-management-646cc67c-l67lq                     1/1     Running     0             12

Install `huggingface_hub` which is required to interact with NeMo Data Store:

```bash
pip install -U "huggingface_hub[cli]"
```


In [2]:
import requests
import json
import os
from pprint import pp
from huggingface_hub import HfApi

  from .autonotebook import tqdm as notebook_tqdm


Specify the Namespace and API endpoints:

In [4]:
NDS_URL = "http://data-store.test" # Data Store
NEMO_URL = "http://nemo.test" # Customizer, Entity Store, Evaluator
NIM_URL = "http://nim.test" # NIM Proxy
NMS_NAMESPACE = "nemo-eval-tutorial"

target_url = f"{NEMO_URL}/v1/evaluation/targets"
config_url = f"{NEMO_URL}/v1/evaluation/configs"
job_url = f"{NEMO_URL}/v1/evaluation/jobs"
llm_chat_completion_url = f"{NIM_URL}/v1/chat/completions"

### 1.2 Deploy NIM for LLMs

This tutorial will use the `Llama-3.1-8b-instruct` model as the LLM to be evaluated. You can either choose to deploy a `Llama-3.1-8b-instruct` NIM locally or use the NIM hosted remotely. As part of the Nemo Microservices Platform, the NeMo Deployment Management service provides an API to deploy NIM on a Kubernetes cluster and manage them through the NIM Operator microservice. The below cells show how to deploy the `Llama-3.1-8b-instruct` NIM and run inference using NIM Proxy service.

**Note**: if you use see a pod named `modeldeployment-meta-llama-3-1-8b-instruct-xxx` from the list of pods above, then a Llama-3.1-8b-instruct NIM has already been deployed and you can skip this step.

In [None]:
deployment_url = f"{NEMO_URL}/v1/deployment/model-deployments"

payload = {
    "name": "llama-3.1-8b-instruct",
    "namespace": "meta",
    "config": {
        "model": "meta/llama-3.1-8b-instruct",
        "nim_deployment": {
            "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
            "image_tag": "1.8.3",
            "pvc_size": "25Gi",
            "gpu": 1,
            "additional_envs": {
                "NIM_GUIDED_DECODING_BACKEND": "outlines"
            }
        }
    }
}

headers = {
    "Content-Type": "application/json"
}

resp = requests.post(deployment_url, json=payload, headers=headers)
pp(resp.json())

Check the deployment status and make sure the status is 'ready' before proceeding.

In [None]:
resp = requests.get(f"{NEMO_URL}/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct", json=payload)
pp(resp.json())

We can test LLM inferences to the NIM endpoint

In [5]:
payload = {
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is the purpose of LLM token log probabilities? Answer with a single sentence."
        }
    ],
    "stream": False,
    "temperature": 0.0
}

headers = {
    "Content-Type": "application/json",
}

resp = requests.post(llm_chat_completion_url, json=payload, headers=headers)
pp(resp.json()['choices'][0]['message'])

{'role': 'assistant',
 'content': 'LLM (Large Language Model) token log probabilities are used to '
            "represent the model's confidence in its predictions, with lower "
            'probabilities indicating less likely or less confident '
            'predictions.'}


### 1.3 Set Up API Key and Access Token

This tutorial uses a remote LLM hosted on [build.nvidia.com](https://build.nvidia.com/) for evaluations that requires a LLM as judge. For this, we will need to set up the API key to access the models. You can generate an NVIDIA API key at [Manage API Keys](https://build.nvidia.com/settings/api-keys).

In [6]:
from getpass import getpass

os.environ['NVIDIA_API_KEY'] = getpass("Enter your NVIDIA API Key")

Enter your NVIDIA API Key ········


Next we need to set up the Hugging Face Access Token which needs to have access to Meta's Llama-3.1-8B-instruct model.

In [7]:
os.environ['HF_Token'] = getpass("Enter your Hugging Face Token")

Enter your Hugging Face Token ········


### 1.4 Running Evaluation Jobs with NeMo Evaluator

Before running evaluations, it is important to understand the typical NeMo Evaluator Workflow:

1.	(Optional) Upload your custom dataset to the NeMo Data Store if you’re not using a built-in dataset.
2.	Create an evaluation configuration in NeMo Evaluator.
3.	Define an evaluation target (the model to evaluate).
4.	Submit an evaluation job to NeMo Evaluator. The following steps occur automatically:

	a. NeMo Evaluator retrieves any required custom data from the NeMo Data Store.

	b. It runs inference using NIM, supporting LLMs, embeddings, and reranking tasks.

	c. Results, including generations, logs, and metrics, are written to the NeMo Data Store.

	d. The results are returned to the user.

5.	Review evaluation results.

## 2. Agentic Evaluation

Agentic evaluation uses RAGAS metrics to score agent outputs. RAGAS is a library for evaluating retrieval-augmented generation and agentic workflows using standardized, research-backed metrics.

Each task contains a set of metrics relevant to the Agentic evaluation, such as topic adherence, tool call accuracy, agent goal accuracy, or answer accuracy, depending on the metric selected in the job configuration.

### 2.1 Upload Custom Data to Nemo Data Store

Before uploding the data, we first need to create namespace in NeMo Data Store and Entity Store.

In [8]:
from helpers import create_namespaces, setup_dataset_repo

DATASET_NAME = "agent_eval"
create_namespaces(NEMO_URL, NDS_URL, NMS_NAMESPACE)
HF_API = HfApi(endpoint=f"{NDS_URL}/v1/hf", token="")
repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)

Follow this [dataset format](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-types/agentic.html#options) to prepare the dataset for agentic evaluation, example datasets are provided at `./eval_dataset/agent_data`. Next, we will upload these example datasets to Data Store.

In [14]:
HF_API.upload_file(path_or_fileobj=os.path.join("./eval_dataset/agent_data", "agent_goal_data.jsonl"),
    path_in_repo="agent_goal_data.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

HF_API.upload_file(path_or_fileobj=os.path.join("./eval_dataset/agent_data", "agent_tool_call_data.jsonl"),
    path_in_repo="agent_tool_call_data.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

agent_tool_call_data.jsonl: 100%|██████████| 3.20k/3.20k [00:00<00:00, 709kB/s]


CommitInfo(commit_url='', commit_message='Upload agent_tool_call_data.jsonl with huggingface_hub', commit_description='', oid='659c7516a96482859db4f4707a32b4a8c5b78ba9', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

### 2.2 Create Evaluation Targets

Currently, agentic evaluation only works with `cached_outputs` targets which point to files that are stored in NeMo Data Store and that contain pre-generated answers.

We will create two evaluation targets for agentic evaluations: one for agent goal accuracy and one for tool calling accuracy.

In [15]:
payload = {
    "type": "cached_outputs",
    "name": "agent-goal-target",
    "namespace": NMS_NAMESPACE,
    "cached_outputs": {
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/agent_goal_data.jsonl",
        }
}

headers = {
    "Content-Type": "application/json",
}


resp = requests.post(target_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T19:18:37.063092',
 'updated_at': '2025-07-08T19:18:37.063095',
 'name': 'agent-goal-target',
 'namespace': 'nemo-eval-tutorial',
 'type': 'cached_outputs',
 'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_goal_data.jsonl'},
 'id': 'eval-target-GDJ771rB6W3xaowbpE4Bsh',
 'custom_fields': {}}


In [16]:
payload = {
    "type": "cached_outputs",
    "name": "agent-tool-call-target",
    "namespace": NMS_NAMESPACE,
    "cached_outputs": {
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/agent_tool_call_data.jsonl",
        }
}

resp = requests.post(target_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T19:18:38.848137',
 'updated_at': '2025-07-08T19:18:38.848149',
 'name': 'agent-tool-call-target',
 'namespace': 'nemo-eval-tutorial',
 'type': 'cached_outputs',
 'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_tool_call_data.jsonl'},
 'id': 'eval-target-HBLMJtD4MhBzV6S3AGor2g',
 'custom_fields': {}}


### 2.3 Create Evaluation Configs

Similarily, we will create two evaluation configs: one for agent goal accuracy and one for tool calling accuracy.

In [17]:
payload = {
  "type": "agentic",
  "name": "agentic-goal-accuracy",
  "namespace": NMS_NAMESPACE,
  "tasks": {
    "goal-accuracy": {
      "type": "agent_goal_accuracy_with_reference",
      "params": {
        "judge": {
          "model": {
            "url": "https://integrate.api.nvidia.com/v1",
            "model_id": "meta/llama-3.3-70b-instruct",
            "api_key": os.environ["NVIDIA_API_KEY"]
          },
          "inference_params": {
            "max_new_tokens": 4024,
            "max_retries": 10,
            "request_timeout": 10,
            "temperature": 0.1
          }
        }
      }
    }
  }
}
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}

resp = requests.post(config_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T19:19:13.374904',
 'updated_at': '2025-07-08T19:19:13.374906',
 'name': 'agentic-goal-accuracy',
 'namespace': 'nemo-eval-tutorial',
 'type': 'agentic',
 'tasks': {'goal-accuracy': {'type': 'agent_goal_accuracy_with_reference',
                             'params': {'judge': {'model': {'url': 'https://integrate.api.nvidia.com/v1',
                                                            'model_id': 'meta/llama-3.3-70b-instruct',
                                                            'api_key': '******'},
                                                  'inference_params': {'max_new_tokens': 4024,
                                                                       'max_retries': 10,
                                                                       'request_timeout': 10,
                                                                       'temperature': 0.1}}}}},
 'id': 'eval-config-MCeuvELVrXYuZQLZWE2tkV',
 'custom_fields': {}}


In [18]:
payload = {
  "type": "agentic",
  "name": "agentic-tool-call-accuracy",
  "namespace": NMS_NAMESPACE,
  "tasks": {
    "tool-call-accuracy": {
      "type": "tool_call_accuracy",
    }
  }
}

resp = requests.post(config_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T19:19:30.170735',
 'updated_at': '2025-07-08T19:19:30.170738',
 'name': 'agentic-tool-call-accuracy',
 'namespace': 'nemo-eval-tutorial',
 'type': 'agentic',
 'tasks': {'tool-call-accuracy': {'type': 'tool_call_accuracy'}},
 'id': 'eval-config-Jr4uE5uUonn3YyfwUHKKSa',
 'custom_fields': {}}


### 2.4 Submit Evaluation Job

To launch the evaluation job, we simply send a request with the previously created evaluation targets and configs to the `/jobs` API endpoint.

In [28]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/agent-goal-target",
    "config": f"{NMS_NAMESPACE}/agentic-goal-accuracy"
}

resp = requests.post(job_url, json=payload, headers=headers)
agent_goal_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-08T19:28:46.640661',
 'updated_at': '2025-07-08T19:28:46.640663',
 'id': 'eval-PLUWikEJpyH7WkcYj6sWZb',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-GDJ771rB6W3xaowbpE4Bsh',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-08T19:18:37.063092',
            'updated_at': '2025-07-08T19:18:37.063095',
            'custom_fields': {},
            'ownership': None,
            'name': 'agent-goal-target',
            'type': 'cached_outputs',
            'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_goal_data.jsonl'},
            'model': None,
            'retriever': None,
            'rag': None,
            'rows': None,
            'dataset': None},
 'config': {'schema_version': '1.0',
            'id': 'eval-c

In [24]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/agent-tool-call-target",
    "config": f"{NMS_NAMESPACE}/agentic-tool-call-accuracy"
}

resp = requests.post(job_url, json=payload, headers=headers)
agent_tool_call_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-08T19:27:35.584648',
 'updated_at': '2025-07-08T19:27:35.584650',
 'id': 'eval-N9HsrJbRXjWKefR6ZSi3oo',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-HBLMJtD4MhBzV6S3AGor2g',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-08T19:18:38.848137',
            'updated_at': '2025-07-08T19:18:38.848149',
            'custom_fields': {},
            'ownership': None,
            'name': 'agent-tool-call-target',
            'type': 'cached_outputs',
            'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_tool_call_data.jsonl'},
            'model': None,
            'retriever': None,
            'rag': None,
            'rows': None,
            'dataset': None},
 'config': {'schema_version': '1.0',
            'id

### 2.5 Monitoring job status and get evaluation results

We can monitor job status and get the evaluation results by sending request with the Job ID to the `/status` and `/results` API, respectively.

In [33]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{agent_goal_eval_job_id}/status")
pp(resp.json())

{'message': None, 'task_status': {}, 'progress': 100.0}


In [34]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{agent_goal_eval_job_id}/results")
pp(resp.json()['tasks'])

{'goal-accuracy': {'metrics': {'agent_goal_accuracy': {'scores': {'agent_goal_accuracy': {'value': 1.0}}}}}}


In [25]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{agent_tool_call_eval_job_id}/status")
pp(resp.json())

{'message': None, 'task_status': {}, 'progress': 100.0}


In [27]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{agent_tool_call_eval_job_id}/results")
pp(resp.json()['tasks'])

{'tool-call-accuracy': {'metrics': {'tool_call_accuracy': {'scores': {'tool_call_accuracy': {'value': 1.0}}}}}}


**Download Evaluation Eesults**: downloads a directory that contains the configuration files, logs, and evaluation results for a specific evaluation job.

In [None]:
!curl -X "GET" "{NEMO_URL}/v1/evaluation/jobs/{agent_eval_job_id}/download-results" \
-H 'accept: application/json' \
-o result.zip

## 3. LLM Evaluation on Academic Benchmarks

**Create Evaluation Config**

In [36]:
payload = {
    "type": "gsm8k",
    "name": "gsm8k-chat-config",
    "namespace": NMS_NAMESPACE,
    "params": {
        "temperature": 0.00001,
        "top_p": 0.00001,
        "max_tokens": 256,
        "stop": ["<|eot|>"],
        "extra": {
            "num_fewshot": 8,
            "batch_size": 16,
            "bootstrap_iters": 100000,
            "dataset_seed": 42,
            "use_greedy": True,
            "top_k": 1,
            "hf_token": os.environ['HF_Token'],
            "tokenizer_backend": "hf",
            "tokenizer": "meta-llama/llama-3.1-8B-Instruct",
            "apply_chat_template": True,
            "fewshot_as_multiturn": True
        }
    }
}
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}

resp = requests.post(config_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T19:39:55.734565',
 'updated_at': '2025-07-08T19:39:55.734568',
 'name': 'gsm8k-chat-config',
 'namespace': 'nemo-eval-tutorial',
 'type': 'gsm8k',
 'params': {'max_tokens': 256,
            'temperature': 1e-05,
            'top_p': 1e-05,
            'stop': ['<|eot|>'],
            'extra': {'num_fewshot': 8,
                      'batch_size': 16,
                      'bootstrap_iters': 100000,
                      'dataset_seed': 42,
                      'use_greedy': True,
                      'top_k': 1,
                      'hf_token': '******',
                      'tokenizer_backend': 'hf',
                      'tokenizer': 'meta-llama/llama-3.1-8B-Instruct',
                      'apply_chat_template': True,
                      'fewshot_as_multiturn': True}},
 'id': 'eval-config-6pTREGQfYHFQzq7oWHATgt',
 'custom_fields': {}}


**Create Evaluation Target**

In [37]:
payload = {
    "type": "model",
    "name": "llama-chat-target",
    "namespace": NMS_NAMESPACE,
    "model": {
        "api_endpoint": {
            "url": llm_chat_completion_url ,
            "model_id": "meta/llama-3.1-8b-instruct",
            "format": "openai"
        }
    }
}

resp = requests.post(target_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T19:40:13.061228',
 'updated_at': '2025-07-08T19:40:13.061229',
 'name': 'llama-chat-target',
 'namespace': 'nemo-eval-tutorial',
 'type': 'model',
 'model': {'schema_version': '1.0',
           'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',
           'type_prefix': 'model',
           'namespace': 'default',
           'created_at': '2025-07-08T19:40:13.061012',
           'updated_at': '2025-07-08T19:40:13.061015',
           'custom_fields': {},
           'name': 'model-EXZQbF8ZXbjoogTvF9UFKr',
           'version_id': 'main',
           'version_tags': [],
           'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',
                            'model_id': 'meta/llama-3.1-8b-instruct',
                            'format': 'openai'}},
 'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',
 'custom_fields': {}}


**Submit Evaluation Job**

In [38]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/llama-chat-target",
    "config": f"{NMS_NAMESPACE}/gsm8k-chat-config"
}

resp = requests.post(job_url, json=payload, headers=headers)
gsm8k_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-08T19:40:19.984771',
 'updated_at': '2025-07-08T19:40:19.984774',
 'id': 'eval-WF5xprkaQpjTNZLpsNb1Y1',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-08T19:40:13.061228',
            'updated_at': '2025-07-08T19:40:13.061229',
            'custom_fields': {},
            'ownership': None,
            'name': 'llama-chat-target',
            'type': 'model',
            'cached_outputs': None,
            'model': {'schema_version': '1.0',
                      'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',
                      'description': None,
                      'type_prefix': 'model',
                      'namespace': 'default',
                      'project': None,
       

We can check the status of the job using the status API. **Note that the status in the API is only regularly updated for custom evaluation. So as long as the status is saying running, it's actually running.**

In [41]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{gsm8k_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully', 'task_status': {}, 'progress': 100.0}


Once the job is completed, we can check the evaluation results using the results endpoint.

In [42]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{gsm8k_eval_job_id}/results")
pp(resp.json()['tasks']['exact_match'])
pp(resp.json()['tasks']['exact_match_stderr'])

{'metrics': {'exact_match': {'scores': {'gsm8k-metric_ranking-1': {'value': 0.7664897649734648},
                                        'gsm8k-metric_ranking-3': {'value': 0.821076573161486}}}}}
{'metrics': {'exact_match_stderr': {'scores': {'gsm8k-metric_ranking-2': {'value': 0.011653286808791036},
                                               'gsm8k-metric_ranking-4': {'value': 0.010557661392901296}}}}}


## 4. Custom Evaluations

### 4.1 Similarity Metrics Evaluation

**Upload Custom Data to Nemo Data Store**

In [43]:
# set up dataset repo
DATASET_NAME = "similarity_eval"
HF_API = HfApi(endpoint=f"{NDS_URL}/v1/hf", token="")
repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)
# upload dataset
HF_API.upload_file(path_or_fileobj=os.path.join("./eval_dataset/similarity_metrics_data", "inputs.jsonl"),
    path_in_repo="similarity_metrics/inputs.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

inputs.jsonl: 100%|██████████| 149k/149k [00:00<00:00, 28.9MB/s]


CommitInfo(commit_url='', commit_message='Upload similarity_metrics/inputs.jsonl with huggingface_hub', commit_description='', oid='6220905767521b90b36e77360dc7213c3dbbb99e', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

**Create Evaluation Config**

In [44]:
payload = {
    "type": "similarity_metrics",
    "name": "similarity-configuration",
    "namespace": NMS_NAMESPACE,
    "params": {
        "max_tokens": 200,
        "temperature": 0.7,
        "extra": {
            "top_k": 20
        }
    },
    "tasks": {
        "my-similarity-metrics-task": {
            "type": "default",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/similarity_metrics/inputs.jsonl",
            },
            "metrics": {
                "accuracy": {"type": "accuracy"},
                "bleu": {"type": "bleu"},
                "rouge": {"type": "rouge"},
                "em": {"type": "em"},
                "f1": {"type": "f1"}
            }
        }
    }
}

resp = requests.post(config_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T20:31:57.416755',
 'updated_at': '2025-07-08T20:31:57.416756',
 'name': 'similarity-configuration',
 'namespace': 'nemo-eval-tutorial',
 'type': 'similarity_metrics',
 'params': {'max_tokens': 200, 'temperature': 0.7, 'extra': {'top_k': 20}},
 'tasks': {'my-similarity-metrics-task': {'type': 'default',
                                          'metrics': {'accuracy': {'type': 'accuracy'},
                                                      'bleu': {'type': 'bleu'},
                                                      'rouge': {'type': 'rouge'},
                                                      'em': {'type': 'em'},
                                                      'f1': {'type': 'f1'}},
                                          'dataset': {'schema_version': '1.0',
                                                      'id': 'dataset-DALY8efJ78UgkXHcrfACGZ',
                                                      'namespace': 'default',
               

**Launch Evaluation Job**

In [45]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/llama-chat-target",
    "config": f"{NMS_NAMESPACE}/similarity-configuration"
}
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}

resp = requests.post(job_url, json=payload, headers=headers)
similarity_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-08T20:32:03.445393',
 'updated_at': '2025-07-08T20:32:03.445396',
 'id': 'eval-TVYBnUrD4XhjDV4UCdXJ6y',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-08T19:40:13.061228',
            'updated_at': '2025-07-08T19:40:13.061229',
            'custom_fields': {},
            'ownership': None,
            'name': 'llama-chat-target',
            'type': 'model',
            'cached_outputs': None,
            'model': {'schema_version': '1.0',
                      'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',
                      'description': None,
                      'type_prefix': 'model',
                      'namespace': 'default',
                      'project': None,
       

**Monitoring job status and results**

In [50]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{similarity_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully',
 'task_status': {'my-similarity-metrics-task': 'completed'},
 'progress': 100.0}


In [51]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{similarity_eval_job_id}/results")
pp(resp.json()['tasks']['my-similarity-metrics-task'])

{'metrics': {'accuracy': {'scores': {'accuracy': {'value': 0.0}}},
             'bleu': {'scores': {'bleu_score': {'value': 0.015511131876432806}}},
             'em': {'scores': {'em': {'value': 0.0}}},
             'f1': {'scores': {'f1': {'value': 0.10128911130270025}}},
             'rouge': {'scores': {'rouge_1_score': {'value': 0.1166731565559731},
                                  'rouge_2_score': {'value': 0.03311328362331498},
                                  'rouge_3_score': {'value': 0.01193015710048473},
                                  'rouge_L_score': {'value': 0.09285834070191781}}}}}


### 4.2 LLM-as-Judge Evaluation

**Upload Custom Dataset to Data Store**

In [52]:
DATASET_NAME = "llm_as_judge_data"
repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)
HF_API.upload_file(
    path_or_fileobj='./eval_dataset/llm_judge_data/math_dataset.csv',
    path_in_repo="llm_as_judge/math_dataset.csv",
    repo_id=repo_id,
    repo_type='dataset',
)

math_dataset.csv: 100%|██████████| 449/449 [00:00<00:00, 108kB/s]


CommitInfo(commit_url='', commit_message='Upload llm_as_judge/math_dataset.csv with huggingface_hub', commit_description='', oid='32e22ec06ea372c6cfddf52cb733e86fd1b135ed', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

- **Item** — Represents the current item from the dataset.
- **Sample** — Contains data related to the output from the model. The `sample.output_text` represents the completion text for completion models and the content of the first message for chat models.

The properties on the `item` object are derived from the dataset's column names (for CSVs) or keys (for JSONs). 
The following rules apply to these properties:

- All non-alphanumeric characters are replaced with underscores.
- Column names are converted to lowercase.
- In case of conflicts, suffixes (`_1`, `_2`, etc.), are appended to the property names.


**Templates for Chat Models**

Prompt templates are used to structure tasks for evaluating the performance of models, specifically following the NIM/OpenAI format for chat-completion tasks. Templates use the Jinja2 templating syntax. Variables are represented using double-curly brackets, for example, `{{item.review}}`.

**Create Eval Config**

We will use `llama-3.3-70b-instruct` hosted on build.nvidia.com as the judge model.

In [53]:
payload = {
  "type": "custom",
  "namespace": NMS_NAMESPACE,
  "name": "custom_llm_as_judge_config",
  "tasks": {
    "qa": {
      "type": "completion",
      "params": {
        "template": {
          "messages": [{
            "role": "system",
            "content": "You are a helpful, respectful and honest assistant. \nAnswers the following question as briefly as you can.\n."
            }, 
            { 
            "role": "user",
            "content": "Answer very briefly (no explanation) this question: {{item.question}}"
            }]
        }
      },
      "metrics": {
        "accuracy": {
          "type": "string-check",
          "params": {
            "check": [
              "{{sample.output_text}}",
              "contains",
              "{{item.answer}}"
            ]
          }
        },
        "bleu": {
            "type": "bleu",
            "params": {
                "references": [
                "{{item.reference_answer}}"
                ]
             }
        },
        "accuracy-llm-judge": {
          "type": "llm-judge",
          "params": {
            "model": {
              "api_endpoint": {
                "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                "model_id": "meta/llama-3.3-70b-instruct",
                "api_key": os.environ["NVIDIA_API_KEY"]
              }
            },
            "template": {
              "messages": [
                {
                  "role": "system",
                  "content": "Your task is to evaluate the semantic similarity between two responses."
                },
                {
                    "role": "user",
                    "content": (
                        "Respond in the following format SIMILARITY: 4. "
                        "The similarity should be a score between 0 and 10.\n\n"
                        "RESPONSE 1: {{item.reference_answer}}\n\n"
                        "RESPONSE 2: {{sample.output_text}}.\n\n"
                    )
                }
              ]
            },
            "scores": {
              "similarity": {
                "type": "int",
                "parser": {
                  "type": "regex",
                  "pattern": "SIMILARITY: (\\d)"
                }
              }
            }
          }
        }
      },
      "dataset": {
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/llm_as_judge/math_dataset.csv"
      }
    }
  }
}
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}

resp = requests.post(config_url, json=payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-08T20:37:58.674883',
 'updated_at': '2025-07-08T20:37:58.674884',
 'name': 'custom_llm_as_judge_config',
 'namespace': 'nemo-eval-tutorial',
 'type': 'custom',
 'tasks': {'qa': {'type': 'completion',
                  'params': {'template': {'messages': [{'role': 'system',
                                                        'content': 'You are a '
                                                                   'helpful, '
                                                                   'respectful '
                                                                   'and honest '
                                                                   'assistant. \n'
                                                                   'Answers '
                                                                   'the '
                                                                   'following '
                                                                  

**Launch Eval Job**

In [54]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/llama-chat-target",
    "config": f"{NMS_NAMESPACE}/custom_llm_as_judge_config"
}
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}

resp = requests.post(job_url, json=payload, headers=headers)
llm_judge_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-08T20:38:04.820220',
 'updated_at': '2025-07-08T20:38:04.820223',
 'id': 'eval-MoXUyarWP3jBR8h6LqyhWU',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-08T19:40:13.061228',
            'updated_at': '2025-07-08T19:40:13.061229',
            'custom_fields': {},
            'ownership': None,
            'name': 'llama-chat-target',
            'type': 'model',
            'cached_outputs': None,
            'model': {'schema_version': '1.0',
                      'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',
                      'description': None,
                      'type_prefix': 'model',
                      'namespace': 'default',
                      'project': None,
       

**Monitoring job status and results**

In [61]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{llm_judge_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully.',
 'task_status': {'qa': 'completed'},
 'progress': 100.0}


In [62]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{llm_judge_eval_job_id}/results")
pp(resp.json()['tasks']['qa'])

{'metrics': {'accuracy': {'scores': {'string-check': {'value': 1.0,
                                                      'stats': {'count': 10,
                                                                'sum': 10.0,
                                                                'mean': 1.0}}}},
             'bleu': {'scores': {'sentence': {'value': 9.135501080023044,
                                              'stats': {'count': 10,
                                                        'sum': 91.35501080023045,
                                                        'mean': 9.135501080023044}},
                                 'corpus': {'value': 0.0}}},
             'accuracy-llm-judge': {'scores': {'similarity': {'value': 1.7,
                                                              'stats': {'count': 10,
                                                                        'sum': 17.0,
                                                                        'mean': 

### 4.3 Tool Calling Evaluation

Required dataset format for a custom tool calling evaluation:

```json
[
    {
        "messages": [
            {
                "role": "user",
                "content": "Find the area of a triangle with a base of 10 units and height of 5 units."
            }
        ],
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "calculate_triangle_area",
                    "description": "Calculate the area of a triangle given its base and height.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "base": {
                                "type": "integer",
                                "description": "The base of the triangle."
                            },
                            "height": {
                                "type": "integer",
                                "description": "The height of the triangle."
                            },
                            "unit": {
                                "type": "string",
                                "description": "The unit of measure (defaults to \"units\" if not specified)"
                            }
                        },
                        "required": [
                            "base",
                            "height"
                        ]
                    }
                }
            }
        ],
        "tool_calls": [
            {
                "function": {
                    "name": "calculate_triangle_area",
                    "arguments": {
                        "base": 10,
                        "height": 5,
                        "unit": "units"
                    }
                }
            }
        ]
    }
]
```


**Upload Custom Tool Calling Dataset to Data Store**

In [95]:
DATASET_NAME = "tool_call_data"
repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)
HF_API.upload_file(
    path_or_fileobj='./eval_dataset/tool_call_data/aiva_tool_call.jsonl',
    path_in_repo="aiva_tool_call.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

CommitInfo(commit_url='', commit_message='Upload aiva_tool_call.jsonl with huggingface_hub', commit_description='', oid='8ba99b4e0b4b2224b059c6cacda3b86d724c406e', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

**Create Eval Config**

In [96]:
config_payload = {
    "type": "custom",
    "namespace": NMS_NAMESPACE,
    "name": "tool-call-eval-config",
    "tasks": {
        "custom-tool-calling": {
            "type": "chat-completion",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/aiva_tool_call.jsonl",
            },
            "params": {
                "template": {
                    "messages": "{{ item.messages | tojson}}",
                    "tools": "{{ item.tools | tojson }}",
                    "tool_choice": "auto"
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
                }
            }
        }
    }
}
resp = requests.post(config_url, json=config_payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-09T02:18:55.955723',
 'updated_at': '2025-07-09T02:18:55.955724',
 'name': 'tool-call-eval-config',
 'namespace': 'nemo-eval-tutorial',
 'type': 'custom',
 'tasks': {'custom-tool-calling': {'type': 'chat-completion',
                                   'params': {'template': {'messages': '{{ '
                                                                       'item.messages '
                                                                       '| '
                                                                       'tojson}}',
                                                           'tools': '{{ '
                                                                    'item.tools '
                                                                    '| tojson '
                                                                    '}}',
                                                           'tool_choice': 'auto'}},
                                   'metri

**Launch Eval Job**

In [97]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/llama-chat-target",
    "config": f"{NMS_NAMESPACE}/tool-call-eval-config"
}
headers = {
    "accept": "application/json",
    "Content-Type": "application/json"
}

resp = requests.post(job_url, json=payload, headers=headers)
tool_call_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-09T02:19:10.833300',
 'updated_at': '2025-07-09T02:19:10.833303',
 'id': 'eval-P6pSKQ2MhjQGFdT85yBBqm',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-08T19:40:13.061228',
            'updated_at': '2025-07-08T19:40:13.061229',
            'custom_fields': {},
            'ownership': None,
            'name': 'llama-chat-target',
            'type': 'model',
            'cached_outputs': None,
            'model': {'schema_version': '1.0',
                      'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',
                      'description': None,
                      'type_prefix': 'model',
                      'namespace': 'default',
                      'project': None,
       

**Monitoring job status and results**

In [99]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{tool_call_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully.',
 'task_status': {'custom-tool-calling': 'completed'},
 'progress': 100.0}


In [100]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{tool_call_eval_job_id}/results")
pp(resp.json()['tasks']['custom-tool-calling']['metrics'])

{'tool-calling-accuracy': {'scores': {'function_name_accuracy': {'value': 0.9,
                                                                 'stats': {'count': 10,
                                                                           'sum': 9.0,
                                                                           'mean': 0.9}},
                                      'function_name_and_args_accuracy': {'value': 0.0,
                                                                          'stats': {'count': 10,
                                                                                    'sum': 0.0,
                                                                                    'mean': 0.0}}}}}


## 5. Retriever Pipeline Evaluation

### 5.1 Deploy Retriever Models

To evaluate retriever pipelines, retriever models must be deployed locally. For this tutorial, we will create a retriever pipeline with both embedding and reranking models. Specifically, we will deploy two retriever NIMs for `llama-3.2-nv-embedqa-1b-v2` and `llama-3.2-nv-rerankqa-1b-v2`, respectively, using Docker.

First, let's identify a free GPU to deploy the retriever models.

In [9]:
!nvidia-smi

Wed Jul  9 06:27:12 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          On  | 00000002:00:01.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000002:00:02.0 Off |  

Then, we can deploy the `llama-3.2-nv-embedqa-1b-v2` embedding NIM (replace `<your-api-key>` with your NGC API Key below):

In [None]:
%%bash
export NGC_API_KEY=<your-api-key>
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
export NIM_MODEL_NAME=nvidia/llama-3.2-nv-embedqa-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.5.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM
docker run -d --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Next, we will deploy the `llama-3.2-nv-rerankqa-1b-v2` reranking NIM:

In [None]:
%%bash
export NIM_MODEL_NAME=nvidia/llama-3.2-nv-rerankqa-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.3.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM
docker run -d --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8001:8000 \
  $IMG_NAME

Get Host IP address and specify the API endpoints for embedding and reranking NIMs:

In [12]:
! ip route get 1 | grep src

1.0.0.0 via 172.27.16.1 dev ens3 src 172.27.20.120 uid 1000 


In [20]:
embed_url = "http://172.27.20.120:8000/v1/embeddings"
embed_model_name = "nvidia/llama-3.2-nv-embedqa-1b-v2"
rerank_url = "http://172.27.20.120:8001/v1/ranking"
rerank_model_name = "nvidia/llama-3.2-nv-rerankqa-1b-v2"

### 5.2 Set up Milvus Vector Database

To run retriever or RAG evaluations, you must first enable the Milvus document store by setting `milvus.enabled: true` in values.yaml, and then upgrade your Helm deployment to apply the change.

```yaml
evaluator:
  enabled: true
  milvus:
    enabled: true
```

You should see a milvus pod running:

In [21]:
! kubectl get pod

NAME                                                          READY   STATUS      RESTARTS      AGE
model-downloader-meta-llama-3-1-8b-instruct-2-0-28trx         0/1     Completed   0             29m
model-downloader-meta-llama-3-2-1b-instruct-2-0-b6scn         0/1     Completed   0             29m
modeldeployment-meta-llama-3-1-8b-instruct-6b64d56fdc-slctn   1/1     Running     0             28m
nemo-argo-workflows-server-655f8d755-svgn2                    1/1     Running     0             30m
nemo-argo-workflows-workflow-controller-8f8877cd4-8t2tf       1/1     Running     0             30m
nemo-customizer-5d8554fcf6-rhwfp                              1/1     Running     2 (29m ago)   30m
nemo-customizerdb-0                                           1/1     Running     0             30m
nemo-data-store-795ccbb97b-nwcf2                              1/1     Running     0             30m
nemo-deployment-management-646cc67c-l67lq                     1/1     Running     0             30m


### 5.3 Evaluate Embedding Pipeline on FIQA Dataset

In [26]:
target_payload = {
    "type": "retriever",
    "name": "embed-target",
    "namespace": NMS_NAMESPACE,
    "retriever": {
    "pipeline": {
        "query_embedding_model": {
        "api_endpoint": {
            "url": embed_url,
            "model_id": embed_model_name,
       }
     },
     "index_embedding_model": {
        "api_endpoint": {
            "url": embed_url,
            "model_id": embed_model_name,
        }
     },
     "top_k": 10
    }
 }
}

config_payload = {
    "type": "retriever",
    "name": "fiqa-config",
    "namespace": NMS_NAMESPACE,
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://fiqa/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}

resp1 = requests.post(target_url, json=target_payload, headers=headers)
resp2 = requests.post(config_url, json=config_payload, headers=headers)

In [27]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/embed-target",
    "config": f"{NMS_NAMESPACE}/fiqa-config"
}
resp = requests.post(job_url, json=payload, headers=headers)
embed_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-09T06:45:27.786401',
 'updated_at': '2025-07-09T06:45:27.786403',
 'id': 'eval-5icjZHn7q6YERCtUCA19bW',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-5Upwjn3JL3NPV8sGacQyWW',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-09T06:44:21.108017',
            'updated_at': '2025-07-09T06:44:21.108018',
            'custom_fields': {},
            'ownership': None,
            'name': 'embed-target',
            'type': 'retriever',
            'cached_outputs': None,
            'model': None,
            'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',
                                                                 'id': 'model-JsfNnspcixm9UusYGSD1iU',
                                                                 'des

In [33]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully',
 'task_status': {'my-beir-task': 'completed'},
 'progress': 100.0}


In [34]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/results")
pp(resp.json()['groups']['evaluation']['metrics'])

{'evaluation': {'scores': {'recall_10': {'value': 0.5984280594234299},
                           'ndcg_cut_10': {'value': 0.5280203494315917},
                           'ndcg_cut_5': {'value': 0.5054691655963462},
                           'recall_5': {'value': 0.5225860130952724}}}}


### 5.4 Evaluate Embedding + Reranking Pipeline on FIQA Dataset

First, let's create a evaluation target for Embedding + Reranking

In [35]:
target_payload = {
    "type": "retriever",
    "name": "embed-rerank-target",
    "namespace": NMS_NAMESPACE,
    "retriever": {
        "pipeline": {
            "query_embedding_model": {
                "api_endpoint": {
                    "url": embed_url,
                    "model_id": embed_model_name,
                }
         },
             "index_embedding_model": {
                "api_endpoint": {
                     "url": embed_url,
                     "model_id": embed_model_name,
                }
             },
            "reranker_model": {
                "api_endpoint": {
                     "url": rerank_url,
                     "model_id":rerank_model_name,
                }
            },
         "top_k": 10
       }
    }
}
resp = requests.post(target_url, json=target_payload, headers=headers)

Then we can launch the eval job for the embed + rerank pipeline on FIQA data:

In [36]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/embed-rerank-target",
    "config": f"{NMS_NAMESPACE}/fiqa-config"
}
resp = requests.post(job_url, json=payload, headers=headers)
embed_rerank_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-09T07:06:34.280669',
 'updated_at': '2025-07-09T07:06:34.280671',
 'id': 'eval-2FJVarertHH2SNA3mYaqmD',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-T4fCAFx1yhgGP3kBfEBzxb',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-09T07:06:31.817357',
            'updated_at': '2025-07-09T07:06:31.817358',
            'custom_fields': {},
            'ownership': None,
            'name': 'embed-rerank-target',
            'type': 'retriever',
            'cached_outputs': None,
            'model': None,
            'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',
                                                                 'id': 'model-AEyhZfkQZdQUw938n7Gw4q',
                                                              

In [38]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{embed_rerank_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully',
 'task_status': {'my-beir-task': 'completed'},
 'progress': 100.0}


In [39]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{embed_rerank_eval_job_id}/results")
pp(resp.json()['groups']['evaluation']['metrics'])

{'evaluation': {'scores': {'recall_10': {'value': 0.5678767905619758},
                           'ndcg_cut_10': {'value': 0.5134364317202811},
                           'recall_5': {'value': 0.5151632575243686},
                           'ndcg_cut_5': {'value': 0.5011430756367373}}}}


### 5.5 Evaluate Embedding + Reranking Pipeline on Custom Dataset

Upload the custom dataset:

In [40]:
DATASET_NAME = "rag_custom_data"
repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)

In [41]:
HF_API.upload_file(
    path_or_fileobj="./eval_dataset/retriever_and_rag/queries.jsonl",
    path_in_repo="rag_data/queries.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

HF_API.upload_file(
    path_or_fileobj="./eval_dataset/retriever_and_rag/corpus.jsonl",
    path_in_repo="rag_data/corpus.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

HF_API.upload_file(
    path_or_fileobj="./eval_dataset/retriever_and_rag/qrels/test.tsv",
    path_in_repo="rag_data/qrels/test.tsv",
    repo_id=repo_id,
    repo_type='dataset',
)

queries.jsonl: 100%|██████████| 16.7k/16.7k [00:00<00:00, 3.90MB/s]
corpus.jsonl: 100%|██████████| 11.0k/11.0k [00:00<00:00, 3.04MB/s]
test.tsv: 100%|██████████| 7.28k/7.28k [00:00<00:00, 1.79MB/s]


CommitInfo(commit_url='', commit_message='Upload rag_data/qrels/test.tsv with huggingface_hub', commit_description='', oid='c901d2bb70290aa2d898d3f8102d732f774baaec', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

Create config for the custom data for retriever eval:

In [42]:
config_payload = {
    "type": "retriever",
    "name": "custom-retriever-config",
    "namespace": NMS_NAMESPACE,
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/rag_data"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}
resp = requests.post(config_url, json=config_payload, headers=headers)

Launch the eval job on the embed + rerank pipeline:

In [43]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/embed-rerank-target",
    "config": f"{NMS_NAMESPACE}/custom-retriever-config"
}
resp = requests.post(job_url, json=payload, headers=headers)
embed_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-09T16:16:07.734774',
 'updated_at': '2025-07-09T16:16:07.734776',
 'id': 'eval-PeDSNf6jrxTyJnovR9ybd7',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-T4fCAFx1yhgGP3kBfEBzxb',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-09T07:06:31.817357',
            'updated_at': '2025-07-09T07:06:31.817358',
            'custom_fields': {},
            'ownership': None,
            'name': 'embed-rerank-target',
            'type': 'retriever',
            'cached_outputs': None,
            'model': None,
            'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',
                                                                 'id': 'model-AEyhZfkQZdQUw938n7Gw4q',
                                                              

In [46]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully',
 'task_status': {'my-beir-task': 'completed'},
 'progress': 100.0}


In [47]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/results")
pp(resp.json()['groups']['evaluation']['metrics'])

{'evaluation': {'scores': {'ndcg_cut_5': {'value': 1.0},
                           'recall_10': {'value': 1.0},
                           'recall_5': {'value': 1.0},
                           'ndcg_cut_10': {'value': 1.0}}}}


## 6. RAG Pipeline Evaluation

### 6.1 Evaluate RAG Pipeline on NFCorpus Dataset

In [48]:
target_payload = {
    "type": "rag",
    "name": "rag-target",
    "namespace": NMS_NAMESPACE,
    "rag": {
        "pipeline": {
            "retriever": {
                "pipeline": {
                     "query_embedding_model": {
                            "api_endpoint": {
                                "url": embed_url,
                                "model_id": embed_model_name
                           }
                     },
                    "index_embedding_model": {
                        "api_endpoint": {
                                "url": embed_url,
                                "model_id": embed_model_name
                        }
                    },
                    "reranker_model": {
                        "api_endpoint": {
                             "url": rerank_url,
                             "model_id": rerank_model_name,
                        }
                    },
                    "top_k": 3
                }
             },
            "model": {
                "api_endpoint": {
                     "url": llm_chat_completion_url,
                     "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        }
    }
}
resp = requests.post(target_url, json=target_payload, headers=headers)

In [49]:
config_payload = {
    "type": "rag",
    "name": "rag-nfcorpus-config",
    "namespace": NMS_NAMESPACE,
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://nfcorpus/"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                        "model_id": "meta/llama-3.3-70b-instruct",
                        "api_key": os.environ['NVIDIA_API_KEY'],
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "https://integrate.api.nvidia.com/v1/embeddings",
                        "model_id": "nvidia/nv-embedqa-e5-v5",
                        "api_key": os.environ['NVIDIA_API_KEY'],
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}
resp = requests.post(config_url, json=config_payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-09T16:30:31.105191',
 'updated_at': '2025-07-09T16:30:31.105192',
 'name': 'rag-nfcorpus-config',
 'namespace': 'nemo-eval-tutorial',
 'type': 'rag',
 'tasks': {'my-beir-task': {'type': 'beir',
                            'params': {'judge_llm': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',
                                                                      'model_id': 'meta/llama-3.3-70b-instruct',
                                                                      'api_key': '******'}},
                                       'judge_embeddings': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',
                                                                             'model_id': 'nvidia/nv-embedqa-e5-v5',
                                                                             'api_key': '******'}},
                                       'judge_timeout': 300,
                                

In [50]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/rag-target",
    "config": f"{NMS_NAMESPACE}/rag-nfcorpus-config"
}
resp = requests.post(job_url, json=payload, headers=headers)
rag_nfcorpus_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-09T16:30:43.855021',
 'updated_at': '2025-07-09T16:30:43.855024',
 'id': 'eval-VrFzxRNEP1hhjrnUtP8Ug1',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-VTApJbSpoJBj8xQhgzxB4',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-09T16:29:23.468547',
            'updated_at': '2025-07-09T16:29:23.468547',
            'custom_fields': {},
            'ownership': None,
            'name': 'rag-target',
            'type': 'rag',
            'cached_outputs': None,
            'model': None,
            'retriever': None,
            'rag': {'pipeline': {'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',
                                                                                      'id': 'model-FbFHZ6sEf7pBhmSZj2eTa5',
     

In [54]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{rag_nfcorpus_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully',
 'task_status': {'my-beir-task': 'completed'},
 'progress': 100.0}


In [55]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{rag_nfcorpus_eval_job_id}/results")
pp(resp.json()['groups']['evaluation']['metrics'])

{'evaluation': {'scores': {'ndcg_cut_10': {'value': 0.2679049189207276},
                           'recall_5': {'value': 0.11898776243879489},
                           'recall_10': {'value': 0.11898776243879489},
                           'ndcg_cut_5': {'value': 0.35399013412606123},
                           'faithfulness': {'value': 0.798750415118147},
                           'answer_relevancy': {'value': 0.38691755209414236}}}}


### 6.2 Evaluate RAG Pipeline on Custom Dataset

Create evaluation config for RAG on custom dataset. We will use remote judge LLM and judge embedding model.

In [56]:
config_payload = {
    "type": "rag",
    "name": "custom-rag-config",
    "namespace": NMS_NAMESPACE,
    "tasks": {
        "rag-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/rag_data"
            },
            "params": {
                "judge_llm": {
                    "api_endpoint": {
                        "url": "https://integrate.api.nvidia.com/v1/chat/completions",
                        "model_id": "meta/llama-3.3-70b-instruct",
                        "api_key": os.environ['NVIDIA_API_KEY'],
                    }
                },
                "judge_embeddings": {
                    "api_endpoint": {
                        "url": "https://integrate.api.nvidia.com/v1/embeddings",
                        "model_id": "nvidia/nv-embedqa-e5-v5",
                        "api_key": os.environ['NVIDIA_API_KEY'],
                    }
                },
                "judge_timeout": 300,
                "judge_max_retries": 5,
                "judge_max_workers": 16
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"},
                "faithfulness": {"type": "faithfulness"},
                "answer_relevancy": {"type": "answer_relevancy"}
            }
        }
    }
}

resp = requests.post(config_url, json=config_payload, headers=headers)
pp(resp.json())

{'created_at': '2025-07-09T17:35:11.033333',
 'updated_at': '2025-07-09T17:35:11.033334',
 'name': 'custom-rag-config',
 'namespace': 'nemo-eval-tutorial',
 'type': 'rag',
 'tasks': {'rag-beir-task': {'type': 'beir',
                             'params': {'judge_llm': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',
                                                                       'model_id': 'meta/llama-3.3-70b-instruct',
                                                                       'api_key': '******'}},
                                        'judge_embeddings': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',
                                                                              'model_id': 'nvidia/nv-embedqa-e5-v5',
                                                                              'api_key': '******'}},
                                        'judge_timeout': 300,
                          

In [57]:
payload = {
    "namespace": NMS_NAMESPACE,
    "target": f"{NMS_NAMESPACE}/rag-target",
    "config": f"{NMS_NAMESPACE}/custom-rag-config"
}
resp = requests.post(job_url, json=payload, headers=headers)
custom_rag_eval_job_id = resp.json()["id"]
pp(resp.json())

{'created_at': '2025-07-09T17:35:43.792368',
 'updated_at': '2025-07-09T17:35:43.792370',
 'id': 'eval-NYudxKxqnNzbRk5k7CpJNr',
 'namespace': 'nemo-eval-tutorial',
 'description': None,
 'target': {'schema_version': '1.0',
            'id': 'eval-target-VTApJbSpoJBj8xQhgzxB4',
            'description': None,
            'type_prefix': 'eval-target',
            'namespace': 'nemo-eval-tutorial',
            'project': None,
            'created_at': '2025-07-09T16:29:23.468547',
            'updated_at': '2025-07-09T16:29:23.468547',
            'custom_fields': {},
            'ownership': None,
            'name': 'rag-target',
            'type': 'rag',
            'cached_outputs': None,
            'model': None,
            'retriever': None,
            'rag': {'pipeline': {'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',
                                                                                      'id': 'model-FbFHZ6sEf7pBhmSZj2eTa5',
     

In [60]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{custom_rag_eval_job_id}/status")
pp(resp.json())

{'message': 'Job completed successfully',
 'task_status': {'rag-beir-task': 'completed'},
 'progress': 100.0}


In [61]:
resp = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{custom_rag_eval_job_id}/results")
pp(resp.json()['groups']['evaluation']['metrics'])

{'evaluation': {'scores': {'ndcg_cut_10': {'value': 1.0},
                           'recall_10': {'value': 1.0},
                           'ndcg_cut_5': {'value': 1.0},
                           'recall_5': {'value': 1.0},
                           'faithfulness': {'value': 0.804586038961039},
                           'answer_relevancy': {'value': 0.5430273571543547}}}}
