# Part III: Model Evaluation Using NeMo Evaluator

This notebook covers the following:

* [Pre-requisites: Configurations and Health Checks](#step-0)
* [Evaluate the Custom Model on Scidocs (BEIR) zero-shot benchmark](#step-1)

In [1]:
from time import sleep, time
from nemo_microservices import NeMoMicroservices

---
<a id="step-0"></a>
## Prerequisites: Configurations and Health Checks

Before you proceed, make sure that you completed the previous notebooks on data preparation and model fine-tuning to obtain the assets required to follow along.

### Configure NeMo Microservices Endpoints

The following code imports necessary configurations and prints the endpoints for the NeMo Data Store, Entity Store, Customizer, Evaluator, and NIM, as well as the namespace and base model.

In [2]:
from config import *

# Initialize NeMo Microservices SDK client
nemo_client = NeMoMicroservices(
    base_url=NEMO_URL,
    inference_base_url=NIM_URL,
)

Paste the Embedding Model Name from your previous notebook

In [3]:
EMBEDDING_MODEL_NAME = f"{NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}" # update this if you used a different name

# Check if the embedding model is hosted by NVIDIA NIM
models = nemo_client.inference.models.list()
model_names = [model.id for model in models.data]

assert EMBEDDING_MODEL_NAME in model_names, \
    f"Model {EMBEDDING_MODEL_NAME} not found"

---
<a id="step-1"></a>
## Evaluate the Custom Model

For the purposes of showcasing zero-shot generalization, we will run the `SciDocs` benchmark from the [Benchmarking Information Retrieval (BEIR)](https://github.com/beir-cellar/beir) benchmark suite.

We choose the `SciDocs` benchmark because its core purpose is to assess a model's ability to find and retrieve a scientific paper that should be cited by another given paper. While this benchmark data has differences from the `SPECTER` dataset that we trained (such as the length of the passages), it is roughly in-domain of scientific data.

### Create a Target Configuration

The first step in evaluation is to create a target configuration. This specifies parameters for the target endpoint being evaluated.

In [4]:
EMBEDDING_URL = f"{NIM_URL}/v1/embeddings"

print("Embedding URL: ", EMBEDDING_URL)
print("Embedding Model Name: ", EMBEDDING_MODEL_NAME)

retriever_target_config = {
 "type": "retriever",
 "retriever": {
   "pipeline": {
     "query_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME,
         "api_key": ""
       }
     },
     "index_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME,
         "api_key": ""
       }
     },
     "top_k": 10
   }
 }
}

Embedding URL:  http://nim.test/v1/embeddings
Embedding Model Name:  embed-sft-ns/fullweight_sft_embedding


### Create an Evaluation Configuration

NeMo Evaluator supports the `BEIR` format for evaluation, and can also run the `BEIR` benchmark itself, of which SciDocs is a part.

You may find more information about retriever model evaluation, including supported metrics, in the [documentation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-types/retriever.html).

In [5]:
retriever_eval_config_scidocs = {
    "type": "retriever",
    "namespace": NMS_NAMESPACE,
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://scidocs/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}


`NOTE`: Above we add a configuration to calculate the recall and NDCG metrics at cuttoffs (k) = 5 and 10.

* `recall@k`:  This metric measures the fraction of all existing relevant items that are successfully found within the top k results returned by a system.

* `NDCG@k`: This metric evaluates how well a system ranks items by relevance within the top k positions, assigning more value to placing highly relevant items at the very top of the list

### Create an Evaluation Job

The following code cell creates an evaluation job using the target and eval configurations defined above.

In [6]:
# Create evaluation job for the base model
eval_job = nemo_client.evaluation.jobs.create(
    config=retriever_eval_config_scidocs,
    target=retriever_target_config
)

print("Evaluation job created: ", eval_job.id)

Evaluation job created:  eval-UyoRebr4tEEZnK61MgGs1W


### Monitor the Evaluation Job

The following code cell defines a helper function to poll on the job status using the `nemo_client.evaluation.jobs.retrieve` method.

**The evaluation will take about 10-12 minutes to complete.**

In [7]:
from time import sleep, time

def wait_eval_job(nemo_client, job_id: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting an eval job."""
    start_time = time()
    job = nemo_client.evaluation.jobs.retrieve(job_id=job_id)
    status = job.status

    while (status in ["pending", "created", "running"]):
        # Check for timeout
        if time() - start_time > timeout:
            raise RuntimeError(f"Took more than {timeout} seconds.")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        job = nemo_client.evaluation.jobs.retrieve(job_id=job_id)
        status = job.status

        # Progress details (only fetch if status is "running")
        progress = 0
        if status == "running" and job.status_details:
            progress = job.status_details.progress or 0
        elif status == "completed":
            progress = 100

        print(f"Job status: {status} after {time() - start_time:.2f} seconds. Progress: {progress}%")

    return job

In [8]:
job = wait_eval_job(nemo_client, eval_job.id, polling_interval=5, timeout=5000)

Job status: running after 5.04 seconds. Progress: 0%
Job status: running after 10.05 seconds. Progress: 0%
Job status: running after 15.07 seconds. Progress: 0%
Job status: running after 20.08 seconds. Progress: 0%
Job status: running after 25.10 seconds. Progress: 0%
Job status: running after 30.11 seconds. Progress: 0%
Job status: running after 35.13 seconds. Progress: 0%
Job status: running after 40.14 seconds. Progress: 0%
Job status: running after 45.15 seconds. Progress: 0%
Job status: running after 50.17 seconds. Progress: 0%
Job status: running after 55.18 seconds. Progress: 0%
Job status: running after 60.19 seconds. Progress: 0%
Job status: running after 65.21 seconds. Progress: 0%
Job status: running after 70.22 seconds. Progress: 0%
Job status: running after 75.24 seconds. Progress: 0%
Job status: running after 80.25 seconds. Progress: 0%
Job status: running after 85.27 seconds. Progress: 0%
Job status: running after 90.28 seconds. Progress: 0%
Job status: running after 95.

### Review the results

In [9]:
results = nemo_client.evaluation.jobs.results(job_id=eval_job.id)
print(results.model_dump_json(indent=2, exclude_unset=True))

{
  "job": "eval-UyoRebr4tEEZnK61MgGs1W",
  "id": "evaluation_result-AjnxBhHBxyXEJzUFDjJfV6",
  "created_at": "2025-07-29T17:59:19.601878",
  "custom_fields": {},
  "files_url": "hf://datasets/evaluation-results/eval-UyoRebr4tEEZnK61MgGs1W",
  "namespace": "default",
  "tasks": {
    "my-beir-task": {
      "metrics": {
        "retriever.ndcg_cut_10": {
          "scores": {
            "ndcg_cut_10": {
              "value": 0.2392602974177994,
              "stats": {}
            }
          }
        },
        "retriever.ndcg_cut_5": {
          "scores": {
            "ndcg_cut_5": {
              "value": 0.1973427260255094,
              "stats": {}
            }
          }
        },
        "retriever.recall_10": {
          "scores": {
            "recall_10": {
              "value": 0.25328333333333275,
              "stats": {}
            }
          }
        },
        "retriever.recall_5": {
          "scores": {
            "recall_5": {
              "value": 0.17

After a single short finetuning run, we see a `recall@5` score of around `0.176`. Note that the `SciDocs` task is considered a challenging zero-shot benchmark, with the SOTA for recall@5 close to the `0.2` mark. 

For comparison, the baseline model (`nvidia/llama-3_2-nv-embedqa-1b-v2`) we used had a recall@5 score of around `0.159` on this task. If interested in scoring the base model used for finetuning, you can either deploy the NIM yourself, or point to managed endpoints at [build.nvidia.com](https://build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2) in your target configuration.

With a quick finetuning run, we were able to further boost the score over `nvidia/llama-3_2-nv-embedqa-1b-v2` what is already an excellent starting point. 