# Part III: Model Evaluation Using NeMo Evaluator

This notebook covers the following:

* [Pre-requisites: Configurations and Health Checks](#step-0)
* [Evaluate the Custom Model on Scidocs (BEIR) zero-shot benchmark](#step-1)

In [None]:
from time import sleep, time
from nemo_microservices import NeMoMicroservices
from config import *

---
<a id="step-0"></a>
## Prerequisites: Configurations and Health Checks

Before you proceed, make sure that you completed the previous notebooks on data preparation and model fine-tuning to obtain the assets required to follow along.

### Configure NeMo Microservices Endpoints

The following code imports necessary configurations for the NeMo Data Store, Entity Store, Customizer, Evaluator, and NIM, as well as the namespace and base model.

In [None]:
# Initialize NeMo Microservices SDK client
nemo_client = NeMoMicroservices(
    base_url=NEMO_URL,
    inference_base_url=NIM_URL,
)

Paste the Embedding Model Name from your previous notebook

In [None]:
EMBEDDING_MODEL_NAME = f"{NMS_NAMESPACE}/{OUTPUT_MODEL_NAME_EMBEDDING}" # update this if you used a different name

# Check if the embedding model is running locally as an NVIDIA NIM (pod in your cluster)
models = nemo_client.inference.models.list()
model_names = [model.id for model in models.data]

assert EMBEDDING_MODEL_NAME in model_names, \
    f"Model {EMBEDDING_MODEL_NAME} not found"

---
<a id="step-1"></a>
## Evaluate the Custom Model

For the purposes of showcasing zero-shot generalization, we will run the `SciDocs` benchmark from the [Benchmarking Information Retrieval (BEIR)](https://github.com/beir-cellar/beir) benchmark suite.

We choose the `SciDocs` benchmark because its core purpose is to assess a model's ability to find and retrieve a scientific paper that should be cited by another given paper. While this benchmark data has differences from the `SPECTER` dataset we used for training (such as the length of the passages), it remains within the scientific domain and serves as a good test of the model's generalization capabilities.

### Create a Target Configuration

The first step in evaluation is to create a target configuration. This specifies parameters for the target endpoint being evaluated.

In [None]:
EMBEDDING_URL = f"{NIM_URL}/v1/embeddings"

print("Embedding URL: ", EMBEDDING_URL)
print("Embedding Model Name: ", EMBEDDING_MODEL_NAME)

retriever_target_config = {
 "type": "retriever",
 "retriever": {
   "pipeline": {
     "query_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME,
         "api_key": ""
       }
     },
     "index_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME,
         "api_key": ""
       }
     },
     "top_k": 10
   }
 }
}

### Create an Evaluation Configuration

NeMo Evaluator supports the `BEIR` format for evaluation, and can also run the `BEIR` benchmark itself, of which SciDocs is a part.

You may find more information about retriever model evaluation, including supported metrics, in the [documentation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-types/retriever.html).

In [None]:
retriever_eval_config_scidocs = {
    "type": "retriever",
    "namespace": NMS_NAMESPACE,
    "tasks": {
        "my-beir-task": {
            "type": "beir",
            "dataset": {
                "files_url": "file://scidocs/"
            },
            "metrics": {
                "recall_5": {"type": "recall_5"},
                "ndcg_cut_5": {"type": "ndcg_cut_5"},
                "recall_10": {"type": "recall_10"},
                "ndcg_cut_10": {"type": "ndcg_cut_10"}
            }
        }
    }
}


`NOTE`: Above we add a configuration to calculate the recall and NDCG metrics at cuttoffs (k) = 5 and 10.

* `recall@k`:  This metric measures the fraction of all existing relevant items that are successfully found within the top k results returned by a system.

* `NDCG@k`: This metric evaluates how well a system ranks items by relevance within the top k positions, assigning more value to placing highly relevant items at the very top of the list

### Create an Evaluation Job

The following code cell creates an evaluation job using the target and eval configurations defined above.

In [None]:
# Create evaluation job for the base model
eval_job = nemo_client.evaluation.jobs.create(
    config=retriever_eval_config_scidocs,
    target=retriever_target_config
)

print("Evaluation job created: ", eval_job.id)

### Monitor the Evaluation Job

The following code cell defines a helper function to poll on the job status using the `nemo_client.evaluation.jobs.retrieve` method.

**The evaluation will take about 10-12 minutes to complete.**

In [None]:
def wait_eval_job(nemo_client, job_id: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting an eval job."""
    start_time = time()
    job = nemo_client.evaluation.jobs.retrieve(job_id=job_id)
    status = job.status

    while (status in ["pending", "created", "running"]):
        # Check for timeout
        if time() - start_time > timeout:
            raise RuntimeError(f"Took more than {timeout} seconds.")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        job = nemo_client.evaluation.jobs.retrieve(job_id=job_id)
        status = job.status

        # Progress details (only fetch if status is "running")
        progress = 0
        if status == "running" and job.status_details:
            progress = job.status_details.progress or 0
        elif status == "completed":
            progress = 100

        print(f"Job status: {status} after {time() - start_time:.2f} seconds. Progress: {progress}%")

    return job

In [None]:
job = wait_eval_job(nemo_client, eval_job.id, polling_interval=5, timeout=5000)

### Review the results

In [None]:
results = nemo_client.evaluation.jobs.results(job_id=eval_job.id)
print(results.model_dump_json(indent=2, exclude_unset=True))

---

## Next Steps

âœ… **Completed in this notebook:**
- Configured evaluation target for the fine-tuned embedding model
- Created evaluation configuration for the BEIR SciDocs benchmark
- Ran evaluation job measuring recall@5, recall@10, NDCG@5, and NDCG@10 metrics
- Analyzed results: recall@5 improved from approximately 0.159 (baseline `nvidia/llama-3_2-nv-embedqa-1b-v2`) to 0.176

**What you've achieved:**

Through this three-part tutorial series, you've completed the full embedding fine-tuning workflow: prepared domain-specific training data, fine-tuned `nvidia/llama-3.2-nv-embedqa-1b-v2` for improved scientific retrieval, deployed your custom model as a NIM, and evaluated performance on the challenging SciDocs zero-shot benchmark.

**Next:**
- Explore other [NeMo Microservices tutorials](../../../README.md) for LLM customization, RAG evaluation, and guardrails
- Visit the [NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) to learn more about advanced features
- Apply these techniques to your own domain-specific datasets for even better retrieval quality
