# Part III: Model Evaluation Using NeMo Evaluator

This notebook covers the following:

0. [Pre-requisites: Configurations and Health Checks](#step-0)
1. [Establish a baseline accuracy benchmark](#step-1). This uses the off-the-shelf llama-3.2-1b-instruct model
2. [Evaluate the LoRA customized model](#step-2)

In [1]:
import os
import json
import requests
from time import sleep, time

from openai import OpenAI

---
<a id="step-0"></a>
## Prerequisites: Configurations and Health Checks

Before you proceed, make sure that you completed the previous notebooks on data preparation and model fine-tuning to obtain the assets required to follow along.

### Configure NeMo Microservices Endpoints

The following code imports necessary configurations and prints the endpoints for the NeMo Data Store, Entity Store, Customizer, Evaluator, and NIM, as well as the namespace and base model.

In [2]:
from config import *

print(f"Data Store endpoint: {NDS_URL}")
print(f"Entity Store, Customizer, Evaluator endpoint: {NEMO_URL}")
print(f"NIM endpoint: {NIM_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Base Model: {BASE_MODEL}")

Data Store endpoint: http://data-store.test
Entity Store, Customizer, Evaluator endpoint: http://nemo.test
NIM endpoint: http://nim.test
Namespace: xlam-tutorial-ns
Base Model: meta/llama-3.2-1b-instruct


### Check Available Models

Specify the customized model name that you got from the previous notebook to the following variable. 

In [3]:
CUSTOMIZED_MODEL = " "

The following code checks if the NIM endpoint hosts the model properly.

In [5]:
resp = requests.get(f"{NIM_URL}/v1/models")

models = resp.json().get("data", [])
model_names = [model["id"] for model in models]

assert CUSTOMIZED_MODEL in model_names, \
    f"Model {CUSTOMIZED_MODEL} not found"

### Verify the Availability of the Datasets

In the previous notebook, we uploaded the test dataset along with the train and validation sets at `/v1/datasets/{NMS_NAMESPACE}/{DATASET_NAME}`. 
The following code performs a sanity check to validate the dataset by sending a GET request to the specified URL.
It asserts that the status code of the response is either 200 or 201, indicating a successful fetch.
If the assertion passes, it prints the URL of the files in the dataset.


In [6]:
# Sanity check to validate dataset
res = requests.get(url=f"{NEMO_URL}/v1/datasets/{NMS_NAMESPACE}/{DATASET_NAME}")
assert res.status_code in (200, 201), f"Status Code {res.status_code} Failed to fetch dataset {res.text}"

print("Files URL:", res.json()["files_url"])

Files URL: hf://datasets/xlam-tutorial-ns/xlam-ft-dataset


---
<a id="step-1"></a>
## Step 1: Establish Baseline Accuracy Benchmark

First, we’ll assess the accuracy of the 'off-the-shelf' base model—pristine, untouched, and blissfully unaware of the transformative magic that is fine-tuning. 

### 1.1: Create an Evaluation Config Object
Create an evaluation configuration object for NeMo Evaluator. For more information on various parameters, refer to the [NeMo Evaluator configuration](https://developer.nvidia.com/docs/nemo-microservices/evaluate/evaluation-configs.html) in the NeMo microservices documentation.


* The `tasks.custom-tool-calling.dataset.files_url` is used to indicate which test file to use. Note that it's required to upload this to the NeMo Data Store and register with Entity store before using.
* The `tasks.dataset.limit` argument below specifies how big a subset of test data to run the evaluation on
* The evaluation metric `tasks.metrics.tool-calling-accuracy` reports `function_name_accuracy` and `function_name_and_args_accuracy` numbers, which are as their names imply.

In [7]:
simple_tool_calling_eval_config = {
    "type": "custom",
    "tasks": {
        "custom-tool-calling": {
            "type": "chat-completion",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/testing/xlam-test-single.jsonl",
                "limit": 50
            },
            "params": {
                "template": {
                    "messages": "{{ item.messages | tojson}}",
                    "tools": "{{ item.tools | tojson }}",
                    "tool_choice": "auto"
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
                }
            }
        }
    }
}

### 1.2: Launch Evaluation Job 

The following code sends a POST request to the NeMo Evaluator API to launch an evaluation job. It uses the evaluation configuration defined in the previous cell and targets the base model.


In [None]:
res = requests.post(
    f"{NEMO_URL}/v1/evaluation/jobs",
    json={
        "config": simple_tool_calling_eval_config,
        "target": {"type": "model", "model": BASE_MODEL}
    }
)

base_eval_job_id = res.json()["id"]

res.json()

The following code defines a helper function to poll on job status until it finishes:

In [9]:
def wait_eval_job(job_url: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting an eval job."""
    start_time = time()
    res = requests.get(job_url)
    status = res.json()["status"]

    while (status in ["pending", "created", "running"]):
        # Check for timeout
        if time() - start_time > timeout:
            raise RuntimeError(f"Took more than {timeout} seconds.")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        res = requests.get(job_url)
        status = res.json()["status"]

        # Progress details (only fetch if status is "running")
        if status == "running":
            progress = res.json().get("status_details", {}).get("progress", 0)
        elif status == "completed":
            progress = 100

        print(f"Job status: {status} after {time() - start_time:.2f} seconds. Progress: {progress}%")

    return res

Run the helper function:

In [10]:
# Poll
res = wait_eval_job(f"{NEMO_URL}/v1/evaluation/jobs/{base_eval_job_id}", polling_interval=5, timeout=600)

Job status: running after 5.03 seconds. Progress: 0.0%
Job status: running after 10.05 seconds. Progress: 0.0%
Job status: running after 15.06 seconds. Progress: 0.0%
Job status: running after 20.08 seconds. Progress: 0.0%
Job status: running after 25.09 seconds. Progress: 8.0%
Job status: running after 30.11 seconds. Progress: 12.0%
Job status: running after 35.12 seconds. Progress: 16.0%
Job status: running after 40.13 seconds. Progress: 26.0%
Job status: running after 45.15 seconds. Progress: 26.0%
Job status: running after 50.16 seconds. Progress: 26.0%
Job status: running after 55.18 seconds. Progress: 26.0%
Job status: running after 60.20 seconds. Progress: 26.0%
Job status: running after 65.21 seconds. Progress: 26.0%
Job status: running after 70.23 seconds. Progress: 32.0%
Job status: running after 75.24 seconds. Progress: 32.0%
Job status: running after 80.26 seconds. Progress: 38.0%
Job status: running after 85.28 seconds. Progress: 38.0%
Job status: running after 90.29 secon

### 1.3 Review Evaluation Metrics

The following code sends a GET request to retrieve the evaluation results for the base evaluation job. 

In [11]:
res = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{base_eval_job_id}/results")
res.json()

{'created_at': '2025-04-02T19:08:07.819322',
 'updated_at': '2025-04-02T19:08:07.819325',
 'id': 'evaluation_result-UgxzDz9JYWA6KMv5f9dWFR',
 'job': 'eval-ddAsfX2u7htc8RjjrTV3G',
 'tasks': {'custom-tool-calling': {'metrics': {'tool-calling-accuracy': {'scores': {'function_name_accuracy': {'value': 0.12,
       'stats': {'count': 50, 'sum': 6.0, 'mean': 0.12}},
      'function_name_and_args_accuracy': {'value': 0.08,
       'stats': {'count': 50, 'sum': 4.0, 'mean': 0.08}}}}}}},
 'groups': {},
 'namespace': 'default',
 'custom_fields': {}}

The following code extracts and prints the accuracy scores for the base model.

In [12]:
# Extract function name accuracy score
base_function_name_accuracy_score = res.json()["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_accuracy"]["value"]
base_function_name_and_args_accuracy = res.json()["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_and_args_accuracy"]["value"]

print(f"Base model: function_name_accuracy: {base_function_name_accuracy_score}")
print(f"Base model: function_name_and_args_accuracy: {base_function_name_and_args_accuracy}")

Base model: function_name_accuracy: 0.12
Base model: function_name_and_args_accuracy: 0.08


Without any finetuning, the `meta/llama-3.2-1b-instruct` model should score in the ballpark of about 12% in `function_name_accuracy`, and 8% in `function_name_and_args_accuracy`

### (Optional) 1.4 Download and Inspect Results

To take a deeper look into the model's generated outputs, you can download and review the results.

In [13]:
def download_evaluation_results(eval_url, eval_job_id, output_file):
    """Downloads evaluation results for a given job ID from the NeMo server."""
    
    download_response = requests.get(f"{eval_url}/v1/evaluation/jobs/{eval_job_id}/download-results")
    
    # Check the response status
    if download_response.status_code == 200:
        # Save the results to a file
        with open(output_file, "wb") as file:
            file.write(download_response.content)
        print(f"Evaluation results for job {eval_job_id} downloaded successfully to {output_file}.")
        return True
    else:
        print(f"Failed to download evaluation results. Status code: {download_response.status_code}")
        print('Response:', download_response.text)
        return False

In [14]:
output_file = f"{base_eval_job_id}.json"

# Assertion fails if download fails
assert download_evaluation_results(eval_url=NEMO_URL, eval_job_id=base_eval_job_id, output_file=output_file) == True

Evaluation results for job eval-ddAsfX2u7htc8RjjrTV3G downloaded successfully to eval-ddAsfX2u7htc8RjjrTV3G.json.


You can inspect the downloaded results file to observe places where the base model errors. Without any fine-tuning, some models not only return inaccurate function names and arguments, but they may not adhere to a consistent structured / predictable output schema. This makes it difficult to automatically parse these outputs, deterring integration with external systems.

---
<a id="step-2"></a>
## Step 2: Evaluate the LoRA Customized Model

### 2.1 Launch Evaluation Job

Run another evaluation job with the same evaluation config but with the customized model.

In [15]:
res = requests.post(
    f"{NEMO_URL}/v1/evaluation/jobs",
    json={
        "config": simple_tool_calling_eval_config,
        "target": {"type": "model", "model": CUSTOMIZED_MODEL},
    },
)

ft_eval_job_id = res.json()["id"]

res.json()

{'created_at': '2025-04-02T19:12:03.849375',
 'updated_at': '2025-04-02T19:12:03.849376',
 'id': 'eval-RMbUCrxKuzuE5cJdhwh3Uo',
 'namespace': 'default',
 'description': None,
 'target': {'schema_version': '1.0',
  'id': 'eval-target-DjryeDuurpvztuwb3MKpVT',
  'description': None,
  'type_prefix': 'eval-target',
  'namespace': 'default',
  'project': None,
  'created_at': '2025-04-02T19:12:03.848747',
  'updated_at': '2025-04-02T19:12:03.848748',
  'custom_fields': {},
  'ownership': None,
  'name': 'eval-target-DjryeDuurpvztuwb3MKpVT',
  'type': 'model',
  'cached_outputs': None,
  'model': 'xlam-tutorial-ns/llama-3.2-1b-xlam-run1@cust-4rZxaBqeqGtVUkZ3MdoMXT',
  'retriever': None,
  'rag': None},
 'config': {'schema_version': '1.0',
  'id': 'eval-config-Cp7srSQAmkGQ3QZcqwL4Jo',
  'description': None,
  'type_prefix': 'eval-config',
  'namespace': 'default',
  'project': None,
  'created_at': '2025-04-02T19:12:03.848538',
  'updated_at': '2025-04-02T19:12:03.848542',
  'custom_fields': 

In [16]:
# Poll
res = wait_eval_job(f"{NEMO_URL}/v1/evaluation/jobs/{ft_eval_job_id}", polling_interval=5, timeout=600)

Job status: running after 5.03 seconds. Progress: 0.0%
Job status: running after 10.04 seconds. Progress: 0.0%
Job status: running after 15.06 seconds. Progress: 0.0%
Job status: running after 20.08 seconds. Progress: 0.0%
Job status: running after 25.09 seconds. Progress: 8.0%
Job status: running after 30.11 seconds. Progress: 12.0%
Job status: running after 35.13 seconds. Progress: 18.0%
Job status: running after 40.14 seconds. Progress: 26.0%
Job status: running after 45.16 seconds. Progress: 26.0%
Job status: running after 50.18 seconds. Progress: 26.0%
Job status: running after 55.19 seconds. Progress: 26.0%
Job status: running after 60.21 seconds. Progress: 26.0%
Job status: running after 65.22 seconds. Progress: 28.0%
Job status: running after 70.24 seconds. Progress: 32.0%
Job status: running after 75.26 seconds. Progress: 32.0%
Job status: running after 80.27 seconds. Progress: 38.0%
Job status: running after 85.29 seconds. Progress: 38.0%
Job status: running after 90.30 secon

### 2.2 Review Evaluation Metrics
The following code sends a GET request to retrieve the evaluation results for the fine-tuned model evaluation job.

In [17]:
res = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{ft_eval_job_id}/results")
res.json()

{'created_at': '2025-04-02T19:12:03.887985',
 'updated_at': '2025-04-02T19:12:03.887986',
 'id': 'evaluation_result-RmJs94jfgu1J5ePZbm23qF',
 'job': 'eval-RMbUCrxKuzuE5cJdhwh3Uo',
 'tasks': {'custom-tool-calling': {'metrics': {'tool-calling-accuracy': {'scores': {'function_name_accuracy': {'value': 0.92,
       'stats': {'count': 50, 'sum': 46.0, 'mean': 0.92}},
      'function_name_and_args_accuracy': {'value': 0.72,
       'stats': {'count': 50, 'sum': 36.0, 'mean': 0.72}}}}}}},
 'groups': {},
 'namespace': 'default',
 'custom_fields': {}}

In [18]:
# Extract function name accuracy score
ft_function_name_accuracy_score = res.json()["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_accuracy"]["value"]
ft_function_name_and_args_accuracy = res.json()["tasks"]["custom-tool-calling"]["metrics"]["tool-calling-accuracy"]["scores"]["function_name_and_args_accuracy"]["value"]

print(f"Custom model: function_name_accuracy: {ft_function_name_accuracy_score}")
print(f"Custom model: function_name_and_args_accuracy: {ft_function_name_and_args_accuracy}")

Custom model: function_name_accuracy: 0.92
Custom model: function_name_and_args_accuracy: 0.72


A successfully fine-tuned `meta/llama-3.2-1b-instruct` results in a significant increase in tool calling accuracy with 

In this case you should observe roughly the following improvements -
* function_name_accuracy: 12% to 92%
* function_name_and_args_accuracy: 8% to 72%

Since this evaluation was on a limited number of samples for demonstration purposes, you may choose to increase `tasks.dataset.limit` in your evaluation config `simple_tool_calling_eval_config`

## (Optional) Next Steps



* You may also run the same evaluation on a base `meta/llama-3.1-70B` model for comparison.
For this, first you will need to deploy the corresponding NIM using instructions [here](https://build.nvidia.com/meta/llama-3_1-70b-instruct/deploy). After your NIM is deployed, set that endpoint as your evaluation target like so -

``` python
# Create an evaluation target
NIM_URL = "http://0.0.0.0:8000"
EVAL_TARGET = {
    "type": "model", 
    "model": {
       "api_endpoint": {
         "url": f"{NIM_URL}/v1/completions",
         "model_id": "meta/llama-3.1-70b-instruct",
        }
    }
}

# Start eval job
res = requests.post(
    f"{NEMO_URL}/v1/evaluation/jobs",
    json={
        "config": simple_tool_calling_eval_config,
        "target": EVAL_TARGET
    }
)
```

Running evaluation using the default config in this notebook, you should observe `meta/llama-3.1-70B` performance similar to -
* function_name_accuracy: 98%
* function_name_and_args_accuracy: 66%

Remarkably, a LoRA-tuned `meta/llama-3.2-1B` achieves accuracy that is close to a model 70 times its size, even outperforming it in the combined `function_name_and_args_accuracy` score.

You can now proceed with the same processes to fine-tune other NIM for LLMs and evaluate the accuracies between the base model and the fine-tuned model. By doing so, you can produce more accurate models for your use case.