# Part III: Model Evaluation Using NeMo Evaluator

This notebook covers the following:

0. [Pre-requisites: Configurations and Health Checks](#step-0)
1. [Establish a baseline accuracy benchmark](#step-1). This uses the off-the-shelf llama-3.2-1b-instruct model
2. [Evaluate the LoRA customized model](#step-2)

In [1]:
from time import sleep, time
from nemo_microservices import NeMoMicroservices

---
<a id="step-0"></a>
## Prerequisites: Configurations and Health Checks

Before you proceed, make sure that you completed the previous notebooks on data preparation and model fine-tuning to obtain the assets required to follow along.

### Configure NeMo Microservices Endpoints

The following code imports necessary configurations and prints the endpoints for the NeMo Data Store, Entity Store, Customizer, Evaluator, and NIM, as well as the namespace and base model.

In [3]:
from config import *

# Initialize NeMo Microservices SDK client
nemo_client = NeMoMicroservices(
    base_url=NEMO_URL,
    inference_base_url=NIM_URL,
)

In [4]:
print(f"Data Store endpoint: {NDS_URL}")
print(f"Entity Store, Customizer, Evaluator endpoint: {NEMO_URL}")
print(f"NIM endpoint: {NIM_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Base Model: {BASE_MODEL}")

Data Store endpoint: http://data-store.test
Entity Store, Customizer, Evaluator endpoint: http://nemo.test
NIM endpoint: http://nim.test
Namespace: xlam-tutorial-ns
Base Model: meta/llama-3.2-1b-instruct


### Check Available Models

Specify the customized model name that you got from the previous notebook to the following variable. 

In [5]:
CUSTOMIZED_MODEL = CUSTOM_MODEL # paste from the previous notebook

The following code checks if the NIM endpoint hosts the model properly.

In [6]:
# Check if the custom LoRA model is hosted by NVIDIA NIM
models = nemo_client.inference.models.list()
model_names = [model.id for model in models.data]

assert CUSTOMIZED_MODEL in model_names, \
    f"Model {CUSTOMIZED_MODEL} not found"

### Verify the Availability of the Datasets

In the previous notebook, we uploaded the test dataset along with the train and validation sets using `nemo_client.datasets.create(name=DATASET_NAME, namespace=NMS_NAMESPACE, ...)`.
The following code performs a sanity check to validate the dataset by retrieving it and printing the URL of the files in the dataset.

In [7]:
# Sanity check to validate dataset
dataset = nemo_client.datasets.retrieve(namespace=NMS_NAMESPACE, dataset_name=DATASET_NAME)
print("Files URL:", dataset.files_url)

Files URL: hf://datasets/xlam-tutorial-ns/xlam-ft-dataset


---
<a id="step-1"></a>
## Step 1: Establish Baseline Accuracy Benchmark

First, we’ll assess the accuracy of the 'off-the-shelf' base model—pristine, untouched, and blissfully unaware of the transformative magic that is fine-tuning. 

### 1.1: Create an Evaluation Config Object
Create an evaluation configuration object for NeMo Evaluator. For more information on various parameters, refer to the [NeMo Evaluator configuration](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-configs.html) in the NeMo microservices documentation.


* The `tasks.custom-tool-calling.dataset.files_url` is used to indicate which test file to use. Note that it's required to upload this to the NeMo Data Store and register with Entity store before using.
* The `tasks.dataset.limit` argument below specifies how big a subset of test data to run the evaluation on
* The evaluation metric `tasks.metrics.tool-calling-accuracy` reports `function_name_accuracy` and `function_name_and_args_accuracy` numbers, which are as their names imply.

In [8]:
simple_tool_calling_eval_config = {
    "type": "custom",
    "tasks": {
        "custom-tool-calling": {
            "type": "chat-completion",
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/testing/xlam-test-single.jsonl",
                "limit": 50
            },
            "params": {
                "template": {
                    "messages": "{{ item.messages | tojson}}",
                    "tools": "{{ item.tools | tojson }}",
                    "tool_choice": "auto"
                }
            },
            "metrics": {
                "tool-calling-accuracy": {
                    "type": "tool-calling",
                    "params": {"tool_calls_ground_truth": "{{ item.tool_calls | tojson }}"}
                }
            }
        }
    }
}

### 1.2: Launch Evaluation Job 

The following code calls the `nemo_client.evaluation.jobs.create()` method to launch an evaluation job in the NeMo Evaluator.
It uses the evaluation configuration defined in the previous cell and targets the base model.

In [9]:
# Create evaluation job for the base model
eval_job = nemo_client.evaluation.jobs.create(
    config=simple_tool_calling_eval_config,
    target={"type": "model", "model": BASE_MODEL}
)

base_eval_job_id = eval_job.id
print(f"Created evaluation job: {base_eval_job_id}")
eval_job

Created evaluation job: eval-6QisJKdfmRquYXdFBNBAD7


EvaluationJob(config=Config(type='custom', id='eval-config-UEphFyZQvSb5rFAkPuRYuu', created_at=datetime.datetime(2025, 6, 20, 16, 36, 27, 493465), custom_fields={}, description=None, groups=None, name='eval-config-UEphFyZQvSb5rFAkPuRYuu', namespace='default', ownership=None, params=None, project=None, schema_version='1.0', tasks={'custom-tool-calling': ConfigTasks(type='chat-completion', dataset=ConfigTasksDatasetDatasetEv(files_url='hf://datasets/xlam-tutorial-ns/xlam-ft-dataset/testing/xlam-test-single.jsonl', id='dataset-UAEo4C8C4p6uEqTFTn2v8B', created_at=datetime.datetime(2025, 6, 20, 16, 36, 27, 493606), custom_fields={}, description=None, format=None, hf_endpoint=None, limit=50, name='dataset-UAEo4C8C4p6uEqTFTn2v8B', namespace='default', ownership=None, project=None, schema_version='1.0', split=None, type_prefix=None, updated_at=datetime.datetime(2025, 6, 20, 16, 36, 27, 493607), version_id='main', version_tags=[]), metrics={'tool-calling-accuracy': ConfigTasksMetrics(type='tool

The following code defines a helper function to poll on job status until it finishes:

In [10]:
def wait_eval_job(nemo_client, job_id: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting an eval job."""
    start_time = time()
    job = nemo_client.evaluation.jobs.retrieve(job_id=job_id)
    status = job.status

    while (status in ["pending", "created", "running"]):
        # Check for timeout
        if time() - start_time > timeout:
            raise RuntimeError(f"Took more than {timeout} seconds.")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        job = nemo_client.evaluation.jobs.retrieve(job_id=job_id)
        status = job.status

        # Progress details (only fetch if status is "running")
        progress = 0
        if status == "running" and job.status_details:
            progress = job.status_details.progress or 0
        elif status == "completed":
            progress = 100

        print(f"Job status: {status} after {time() - start_time:.2f} seconds. Progress: {progress}%")

    return job

Run the helper function:

In [11]:
# Poll
job = wait_eval_job(nemo_client, base_eval_job_id, polling_interval=5, timeout=600)

Job status: running after 5.03 seconds. Progress: 0.0%
Job status: running after 10.05 seconds. Progress: 0.0%
Job status: running after 15.06 seconds. Progress: 0.0%
Job status: running after 20.08 seconds. Progress: 0.0%
Job status: running after 25.09 seconds. Progress: 8.0%
Job status: running after 30.11 seconds. Progress: 12.0%
Job status: running after 35.12 seconds. Progress: 16.0%
Job status: running after 40.13 seconds. Progress: 26.0%
Job status: running after 45.15 seconds. Progress: 26.0%
Job status: running after 50.16 seconds. Progress: 26.0%
Job status: running after 55.18 seconds. Progress: 26.0%
Job status: running after 60.20 seconds. Progress: 26.0%
Job status: running after 65.21 seconds. Progress: 26.0%
Job status: running after 70.23 seconds. Progress: 32.0%
Job status: running after 75.24 seconds. Progress: 32.0%
Job status: running after 80.26 seconds. Progress: 38.0%
Job status: running after 85.28 seconds. Progress: 38.0%
Job status: running after 90.29 secon

### 1.3 Review Evaluation Metrics

The following code retrieves the evaluation results for the base evaluation job.

In [13]:
results = nemo_client.evaluation.jobs.results(job_id=base_eval_job_id)
print(results.model_dump_json(indent=2, exclude_unset=True))

{
  "job": "eval-6QisJKdfmRquYXdFBNBAD7",
  "id": "evaluation_result-B8PqbKeE9KrzMUxGRCcpF2",
  "created_at": "2025-06-20T16:36:27.527337",
  "custom_fields": {},
  "groups": {},
  "namespace": "default",
  "tasks": {
    "custom-tool-calling": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 0.12,
              "stats": {
                "count": 50,
                "mean": 0.16,
                "sum": 8.0
              }
            },
            "function_name_and_args_accuracy": {
              "value": 0.08,
              "stats": {
                "count": 50,
                "mean": 0.12,
                "sum": 6.0
              }
            }
          }
        }
      }
    }
  },
  "updated_at": "2025-06-20T16:36:27.527338"
}


The following code extracts and prints the accuracy scores for the base model.

In [14]:
# Extract function name accuracy score
base_scores = results.tasks["custom-tool-calling"].metrics["tool-calling-accuracy"].scores
base_function_name_accuracy_score = base_scores["function_name_accuracy"].value
base_function_name_and_args_accuracy = base_scores["function_name_and_args_accuracy"].value

print(f"Base model: function_name_accuracy: {base_function_name_accuracy_score}")
print(f"Base model: function_name_and_args_accuracy: {base_function_name_and_args_accuracy}")

Base model: function_name_accuracy: 0.12
Base model: function_name_and_args_accuracy: 0.08


Without any finetuning, the `meta/llama-3.2-1b-instruct` model should score in the ballpark of about 12% in `function_name_accuracy`, and 8% in `function_name_and_args_accuracy` (note that scores will vary by about +/-4% due to non-determinism of LLMs).

### (Optional) 1.4 Download and Inspect Results

To take a deeper look into the model's generated outputs, you can download and review the results.

In [32]:
def download_evaluation_results(nemo_client, eval_job_id, output_file):
    """Downloads evaluation results for a given job ID."""
    
    try:
        # Get download results
        results = nemo_client.evaluation.jobs.download_results(job_id=eval_job_id)
        
        # Save the results to a file
        results.write_to_file(output_file)

        print(f"Evaluation results for job {eval_job_id} downloaded successfully to {output_file}.")
        return True
    except Exception as e:
        print(f"Failed to download evaluation results: {e}")
        return False

In [33]:
output_file = f"{base_eval_job_id}.zip"

# Assertion fails if download fails
assert download_evaluation_results(nemo_client=nemo_client, eval_job_id=base_eval_job_id, output_file=output_file) == True

Evaluation results for job eval-6QisJKdfmRquYXdFBNBAD7 downloaded successfully to eval-6QisJKdfmRquYXdFBNBAD7.zip.


You can inspect the downloaded results file to observe places where the base model errors. Without any fine-tuning, some models not only return inaccurate function names and arguments, but they may not adhere to a consistent structured / predictable output schema. This makes it difficult to automatically parse these outputs, deterring integration with external systems.

---
<a id="step-2"></a>
## Step 2: Evaluate the LoRA Customized Model

### 2.1 Launch Evaluation Job

Run another evaluation job with the same evaluation config but with the customized model.

In [34]:
# Create evaluation job for customized model
ft_eval_job = nemo_client.evaluation.jobs.create(
    config=simple_tool_calling_eval_config,
    target={"type": "model", "model": CUSTOMIZED_MODEL}
)

ft_eval_job_id = ft_eval_job.id
print(f"Created evaluation job for customized model: {ft_eval_job_id}")
ft_eval_job

Created evaluation job for customized model: eval-QbibneF16tyDiCJWc9oxgZ


EvaluationJob(config=Config(type='custom', id='eval-config-QZNjbAHSWjNr5FvAVhuzD', created_at=datetime.datetime(2025, 6, 20, 16, 55, 38, 581764), custom_fields={}, description=None, groups=None, name='eval-config-QZNjbAHSWjNr5FvAVhuzD', namespace='default', ownership=None, params=None, project=None, schema_version='1.0', tasks={'custom-tool-calling': ConfigTasks(type='chat-completion', dataset=ConfigTasksDatasetDatasetEv(files_url='hf://datasets/xlam-tutorial-ns/xlam-ft-dataset/testing/xlam-test-single.jsonl', id='dataset-DRno5c1nwQnfmKJQAd5KQA', created_at=datetime.datetime(2025, 6, 20, 16, 55, 38, 581838), custom_fields={}, description=None, format=None, hf_endpoint=None, limit=50, name='dataset-DRno5c1nwQnfmKJQAd5KQA', namespace='default', ownership=None, project=None, schema_version='1.0', split=None, type_prefix=None, updated_at=datetime.datetime(2025, 6, 20, 16, 55, 38, 581839), version_id='main', version_tags=[]), metrics={'tool-calling-accuracy': ConfigTasksMetrics(type='tool-c

In [35]:
# Poll
job = wait_eval_job(nemo_client, ft_eval_job_id, polling_interval=5, timeout=600)

Job status: running after 5.03 seconds. Progress: 0.0%
Job status: running after 10.04 seconds. Progress: 0.0%
Job status: running after 15.06 seconds. Progress: 0.0%
Job status: running after 20.08 seconds. Progress: 0.0%
Job status: running after 25.09 seconds. Progress: 8.0%
Job status: running after 30.11 seconds. Progress: 12.0%
Job status: running after 35.13 seconds. Progress: 18.0%
Job status: running after 40.14 seconds. Progress: 26.0%
Job status: running after 45.16 seconds. Progress: 26.0%
Job status: running after 50.18 seconds. Progress: 26.0%
Job status: running after 55.19 seconds. Progress: 26.0%
Job status: running after 60.21 seconds. Progress: 26.0%
Job status: running after 65.22 seconds. Progress: 28.0%
Job status: running after 70.24 seconds. Progress: 32.0%
Job status: running after 75.26 seconds. Progress: 32.0%
Job status: running after 80.27 seconds. Progress: 38.0%
Job status: running after 85.29 seconds. Progress: 38.0%
Job status: running after 90.30 secon

### 2.2 Review Evaluation Metrics
The following code retrieves the evaluation results for the fine-tuned model evaluation job.

In [36]:
# Get evaluation results for customized model
ft_results = nemo_client.evaluation.jobs.results(job_id=ft_eval_job_id)
print(ft_results.model_dump_json(indent=2))

{
  "job": "eval-QbibneF16tyDiCJWc9oxgZ",
  "id": "evaluation_result-B2uHJZANeXGz92ME8McJj5",
  "created_at": "2025-06-20T16:55:38.609088",
  "custom_fields": {},
  "description": null,
  "files_url": null,
  "groups": {},
  "namespace": "default",
  "ownership": null,
  "project": null,
  "tasks": {
    "custom-tool-calling": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "function_name_accuracy": {
              "value": 0.92,
              "stats": {
                "count": 50,
                "max": null,
                "mean": 0.92,
                "min": null,
                "stddev": null,
                "stderr": null,
                "sum": 46.0,
                "sum_squared": null,
                "variance": null
              }
            },
            "function_name_and_args_accuracy": {
              "value": 0.72,
              "stats": {
                "count": 50,
                "max": null,
                "mean": 0.7

In [38]:
# Extract function name accuracy score
ft_scores = ft_results.tasks["custom-tool-calling"].metrics["tool-calling-accuracy"].scores
ft_function_name_accuracy_score = ft_scores["function_name_accuracy"].value
ft_function_name_and_args_accuracy = ft_scores["function_name_and_args_accuracy"].value

print(f"Custom model: function_name_accuracy: {ft_function_name_accuracy_score}")
print(f"Custom model: function_name_and_args_accuracy: {ft_function_name_and_args_accuracy}")

Custom model: function_name_accuracy: 0.92
Custom model: function_name_and_args_accuracy: 0.72


A successfully fine-tuned `meta/llama-3.2-1b-instruct` results in a significant increase in tool calling accuracy with 

In this case you should observe roughly the following improvements -
* function_name_accuracy: 12% to 92%
* function_name_and_args_accuracy: 8% to 72%

Since this evaluation was on a limited number of samples for demonstration purposes, you may choose to increase `tasks.dataset.limit` in your evaluation config `simple_tool_calling_eval_config`

## (Optional) Next Steps



* You may also run the same evaluation on a base `meta/llama-3.1-70b-instruct` model for comparison.
For this, first you will need to deploy the corresponding NIM using instructions [here](https://build.nvidia.com/meta/llama-3_1-70b-instruct/deploy). After your NIM is deployed, set that endpoint as your evaluation target like so -

``` python
# Create an evaluation target
NIM_URL = "http://0.0.0.0:8000"
EVAL_TARGET = {
    "type": "model", 
    "model": {
       "api_endpoint": {
         "url": f"{NIM_URL}/v1/completions",
         "model_id": "meta/llama-3.1-70b-instruct",
        }
    }
}

# Start eval job
eval_job = nemo_client.evaluation.jobs.create(
    config=simple_tool_calling_eval_config,
    target=EVAL_TARGET
)
```

Running evaluation using the default config in this notebook, you should observe `meta/llama-3.1-70b-instruct` performance similar to -
* function_name_accuracy: 98%
* function_name_and_args_accuracy: 66%

Remarkably, a LoRA-tuned `meta/llama-3.2-1b-instruct` achieves accuracy that is close to a model 70 times its size, even outperforming it in the combined `function_name_and_args_accuracy` score.

You can now proceed with the same processes to fine-tune other NIM for LLMs and evaluate the accuracies between the base model and the fine-tuned model. By doing so, you can produce more accurate models for your use case.