# Custom LLM-as-a-Judge Implementation

In the following notebook, we'll be walking through an example of how you can leverage Custom LLM-as-a-Judge through NeMo Evaluator Microservice. 

Full documentation is available [here](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-custom.html#evaluation-with-llm-as-a-judge)!

In our example - we'll be looking at the following scenario:

*We have a JSONL file with medical consultation information (synthetically generated). We will use a [`build.nvidia.com`](https://build.nvidia.com/) endpoint model to generate summaries of those consultations - and then use OpenAI to judge the summaries on metrics we define ahead of time - in this case: Correctness and Completeness.*

We'll note different places you could change this example to adjust to your desired workflow along the way, as Custom LLM-as-a-Judge is a flexible evaluation!

## Necessary Configurations

You'll need to have set up the NeMo Microservices including: 

- NeMo Evaluator
- NeMo Data Store and Entity Store

If you wish to evaluate a NIM for LLMs, or use a NIM for LLMs as a judge, you will also need to provide the respective NIM for LLMs URL.

In [8]:
# (Required) NeMo Microservices URLs
NDS_URL = ""
NEMO_URL = ""
# (Optional based on use case) NeMo Microservices URLs
NIM_URL = ""

# If you have set a token for NeMo Data Store, provide it here
NDS_TOKEN = "token"

# Configure to your liking!
NMS_NAMESPACE = "custom-llm-as-a-judge-eval"
DATASET_NAME = "custom-llm-as-a-judge-eval-data"

In [5]:
print(f"Data Store endpoint: {NDS_URL}")
print(f"Entity Store, Customizer, Evaluator endpoint: {NEMO_URL}")
print(f"NIM endpoint: {NIM_URL}")
print(f"Namespace: {NMS_NAMESPACE}")

Data Store endpoint: https://nmp.int.aire.nvidia.com
Entity Store, Customizer, Evaluator endpoint: https://datastore.int.aire.nvidia.com
NIM endpoint: https://nim.int.aire.nvidia.com
Namespace: custom-llm-as-a-judge-eval


## Setting Up NeMo Data Store and Entity Store

We'll first need to ensure that our namespace is created and is available both in our NeMo Entity Store and Data Store.

In [9]:
import requests

def create_namespaces(entity_host, ds_host, namespace):
    # Create namespace in Entity Store
    entity_store_url = f"{entity_host}/v1/namespaces"
    resp = requests.post(entity_store_url, json={"id": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Entity Store during namespace creation: {resp.status_code}"
    print(resp)

    # Create namespace in Data Store
    nds_url = f"{ds_host}/v1/datastore/namespaces"
    resp = requests.post(nds_url, data={"namespace": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Data Store during namespace creation: {resp.status_code}"
    print(resp)

create_namespaces(entity_host=NEMO_URL, ds_host=NDS_URL, namespace=NMS_NAMESPACE)

<Response [200]>
<Response [201]>


Now we can do a simple verification.

In [10]:
# Verify Namespace in Data Store
response = requests.get(f"{NDS_URL}/v1/datastore/namespaces/{NMS_NAMESPACE}")
print(f"Status Code: {response.status_code}\nResponse JSON: {response.json()}")

# Verify Namespace in Entity Store
response = requests.get(f"{NEMO_URL}/v1/namespaces/{NMS_NAMESPACE}")
print(f"Status Code: {response.status_code}\nResponse JSON: {response.json()}")

Status Code: 201
Response JSON: {'namespace': 'custom-llm-as-a-judge-eval-v1', 'created_at': '2025-05-08T17:19:19Z', 'updated_at': '2025-05-08T17:19:19Z'}
Status Code: 200
Response JSON: {'id': 'custom-llm-as-a-judge-eval-v1', 'created_at': '2025-05-08T17:19:19.316626', 'updated_at': '2025-05-08T17:19:19.316630', 'description': None, 'project': None, 'custom_fields': {}, 'ownership': None}


## Creating a Repository for our Data

Next, we'll want to create a repository on our NeMo Data Store!

We'll start by defining our repository ID.

In [11]:
repo_id = f"{NMS_NAMESPACE}/{DATASET_NAME}"

Next, we can use the Hugging Face Hub API to create the repository.

In [12]:
from huggingface_hub import HfApi

hf_api = HfApi(endpoint=f"{NDS_URL}/v1/hf", token="")

# Create repo
hf_api.create_repo(
    repo_id=repo_id,
    repo_type='dataset',
)

  from .autonotebook import tqdm as notebook_tqdm


RepoUrl('datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1', endpoint='https://datastore.int.aire.nvidia.com/v1/hf', repo_type='dataset', repo_id='custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1')

We're going to upload our data to the NeMo Data Store, but before we do - let's take a look at it!

Here's an example of a row of data:

```python
{
    "ID": "C012", 
    "content": "Date: 2025-04-12\nChief Complaint (CC): ...", 
    "summary": "New Clinical Problem: ..."
}
```

As you can see, we have a `content` field with a synthetically generated medical consultation, as well as a `summary` field with an AI generated summary. 

> NOTE: In this example we won't be directly leveraging the `summary` field - but we'll cover how you would be able to leverage extra fields if they were necessary!

Next, let's upload our file directly to our newly created repository using the following code cell!

In [13]:
hf_api.upload_file(
    path_or_fileobj="./doctor_consults_with_summaries.jsonl",
    path_in_repo="doctor_consults_with_summaries.jsonl",
    repo_id=repo_id,
    repo_type='dataset',
)

CommitInfo(commit_url='', commit_message='Upload doctor_consults_with_summaries.jsonl with huggingface_hub', commit_description='', oid='6f73f13bc78005f7edc7306437aa61a758e0c560', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)

Next, we'll registed the dataset with NeMo Entity Store!

This will allow us to leverage this dataset for evaluation jobs - through the `/v1/datasets/` endpoint, which will allow us to refer to the dataset by it's namespace and name.

In [14]:
resp = requests.post(
    url=f"{NEMO_URL}/v1/datasets",
    json={
        "name": DATASET_NAME,
        "namespace": NMS_NAMESPACE,
        "description": "LLM As a Judge Test",
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}",
        "project": "custom-llm-as-a-judge-test",
    },
)
assert resp.status_code in (200, 201), f"Status Code {resp.status_code} Failed to create dataset {resp.text}"
resp.json()

{'created_at': '2025-05-08T17:19:35.030601',
 'updated_at': '2025-05-08T17:19:35.030605',
 'name': 'custom-llm-as-a-judge-eval-data-v1',
 'namespace': 'custom-llm-as-a-judge-eval-v1',
 'description': 'LLM As a Judge Test',
 'format': None,
 'files_url': 'hf://datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1',
 'hf_endpoint': None,
 'split': None,
 'limit': None,
 'id': 'dataset-CU15CaykeHJJPTKrZDJw68',
 'project': 'custom-llm-as-a-judge-test',
 'custom_fields': {}}

Now, let's verify it landed.

In [15]:
res = requests.get(url=f"{NEMO_URL}/v1/datasets/{NMS_NAMESPACE}/{DATASET_NAME}")
assert res.status_code in (200, 201), f"Status Code {res.status_code} Failed to fetch dataset {res.text}"
dataset_obj = res.json()

print("Files URL:", dataset_obj["files_url"])
assert dataset_obj["files_url"] == f"hf://datasets/{repo_id}"

Files URL: hf://datasets/custom-llm-as-a-judge-eval-v1/custom-llm-as-a-judge-eval-data-v1


## NeMo Evaluator Set-Up

In the following steps, we'll make a few assumptions:

1. You will be using an OpenAI model as the Judge LLM
2. You will be using a [`build.nvidia.com`](https://build.nvidia.com/) model to generate responses. 

Each of these models can be changed to accomodate NIM for LLMs, or any OpenAI API compatible models.

### Evaluation Configuration Set Up

In order to use both the OpenAI model, and the [`build.nvidia.com`](https://build.nvidia.com/) model, we'll need to provide our API keys for both!

> NOTE: You can find the API key on [`build.nvidia.com`](https://build.nvidia.com/) by clicking the green "Get API Key" button!

In [16]:
import getpass

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")

In [17]:
NVDEV_API_KEY = getpass.getpass("NVDEV API Key:")

#### Judge LLM Configuration

In the following cell - we'll going to set our Judge LLM configuration - while the example provided is for an OpenAI model - you could change this to point at any Judge LLM you'd like that is compatible with NeMo Evaluator. 

This includes, but is not limited to:

Completion Endpoints
```python 
"api_endpoint": {
    "url": "<my-nim-deployment-base-url>/chat/completions",
    "model_id": "<my-model>"
}
```

External Endpoint
```python 
"api_endpoint": {
    "url": "<external-openai-compatible-base-url>/chat/completions",
    "model_id": "<external-model>",
    "api_key": "<my-api-key>",
    "format": "openai"        
}
```

You can check out more examples on this page of the [documentation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-targets.html).

In [18]:
judge_model_config = {
    "api_endpoint": {
        "url": "https://api.openai.com/v1/chat/completions",
        "model_id": "gpt-4.1",
        "api_key": OPENAI_API_KEY,
    }
}

Let's build a few prompt templates we can use for our Judge LLM to judge the produced summary on a few different metrics!

In [19]:
completeness_system_prompt = """
You are a judge. Rate how complete the summary is 
on a scale from 1 to 5:
1 = missing critical information … 5 = fully complete
Please respond with RATING: <number>
"""

correctness_system_prompt = """
You are a judge. Rate the summary's correctness 
(no false info) on a scale 1-5:
1 = many inaccuracies … 5 = completely accurate
Please respond with RATING: <number>
"""

Let's also set up our user prompt, which we'll use across both metrics. 

Notice that we can reference items in our dataset through the `{{ item.content }}` template. If we wanted to address our summaries, we could instead use `{{ item.summary }}`!

Also notice that we can address the generation from our target LLM with the ``{{ sample.output_text }}``.

In [20]:
user_prompt = """
Full Consult: {{ item.content }}
Summary: {{ sample.output_text }}
"""

We will also need to pass a `regex` parser so we can collect the numeric scores from our prompt - for this reason, it's important to specify in the system prompt some easily identifiable score extraction sequence. 

In the example system prompt above, you'll notice we used:

```python
"Please respond with RATING: <number>"
```

This allows us to use the following parser to collect our scores.

```python
"scores": { 
    "completeness": { 
        "type": "int",
        "parser": {
            "type": "regex",
            "pattern": r"RATING:\s*(\d+)"
        }
    },
}
```

Now that we've got the atomic parts of our Custom LLM-as-a-Judge evaluation configuration in place - let's build the whole thing!

> NOTE: We're using two metrics here `correctness` and `completeness` - but you can define more (or a single metric) as you see fit!

In [21]:
llm_as_a_judge_config = {
    "type": "custom",
    "name": "doctor_consult_summary_eval",
    "tasks": {
        "consult_summary_eval": {
            "type": "chat-completion",
            "params": {
                "template": {
                    # This is where we define the prompt template that will be sent to the target LLM we are evaluating. 
                    # Notice that we can reference items in our dataset through the `{{ item.content }}` template in this prompt as well.
                    "messages": [
                        {
                            "role": "system",
                            "content": (
                                "Given a full medical consultation, please provide a 50 word summary of the consultation."
                            )
                        },
                        {
                            "role": "user",
                            "content": (
                                "Full Consult: {{ item.content }}"
                            )
                        }
                    ],
                    "max_tokens": 200
                }
            },
            "dataset": {
                "files_url": (
                    f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/"
                ),
                # This is where we can limit the number of samples we want to use for evaluation.
                "limit" : 25
            },
            "metrics": {
                "completeness": {
                    "type": "llm-judge",
                    "params": {
                        "model": judge_model_config,
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": completeness_system_prompt
                                },
                                {
                                    "role": "user",
                                    "content": user_prompt
                                }
                            ]
                        },
                        "scores": { 
                            "completeness": { 
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)"
                                }
                            },
                        }
                    }
                },
                "correctness": {
                    "type": "llm-judge",
                    "params": {
                        "model": judge_model_config,
                        "template": {
                            "messages": [
                                {
                                    "role": "system",
                                    "content": correctness_system_prompt
                                },
                                {
                                    "role": "user",
                                    "content": user_prompt
                                }
                            ]
                        },
                        "scores": { 
                            "correctness": { 
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)"
                                }
                            },
                        }
                    }
                }
            }
        }
    }
}

### Target Configuration

Just as with the Judge LLM - you can identify any targets, please see the [Target documentation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-targets.html) for more examples!

We're going to be using Llama 3.1 70B as our model to be tested in this example.

In [27]:
llama_3_1_70b_target = {
    "type" : "model",
    "model" : {
        "api_endpoint": {
            "url": "https://integrate.api.nvidia.com/v1/chat/completions",
            "model_id": "meta/llama-3.1-70b-instruct",
            "api_key": NVDEV_API_KEY
        }
    }
}

## Evaluation Job and Status

At this point - we're ready to kick-off our Evaluation Job as we've prepared both our Evaluation Configuration and our Target configuration!

In [28]:
res = requests.post(
    f"{NEMO_URL}/v1/evaluation/jobs",
    json={
        "config": llm_as_a_judge_config,
        "target": llama_3_1_70b_target
    }
)

base_eval_job_id = res.json()["id"]

res.json()

{'created_at': '2025-05-08T17:21:24.210032',
 'updated_at': '2025-05-08T17:21:24.210033',
 'id': 'eval-CjUYLQdriBtAA5X9KPDLAU',
 'namespace': 'default',
 'description': None,
 'target': {'schema_version': '1.0',
  'id': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn',
  'description': None,
  'type_prefix': 'eval-target',
  'namespace': 'default',
  'project': None,
  'created_at': '2025-05-08T17:21:24.209441',
  'updated_at': '2025-05-08T17:21:24.209442',
  'custom_fields': {},
  'ownership': None,
  'name': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn',
  'type': 'model',
  'cached_outputs': None,
  'model': {'schema_version': '1.0',
   'id': 'model-R1gn9w1tPwdfukCDoCBF2F',
   'description': None,
   'type_prefix': 'model',
   'namespace': 'default',
   'project': None,
   'created_at': '2025-05-08T17:21:24.209465',
   'updated_at': '2025-05-08T17:21:24.209465',
   'custom_fields': {},
   'ownership': None,
   'name': 'model-R1gn9w1tPwdfukCDoCBF2F',
   'version_id': 'main',
   'version_tags': [],
   'sp

We'll use the following helper function to wait for our job to be completed.

In [29]:
from time import sleep, time

def wait_eval_job(job_url: str, polling_interval: int = 10, timeout: int = 6000):
    """Helper for waiting an eval job."""
    start_time = time()
    res = requests.get(job_url)
    status = res.json()["status"]

    while (status in ["pending", "created", "running"]):
        # Check for timeout
        if time() - start_time > timeout:
            raise RuntimeError(f"Took more than {timeout} seconds.")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        res = requests.get(job_url)
        status = res.json()["status"]

        # Progress details (only fetch if status is "running")
        if status == "running":
            progress = res.json().get("status_details", {}).get("progress", 0)
        elif status == "completed":
            progress = 100

        print(f"Job status: {status} after {time() - start_time:.2f} seconds. Progress: {progress}%")

    return res

The job itself may take ~250-300s to complete, depending on hardware, models used, and other factors. 

In [30]:
res = wait_eval_job(f"{NEMO_URL}/v1/evaluation/jobs/{base_eval_job_id}", polling_interval=5, timeout=600)

Job status: running after 6.03 seconds. Progress: 8.0%
Job status: running after 14.26 seconds. Progress: 16.0%
Job status: running after 21.37 seconds. Progress: 20.0%
Job status: running after 26.88 seconds. Progress: 28.0%
Job status: running after 32.41 seconds. Progress: 32.0%
Job status: running after 37.93 seconds. Progress: 40.0%
Job status: running after 43.44 seconds. Progress: 44.0%
Job status: running after 48.95 seconds. Progress: 48.0%
Job status: running after 56.06 seconds. Progress: 56.0%
Job status: running after 61.57 seconds. Progress: 64.0%
Job status: running after 67.08 seconds. Progress: 72.0%
Job status: running after 72.59 seconds. Progress: 72.0%
Job status: running after 78.12 seconds. Progress: 80.0%
Job status: running after 83.64 seconds. Progress: 84.0%
Job status: running after 90.75 seconds. Progress: 88.0%
Job status: running after 96.26 seconds. Progress: 96.0%
Job status: completed after 101.78 seconds. Progress: 100%


Now we can verify our job is complete!

In [31]:
print(res.json())

{'created_at': '2025-05-08T17:21:24.210032', 'updated_at': '2025-05-08T17:23:09.228092', 'id': 'eval-CjUYLQdriBtAA5X9KPDLAU', 'namespace': 'default', 'description': None, 'target': {'schema_version': '1.0', 'id': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn', 'description': None, 'type_prefix': 'eval-target', 'namespace': 'default', 'project': None, 'created_at': '2025-05-08T17:21:24.209441', 'updated_at': '2025-05-08T17:21:24.209442', 'custom_fields': {}, 'ownership': None, 'name': 'eval-target-V5VgZ54rJBeoD8w1WmzSHn', 'type': 'model', 'cached_outputs': None, 'model': {'schema_version': '1.0', 'id': 'model-R1gn9w1tPwdfukCDoCBF2F', 'description': None, 'type_prefix': 'model', 'namespace': 'default', 'project': None, 'created_at': '2025-05-08T17:21:24.209465', 'updated_at': '2025-05-08T17:21:24.209465', 'custom_fields': {}, 'ownership': None, 'name': 'model-R1gn9w1tPwdfukCDoCBF2F', 'version_id': 'main', 'version_tags': [], 'spec': None, 'artifact': None, 'base_model': None, 'api_endpoint': {'url

Now that it's complete - we can look at the scores the Custom LLM-as-a-Judge evaluation produced!

In [32]:
res = requests.get(f"{NEMO_URL}/v1/evaluation/jobs/{base_eval_job_id}/results")
res.json()

{'created_at': '2025-05-08T17:21:24.281058',
 'updated_at': '2025-05-08T17:21:24.281059',
 'id': 'evaluation_result-QWzZVGtKY1Z7hdty63hLfP',
 'job': 'eval-CjUYLQdriBtAA5X9KPDLAU',
 'tasks': {'consult_summary_eval': {'metrics': {'completeness': {'scores': {'completeness': {'value': 4.92,
       'stats': {'count': 25, 'sum': 123.0, 'mean': 4.92}}}},
    'correctness': {'scores': {'correctness': {'value': 4.92,
       'stats': {'count': 25, 'sum': 123.0, 'mean': 4.92}}}}}}},
 'groups': {},
 'namespace': 'default',
 'custom_fields': {}}