# Custom LLM-as-a-Judge Implementation

This notebook demonstrates how to leverage Custom LLM-as-a-Judge through NeMo Evaluator Microservice.

Full documentation: [NeMo Evaluator Custom Evaluation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-custom.html#evaluation-with-llm-as-a-judge)

## Overview

In this example, we'll evaluate medical consultation summaries using:
- **Target Model**: A model that generates summaries (default: your deployed NIM, no API key needed)
- **Judge Model**: A model that evaluates the summaries (default: your deployed NIM, no API key needed)
  - Evaluates on two metrics:
    - **Completeness**: How well the summary captures all critical information (1-5 scale)
    - **Correctness**: How accurate the summary is without false information (1-5 scale)

**No API keys required!** The notebook is pre-configured to use your deployed NIM endpoint for both models.


## Prerequisites

- NeMo Evaluator service deployed
- NeMo Data Store service deployed
- NeMo Entity Store service deployed
- **Judge Model**: Choose one:
  - Option A: OpenAI API key (for GPT-4.1 judge) - **Optional**
  - Option B: NVIDIA API key (for NVIDIA models as judge)
  - Option C: Your deployed NIM endpoint (no API key needed)
- **Target Model**: NVIDIA API key OR your deployed NIM endpoint


In [1]:
# Set RUN_LOCALLY BEFORE importing config (for port-forward mode)
import os
if "RUN_LOCALLY" not in os.environ:
    os.environ["RUN_LOCALLY"] = "true"
    print("‚úÖ Set RUN_LOCALLY=true (using localhost with port-forwards)")

# Install llama-stack-client from GitHub main (same as llamastack demo)
# This ensures compatibility with the latest server version
%pip install --upgrade git+https://github.com/meta-llama/llama-stack-client-python.git@main

# Install required packages
%pip install requests huggingface-hub datasets jupyterlab python-dotenv openai llama-stack-client


‚úÖ Set RUN_LOCALLY=true (using localhost with port-forwards)
Collecting git+https://github.com/meta-llama/llama-stack-client-python.git@main
  Cloning https://github.com/meta-llama/llama-stack-client-python.git (to revision main) to /private/var/folders/54/0nyyn56s1bsd1kbwqv8fdwxr0000gn/T/pip-req-build-r6o1bsk6
  Running command git clone --filter=blob:none --quiet https://github.com/meta-llama/llama-stack-client-python.git /private/var/folders/54/0nyyn56s1bsd1kbwqv8fdwxr0000gn/T/pip-req-build-r6o1bsk6
  Resolved https://github.com/meta-llama/llama-stack-client-python.git to commit f8eb65140836de310042c914be5ec8c26e87554a
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Load configuration
from config import (
    NDS_URL, ENTITY_STORE_URL, EVALUATOR_URL, NEMO_URL, LLAMASTACK_URL,
    NMS_NAMESPACE, DATASET_NAME, NDS_TOKEN,
    OPENAI_API_KEY, NVIDIA_API_KEY, RUN_LOCALLY
)

print(f"‚úÖ Configuration loaded")
print(f"Mode: {'Local (port-forward)' if RUN_LOCALLY else 'Cluster'}")
print(f"Data Store: {NDS_URL}")
print(f"Entity Store: {ENTITY_STORE_URL}")
print(f"Evaluator: {EVALUATOR_URL}")
print(f"LlamaStack: {LLAMASTACK_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Dataset: {DATASET_NAME}")

# Quick connectivity test
import requests
try:
    r = requests.get(f"{NDS_URL}/v1/datastore/namespaces", timeout=2)
    print(f"‚úÖ Data Store connectivity: OK")
except Exception as e:
    print(f"‚ö†Ô∏è  Data Store connectivity: FAILED - {e}")
    if RUN_LOCALLY:
        print(f"\nüì° Port-forward setup required for local mode:")
        print(f"   Run this in a terminal:")
        print(f"   ./port-forward.sh")
        print(f"\n   Or manually:")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemodatastore-sample 8001:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemoentitystore-sample 8002:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/nemoevaluator-sample 8004:8000 &")
        print(f"   oc port-forward -n {NMS_NAMESPACE} svc/llamastack 8321:8321 &")
    else:
        print(f"   If running from outside cluster, set RUN_LOCALLY=true environment variable")
        print(f"   Or ensure you're running this notebook from within the cluster")

# Initialize LlamaStack client
try:
    from llama_stack_client import LlamaStackClient
    client = LlamaStackClient(base_url=LLAMASTACK_URL)
    # Test connectivity
    try:
        server_info = client._client.get("/")
        print(f"‚úÖ LlamaStack connectivity: OK")
        try:
            client_version = client._client._version
            print(f"   LlamaStack client version: {client_version}")
        except:
            pass
    except Exception as e:
        print(f"‚ö†Ô∏è  LlamaStack connectivity: FAILED - {e}")
        print(f"   Make sure LlamaStack is deployed and port-forward is active: oc port-forward -n {NMS_NAMESPACE} svc/llamastack 8321:8321")
        client = None
except ImportError:
    print("‚ö†Ô∏è  LlamaStack client not available - install with: %pip install --upgrade git+https://github.com/meta-llama/llama-stack-client-python.git@main")
    print("   Continuing without LlamaStack integration...")
    client = None
except Exception as e:
    print(f"‚ö†Ô∏è  LlamaStack initialization failed: {e}")
    print("   Continuing without LlamaStack integration...")
    client = None


‚úÖ Configuration loaded
Mode: Local (port-forward)
Data Store: http://localhost:8001
Entity Store: http://localhost:8002
Evaluator: http://localhost:8004
LlamaStack: http://localhost:8321
Namespace: anemo-rhoai
Dataset: custom-llm-as-a-judge-eval-data
‚úÖ Data Store connectivity: OK


INFO:httpx:HTTP Request: GET http://localhost:8321/ "HTTP/1.1 404 Not Found"


‚úÖ LlamaStack connectivity: OK


## Step 1: Set Up Namespaces

Create namespaces in both Entity Store and Data Store.


In [3]:
import requests

def create_namespaces(entity_host, ds_host, namespace):
    """Create namespace in both Entity Store and Data Store."""
    # Create namespace in Entity Store
    entity_store_url = f"{entity_host}/v1/namespaces"
    resp = requests.post(entity_store_url, json={"id": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Entity Store: {resp.status_code} - {resp.text}"
    print(f"‚úÖ Entity Store namespace created/verified: {namespace}")

    # Create namespace in Data Store
    nds_url = f"{ds_host}/v1/datastore/namespaces"
    resp = requests.post(nds_url, data={"namespace": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Data Store: {resp.status_code} - {resp.text}"
    print(f"‚úÖ Data Store namespace created/verified: {namespace}")

create_namespaces(entity_host=ENTITY_STORE_URL, ds_host=NDS_URL, namespace=NMS_NAMESPACE)


‚úÖ Entity Store namespace created/verified: anemo-rhoai
‚úÖ Data Store namespace created/verified: anemo-rhoai


## Step 2: Upload Dataset to Data Store

Upload the medical consultation data to the Data Store.


In [4]:
from huggingface_hub import HfApi

repo_id = f"{NMS_NAMESPACE}/{DATASET_NAME}"
print(f"Repository ID: {repo_id}")

# Create HfApi client pointing to NeMo Data Store
hf_api = HfApi(endpoint=f"{NDS_URL}/v1/hf", token=NDS_TOKEN if NDS_TOKEN != "token" else None)

# IMPORTANT: Ensure namespace exists in Gitea before creating repository
# Data Store's Gitea backend needs the namespace directory to exist,
# otherwise it defaults to "default" namespace. Creating a temporary
# repository first ensures the namespace is created in Gitea.
temp_repo_id = f"{NMS_NAMESPACE}/.namespace-init"
try:
    # Create temporary repo to ensure namespace exists in Gitea
    hf_api.create_repo(repo_id=temp_repo_id, repo_type='dataset', exist_ok=True)
    # Delete temporary repo (namespace directory will remain)
    try:
        hf_api.delete_repo(repo_id=temp_repo_id, repo_type='dataset')
    except:
        pass  # Ignore if deletion fails
    print(f"‚úÖ Namespace '{NMS_NAMESPACE}' initialized in Gitea")
except Exception as e:
    # If temp repo creation fails, namespace might already exist - continue
    print(f"‚ÑπÔ∏è  Namespace check: {e}")

# Create repository (now namespace should exist in Gitea)
try:
    hf_api.create_repo(repo_id=repo_id, repo_type='dataset', exist_ok=True)
    print(f"‚úÖ Repository created: {repo_id}")
except Exception as e:
    print(f"‚ö†Ô∏è  Repository may already exist: {e}")


  from .autonotebook import tqdm as notebook_tqdm


Repository ID: anemo-rhoai/custom-llm-as-a-judge-eval-data
‚ÑπÔ∏è  Namespace check: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'anemo-rhoai/.namespace-init'.
‚úÖ Repository created: anemo-rhoai/custom-llm-as-a-judge-eval-data


In [5]:
# Upload data file
data_file = "./data/doctor_consults_with_summaries.jsonl"

try:
    hf_api.upload_file(
        path_or_fileobj=data_file,
        path_in_repo="doctor_consults_with_summaries.jsonl",
        repo_id=repo_id,
        repo_type='dataset',
    )
    print(f"‚úÖ Data uploaded to {repo_id}")
except Exception as e:
    if "already exists" in str(e).lower() or "409" in str(e):
        print(f"‚ÑπÔ∏è  File already exists in repository (this is OK)")
    else:
        print(f"‚ö†Ô∏è  Upload warning: {e}")
        # Try to continue anyway - file might already be there
        print(f"   Continuing...")


‚úÖ Data uploaded to anemo-rhoai/custom-llm-as-a-judge-eval-data


## Step 3: Register Dataset in Entity Store

Register the dataset so it can be used in evaluation jobs.


In [6]:
# Register dataset in Entity Store
resp = requests.post(
    url=f"{ENTITY_STORE_URL}/v1/datasets",
    json={
        "name": DATASET_NAME,
        "namespace": NMS_NAMESPACE,
        "description": "Medical consultation summaries for LLM-as-a-Judge evaluation",
        "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}",
        "project": "custom-llm-as-a-judge-test",
    },
)

# Handle response - 409 means dataset already exists (OK for re-running notebook)
if resp.status_code in (200, 201):
    print(f"‚úÖ Dataset registered: {DATASET_NAME}")
elif resp.status_code == 409:
    print(f"‚ÑπÔ∏è  Dataset already exists: {DATASET_NAME} (this is OK)")
else:
    raise Exception(f"Status Code {resp.status_code} Failed to create dataset: {resp.text}")

# Verify dataset exists
res = requests.get(url=f"{ENTITY_STORE_URL}/v1/datasets/{NMS_NAMESPACE}/{DATASET_NAME}")
if res.status_code not in (200, 201):
    raise Exception(f"Status Code {res.status_code} Failed to fetch dataset: {res.text}")

dataset_obj = res.json()
print(f"‚úÖ Dataset verified. Files URL: {dataset_obj['files_url']}")


‚ÑπÔ∏è  Dataset already exists: custom-llm-as-a-judge-eval-data (this is OK)
‚úÖ Dataset verified. Files URL: hf://datasets/anemo-rhoai/custom-llm-as-a-judge-eval-data


## Step 4: Configure Judge LLM and Target Model

Set up the judge model (OpenAI) and target model (for generating summaries).


In [7]:
# Judge LLM Configuration
# IMPORTANT: Use NIM_URL_CLUSTER from config (matches e2e-notebook pattern)
# This uses the standard NIM service (meta-llama3-1b-instruct) on port 8000

# Option 1: Use your deployed NIM as judge (RECOMMENDED - no API key needed)
# This uses the standard NIM service (meta-llama3-1b-instruct) on port 8000
# Use NIM_URL_CLUSTER from config (matches e2e-notebook pattern)
import importlib
import config
importlib.reload(config)
from config import NIM_URL_CLUSTER

judge_model_config = {
    "api_endpoint": {
        "url": f"{NIM_URL_CLUSTER}/v1/chat/completions",  # Full path with revision service
        "model_id": "meta/llama-3.2-1b-instruct",  # Adjust to your deployed model
        "format": "openai"  # Specify format - may help with URL handling
    }
}

print("‚úÖ Judge model configured: Your NIM (meta/llama-3.2-1b-instruct)")
print(f"‚ÑπÔ∏è  Creating evaluation target with cluster URL: {NIM_URL_CLUSTER}/v1/chat/completions")
print("   (Service mesh handles authentication - no token needed)")
print("   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)")
print("   If job fails, this is a known Evaluator limitation")


‚úÖ Judge model configured: Your NIM (meta/llama-3.2-1b-instruct)
‚ÑπÔ∏è  Creating evaluation target with cluster URL: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
   (Service mesh handles authentication - no token needed)
   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)
   If job fails, this is a known Evaluator limitation


In [8]:
# Target Model Configuration
# IMPORTANT: Use NIM_URL_CLUSTER from config (matches e2e-notebook pattern)
# This uses the standard NIM service (meta-llama3-1b-instruct) on port 8000

# Option 1: Use your deployed NIM as target (RECOMMENDED - no API key needed)
from config import NMS_NAMESPACE

# This uses the standard NIM service (meta-llama3-1b-instruct) on port 8000
# Then use: oc get svc <name>-<revision> -n <namespace>
# Use NIM_URL_CLUSTER from config (matches e2e-notebook pattern)
import importlib
import config
importlib.reload(config)
from config import NIM_URL_CLUSTER, NMS_NAMESPACE

target_model_config = {
    "type": "model",
    "model": {
        "api_endpoint": {
            "url": f"{NIM_URL_CLUSTER}/v1/chat/completions",  # Full path with revision service
            "model_id": "meta/llama-3.2-1b-instruct",  # Adjust to your deployed model
            "format": "openai"  # Specify format - may help with URL handling
        }
    }
}

print("‚úÖ Target model configured: Your NIM (meta/llama-3.2-1b-instruct)")
print(f"‚ÑπÔ∏è  Creating evaluation target with cluster URL: {NIM_URL_CLUSTER}/v1/chat/completions")
print("   (Service mesh handles authentication - no token needed)")
print("   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)")
print("   If job fails, try creating evaluation target first (see troubleshooting)")


‚úÖ Target model configured: Your NIM (meta/llama-3.2-1b-instruct)
‚ÑπÔ∏è  Creating evaluation target with cluster URL: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
   (Service mesh handles authentication - no token needed)
   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)
   If job fails, try creating evaluation target first (see troubleshooting)


## Step 5: Define Evaluation Prompts

Create prompts for the judge to evaluate completeness and correctness.


In [9]:
# System prompts for judge evaluation
completeness_system_prompt = """
You are a judge. Rate how complete the summary is 
on a scale from 1 to 5:
1 = missing critical information ‚Ä¶ 5 = fully complete
Please respond with RATING: <number>
"""

correctness_system_prompt = """
You are a judge. Rate the summary's correctness 
(no false info) on a scale 1-5:
1 = many inaccuracies ‚Ä¶ 5 = completely accurate
Please respond with RATING: <number>
"""

# User prompt template (references dataset item and model output)
user_prompt = """
Full Consult: {{ item.content }}
Summary: {{ sample.output_text }}
"""

print("‚úÖ Evaluation prompts defined")


‚úÖ Evaluation prompts defined


## Step 6: Create Evaluation Configuration

Build the custom LLM-as-a-Judge evaluation configuration.


In [10]:
llm_as_a_judge_config = {
    "type": "custom",
    "name": "doctor_consult_summary_eval",
    "tasks": {
        "consult_summary_eval": {
            "type": "chat-completion",
            "params": {
                "template": {
                    # Prompt sent to target LLM to generate summary
                    "messages": [
                        {
                            "role": "system",
                            "content": "Given a full medical consultation, please provide a 50 word summary of the consultation."
                        },
                        {
                            "role": "user",
                            "content": "Full Consult: {{ item.content }}"
                        }
                    ],
                    "max_tokens": 200
                }
            },
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/",
                "limit": 5  # Reduced for quick test - increase for full evaluation
            },
            "metrics": {
                "completeness": {
                    "type": "llm-judge",
                    "params": {
                        "model": judge_model_config,
                        "template": {
                            "messages": [
                                {"role": "system", "content": completeness_system_prompt},
                                {"role": "user", "content": user_prompt}
                            ]
                        },
                        "scores": {
                            "completeness": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)"
                                }
                            }
                        }
                    }
                },
                "correctness": {
                    "type": "llm-judge",
                    "params": {
                        "model": judge_model_config,
                        "template": {
                            "messages": [
                                {"role": "system", "content": correctness_system_prompt},
                                {"role": "user", "content": user_prompt}
                            ]
                        },
                        "scores": {
                            "correctness": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING:\s*(\d+)"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

print("‚úÖ Evaluation configuration created")
print(f"   - Type: custom")
print(f"   - Metrics: completeness, correctness")
print(f"   - Sample limit: 5 (for quick test)")


‚úÖ Evaluation configuration created
   - Type: custom
   - Metrics: completeness, correctness
   - Sample limit: 5 (for quick test)


## Step 6.5: Verify NIM Connectivity (Optional Diagnostic)

Before submitting the job, you can verify the NIM endpoint is correctly configured.


In [11]:
# Diagnostic: Print the URLs that will be used
from config import STANDARD_NIM_SERVICE, NMS_NAMESPACE, NIM_URL_CLUSTER

print("üìã Configuration Summary:")
print(f"   Target Model URL: {target_model_config['model']['api_endpoint']['url']}")
print(f"   Judge Model URL: {judge_model_config['api_endpoint']['url']}")
print(f"\nüí° Note: These URLs must be accessible from within the cluster.")
print(f"   If the job fails with connection errors, verify:")
print(f"   1. InferenceService exists: oc get inferenceservice {STANDARD_NIM_SERVICE} -n {NMS_NAMESPACE}")
print(f"   2. Service name matches: {STANDARD_NIM_SERVICE}")
print(f"   3. Using HTTP endpoint: {NIM_URL_CLUSTER}")
print(f"   4. Service account token is set (check .env file)")
print(f"   5. Test connectivity: oc run test --image=curlimages/curl --rm -i --restart=Never -n {NMS_NAMESPACE} -- curl -k {NIM_URL_CLUSTER}/v1/models")


üìã Configuration Summary:
   Target Model URL: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
   Judge Model URL: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions

üí° Note: These URLs must be accessible from within the cluster.
   If the job fails with connection errors, verify:
   1. InferenceService exists: oc get inferenceservice meta-llama3-1b-instruct -n anemo-rhoai
   2. Service name matches: meta-llama3-1b-instruct
   3. Using HTTP endpoint: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000
   4. Service account token is set (check .env file)
   5. Test connectivity: oc run test --image=curlimages/curl --rm -i --restart=Never -n anemo-rhoai -- curl -k http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/models


## Step 6.5: Create Evaluation Targets

**NOTE**: Creating evaluation targets first is a best practice and follows the e2e notebook pattern. However, there is a **known limitation** in NeMo Evaluator v25.06 where it strips `/chat/completions` from URLs during job execution, even when targets are created correctly.

**Current Status**: 
- ‚úÖ Target creation works (URLs stored correctly)
- ‚úÖ Job submission works  
- ‚ùå Job execution fails (URL stripped to `/v1` instead of `/v1/chat/completions`)

**This is an Evaluator limitation** - the notebook is configured correctly, but the Evaluator's internal URL handling needs to be fixed in a future version.


In [12]:
# Create evaluation target for judge model
# Use target reference (created in previous cells) for cleaner configuration
import requests
from config import EVALUATOR_URL, NMS_NAMESPACE

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

# Delete existing target if it exists (for clean re-runs)
judge_target_name = "meta-llama3-1b-instruct-judge"
try:
    res = requests.delete(f"{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/{judge_target_name}")
    if res.status_code in (200, 404):
        print(f"‚úÖ Cleaned up existing judge target (if any)")
except:
    pass

# Create judge model target
judge_target_data = {
    "type": "model",
    "name": judge_target_name,
    "namespace": NMS_NAMESPACE,
    "model": {
        "api_endpoint": {
            "url": f"{NIM_URL_CLUSTER}/v1/chat/completions",
            "model_id": "meta/llama-3.2-1b-instruct",
            "format": "openai"  # Specify format
        }
    }
}

print(f"Creating judge model evaluation target: {judge_target_name}")
res = requests.post(f"{EVALUATOR_URL}/v1/evaluation/targets", headers=headers, json=judge_target_data)

if res.status_code not in (200, 201):
    print(f"‚ö†Ô∏è  Warning: Could not create judge target: {res.status_code}")
    print(f"Response: {res.text[:200]}")
    # Continue anyway - might already exist
else:
    judge_target_response = res.json()
    print(f"‚úÖ Judge target created: {judge_target_response.get('name')}")

# Update judge_model_config to use target name for job submission
# The config will reference this target by name
judge_target_ref = f"{NMS_NAMESPACE}/{judge_target_name}"
print(f"\nüí° Judge target reference: {judge_target_ref}")
print("   (Will use this in evaluation config)")


‚úÖ Cleaned up existing judge target (if any)
Creating judge model evaluation target: meta-llama3-1b-instruct-judge
‚úÖ Judge target created: meta-llama3-1b-instruct-judge

üí° Judge target reference: anemo-rhoai/meta-llama3-1b-instruct-judge
   (Will use this in evaluation config)


In [13]:
# Create evaluation target for target model
# Delete existing target if it exists (for clean re-runs)
target_target_name = "meta-llama3-1b-instruct-target"
try:
    res = requests.delete(f"{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/{target_target_name}")
    if res.status_code in (200, 404):
        print(f"‚úÖ Cleaned up existing target model target (if any)")
except:
    pass

# Create target model evaluation target
target_target_data = {
    "type": "model",
    "name": target_target_name,
    "namespace": NMS_NAMESPACE,
    "model": {
        "api_endpoint": {
            "url": f"{NIM_URL_CLUSTER}/v1/chat/completions",
            "model_id": "meta/llama-3.2-1b-instruct",
            "format": "openai"  # Specify format
        }
    }
}

print(f"Creating target model evaluation target: {target_target_name}")
res = requests.post(f"{EVALUATOR_URL}/v1/evaluation/targets", headers=headers, json=target_target_data)

if res.status_code not in (200, 201):
    print(f"‚ö†Ô∏è  Warning: Could not create target model target: {res.status_code}")
    print(f"Response: {res.text[:200]}")
    # Continue anyway - might already exist
else:
    target_target_response = res.json()
    print(f"‚úÖ Target model target created: {target_target_response.get('name')}")

# Store target reference for job submission
target_model_ref = f"{NMS_NAMESPACE}/{target_target_name}"
print(f"\nüí° Target model reference: {target_model_ref}")
print("   (Will use this in job submission)")


‚úÖ Cleaned up existing target model target (if any)
Creating target model evaluation target: meta-llama3-1b-instruct-target
‚úÖ Target model target created: meta-llama3-1b-instruct-target

üí° Target model reference: anemo-rhoai/meta-llama3-1b-instruct-target
   (Will use this in job submission)


## Step 7: Submit Evaluation Job

Submit the evaluation job to NeMo Evaluator.


In [14]:
# Submit evaluation job
# IMPORTANT: Use inline target config (like original notebook) to avoid Data Store validation issues
# Using inline config bypasses evaluation target lookup which triggers Data Store dataset validation
try:
    # Use inline target config (matches original notebook approach)
    job_payload = {
        "config": llm_as_a_judge_config,
        "target": target_model_config  # Use inline config object, not target reference
    }
    
    print("üì§ Submitting evaluation job with inline target config...")
    print(f"   Target: {target_model_config['model']['api_endpoint']['url']}")
    
    res = requests.post(
        f"{EVALUATOR_URL}/v1/evaluation/jobs",
        json=job_payload,
        timeout=30
    )

    if res.status_code not in (200, 201):
        print(f"‚ùå Failed to submit job: {res.status_code}")
        print(f"Response: {res.text}")
        raise Exception(f"Job submission failed: {res.status_code} - {res.text}")

    job_data = res.json()
    base_eval_job_id = job_data["id"]
    print(f"‚úÖ Evaluation job submitted")
    print(f"   Job ID: {base_eval_job_id}")
    print(f"   Status: {job_data.get('status', 'unknown')}")
    
except requests.exceptions.RequestException as e:
    print(f"‚ùå Network error submitting job: {e}")
    print(f"   Check that Evaluator is accessible at: {EVALUATOR_URL}")
    raise
except Exception as e:
    print(f"‚ùå Error submitting job: {e}")
    raise


üì§ Submitting evaluation job with inline target config...
   Target: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
‚úÖ Evaluation job submitted
   Job ID: eval-VgyjqL9ciYXZ9XCVGGLmfe
   Status: created


In [15]:
# Troubleshooting: Verify NIM service accessibility
print("üîç Troubleshooting NIM Connection:")
print(f"   1. Service exists: oc get svc {STANDARD_NIM_SERVICE} -n {NMS_NAMESPACE}")
print(f"   2. Service has endpoints: oc get endpoints {STANDARD_NIM_SERVICE} -n {NMS_NAMESPACE}")
print(f"   3. Pod is running: oc get pod -n {NMS_NAMESPACE} | grep {STANDARD_NIM_SERVICE}")
print(f"   4. Test from evaluator pod:")
print(f"      oc exec -n {NMS_NAMESPACE} <evaluator-pod> -- curl -s {NIM_URL_CLUSTER}/v1/models")
print(f"\n   Expected URL format: {NIM_URL_CLUSTER}/v1/chat/completions")
print(f"   (HTTPS uses port 443 by default - no need to specify port)")
print(f"   ‚ö†Ô∏è  Service account token is NOT needed (service mesh handles auth)")


üîç Troubleshooting NIM Connection:
   1. Service exists: oc get svc meta-llama3-1b-instruct -n anemo-rhoai
   2. Service has endpoints: oc get endpoints meta-llama3-1b-instruct -n anemo-rhoai
   3. Pod is running: oc get pod -n anemo-rhoai | grep meta-llama3-1b-instruct
   4. Test from evaluator pod:
      oc exec -n anemo-rhoai <evaluator-pod> -- curl -s http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/models

   Expected URL format: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
   (HTTPS uses port 443 by default - no need to specify port)
   ‚ö†Ô∏è  Service account token is NOT needed (service mesh handles auth)


## Step 8: Wait for Job Completion

Monitor the evaluation job until it completes.


In [16]:
from time import sleep, time

def wait_eval_job(job_url: str, polling_interval: int = 10, timeout: int = 600):
    """Helper for waiting an eval job with error handling."""
    start_time = time()
    
    try:
        res = requests.get(job_url, timeout=10)
        if res.status_code != 200:
            raise Exception(f"Failed to get job status: {res.status_code} - {res.text}")
    except requests.exceptions.RequestException as e:
        raise Exception(f"Network error getting job status: {e}")
    
    job_data = res.json()
    status = job_data["status"]
    print(f"Initial status: {status}")
    
    # Check for immediate terminal states
    if status == "failed":
        print(f"‚ùå Job failed immediately!")
        status_details = job_data.get('status_details', {})
        error_msg = status_details.get('message', 'Unknown error')
        print(f"Error: {error_msg}")
        return res
    elif status == "completed":
        print(f"‚úÖ Job completed immediately!")
        return res

    # Poll for status updates
    while status in ["pending", "created", "running"]:
        # Check for timeout
        elapsed = time() - start_time
        if elapsed > timeout:
            raise RuntimeError(f"Job took more than {timeout} seconds (timed out).")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        try:
            res = requests.get(job_url, timeout=10)
            if res.status_code != 200:
                print(f"‚ö†Ô∏è  Failed to get status: {res.status_code} - {res.text}")
                sleep(polling_interval)  # Wait before retrying
                continue
        except requests.exceptions.RequestException as e:
            print(f"‚ö†Ô∏è  Network error getting status: {e} - retrying...")
            sleep(polling_interval)
            continue
            
        job_data = res.json()
        status = job_data["status"]
        elapsed = time() - start_time

        # Handle terminal states immediately
        if status == "failed":
            print(f"\n‚ùå Job failed after {elapsed:.1f}s")
            status_details = job_data.get('status_details', {})
            error_msg = status_details.get('message', 'Unknown error')
            print(f"Error: {error_msg}")
            
            # Print task status if available
            task_status = status_details.get('task_status', {})
            if task_status:
                print(f"\nTask status details:")
                for task_name, task_info in task_status.items():
                    print(f"  - {task_name}: {task_info}")
            return res
        elif status == "completed":
            progress = 100
            print(f"‚úÖ Status: {status} | Progress: {progress}% | Elapsed: {elapsed:.1f}s")
            return res
        elif status == "running":
            progress = job_data.get("status_details", {}).get("progress", 0)
            print(f"‚è≥ Status: {status} | Progress: {progress}% | Elapsed: {elapsed:.1f}s")
        else:
            # Unknown status - log and continue
            print(f"‚ö†Ô∏è  Status: {status} | Elapsed: {elapsed:.1f}s")

    # If we exit the loop, status should be terminal, but check anyway
    if status not in ["completed", "failed"]:
        print(f"‚ö†Ô∏è  Unexpected final status: {status}")
        print(f"   Full job data: {job_data}")

    return res

print("‚è≥ Waiting for evaluation job to complete...")
try:
    res = wait_eval_job(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}", polling_interval=5, timeout=600)
except Exception as e:
    print(f"‚ùå Error waiting for job: {e}")
    raise


‚è≥ Waiting for evaluation job to complete...
Initial status: running
‚è≥ Status: running | Progress: 60.0% | Elapsed: 5.2s
‚úÖ Status: completed | Progress: 100% | Elapsed: 10.3s


In [17]:
# Check final status (this cell provides additional details if needed)
try:
    job_data = res.json()
    final_status = job_data["status"]
    
    if final_status == "completed":
        print(f"‚úÖ Job completed successfully!")
        print(f"   You can now view results in the next cell.")
    elif final_status == "failed":
        print(f"\n‚ùå Job failed - Summary:")
        status_details = job_data.get('status_details', {})
        error_msg = status_details.get('message', 'Unknown error')
        
        # Extract key error information
        if "Error connecting to inference server" in error_msg:
            print(f"   Issue: Cannot connect to NIM endpoint")
            print(f"   Check: Is the NIM service running and accessible from cluster?")
            print(f"   URL used: Check the target/judge model configuration")
        
        print(f"\n   Full error message:")
        print(f"   {error_msg[:500]}...")  # Truncate very long errors
        
        # Print task status if available
        task_status = status_details.get('task_status', {})
        if task_status:
            print(f"\n   Task status details:")
            for task_name, task_info in task_status.items():
                print(f"     - {task_name}: {task_info}")
    else:
        print(f"‚ö†Ô∏è  Job status: {final_status}")
        print(f"   Full response: {job_data}")
except Exception as e:
    print(f"‚ö†Ô∏è  Error parsing job status: {e}")
    print(f"   Raw response: {res.text if hasattr(res, 'text') else res}")


‚úÖ Job completed successfully!
   You can now view results in the next cell.


## Step 9: View Results

Retrieve and display the evaluation results.


In [18]:
# Get results
try:
    res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}/results", timeout=30)
    
    if res.status_code == 200:
        results = res.json()
        
        # Extract metrics
        tasks = results.get("tasks", {})
        if not tasks:
            print("‚ö†Ô∏è  No tasks found in results")
            print(f"   Full response: {results}")
        else:
            for task_name, task_data in tasks.items():
                print(f"\nüìä Task: {task_name}")
                metrics = task_data.get("metrics", {})
                if not metrics:
                    print(f"   ‚ö†Ô∏è  No metrics found for this task")
                else:
                    for metric_name, metric_data in metrics.items():
                        scores = metric_data.get("scores", {})
                        if not scores:
                            print(f"   ‚ö†Ô∏è  No scores found for metric: {metric_name}")
                        else:
                            for score_name, score_data in scores.items():
                                value = score_data.get("value", "N/A")
                                stats = score_data.get("stats", {})
                                mean = stats.get("mean", "N/A")
                                count = stats.get("count", "N/A")
                                print(f"   {score_name}: {value} (mean: {mean}, count: {count})")
        
        print(f"\n‚úÖ Results retrieved successfully!")
    elif res.status_code == 404:
        print(f"‚ö†Ô∏è  Results not yet available (404)")
        print(f"   Job may still be processing. Wait a moment and try again.")
    else:
        print(f"‚ùå Failed to get results: {res.status_code}")
        print(f"   Response: {res.text}")
        
except requests.exceptions.RequestException as e:
    print(f"‚ùå Network error getting results: {e}")
    print(f"   Check that Evaluator is accessible at: {EVALUATOR_URL}")
except Exception as e:
    print(f"‚ùå Error getting results: {e}")
    raise



üìä Task: consult_summary_eval
   completeness: 4.0 (mean: 4.0, count: 5)
   correctness: 1.6 (mean: 1.6, count: 5)

‚úÖ Results retrieved successfully!
