# Custom LLM-as-a-Judge Implementation

This notebook demonstrates how to leverage Custom LLM-as-a-Judge through NeMo Evaluator Microservice.

Full documentation: [NeMo Evaluator Custom Evaluation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-custom.html#evaluation-with-llm-as-a-judge)

## üîí Security Setup (REQUIRED FIRST STEP)

**IMPORTANT**: This notebook uses `env.donotcommit` file for sensitive configuration (tokens, API keys). 

**Before running this notebook:**
1. Copy the template: `cp env.donotcommit.example env.donotcommit`
2. Edit `env.donotcommit` and add your `NIM_SERVICE_ACCOUNT_TOKEN`
3. The `env.donotcommit` file is git-ignored and will NOT be committed to version control

**Get your service account token:**
```bash
oc get secret <service-account-name> -n <namespace> -o jsonpath='{.data.token}' | base64 -d
```

## Overview

In this example, we'll evaluate medical consultation summaries using:
- **Target Model**: NIM Model Serving (meta/llama-3.2-1b-instruct) - generates summaries
- **Judge Model**: NIM Model Serving (meta/llama-3.2-1b-instruct) - evaluates the summaries
  - Evaluates on two metrics:
    - **Completeness**: How well the summary captures all critical information (1-5 scale)
    - **Correctness**: How accurate the summary is without false information (1-5 scale)

**Configuration**: The notebook is pre-configured to use NIM Model Serving (your configured InferenceService) for both models. All configuration is loaded from the `env.donotcommit` file (see Security Setup above).

**‚ö†Ô∏è WORKAROUND**: This notebook uses NIM Model Serving (Knative/KServe) with an **external URL** to work around a known Evaluator bug. Evaluator v25.06/v25.08 strips `/chat/completions` from cluster-internal Knative service URLs, but using an external URL (HTTPS) may bypass this issue. Configure your external URL in `env.donotcommit` via `NIM_MODEL_SERVING_URL_EXTERNAL`.


In [None]:
# ============================================================================
# CONFIGURATION: Load Environment Variables from env.donotcommit file
# ============================================================================
# üîí SECURITY: Never hardcode secrets in notebooks!
# All sensitive values (tokens, API keys) should be in env.donotcommit file
# 
# SETUP INSTRUCTIONS:
# 1. Copy env.donotcommit.example to env.donotcommit: cp env.donotcommit.example env.donotcommit
# 2. Edit env.donotcommit and fill in your values (especially NIM_SERVICE_ACCOUNT_TOKEN)
# 3. env.donotcommit is git-ignored and will NOT be committed to version control
#
# IMPORTANT: Run this cell FIRST before importing config!
# If you get connection errors, restart the kernel and run cells in order.
import os
import sys
from pathlib import Path

# Load env.donotcommit file from the notebook directory
try:
    from dotenv import load_dotenv
    # Find env.donotcommit file in the same directory as this notebook
    notebook_dir = Path().resolve()  # Current working directory (where notebook is run from)
    env_file = notebook_dir / "env.donotcommit"
    
    if env_file.exists():
        load_dotenv(env_file, override=False)  # override=False: don't overwrite existing env vars
        print(f"‚úÖ Loaded env.donotcommit file from: {env_file}")
    else:
        print(f"‚ö†Ô∏è  env.donotcommit file not found at: {env_file}")
        print(f"   Looking for env.donotcommit.example template...")
        # Check if env.donotcommit.example exists
        env_example = notebook_dir / "env.donotcommit.example"
        if env_example.exists():
            print(f"   ‚ÑπÔ∏è  env.donotcommit.example exists at: {env_example}")
            print(f"   üìù Please copy it to env.donotcommit and fill in your values:")
            print(f"      cp env.donotcommit.example env.donotcommit")
            print(f"      # Then edit env.donotcommit and add your NIM_SERVICE_ACCOUNT_TOKEN")
        else:
            print(f"   ‚ö†Ô∏è  env.donotcommit.example not found - creating template...")
            env_example_content = """# NeMo Microservices Configuration
# Copy this file to env.donotcommit and fill in your values
# env.donotcommit is git-ignored and will NOT be committed

# REQUIRED: Namespace for cluster services
# Replace with your actual OpenShift namespace/project name
# Find your namespace: oc projects
NMS_NAMESPACE=your-namespace



# REQUIRED: NIM Model Serving Configuration
# Replace with your actual InferenceService name
# Find your service: oc get inferenceservice -n <your-namespace>
NIM_MODEL_SERVING_SERVICE=your-inferenceservice-name
NIM_MODEL_SERVING_MODEL=meta/llama-3.2-1b-instruct

# REQUIRED: External URL for NIM Model Serving (HTTPS)
# This is the external URL from the InferenceService status
# Find your URL: oc get inferenceservice <name> -n <namespace> -o jsonpath='{.status.url}'
# Format: https://<service-name>-<namespace>.apps.<cluster-domain>
NIM_MODEL_SERVING_URL_EXTERNAL=https://your-service-name-your-namespace.apps.your-cluster-domain.com

# Configuration flags
USE_NIM_MODEL_SERVING=true
USE_EXTERNAL_URL=true

# OPTIONAL: Dataset name for evaluation data
DATASET_NAME=custom-llm-as-a-judge-eval-data

# REQUIRED: NIM Service Account Token (for LlamaStack and fallback)
# Kubernetes service account token (JWT) for authenticating with KServe InferenceService
# Get your token: oc create token <service-account-name> -n <your-namespace> --duration=8760h
# Example: oc create token my-model-sa -n my-namespace --duration=8760h
# The service account name is typically: <inferenceservice-name>-sa
# Find your service account: oc get sa -n <your-namespace> | grep model
NIM_SERVICE_ACCOUNT_TOKEN=

# OPTIONAL: API Keys (only needed if using external APIs)
# OPENAI_API_KEY=
# NVIDIA_API_KEY=
# HF_TOKEN=
"""
            env_example.write_text(env_example_content)
            print(f"   ‚úÖ Created env.donotcommit.example template at: {env_example}")
            print(f"   üìù Please copy it to env.donotcommit and fill in your values:")
            print(f"      cp env.donotcommit.example env.donotcommit")
            print(f"      # Then edit env.donotcommit and add your NIM_SERVICE_ACCOUNT_TOKEN")
except ImportError:
    print("‚ö†Ô∏è  python-dotenv not installed - install with: pip install python-dotenv")
    print("   Will use system environment variables only (not recommended)")

# Clear any cached config module to force reload
if 'config' in sys.modules:
    del sys.modules['config']
    print("‚ö†Ô∏è  Cleared cached config module - will reload with new env vars")

# Set defaults (will be overridden by env.donotcommit file if present)
# These are fallback values - prefer setting them in env.donotcommit file
# Note: These defaults are examples - users should set their own values in env.donotcommit
os.environ.setdefault("NMS_NAMESPACE", "anemo-rhoai")
os.environ.setdefault("NIM_MODEL_SERVING_SERVICE", "anemo-rhoai-model")
os.environ.setdefault("NIM_MODEL_SERVING_MODEL", "meta/llama-3.2-1b-instruct")
os.environ.setdefault("NIM_MODEL_SERVING_URL_EXTERNAL", "https://anemo-rhoai-model-anemo-rhoai.apps.ai-dev05.kni.syseng.devcluster.openshift.com")
os.environ.setdefault("USE_NIM_MODEL_SERVING", "true")
os.environ.setdefault("USE_EXTERNAL_URL", "true")
os.environ.setdefault("DATASET_NAME", "custom-llm-as-a-judge-eval-data")
# NIM_SERVICE_ACCOUNT_TOKEN should come from env.donotcommit file, not hardcoded here

# Validate required token
if not os.environ.get("NIM_SERVICE_ACCOUNT_TOKEN"):
    print("\n‚ùå ERROR: NIM_SERVICE_ACCOUNT_TOKEN is not set!")
    print("   Please set it in your env.donotcommit file:")
    print("   1. Copy env.donotcommit.example to env.donotcommit: cp env.donotcommit.example env.donotcommit")
    print("   2. Edit env.donotcommit and add: NIM_SERVICE_ACCOUNT_TOKEN=your-token-here")
    print("   3. Get token from service account:")
    print("      oc get secret <service-account-name> -n <namespace> -o jsonpath='{.data.token}' | base64 -d")
    raise ValueError("NIM_SERVICE_ACCOUNT_TOKEN is required but not set in env.donotcommit file!")

print("\n‚úÖ Environment variables loaded")
print(f"   NMS_NAMESPACE: {os.environ.get('NMS_NAMESPACE')}")
print(f"   Mode: Cluster (Workbench/Notebook)")
print(f"   NIM Model Serving: {os.environ.get('NIM_MODEL_SERVING_SERVICE')}")
print(f"   External URL: {os.environ.get('NIM_MODEL_SERVING_URL_EXTERNAL')}")
print(f"   Using external URL: {os.environ.get('USE_EXTERNAL_URL')} (workaround for Evaluator bug)")
token_set = "‚úÖ Set" if os.environ.get('NIM_SERVICE_ACCOUNT_TOKEN') else "‚ùå Not set"
print(f"   Service Account Token: {token_set}")
print(f"\nüí° If you see connection errors, restart the kernel and run cells in order!")

# ============================================================================
# Install Required Packages
# ============================================================================
# Install llama-stack-client from GitHub main (same as llamastack demo)
# This ensures compatibility with the latest server version
%pip install --upgrade git+https://github.com/meta-llama/llama-stack-client-python.git@main

# Install required packages
# Note: Dependency conflicts with feast package are expected and can be ignored
# The notebook doesn't use feast, so the conflicts won't affect functionality
%pip install requests huggingface-hub datasets jupyterlab python-dotenv openai llama-stack-client

# Suppress dependency conflict warnings (these are from feast, which we don't use)
import warnings
warnings.filterwarnings('ignore', message='.*dependency conflicts.*')
print("‚úÖ Packages installed (dependency warnings from feast can be ignored)")


‚úÖ Set RUN_LOCALLY=true (using localhost with port-forwards)
Collecting git+https://github.com/meta-llama/llama-stack-client-python.git@main
  Cloning https://github.com/meta-llama/llama-stack-client-python.git (to revision main) to /private/var/folders/54/0nyyn56s1bsd1kbwqv8fdwxr0000gn/T/pip-req-build-r6o1bsk6
  Running command git clone --filter=blob:none --quiet https://github.com/meta-llama/llama-stack-client-python.git /private/var/folders/54/0nyyn56s1bsd1kbwqv8fdwxr0000gn/T/pip-req-build-r6o1bsk6
  Resolved https://github.com/meta-llama/llama-stack-client-python.git to commit f8eb65140836de310042c914be5ec8c26e87554a
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Load configuration
# IMPORTANT: Reload config module to pick up environment variables set in previous cell
import importlib
import sys

# Remove config from cache if it was already imported
if 'config' in sys.modules:
    importlib.reload(sys.modules['config'])

from config import (
    NDS_URL, ENTITY_STORE_URL, EVALUATOR_URL, NEMO_URL, LLAMASTACK_URL,
    NMS_NAMESPACE, DATASET_NAME, NDS_TOKEN,
    OPENAI_API_KEY, NVIDIA_API_KEY,
    ACTIVE_NIM_SERVICE, ACTIVE_NIM_MODEL, NIM_URL_CLUSTER
)

print(f"‚úÖ Configuration loaded")
print(f"Mode: Cluster (Workbench/Notebook)")
print(f"Data Store: {NDS_URL}")
print(f"Entity Store: {ENTITY_STORE_URL}")
print(f"Evaluator: {EVALUATOR_URL}")
print(f"LlamaStack: {LLAMASTACK_URL}")
print(f"Namespace: {NMS_NAMESPACE}")
print(f"Dataset: {DATASET_NAME}")
print(f"NIM Model Serving: {ACTIVE_NIM_SERVICE} ({ACTIVE_NIM_MODEL})")
print(f"NIM URL: {NIM_URL_CLUSTER}")

# Quick connectivity test
import requests
try:
    r = requests.get(f"{NDS_URL}/v1/datastore/namespaces", timeout=5)
    print(f"‚úÖ Data Store connectivity: OK")
except Exception as e:
    print(f"‚ö†Ô∏è  Data Store connectivity: FAILED - {e}")
    print(f"   Ensure you're running this notebook from within a Workbench/Notebook in the cluster")
    print(f"   Verify services are running:")
    print(f"   oc get pods -n {NMS_NAMESPACE} | grep -E '(datastore|entitystore|evaluator)'")
    print(f"   Check service endpoints:")
    print(f"   oc get svc -n {NMS_NAMESPACE} | grep -E '(datastore|entitystore|evaluator)'")

# Initialize LlamaStack client
try:
    from llama_stack_client import LlamaStackClient
    import logging
    
    # Suppress httpx INFO logs (404 on root endpoint is expected)
    logging.getLogger("httpx").setLevel(logging.WARNING)
    
    client = LlamaStackClient(base_url=LLAMASTACK_URL)
    # Test connectivity
    # Note: 404 on root endpoint is expected - it just means the service is reachable
    try:
        server_info = client._client.get("/")
        print(f"‚úÖ LlamaStack connectivity: OK")
        try:
            client_version = client._client._version
            print(f"   LlamaStack client version: {client_version}")
        except:
            pass
    except Exception as e:
        # 404 is OK - it means service is reachable but root endpoint doesn't exist
        if "404" in str(e) or "Not Found" in str(e):
            print(f"‚úÖ LlamaStack connectivity: OK (service reachable)")
        else:
            print(f"‚ö†Ô∏è  LlamaStack connectivity: FAILED - {e}")
            print(f"   Make sure LlamaStack is deployed: oc get pods -n {NMS_NAMESPACE} | grep llamastack")
            client = None
except ImportError:
    print("‚ö†Ô∏è  LlamaStack client not available - install with: %pip install --upgrade git+https://github.com/meta-llama/llama-stack-client-python.git@main")
    print("   Continuing without LlamaStack integration...")
    client = None
except Exception as e:
    print(f"‚ö†Ô∏è  LlamaStack initialization failed: {e}")
    print("   Continuing without LlamaStack integration...")
    client = None


‚úÖ Configuration loaded
Mode: Local (port-forward)
Data Store: http://localhost:8001
Entity Store: http://localhost:8002
Evaluator: http://localhost:8004
LlamaStack: http://localhost:8321
Namespace: anemo-rhoai
Dataset: custom-llm-as-a-judge-eval-data
‚úÖ Data Store connectivity: OK


INFO:httpx:HTTP Request: GET http://localhost:8321/ "HTTP/1.1 404 Not Found"


‚úÖ LlamaStack connectivity: OK


## Step 1: Set Up Namespaces

Create namespaces in both Entity Store and Data Store.


In [3]:
import requests

def create_namespaces(entity_host, ds_host, namespace):
    """Create namespace in both Entity Store and Data Store."""
    # Create namespace in Entity Store
    entity_store_url = f"{entity_host}/v1/namespaces"
    resp = requests.post(entity_store_url, json={"id": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Entity Store: {resp.status_code} - {resp.text}"
    print(f"‚úÖ Entity Store namespace created/verified: {namespace}")

    # Create namespace in Data Store
    nds_url = f"{ds_host}/v1/datastore/namespaces"
    resp = requests.post(nds_url, data={"namespace": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Data Store: {resp.status_code} - {resp.text}"
    print(f"‚úÖ Data Store namespace created/verified: {namespace}")

create_namespaces(entity_host=ENTITY_STORE_URL, ds_host=NDS_URL, namespace=NMS_NAMESPACE)


‚úÖ Entity Store namespace created/verified: anemo-rhoai
‚úÖ Data Store namespace created/verified: anemo-rhoai


## Step 2: Upload Dataset to Data Store

Upload the medical consultation data to the Data Store.


In [4]:
from huggingface_hub import HfApi

repo_id = f"{NMS_NAMESPACE}/{DATASET_NAME}"
print(f"Repository ID: {repo_id}")

# Create HfApi client pointing to NeMo Data Store
hf_api = HfApi(endpoint=f"{NDS_URL}/v1/hf", token=NDS_TOKEN if NDS_TOKEN != "token" else None)

# IMPORTANT: Ensure namespace exists in Gitea before creating repository
# Data Store's Gitea backend needs the namespace directory to exist,
# otherwise it defaults to "default" namespace. Creating a temporary
# repository first ensures the namespace is created in Gitea.
temp_repo_id = f"{NMS_NAMESPACE}/.namespace-init"
try:
    # Create temporary repo to ensure namespace exists in Gitea
    hf_api.create_repo(repo_id=temp_repo_id, repo_type='dataset', exist_ok=True)
    # Delete temporary repo (namespace directory will remain)
    try:
        hf_api.delete_repo(repo_id=temp_repo_id, repo_type='dataset')
    except:
        pass  # Ignore if deletion fails
    print(f"‚úÖ Namespace '{NMS_NAMESPACE}' initialized in Gitea")
except Exception as e:
    # If temp repo creation fails, namespace might already exist - continue
    print(f"‚ÑπÔ∏è  Namespace check: {e}")

# Create repository (now namespace should exist in Gitea)
try:
    hf_api.create_repo(repo_id=repo_id, repo_type='dataset', exist_ok=True)
    print(f"‚úÖ Repository created: {repo_id}")
except Exception as e:
    print(f"‚ö†Ô∏è  Repository may already exist: {e}")


  from .autonotebook import tqdm as notebook_tqdm


Repository ID: anemo-rhoai/custom-llm-as-a-judge-eval-data
‚ÑπÔ∏è  Namespace check: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'anemo-rhoai/.namespace-init'.
‚úÖ Repository created: anemo-rhoai/custom-llm-as-a-judge-eval-data


In [None]:
# Upload data file
import os
data_file = "./data/doctor_consults_with_summaries.jsonl"

# Verify file exists
if not os.path.exists(data_file):
    raise FileNotFoundError(f"‚ùå Data file not found: {data_file}\n"
                           f"   Make sure the file exists in the data/ directory.\n"
                           f"   Current working directory: {os.getcwd()}")

print(f"üìÅ Uploading data file: {data_file}")
print(f"   File size: {os.path.getsize(data_file)} bytes")

try:
    hf_api.upload_file(
        path_or_fileobj=data_file,
        path_in_repo="doctor_consults_with_summaries.jsonl",
        repo_id=repo_id,
        repo_type='dataset',
    )
    print(f"‚úÖ Data uploaded to {repo_id}")
    
    # Verify upload by listing repository files
    try:
        repo_files = hf_api.list_repo_files(repo_id=repo_id, repo_type='dataset')
        print(f"‚úÖ Verified: Repository contains {len(repo_files)} file(s)")
        for file in repo_files:
            print(f"   - {file}")
        if not repo_files:
            raise Exception("Repository is empty after upload!")
    except Exception as verify_error:
        print(f"‚ùå Upload verification failed: {verify_error}")
        print(f"   The evaluation job will fail if the file is not in the repository!")
        raise
        
except Exception as e:
    if "already exists" in str(e).lower() or "409" in str(e):
        print(f"‚ÑπÔ∏è  File already exists in repository (this is OK)")
        # Still verify it's there
        try:
            repo_files = hf_api.list_repo_files(repo_id=repo_id, repo_type='dataset')
            print(f"‚úÖ Verified: Repository contains {len(repo_files)} file(s)")
            for file in repo_files:
                print(f"   - {file}")
            if not repo_files:
                print(f"‚ö†Ô∏è  WARNING: Repository appears empty even though file exists!")
                print(f"   This may cause the evaluation job to fail.")
                print(f"   Try deleting and re-uploading the file.")
        except Exception as verify_error:
            print(f"‚ö†Ô∏è  Could not verify existing file: {verify_error}")
    else:
        print(f"‚ùå Upload failed: {e}")
        print(f"   This will cause the evaluation job to fail!")
        raise


‚úÖ Data uploaded to anemo-rhoai/custom-llm-as-a-judge-eval-data


## Step 3: Register Dataset with LlamaStack

Register the dataset with LlamaStack so it can be used for training/fine-tuning workflows.


## Step 4: Register Dataset in Entity Store

Register the dataset in Entity Store so it can be used in evaluation jobs.

In [None]:
# Register dataset in Entity Store (required for Evaluator service)
# IMPORTANT: Make sure Step 2 (upload) completed successfully before running this cell
if repo_id:
    print(f"Step 4: Registering dataset in Entity Store...")
    print(f"   Dataset: {repo_id}")
    print(f"   Make sure Step 2 (upload) completed successfully!")
    
    try:
        files_url = f"hf://datasets/{repo_id}"
        print(f"   Files URL: {files_url}")
        
        resp = requests.post(
            url=f"{ENTITY_STORE_URL}/v1/datasets",
            json={
                "name": DATASET_NAME,
                "namespace": NMS_NAMESPACE,
                "description": "Medical consultation summaries for LLM-as-a-Judge evaluation",
                "files_url": files_url,
                "project": "custom-llm-as-a-judge-test",
            },
            timeout=30
        )
        
        # Handle response - 409 means dataset already exists (OK for re-running notebook)
        if resp.status_code in (200, 201):
            print(f"‚úÖ Dataset registered in Entity Store: {DATASET_NAME}")
            dataset_obj = resp.json()
            if 'files_url' in dataset_obj:
                print(f"   Files URL: {dataset_obj['files_url']}")
        elif resp.status_code == 409:
            print(f"‚ÑπÔ∏è  Dataset {DATASET_NAME} already exists in Entity Store (this is OK)")
            print(f"   The dataset is ready to use")
        else:
            print(f"‚ö†Ô∏è  Failed to register in Entity Store: {resp.status_code}")
            print(f"   Response: {resp.text}")
            print(f"\nüí° Troubleshooting:")
            print(f"   1. Make sure Step 2 (upload) cell completed successfully")
            print(f"   2. Verify files were uploaded: Check the output of Step 2 cell")
            print(f"   3. Wait a few seconds and try running Step 2 again, then this cell")
            print(f"   4. Check Entity Store: oc get pods -n {NMS_NAMESPACE} | grep entitystore")
            print(f"   5. Check Data Store: oc get pods -n {NMS_NAMESPACE} | grep datastore")
            print(f"   6. Verify Entity Store URL is correct: {ENTITY_STORE_URL}")
            print(f"   7. If dataset already exists, you can skip this step")
            raise Exception(f"Status Code {resp.status_code} Failed to create dataset: {resp.text}")
        
        # Verify dataset exists
        res = requests.get(url=f"{ENTITY_STORE_URL}/v1/datasets/{NMS_NAMESPACE}/{DATASET_NAME}", timeout=30)
        if res.status_code not in (200, 201):
            raise Exception(f"Status Code {res.status_code} Failed to fetch dataset: {res.text}")
        
        dataset_obj = res.json()
        print(f"‚úÖ Dataset verified. Files URL: {dataset_obj['files_url']}")
        
    except requests.exceptions.Timeout:
        print(f"‚ö†Ô∏è  Timeout connecting to Entity Store")
        print(f"   Check Entity Store is running: oc get pods -n {NMS_NAMESPACE} | grep entitystore")
        raise
    except requests.exceptions.ConnectionError as e:
        print(f"‚ö†Ô∏è  Connection error to Entity Store: {e}")
        print(f"   Check Entity Store URL: {ENTITY_STORE_URL}")
        print(f"   Check Entity Store is running: oc get pods -n {NMS_NAMESPACE} | grep entitystore")
        raise
    except Exception as e:
        print(f"‚ö†Ô∏è  Error registering dataset in Entity Store: {e}")
        print(f"\nüí° Troubleshooting:")
        print(f"   1. Make sure Step 2 (upload) cell completed successfully")
        print(f"   2. Check Entity Store: oc get pods -n {NMS_NAMESPACE} | grep entitystore")
        print(f"   3. Check Entity Store logs: oc logs -n {NMS_NAMESPACE} deployment/nemoentitystore-sample --tail=50")
        raise
else:
    print(f"\n‚ö†Ô∏è  Skipping Entity Store registration (files not uploaded to Data Store)")
    print(f"   Run Step 2 (upload) cell first to upload files to Data Store")

In [6]:
# Register dataset with LlamaStack (only if files were uploaded and client is available)
# IMPORTANT: Make sure Step 2 (upload) completed successfully before running this cell
if repo_id and client is not None:
    print(f"Step 3: Registering dataset with LlamaStack...")
    print(f"   Dataset: {repo_id}")
    print(f"   Make sure Step 2 (upload) completed successfully!")
    
    try:
        dataset_uri = f"hf://datasets/{repo_id}"
        print(f"   Dataset URI: {dataset_uri}")
        
        response = client.beta.datasets.register(
            purpose="post-training/messages",
            dataset_id=DATASET_NAME,
            source={
                "type": "uri",
                "uri": dataset_uri
            },
            metadata={
                "format": "jsonl",
                "description": "Medical consultation summaries for LLM-as-a-Judge evaluation",
                "provider_id": "nvidia",
            }
        )
        print(f"‚úÖ Dataset registered with LlamaStack: {DATASET_NAME}")
        if hasattr(response, 'dataset_id'):
            print(f"   Dataset ID: {response.dataset_id}")
    except Exception as e:
        error_msg = str(e)
        if "already exists" in error_msg.lower() or "409" in error_msg:
            print(f"‚ÑπÔ∏è  Dataset {DATASET_NAME} already exists in LlamaStack (this is OK)")
        else:
            print(f"‚ö†Ô∏è  Error registering dataset with LlamaStack: {error_msg}")
            print(f"\nüí° Troubleshooting:")
            print(f"   1. Make sure Step 2 (upload) cell completed successfully")
            print(f"   2. Verify files were uploaded: Check the output of Step 2 cell")
            print(f"   3. Wait a few seconds and try running Step 2 again, then this cell")
            print(f"   4. Check Data Store: oc get pods -n {NMS_NAMESPACE} | grep datastore")
            print(f"   5. Check LlamaStack: oc get pods -n {NMS_NAMESPACE} | grep llamastack")
            print(f"   6. If dataset already exists, you can skip this step")
            print(f"\n   Continuing without LlamaStack registration...")
elif not repo_id:
    print(f"\n‚ö†Ô∏è  Skipping LlamaStack registration (files not uploaded to Data Store)")
    print(f"   Run Step 2 (upload) cell first to upload files to Data Store")
elif client is None:
    print(f"\n‚ö†Ô∏è  Skipping LlamaStack registration (LlamaStack client not available)")
    print(f"   Make sure the LlamaStack initialization cell ran successfully")
else:
    print(f"\n‚ö†Ô∏è  Skipping LlamaStack registration")


‚ÑπÔ∏è  Dataset already exists: custom-llm-as-a-judge-eval-data (this is OK)
‚úÖ Dataset verified. Files URL: hf://datasets/anemo-rhoai/custom-llm-as-a-judge-eval-data


## Step 5: Configure Judge LLM and Target Model

Set up the judge model (OpenAI) and target model (for generating summaries).


In [None]:
# Judge LLM Configuration
# IMPORTANT: Use NIM Model Serving (anemo-rhoai) for meta-llama-2-7b-chat
# This uses the NIM Model Serving InferenceService (KServe/Knative) on port 80
#
# ‚ö†Ô∏è  WORKAROUND for Evaluator v25.06 URL stripping bug:
# Evaluator strips /chat/completions from Knative URLs, so we configure the base URL
# and rely on the model_id to indicate it's a chat model. The Evaluator should handle
# the endpoint routing based on the model type.

import importlib
import config
importlib.reload(config)
from config import NIM_URL_CLUSTER, ACTIVE_NIM_MODEL, ACTIVE_NIM_SERVICE

# Use external URL to work around Evaluator URL stripping bug
# External URLs may not have the same URL stripping issue as cluster-internal Knative URLs
# External URLs require authentication - use service account token from environment
base_url = NIM_URL_CLUSTER.rstrip('/')
# Get service account token from environment (set in cell 1)
SERVICE_ACCOUNT_TOKEN = os.environ.get("NIM_SERVICE_ACCOUNT_TOKEN", "")
if not SERVICE_ACCOUNT_TOKEN:
    raise ValueError("NIM_SERVICE_ACCOUNT_TOKEN not set! Please set it in cell 1 (environment configuration).")

judge_model_config = {
    "api_endpoint": {
        "url": f"{base_url}/v1/chat/completions",  # Full path with external URL
        "model_id": ACTIVE_NIM_MODEL,  # meta/llama-3.2-1b-instruct (from NIM Model Serving)
        "format": "openai",  # Specify format
        "api_key": SERVICE_ACCOUNT_TOKEN  # Service account token for external URL authentication
    }
}

print(f"‚úÖ Judge model configured: NIM Model Serving ({ACTIVE_NIM_MODEL})")
print(f"   Service: {ACTIVE_NIM_SERVICE}")
print(f"   URL: {base_url}/v1/chat/completions")
print(f"   ‚ÑπÔ∏è  Using external URL with authentication token")
print(f"   (External URL bypasses Evaluator URL stripping bug)")


‚úÖ Judge model configured: Your NIM (meta/llama-3.2-1b-instruct)
‚ÑπÔ∏è  Creating evaluation target with cluster URL: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
   (Service mesh handles authentication - no token needed)
   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)
   If job fails, this is a known Evaluator limitation


In [None]:
# Target Model Configuration
# IMPORTANT: Use NIM Model Serving (anemo-rhoai) for meta-llama-2-7b-chat
# This uses the NIM Model Serving InferenceService (KServe/Knative) on port 80
#
# ‚ö†Ô∏è  WORKAROUND for Evaluator v25.06 URL stripping bug:
# Evaluator strips /chat/completions from Knative URLs, so we configure the base URL
# and rely on the model_id to indicate it's a chat model.

from config import NMS_NAMESPACE

# This uses the NIM Model Serving service (anemo-rhoai) deployed via Helm chart
# Use NIM_URL_CLUSTER from config (automatically uses NIM Model Serving if enabled)
import importlib
import config
importlib.reload(config)
from config import NIM_URL_CLUSTER, NMS_NAMESPACE, ACTIVE_NIM_MODEL, ACTIVE_NIM_SERVICE

# Use external URL to work around Evaluator URL stripping bug
# External URLs may not have the same URL stripping issue as cluster-internal Knative URLs
# External URLs require authentication - use service account token from environment
base_url = NIM_URL_CLUSTER.rstrip('/')
# Get service account token from environment (set in cell 1)
SERVICE_ACCOUNT_TOKEN = os.environ.get("NIM_SERVICE_ACCOUNT_TOKEN", "")
if not SERVICE_ACCOUNT_TOKEN:
    raise ValueError("NIM_SERVICE_ACCOUNT_TOKEN not set! Please set it in cell 1 (environment configuration).")

target_model_config = {
    "type": "model",
    "model": {
        "api_endpoint": {
            "url": f"{base_url}/v1/chat/completions",  # Full path with external URL
            "model_id": ACTIVE_NIM_MODEL,  # meta/llama-3.2-1b-instruct (from NIM Model Serving)
            "format": "openai",  # Specify format
            "api_key": SERVICE_ACCOUNT_TOKEN  # Service account token for external URL authentication
        }
    }
}

print(f"‚úÖ Target model configured: NIM Model Serving ({ACTIVE_NIM_MODEL})")
print(f"   Service: {ACTIVE_NIM_SERVICE}")
print(f"   URL: {base_url}/v1/chat/completions")
print(f"   ‚ÑπÔ∏è  Using external URL with authentication token")
print(f"   (External URL bypasses Evaluator URL stripping bug)")


‚úÖ Target model configured: Your NIM (meta/llama-3.2-1b-instruct)
‚ÑπÔ∏è  Creating evaluation target with cluster URL: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
   (Service mesh handles authentication - no token needed)
   (Evaluation jobs run inside cluster and need cluster service URL, not localhost)
   If job fails, try creating evaluation target first (see troubleshooting)


## Step 5: Define Evaluation Prompts

Create prompts for the judge to evaluate completeness and correctness.


In [9]:
# System prompts for judge evaluation
completeness_system_prompt = """
You are a judge. Rate how complete the summary is 
on a scale from 1 to 5:
1 = missing critical information ‚Ä¶ 5 = fully complete
Please respond with RATING: <number>
"""

correctness_system_prompt = """
You are a judge. Rate the summary's correctness 
(no false info) on a scale 1-5:
1 = many inaccuracies ‚Ä¶ 5 = completely accurate
Please respond with RATING: <number>
"""

# User prompt template (references dataset item and model output)
user_prompt = """
Full Consult: {{ item.content }}
Summary: {{ sample.output_text }}
"""

print("‚úÖ Evaluation prompts defined")


‚úÖ Evaluation prompts defined


## Step 6: Create Evaluation Configuration

Build the custom LLM-as-a-Judge evaluation configuration.


In [10]:
llm_as_a_judge_config = {
    "type": "custom",
    "name": "doctor_consult_summary_eval",
    "tasks": {
        "consult_summary_eval": {
            "type": "chat-completion",
            "params": {
                "template": {
                    # Prompt sent to target LLM to generate summary
                    "messages": [
                        {
                            "role": "system",
                            "content": "Given a full medical consultation, please provide a 50 word summary of the consultation."
                        },
                        {
                            "role": "user",
                            "content": "Full Consult: {{ item.content }}"
                        }
                    ],
                    "max_tokens": 200
                }
            },
            "dataset": {
                "files_url": f"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/",
                "limit": 5  # Reduced for quick test - increase for full evaluation
            },
            "metrics": {
                "completeness": {
                    "type": "llm-judge",
                    "params": {
                        "model": judge_model_config,
                        "template": {
                            "messages": [
                                {"role": "system", "content": completeness_system_prompt},
                                {"role": "user", "content": user_prompt}
                            ]
                        },
                        "scores": {
                            "completeness": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING: *([0-9]+)"
                                }
                            }
                        }
                    }
                },
                "correctness": {
                    "type": "llm-judge",
                    "params": {
                        "model": judge_model_config,
                        "template": {
                            "messages": [
                                {"role": "system", "content": correctness_system_prompt},
                                {"role": "user", "content": user_prompt}
                            ]
                        },
                        "scores": {
                            "correctness": {
                                "type": "int",
                                "parser": {
                                    "type": "regex",
                                    "pattern": r"RATING: *([0-9]+)"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

print("‚úÖ Evaluation configuration created")
print(f"   - Type: custom")
print(f"   - Metrics: completeness, correctness")
print(f"   - Sample limit: 5 (for quick test)")


‚úÖ Evaluation configuration created
   - Type: custom
   - Metrics: completeness, correctness
   - Sample limit: 5 (for quick test)


## Step 6.5: Create Evaluation Targets

**‚ö†Ô∏è CRITICAL: Known Evaluator v25.06 Bug with NIM Model Serving (Knative)**

There is a **known bug in NeMo Evaluator v25.06** that strips `/chat/completions` from Knative service URLs during job execution. This is a **fundamental Evaluator bug** that cannot be worked around in the notebook configuration.

**What Happens:**
1. ‚úÖ **Target creation works**: URLs are stored correctly with `/v1/chat/completions`
2. ‚úÖ **Job submission works**: Jobs are accepted and created successfully  
3. ‚ùå **Job execution fails**: Evaluator strips `/chat/completions` from the URL, resulting in:
   ```
   Error connecting to inference server at http://anemo-rhoai-predictor.../v1
   ```
   Instead of the correct: `http://anemo-rhoai-predictor.../v1/chat/completions`

**Why This Happens:**
- NIM Model Serving uses Knative/KServe InferenceServices
- Evaluator v25.06 has a bug in its URL handling for Knative services
- The bug strips path components from Knative URLs during job execution
- NIM services require `/v1/chat/completions` endpoint - `/v1` alone doesn't work

**Solutions:**
1. **Use Standard NIM Service** instead - standard NIM services (not Knative) work correctly with Evaluator
2. **Wait for Evaluator fix** - this needs to be fixed in the Evaluator codebase (bug persists in v25.08)
3. **Use external API endpoints** - if your NIM service is accessible via external URL, that might work

**Current Status**: 
- ‚úÖ Evaluator upgraded to v25.08
- ‚ùå **Bug persists in v25.08** - URL stripping still occurs
- This notebook is configured to use NIM Model Serving as requested, but **it will fail due to the Evaluator bug**
- The configuration is correct, but Evaluator cannot work with Knative InferenceServices for LLM-as-a-Judge evaluation jobs


In [None]:
# Create evaluation target for judge model
# Use target reference (created in previous cells) for cleaner configuration
import requests
from config import EVALUATOR_URL, NMS_NAMESPACE

headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

# Delete existing target if it exists (for clean re-runs)
from config import ACTIVE_NIM_MODEL
# Get service account token from environment (set in cell 1)
SERVICE_ACCOUNT_TOKEN = os.environ.get("NIM_SERVICE_ACCOUNT_TOKEN", "")
if not SERVICE_ACCOUNT_TOKEN:
    raise ValueError("NIM_SERVICE_ACCOUNT_TOKEN not set! Please set it in cell 1 (environment configuration).")

judge_target_name = "anemo-rhoai-llama-3.2-1b-judge"
try:
    res = requests.delete(f"{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/{judge_target_name}")
    if res.status_code in (200, 404):
        print(f"‚úÖ Cleaned up existing judge target (if any)")
except:
    pass

# Create judge model target
judge_target_data = {
    "type": "model",
    "name": judge_target_name,
    "namespace": NMS_NAMESPACE,
    "model": {
        "api_endpoint": {
            # Use external URL with full path - may work around Evaluator URL stripping
            "url": f"{NIM_URL_CLUSTER.rstrip('/')}/v1/chat/completions",
            "model_id": ACTIVE_NIM_MODEL,  # meta/llama-3.2-1b-instruct
            "format": "openai",  # Specify format
            "api_key": SERVICE_ACCOUNT_TOKEN  # Service account token for authentication
        }
    }
}

print(f"Creating judge model evaluation target: {judge_target_name}")
res = requests.post(f"{EVALUATOR_URL}/v1/evaluation/targets", headers=headers, json=judge_target_data)

if res.status_code not in (200, 201):
    print(f"‚ö†Ô∏è  Warning: Could not create judge target: {res.status_code}")
    print(f"Response: {res.text[:200]}")
    # Continue anyway - might already exist
else:
    judge_target_response = res.json()
    print(f"‚úÖ Judge target created: {judge_target_response.get('name')}")

# Update judge_model_config to use target name for job submission
# The config will reference this target by name
judge_target_ref = f"{NMS_NAMESPACE}/{judge_target_name}"
print(f"\nüí° Judge target reference: {judge_target_ref}")
print("   (Will use this in evaluation config)")


‚úÖ Cleaned up existing judge target (if any)
Creating judge model evaluation target: meta-llama3-1b-instruct-judge
‚úÖ Judge target created: meta-llama3-1b-instruct-judge

üí° Judge target reference: anemo-rhoai/meta-llama3-1b-instruct-judge
   (Will use this in evaluation config)


In [None]:
# Create evaluation target for target model
# Delete existing target if it exists (for clean re-runs)
from config import ACTIVE_NIM_MODEL
# Get service account token from environment (set in cell 1)
SERVICE_ACCOUNT_TOKEN = os.environ.get("NIM_SERVICE_ACCOUNT_TOKEN", "")
if not SERVICE_ACCOUNT_TOKEN:
    raise ValueError("NIM_SERVICE_ACCOUNT_TOKEN not set! Please set it in cell 1 (environment configuration).")

target_target_name = "anemo-rhoai-llama-3.2-1b-target"
try:
    res = requests.delete(f"{EVALUATOR_URL}/v1/evaluation/targets/{NMS_NAMESPACE}/{target_target_name}")
    if res.status_code in (200, 404):
        print(f"‚úÖ Cleaned up existing target model target (if any)")
except:
    pass

# Create target model evaluation target
target_target_data = {
    "type": "model",
    "name": target_target_name,
    "namespace": NMS_NAMESPACE,
    "model": {
        "api_endpoint": {
            # Use external URL with full path - may work around Evaluator URL stripping
            "url": f"{NIM_URL_CLUSTER.rstrip('/')}/v1/chat/completions",
            "model_id": ACTIVE_NIM_MODEL,  # meta/llama-3.2-1b-instruct
            "format": "openai",  # Specify format
            "api_key": SERVICE_ACCOUNT_TOKEN  # Service account token for authentication
        }
    }
}

print(f"Creating target model evaluation target: {target_target_name}")
res = requests.post(f"{EVALUATOR_URL}/v1/evaluation/targets", headers=headers, json=target_target_data)

if res.status_code not in (200, 201):
    print(f"‚ö†Ô∏è  Warning: Could not create target model target: {res.status_code}")
    print(f"Response: {res.text[:200]}")
    # Continue anyway - might already exist
else:
    target_target_response = res.json()
    print(f"‚úÖ Target model target created: {target_target_response.get('name')}")

# Store target reference for job submission
target_model_ref = f"{NMS_NAMESPACE}/{target_target_name}"
print(f"\nüí° Target model reference: {target_model_ref}")
print("   (Will use this in job submission)")


‚úÖ Cleaned up existing target model target (if any)
Creating target model evaluation target: meta-llama3-1b-instruct-target
‚úÖ Target model target created: meta-llama3-1b-instruct-target

üí° Target model reference: anemo-rhoai/meta-llama3-1b-instruct-target
   (Will use this in job submission)


## Step 7: Submit Evaluation Job

Submit the evaluation job to NeMo Evaluator.


In [14]:
# Submit evaluation job
# IMPORTANT: Use inline target config (like original notebook) to avoid Data Store validation issues
# Using inline config bypasses evaluation target lookup which triggers Data Store dataset validation
try:
    # Use inline target config (matches original notebook approach)
    job_payload = {
        "config": llm_as_a_judge_config,
        "target": target_model_config  # Use inline config object, not target reference
    }
    
    print("üì§ Submitting evaluation job with inline target config...")
    print(f"   Target: {target_model_config['model']['api_endpoint']['url']}")
    
    res = requests.post(
        f"{EVALUATOR_URL}/v1/evaluation/jobs",
        json=job_payload,
        timeout=30
    )

    if res.status_code not in (200, 201):
        print(f"‚ùå Failed to submit job: {res.status_code}")
        print(f"Response: {res.text}")
        raise Exception(f"Job submission failed: {res.status_code} - {res.text}")

    job_data = res.json()
    base_eval_job_id = job_data["id"]
    print(f"‚úÖ Evaluation job submitted")
    print(f"   Job ID: {base_eval_job_id}")
    print(f"   Status: {job_data.get('status', 'unknown')}")
    
except requests.exceptions.RequestException as e:
    print(f"‚ùå Network error submitting job: {e}")
    print(f"   Check that Evaluator is accessible at: {EVALUATOR_URL}")
    raise
except Exception as e:
    print(f"‚ùå Error submitting job: {e}")
    raise


üì§ Submitting evaluation job with inline target config...
   Target: http://meta-llama3-1b-instruct.anemo-rhoai.svc.cluster.local:8000/v1/chat/completions
‚úÖ Evaluation job submitted
   Job ID: eval-VgyjqL9ciYXZ9XCVGGLmfe
   Status: created


## Step 8: Wait for Job Completion

Monitor the evaluation job until it completes.


In [16]:
from time import sleep, time

def wait_eval_job(job_url: str, polling_interval: int = 10, timeout: int = 600):
    """Helper for waiting an eval job with error handling."""
    start_time = time()
    
    try:
        res = requests.get(job_url, timeout=10)
        if res.status_code != 200:
            raise Exception(f"Failed to get job status: {res.status_code} - {res.text}")
    except requests.exceptions.RequestException as e:
        raise Exception(f"Network error getting job status: {e}")
    
    job_data = res.json()
    status = job_data["status"]
    print(f"Initial status: {status}")
    
    # Check for immediate terminal states
    if status == "failed":
        print(f"‚ùå Job failed immediately!")
        status_details = job_data.get('status_details', {})
        error_msg = status_details.get('message', 'Unknown error')
        print(f"Error: {error_msg}")
        return res
    elif status == "completed":
        print(f"‚úÖ Job completed immediately!")
        return res

    # Poll for status updates
    while status in ["pending", "created", "running"]:
        # Check for timeout
        elapsed = time() - start_time
        if elapsed > timeout:
            raise RuntimeError(f"Job took more than {timeout} seconds (timed out).")

        # Sleep before polling again
        sleep(polling_interval)

        # Fetch updated status and progress
        try:
            res = requests.get(job_url, timeout=10)
            if res.status_code != 200:
                print(f"‚ö†Ô∏è  Failed to get status: {res.status_code} - {res.text}")
                sleep(polling_interval)  # Wait before retrying
                continue
        except requests.exceptions.RequestException as e:
            print(f"‚ö†Ô∏è  Network error getting status: {e} - retrying...")
            sleep(polling_interval)
            continue
            
        job_data = res.json()
        status = job_data["status"]
        elapsed = time() - start_time

        # Handle terminal states immediately
        if status == "failed":
            print(f"\n‚ùå Job failed after {elapsed:.1f}s")
            status_details = job_data.get('status_details', {})
            error_msg = status_details.get('message', 'Unknown error')
            print(f"Error: {error_msg}")
            
            # Print task status if available
            task_status = status_details.get('task_status', {})
            if task_status:
                print(f"\nTask status details:")
                for task_name, task_info in task_status.items():
                    print(f"  - {task_name}: {task_info}")
            return res
        elif status == "completed":
            progress = 100
            print(f"‚úÖ Status: {status} | Progress: {progress}% | Elapsed: {elapsed:.1f}s")
            return res
        elif status == "running":
            progress = job_data.get("status_details", {}).get("progress", 0)
            print(f"‚è≥ Status: {status} | Progress: {progress}% | Elapsed: {elapsed:.1f}s")
        else:
            # Unknown status - log and continue
            print(f"‚ö†Ô∏è  Status: {status} | Elapsed: {elapsed:.1f}s")

    # If we exit the loop, status should be terminal, but check anyway
    if status not in ["completed", "failed"]:
        print(f"‚ö†Ô∏è  Unexpected final status: {status}")
        print(f"   Full job data: {job_data}")

    return res

print("‚è≥ Waiting for evaluation job to complete...")
try:
    res = wait_eval_job(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}", polling_interval=5, timeout=600)
except Exception as e:
    print(f"‚ùå Error waiting for job: {e}")
    raise


‚è≥ Waiting for evaluation job to complete...
Initial status: running
‚è≥ Status: running | Progress: 60.0% | Elapsed: 5.2s
‚úÖ Status: completed | Progress: 100% | Elapsed: 10.3s


In [17]:
# Check final status (this cell provides additional details if needed)
try:
    job_data = res.json()
    final_status = job_data["status"]
    
    if final_status == "completed":
        print(f"‚úÖ Job completed successfully!")
        print(f"   You can now view results in the next cell.")
    elif final_status == "failed":
        print(f"\n‚ùå Job failed - Summary:")
        status_details = job_data.get('status_details', {})
        error_msg = status_details.get('message', 'Unknown error')
        
        # Extract key error information
        if "Error connecting to inference server" in error_msg:
            print(f"   Issue: Cannot connect to NIM endpoint")
            print(f"   Check: Is the NIM service running and accessible from cluster?")
            print(f"   URL used: Check the target/judge model configuration")
        
        print(f"\n   Full error message:")
        print(f"   {error_msg[:500]}...")  # Truncate very long errors
        
        # Print task status if available
        task_status = status_details.get('task_status', {})
        if task_status:
            print(f"\n   Task status details:")
            for task_name, task_info in task_status.items():
                print(f"     - {task_name}: {task_info}")
    else:
        print(f"‚ö†Ô∏è  Job status: {final_status}")
        print(f"   Full response: {job_data}")
except Exception as e:
    print(f"‚ö†Ô∏è  Error parsing job status: {e}")
    print(f"   Raw response: {res.text if hasattr(res, 'text') else res}")


‚úÖ Job completed successfully!
   You can now view results in the next cell.


## Step 9: View Results

Retrieve and display the evaluation results.


In [18]:
# Get results
try:
    res = requests.get(f"{EVALUATOR_URL}/v1/evaluation/jobs/{base_eval_job_id}/results", timeout=30)
    
    if res.status_code == 200:
        results = res.json()
        
        # Extract metrics
        tasks = results.get("tasks", {})
        if not tasks:
            print("‚ö†Ô∏è  No tasks found in results")
            print(f"   Full response: {results}")
        else:
            for task_name, task_data in tasks.items():
                print(f"\nüìä Task: {task_name}")
                metrics = task_data.get("metrics", {})
                if not metrics:
                    print(f"   ‚ö†Ô∏è  No metrics found for this task")
                else:
                    for metric_name, metric_data in metrics.items():
                        scores = metric_data.get("scores", {})
                        if not scores:
                            print(f"   ‚ö†Ô∏è  No scores found for metric: {metric_name}")
                        else:
                            for score_name, score_data in scores.items():
                                value = score_data.get("value", "N/A")
                                stats = score_data.get("stats", {})
                                mean = stats.get("mean", "N/A")
                                count = stats.get("count", "N/A")
                                print(f"   {score_name}: {value} (mean: {mean}, count: {count})")
        
        print(f"\n‚úÖ Results retrieved successfully!")
    elif res.status_code == 404:
        print(f"‚ö†Ô∏è  Results not yet available (404)")
        print(f"   Job may still be processing. Wait a moment and try again.")
    else:
        print(f"‚ùå Failed to get results: {res.status_code}")
        print(f"   Response: {res.text}")
        
except requests.exceptions.RequestException as e:
    print(f"‚ùå Network error getting results: {e}")
    print(f"   Check that Evaluator is accessible at: {EVALUATOR_URL}")
except Exception as e:
    print(f"‚ùå Error getting results: {e}")
    raise



üìä Task: consult_summary_eval
   completeness: 4.0 (mean: 4.0, count: 5)
   correctness: 1.6 (mean: 1.6, count: 5)

‚úÖ Results retrieved successfully!
