# NeMo Evaluator Microservice: Prompt Optimization with MIPROv2

In this notebook, we'll demonstrate how to use NVIDIA NeMo Evaluator Microservice for prompt optimization using MIPROv2 (Multiprompt Instruction PRoposal Optimizer Version 2). This approach uses Bayesian Optimization to improve LLM-as-a-Judge prompts and evaluate their effectiveness.

The Judge model we'll be improving today is the NVIDIA Nemotron Nano 9B V2 model, which is a Large Language Model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks.

The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers, making it an effecient and fast model - well suited to this task.

We'll walk through the required steps of:

1. Setting up the environment and data
2. Creating and Submitting the optimization job
3. Analyzing results and comparing baseline vs optimized prompts

Let's get started!

> **NOTE**: You will need access to a deployed instance of NeMo Evaluator and NVIDIA NeMo Data Store Microservice. You can find details [here](https://docs.nvidia.com/nemo/microservices/25.9.0/get-started/setup/minikube/index.html#nemo-ms-get-started-prerequisites) on how to do that!


## Setup and Installation


In order to use this notebook, you'll want to set-up the virtual environment with `uv`. 

1. Get `uv` - you can start [here](https://docs.astral.sh/uv/getting-started/installation/)
2. Run `uv sync` to create the virtual environment.
3. Select the newly created virtual environment to use as the kernel in this Jupyter Notebook. 


## Configure Endpoints

Set up your Evaluator and Data Store endpoints:


In [None]:
import requests
import json
import pandas as pd
from typing import Dict, Any

# Configure your endpoints
EVAL_URL = "<< YOUR NEMO MICROSERVICE ENDPOINTS HERE >>"
DATASTORE_URL = "<< YOUR NEMO DATA STORE ENDPOINTS HERE >>"

## Health Check

Verify connectivity to the Evaluator service:


## Examine Dataset Format

Let's examine the HelpSteer2 dataset format to understand the structure for prompt optimization:


In [251]:
# Read and examine the dataset
import json

dataset_path = "./data/hs2.jsonl"

# Load first few examples to understand the format
examples = []
with open(dataset_path, 'r') as f:
    for i, line in enumerate(f):
        if i >= 5:  # Just show first 5 examples
            break
        examples.append(json.loads(line))

print("Dataset Structure:")
for i, example in enumerate(examples):
    print(f"\nExample {i+1}:")
    print(f"Keys: {list(example.keys())}")
    print(f"Prompt: {example['prompt'][:100]}...")
    print(f"Response: {example['response'][:100]}...")
    print(f"Reference Helpfulness: {example['reference_helpfulness']}")
    break  # Show detailed view of first example only

Dataset Structure:

Example 1:
Keys: ['prompt', 'response', 'reference_helpfulness']
Prompt: c#...
Response: C# is a high-level, object-oriented programming language developed by Microsoft as part of its .NET ...
Reference Helpfulness: 3


## Upload Dataset to NeMo Data Store

Upload the dataset to the NeMo Data Store for use in prompt optimization:

In [252]:
import os
from huggingface_hub import HfApi

HF_ENDPOINT = f"{DATASTORE_URL}/v1/hf"
NAMESPACE = "llm-judge"
DATASET_NAME = "hs2-short"

hf_api = HfApi(endpoint=HF_ENDPOINT, token="mock")
repo_id = f"{NAMESPACE}/{DATASET_NAME}"

# Create the dataset repo if it doesn't exist
hf_api.create_repo(repo_id=repo_id, repo_type="dataset", exist_ok=True)

# Upload the file
dataset_url = hf_api.upload_file(
    path_or_fileobj="./data/hs2.jsonl",
    path_in_repo="hs2.jsonl",
    repo_id=repo_id,
    repo_type="dataset",
    revision="main",
    commit_message=f"Eval dataset in {repo_id}"
)

print(f"Dataset uploaded: {dataset_url}")

hs2.jsonl:   0%|          | 0.00/199k [00:00<?, ?B/s]

Dataset uploaded: https://datastore.aire.nvidia.com/v1/hf/datasets/llm-judge/hs2-short/blob/main/hs2.jsonl


## Configure Prompt Optimization with MIPROv2 through an Inline Job!

Now we'll set up the prompt optimization configuration using MIPROv2. This includes:

- **Initial instruction**: The baseline prompt to optimize
- **Signature**: Defines the input/output structure matching our dataset
- **Metrics**: How to evaluate prompt performance
- **Optimization parameters**: Control the optimization process


### Target Configuration

Our target configuration tells NeMo Evaluator Microservice what model is the target for our evaluation. 

Let's break down the key components:

- **API Endpoint**: `model_id`, `url`, and `api_key` in this example point at a remote hosted model (in this case, hosted on [OpenRouter](https://openrouter.ai/nvidia/nemotron-nano-9b-v2)). You can substitute any OpenAI API compatible endpoint here - including, of course, the [NVIDIA Nemotron Nano 9B V2 NIM](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2/deploy)!

> NOTE: You can find your OpenRouter API Key through [this process](https://openrouter.ai/docs/api-reference/authentication)!

In [1]:
import os
import getpass

os.environ["OPENROUTER_API_KEY"]  = getpass.getpass("Enter your OpenRouter API Key: ")

### Configuration Explanation

Let's break down the key components:

- **Signature**: `"prompt, response, reference_helpfulness -> helpfulness"` matches our dataset structure
- **Initial instruction**: A baseline prompt for evaluating helpfulness
- **MIPROv2 parameters**:
  - `auto: "light"` - Light optimization intensity
  - `max_bootstrapped_demos: 2` - Generate up to 2 examples
  - `max_labeled_demos: 2` - Use up to 2 examples from training set
- **Metric**: Number-check with epsilon=1 allows scores within 1 point to be considered correct


## Submit Prompt Optimization Job

Now we'll create and submit the optimization job in a single config!

In [271]:
# Create job configuration
job_config = {
  "target": {
    "type": "model",
    "model": {
      "api_endpoint": {
        "model_id": "nvidia/nemotron-nano-9b-v2",
        "url": "https://openrouter.ai/api/v1/chat/completions",
        "api_key": os.environ["OPENROUTER_API_KEY"]
      }
    }
  },
  "config": {
    "type": "custom",
    "tasks": {
      "helpfulness-prompt-optimization": {
        "type": "prompt-optimization",
        "params": {
          "optimizer": {
            "type": "miprov2",
            "instruction": "Your task is to evaluate the helpfulness of a response to a given prompt on a scale of 0-4. Output ONLY a single digit (0, 1, 2, 3, or 4) with no additional text.",
            "signature": "prompt, response -> reference_helpfulness: int",
            "auto": None,
            "num_trials": 1,
            "num_candidates": 1,
            "max_bootstrapped_demos": 0,
            "max_labeled_demos": 0,
            "minibatch_size": 2
          }
        },
        "metrics": {
          "number-check": {
            "type": "number-check",
            "params": {
              "check": [
                "absolute difference",
                "{{item.reference_helpfulness | trim}}",
                "{{reference_helpfulness | trim}}",
                "epsilon",
                1
              ]
            }
          }
        },
        "dataset": {
          "files_url": f"hf://datasets/{NAMESPACE}/{DATASET_NAME}"
        }
      }
    }
  }
}


Next, we can submit the job to the `v1/evaluation/jobs` endpoint!

In [261]:
# Submit the job
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

response = requests.post(
    job_endpoint,
    json=job_config,
    headers={'accept': 'application/json'}
).json()

print(response)

job_id = response["id"]
print(f"Prompt Optimization Job Submitted!")
print(f"Job ID: {job_id}")
print(f"\nJob Details:")
print(json.dumps(response, indent=2))

{'created_at': '2025-09-17T21:27:03.479720', 'updated_at': '2025-09-17T21:27:03.479721', 'id': 'eval-GTtMA2M9SBYGSouHtgv2Cs', 'namespace': 'default', 'description': None, 'target': {'schema_version': '1.0', 'id': 'eval-target-FPZPXByTCnzhneCCTjwNSf', 'description': None, 'type_prefix': 'eval-target', 'namespace': 'default', 'project': None, 'created_at': '2025-09-17T21:27:03.479166', 'updated_at': '2025-09-17T21:27:03.479166', 'custom_fields': {}, 'ownership': None, 'name': 'eval-target-FPZPXByTCnzhneCCTjwNSf', 'type': 'model', 'cached_outputs': None, 'model': {'schema_version': '1.0', 'id': 'model-XLdAWScdr44np3Z7UFygpg', 'description': None, 'type_prefix': 'model', 'namespace': 'default', 'project': None, 'created_at': '2025-09-17T21:27:03.479192', 'updated_at': '2025-09-17T21:27:03.479193', 'custom_fields': {}, 'ownership': None, 'name': 'model-XLdAWScdr44np3Z7UFygpg', 'version_id': 'main', 'version_tags': [], 'spec': None, 'artifact': None, 'base_model': None, 'api_endpoint': {'url

## Monitor Job Progress

Let's monitor the optimization job status. Prompt optimization can take some time as it involves multiple optimization trials:


In [None]:
import time

def check_job_status(job_id: str) -> Dict[str, Any]:
    """Check the status of an evaluation job."""
    monitoring_endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{job_id}"
    response = requests.get(monitoring_endpoint).json()
    return response

def wait_for_completion(job_ids: list, check_interval: int = 30) -> Dict[str, Any]:
    """Wait for job completion with periodic status updates."""
    print(f"Monitoring jobs {job_ids}...")
    
    while True:
        status_response = check_job_status(job_id)
        status = status_response["status"]
    
        print(f"\nJob Status {job_id}: {status}")
    
        if status in ["completed", "failed", "cancelled"]:
            print(f"\nJob completed with status: {status}")
            return status_response
        
        print(f"Waiting {check_interval} seconds before next check...")
        time.sleep(check_interval)

# Check current status
current_status = check_job_status(job_id)
print(f"Current Status: {current_status['status']}")
if current_status['status'] == 'failed':
    print(f"Job failed with status: {current_status['status']}")
    print(f"Job details: {current_status}")

Current Status: created


> NOTE: At this time - `progress` is not captured during the running job

In [None]:
wait_for_completion([job_id])
final_status = check_job_status(job_id)
print(final_status)

Monitoring jobs ['eval-GTtMA2M9SBYGSouHtgv2Cs', 'eval-YbisDgYUhnCHY9BFsHfm5b']...

Job Status eval-GTtMA2M9SBYGSouHtgv2Cs: running

Job Status eval-YbisDgYUhnCHY9BFsHfm5b: completed

Job completed with status: completed
{'created_at': '2025-09-17T20:42:07.698905', 'updated_at': '2025-09-17T21:26:53.154123', 'id': 'eval-YbisDgYUhnCHY9BFsHfm5b', 'namespace': 'default', 'description': None, 'target': {'schema_version': '1.0', 'id': 'eval-target-9ipuYnMsyznDb7zUiZ7muC', 'description': None, 'type_prefix': 'eval-target', 'namespace': 'default', 'project': None, 'created_at': '2025-09-17T20:42:07.698478', 'updated_at': '2025-09-17T20:42:07.698478', 'custom_fields': {}, 'ownership': None, 'name': 'eval-target-9ipuYnMsyznDb7zUiZ7muC', 'type': 'model', 'cached_outputs': None, 'model': {'schema_version': '1.0', 'id': 'model-KrjaH39DaptdzyusEdA2cu', 'description': None, 'type_prefix': 'model', 'namespace': 'default', 'project': None, 'created_at': '2025-09-17T20:42:07.698498', 'updated_at': '2025

## Analyze Optimization Results

Once the job completes, let's examine the results to see how the prompt was optimized:


In [269]:
import requests

endpoint = f"{EVAL_URL}/v1/evaluation/jobs/{job_id_1}/results"
final_results = requests.get(endpoint).json()
final_results

{'created_at': '2025-09-17T20:42:17.192537',
 'updated_at': '2025-09-17T21:26:53.110771',
 'id': 'evaluation_result-WPdqDPDuyLdsqz1oZPrdB6',
 'job': 'eval-YbisDgYUhnCHY9BFsHfm5b',
 'files_url': 'hf://datasets/evaluation-results/eval-YbisDgYUhnCHY9BFsHfm5b',
 'tasks': {'helpfulness-prompt-optimization': {'metrics': {'number-check': {'scores': {'baseline': {'value': 0.8118000000000001,
       'stats': {'count': 85}},
      'optimized': {'value': 0.8941, 'stats': {'count': 85}}}}},
   'data': {'baseline_prompt': 'Your task is to evaluate the helpfulness of a response to a given prompt on a scale of 0-4. Output ONLY a single digit (0, 1, 2, 3, or 4) with no additional text.',
    'optimized_prompt': 'Evaluate the helpfulness of the response to the given prompt by systematically analyzing its relevance, clarity, completeness, and alignment with the prompt\'s requirements. Assign a score from 0-4 based on these criteria, ensuring the output is strictly a single digit (0, 1, 2, 3, or 4) with 

In [270]:
def analyze_optimization_results(job_response: Dict[str, Any]):
    """Analyze and display prompt optimization results."""
    
    print("=" * 80)
    print("PROMPT OPTIMIZATION RESULTS")
    print("=" * 80)
    
    # Navigate to the helpfulness task results
    tasks = job_response.get("tasks", {})
    helpfulness_task = tasks.get("helpfulness-prompt-optimization", {})
    
    # Display metrics comparison
    metrics = helpfulness_task.get("metrics", {})
    if metrics and "number-check" in metrics:
        scores = metrics["number-check"].get("scores", {})
        
        print("\n📊 PERFORMANCE METRICS:")
        print("-" * 40)
        
        baseline_score = None
        optimized_score = None
        
        if "baseline" in scores:
            baseline_data = scores["baseline"]
            baseline_score = baseline_data.get("value", 0)
            baseline_count = baseline_data.get("stats", {}).get("count", 0)
            print(f"Baseline Accuracy:  {baseline_score:.4f} (n={baseline_count})")
        
        if "optimized" in scores:
            optimized_data = scores["optimized"]
            optimized_score = optimized_data.get("value", 0)
            optimized_count = optimized_data.get("stats", {}).get("count", 0)
            print(f"Optimized Accuracy: {optimized_score:.4f} (n={optimized_count})")
        
        if baseline_score is not None and optimized_score is not None:
            improvement = optimized_score - baseline_score
            improvement_pct = (improvement / baseline_score) * 100 if baseline_score > 0 else 0
            print(f"Improvement:        {improvement:+.4f} ({improvement_pct:+.2f}%)")
    
    # Display prompts
    data = helpfulness_task.get("data", {})
    if data:
        print("\n📝 PROMPT COMPARISON:")
        print("-" * 40)
        
        if "baseline_prompt" in data:
            baseline_prompt = data["baseline_prompt"]
            print("\n🔸 BASELINE PROMPT:")
            print(f'"{baseline_prompt}"')
        
        if "optimized_prompt" in data:
            optimized_prompt = data["optimized_prompt"]
            print("\n🔹 OPTIMIZED PROMPT:")
            print(f'"{optimized_prompt}"')
            
            # Check if prompts are identical
            if baseline_prompt == optimized_prompt:
                print("\n⚠️  Note: The optimized prompt is identical to the baseline prompt.")
                print("   This suggests the optimization process found the original prompt")
                print("   was already optimal for the given task.")
    
    # Display additional metadata
    print("\n📋 JOB METADATA:")
    print("-" * 40)
    print(f"Job ID:        {job_response.get('job', 'N/A')}")
    print(f"Created:       {job_response.get('created_at', 'N/A')}")
    print(f"Updated:       {job_response.get('updated_at', 'N/A')}")
    print(f"Files URL:     {job_response.get('files_url', 'N/A')}")
    
    print("\n" + "=" * 80)
    
    return helpfulness_task

# Analyze the results
if final_status["status"] == "completed":
    optimization_results = analyze_optimization_results(final_results)
else:
    print(f"Job status: {final_status['status']}")
    print("Please wait for job completion before analyzing results.")

PROMPT OPTIMIZATION RESULTS

📊 PERFORMANCE METRICS:
----------------------------------------
Baseline Accuracy:  0.8118 (n=85)
Optimized Accuracy: 0.8941 (n=85)
Improvement:        +0.0823 (+10.14%)

📝 PROMPT COMPARISON:
----------------------------------------

🔸 BASELINE PROMPT:
"Your task is to evaluate the helpfulness of a response to a given prompt on a scale of 0-4. Output ONLY a single digit (0, 1, 2, 3, or 4) with no additional text."

🔹 OPTIMIZED PROMPT:
"Evaluate the helpfulness of the response to the given prompt by systematically analyzing its relevance, clarity, completeness, and alignment with the prompt's requirements. Assign a score from 0-4 based on these criteria, ensuring the output is strictly a single digit (0, 1, 2, 3, or 4) with no additional text. Prioritize technical precision and user-centricity in your assessment.
{"augmented": true, "prompt": "c#", "response": "C# is a high-level, object-oriented programming language developed by Microsoft as part of its .NE

## Understanding the Results

The optimization results provide several key insights:

### Metrics
- **Baseline Accuracy**: Performance of the original prompt
- **Optimized Accuracy**: Performance of the MIPROv2-optimized prompt
- **Improvement**: Quantified improvement in evaluation accuracy

### Prompts
- **Baseline Prompt**: Your original instruction
- **Optimized Prompt**: The improved prompt generated by MIPROv2, which may include:
  - Refined instructions
  - Few-shot examples
  - Better task framing


## Summary

In this notebook, we've demonstrated how to:

1. **Set up prompt optimization** with MIPROv2 using NeMo Evaluator
2. **Configure the optimization task** with proper signature and metrics
3. **Submit and monitor** optimization jobs
4. **Analyze results** to understand prompt improvements

### Key Takeaways:

- **MIPROv2** uses Bayesian Optimization to systematically improve prompts
- **Signature definition** must match your dataset structure exactly
- **Metric configuration** determines how optimization success is measured
- **Optimization intensity** (`auto`: light/medium/heavy) controls compute vs. quality tradeoff
- **Results provide both quantitative metrics and the actual optimized prompts**
