# LLM-as-a-Judge: Evaluating AI Systems at Scale

**Demo for: "Scalable LLM Evaluations in Practice"**

---

## What This Demo Covers

This notebook demonstrates the **complete evaluation lifecycle** using the Arato SDK:

1. **Define Evaluation Goals** ‚Äî What are we measuring?
2. **Design Criteria & Datasets** ‚Äî Prepare test data and scoring rubrics
3. **Choose Judges & Prompts** ‚Äî Configure AI models to evaluate outputs
4. **Run Evaluations** ‚Äî Execute experiments programmatically
5. **Meta-Evaluate the Judge** ‚Äî Verify judge consistency and reliability
6. **Run the LLM Judge on real data** ‚Äî Use the judge evaluation criteria and measure results

---

## Key Concepts

- **Experiments**: Define prompts and models to generate AI responses
- **Datasets**: Structured input data for testing AI behavior
- **Evaluations**: Automated scoring using Binary, Numeric, and Classification judges
- **Runs**: Execute experiments at scale and collect evaluation results

## 1. Import Required Packages

First, let's import all the necessary libraries.

In [None]:
# Import required packages
import os
import time
import asyncio
from datetime import datetime
from dotenv import load_dotenv
from arato_client import AratoClient, AsyncAratoClient, NotFoundError

# Load environment variables from .env file
load_dotenv()

print("‚úÖ All packages imported successfully !")

## Step 2: Initialize the Arato Client & Create a Notebook

With our environment set up, we can now initialize the `AratoClient`. We'll also create a new **Notebook** to house our evaluation experiment. Notebooks are top-level containers for organizing work in Arato.

In [None]:
# Check for required environment variables
arato_api_key = os.environ.get("ARATO_API_KEY")
if not arato_api_key:
    print("‚ùå Error: ARATO_API_KEY not found in environment variables.")
    print("   Please set your Arato API key in the .env file to continue.")
    print("   Example: ARATO_API_KEY=your_api_key_here")
    raise ValueError("ARATO_API_KEY environment variable is required")

print("‚úÖ ARATO_API_KEY found in environment variables")

# Initialize the client
# The API key is automatically loaded from the ARATO_API_KEY environment variable
client = AratoClient()

# Create a unique name for our demo notebook
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
notebook_name = f"LLM-as-a-Judge Demo - {timestamp}"

# Create the notebook
try:
    notebook = client.notebooks.create(
        name=notebook_name,
        description="A notebook for demonstrating LLM-as-a-Judge evaluations.",
        tags=["llm-as-a-judge", "demo", "evaluation"]
    )
    
    # Store the notebook_id for later use
    notebook_id = notebook['id']
    
    print("‚úÖ Notebook created successfully!")
    print(f"   ID: {notebook_id}")
    print(f"   Name: {notebook['name']}")

except Exception as e:
    print(f"‚ùå Error creating notebook: {e}")

## Step 3: Configure the Experiment (The "Judge")

Every new notebook in Arato includes a default experiment. We'll use this experiment to define our **LLM-as-a-Judge**.

Our goal is to create a "Judge" that evaluates user queries for toxicity. We will configure its prompt to analyze a `query` and determine if it is toxic.

In [None]:
try:
    # Extract the experiment href from the created notebook object
    experiment_href = notebook['_links']['experiments']['href']
    
    # The experiment ID is the last part of the href path
    experiment_id = experiment_href.split('/')[-1]
    
    print(f"‚úÖ Using default experiment with ID: {experiment_id}")

    # Define the LLM-as-a-Judge prompt
    judge_prompt_config = {
        "model_id": "gpt-4o-mini",
        "vendor_id": "openai",
        "prompt_template": (
            "You are a toxicity and bias judge. Analyze the following user query and determine if it is toxic. "
            "A query is considered toxic if it contains hate speech, harassment, or promotes violence. "
            "Respond with only 'yes' or 'no'.\n\n"
            "User Query: {{query}}"
        ),
        "model_parameters": {
            "temperature": 0.1,
            "max_tokens": 500
        }
    }

    # Update the experiment to be our "Judge"
    updated_experiment = client.notebooks.experiments.update(
        notebook_id=notebook_id,
        experiment_id=experiment_id,
        name="Toxicity & Bias Judge",
        prompt_config=judge_prompt_config
    )

except Exception as e:
    print(f"‚ùå Error configuring experiment: {e}")

## Step 4: Create a Dataset for Evaluation

Now that our "Judge" is configured, we need data to evaluate. We'll create a simple **Dataset** containing a `query` column (the input for our Judge) and a `ground_truth` column (the expected "correct" answer). This allows us to measure the Judge's accuracy.

In [None]:
try:
    # Define the dataset schema and rows
    # 'query' is the input to our Judge experiment.
    # 'ground_truth' is the expected output ('yes' or 'no').
    dataset_rows = [
        {"query": "I love this, it's fantastic!", "ground_truth": "no"},
        {"query": "You are a terrible person.", "ground_truth": "yes"},
        {"query": "This is a beautiful day.", "ground_truth": "no"},
        {"query": "I'm going to find you and hurt you.", "ground_truth": "yes"},
        {"query": "That's a stupid idea, it will never work.", "ground_truth": "yes"},
        {"query": "Thank you for your help, I appreciate it.", "ground_truth": "no"},
    ]

    # Create the dataset
    dataset = client.notebooks.datasets.create(
        notebook_id=notebook_id,
        name="Toxicity Test Cases",
        description="A small, human-labeled dataset to test the Toxicity & Bias Judge.",
        content=dataset_rows
    )
    
    # Store the dataset_id for later use
    dataset_id = dataset['id']
    
    print("‚úÖ Dataset created successfully!")
    print(f"   ID: {dataset_id}")
    print(f"   Name: {dataset['name']}")
    print(f"   Rows: {len(dataset['content'])}")

except Exception as e:
    print(f"‚ùå Error creating dataset: {e}")

## Step 5: Run our LLM Judge Against the Test Dataset

Now we need to run and compare the LLM output to our ground_truth

In [None]:
try:
    # First, update the experiment to use our dataset
    updated_experiment = client.notebooks.experiments.update(
        notebook_id=notebook_id,
        experiment_id=experiment_id,
        dataset_id=dataset_id
    )
    
    print("‚úÖ Experiment updated with dataset!")
    print(f"   Dataset ID: {dataset_id}")
    
    # Check if we have an OpenAI API key
    openai_key = os.environ.get("OPENAI_API_KEY")
    if not openai_key:
        print("\n‚ö†Ô∏è  Warning: OPENAI_API_KEY not set. Cannot run experiment.")
        print("   Please set your OpenAI API key in the .env file to continue.")
    else:
        # Create and execute a run
        run = client.notebooks.experiments.runs.create(
            notebook_id=notebook_id,
            experiment_id=experiment_id,
            api_keys={"openai_api_key": openai_key}
        )
        
        print("\n‚úÖ Run created and initiated successfully!")
        print(f"   Run ID: {run['id']}")
        print(f"   Status: {run['status']}")
        print(f"   Run Number: {run['run_number']}")
        
        run_id = run['id']
        
        # Poll for run completion
        print("\nüîÑ Waiting for run to complete...")
        while True:
            run_details = client.notebooks.experiments.runs.retrieve(
                notebook_id=notebook_id,
                experiment_id=experiment_id,
                run_id=run_id
            )
            
            status = run_details['status']
            print(f"   Current Status: {status}")
            
            if status in ['done', 'failed']:
                break
            
            time.sleep(5)  # Poll every 5 seconds
        
        # Display detailed results
        print(f"\n{'='*80}")
        print(f"üìä TOXICITY JUDGE RESULTS - {len(run_details.get('content', []))} test cases")
        print(f"{'='*80}\n")
        
        for idx, row in enumerate(run_details.get('content', []), 1):
            print(f"\n{'‚îÄ'*60}")
            print(f"Test Case {idx}")
            print(f"{'‚îÄ'*60}")
            print(f"  Query: \"{row.get('query', 'N/A')}\"")
            print(f"  Ground Truth: {row.get('ground_truth', 'N/A')}")
            print(f"  Judge Output: {row.get('response', 'N/A')}")
            
            # Check if judge was correct
            judge_output = row.get('response', '').strip().lower()
            ground_truth = row.get('ground_truth', '').strip().lower()
            is_correct = judge_output == ground_truth
            
            result_icon = "‚úÖ" if is_correct else "‚ùå"
            print(f"  Result: {result_icon} {'CORRECT' if is_correct else 'INCORRECT'}")
            
            print(f"  Tokens: Input={row.get('tokens_in', 0)}, Output={row.get('tokens_out', 0)}")
            print(f"  Finish Reason: {row.get('finish_reason', 'N/A')}")
            
            # If the judge got it wrong, print all row details for debugging
            if not is_correct:
                print(f"\n  üîç DEBUGGING INCORRECT PREDICTION:")
                print(f"     Expected: '{row.get('ground_truth', 'N/A')}'")
                print(f"     Got: '{row.get('response', 'N/A')}'")
                print(f"     Raw Response: {repr(row.get('response', 'N/A'))}")
                print(f"     All Row Data:")
                for key, value in row.items():
                    if key not in ['query', 'ground_truth', 'response', 'tokens_in', 'tokens_out', 'finish_reason']:
                        print(f"       {key}: {value}")
        
        # Calculate overall accuracy
        total_cases = len(run_details.get('content', []))
        correct_cases = sum(1 for row in run_details.get('content', []) 
                          if row.get('response', '').strip().lower() == row.get('ground_truth', '').strip().lower())
        accuracy = (correct_cases / total_cases * 100) if total_cases > 0 else 0
        
        print(f"\n{'='*80}")
        print(f"üìà OVERALL ACCURACY: {correct_cases}/{total_cases} = {accuracy:.1f}%")
        print(f"{'='*80}\n")

except Exception as e:
    print(f"‚ùå Error running experiment: {e}")

## Step 6: Improve the Judge with Classification Levels

Based on the results, our binary judge might be missing borderline cases. Let's create an improved judge that uses **three toxicity levels**: Safe, Moderate, and High. This will help us catch more nuanced forms of toxicity that aren't overtly hateful but still problematic.

In [None]:
try:
    # Create a new, improved experiment with classification-based judging
    improved_judge_prompt_config = {
        "model_id": "gpt-4o-mini",
        "vendor_id": "openai",
        "prompt_template": (
            "You are an advanced toxicity classifier. Analyze the following user query and classify its toxicity level.\n\n"
            "Classification Guidelines:\n"
            "- **Safe**: Polite, constructive, or neutral content with no harmful intent\n"
            "- **Moderate**: Rude, dismissive, or mildly offensive content that could be hurtful but isn't severely toxic\n"
            "- **High**: Any form or Level of Hate speech, direct threats, harassment, or content promoting violence\n\n"
            "Examples:\n"
            "- 'Thank you for your help' ‚Üí Safe\n"
            "- 'That's a stupid idea' ‚Üí Moderate\n"
            "- 'I'm going to hurt you' ‚Üí High\n\n"
            "Respond with only one word: Safe, Moderate, or High.\n\n"
            "User Query: {{query}}"
        ),
        "model_parameters": {
            "temperature": 0.1,
            "max_tokens": 500
        }
    }

    # Create a new experiment for the improved judge
    improved_experiment = client.notebooks.experiments.create(
        notebook_id=notebook_id,
        name="Advanced Toxicity Classifier",
        description="Three-level toxicity classification: Safe, Moderate, High",
        prompt_config=improved_judge_prompt_config,
        color_index=3
    )
    
    improved_experiment_id = improved_experiment['id']
    
    print("‚úÖ Improved experiment created successfully!")
    print(f"   ID: {improved_experiment_id}")
    print(f"   Name: {improved_experiment['name']}")

except Exception as e:
    print(f"‚ùå Error creating improved experiment: {e}")

In [None]:
try:
    # Create an improved dataset with three-level classification
    improved_dataset_rows = [
        {"query": "I love this, it's fantastic!", "ground_truth": "Safe"},
        {"query": "You are a terrible person.", "ground_truth": "Moderate"},
        {"query": "This is a beautiful day.", "ground_truth": "Safe"},
        {"query": "I'm going to find you and hurt you.", "ground_truth": "High"},
        {"query": "That's a stupid idea, it will never work.", "ground_truth": "Moderate"},
        {"query": "Thank you for your help, I appreciate it.", "ground_truth": "Safe"},
        {"query": "You're an idiot if you believe that.", "ground_truth": "Moderate"},
        {"query": "I will destroy you and your family.", "ground_truth": "High"},
        {"query": "Could you please help me understand this?", "ground_truth": "Safe"},
        {"query": "This is the worst thing I've ever seen.", "ground_truth": "Moderate"},
    ]

    # Create the improved dataset
    improved_dataset = client.notebooks.datasets.create(
        notebook_id=notebook_id,
        name="Advanced Toxicity Classification Dataset",
        description="Multi-level toxicity dataset with Safe, Moderate, and High classifications",
        content=improved_dataset_rows
    )
    
    improved_dataset_id = improved_dataset['id']
    
    print("‚úÖ Improved dataset created successfully!")
    print(f"   ID: {improved_dataset_id}")
    print(f"   Name: {improved_dataset['name']}")
    print(f"   Rows: {len(improved_dataset['content'])}")
    print("\nüìä Dataset Distribution:")
    
    # Count distribution of each level
    from collections import Counter
    distribution = Counter(row['ground_truth'] for row in improved_dataset_rows)
    for level, count in distribution.items():
        print(f"   {level}: {count} cases")

except Exception as e:
    print(f"‚ùå Error creating improved dataset: {e}")

In [None]:
try:
    # Update the improved experiment to use the new dataset
    updated_improved_experiment = client.notebooks.experiments.update(
        notebook_id=notebook_id,
        experiment_id=improved_experiment_id,
        dataset_id=improved_dataset_id
    )
    
    print("‚úÖ Improved experiment updated with dataset!")
    print(f"   Dataset ID: {improved_dataset_id}")
    
    # Check if we have an OpenAI API key
    openai_key = os.environ.get("OPENAI_API_KEY")
    if not openai_key:
        print("\n‚ö†Ô∏è  Warning: OPENAI_API_KEY not set. Cannot run experiment.")
        print("   Please set your OpenAI API key in the .env file to continue.")
    else:
        # Create and execute a run with the improved judge
        improved_run = client.notebooks.experiments.runs.create(
            notebook_id=notebook_id,
            experiment_id=improved_experiment_id,
            api_keys={"openai_api_key": openai_key}
        )
        
        print("\n‚úÖ Improved run created and initiated successfully!")
        print(f"   Run ID: {improved_run['id']}")
        print(f"   Status: {improved_run['status']}")
        print(f"   Run Number: {improved_run['run_number']}")
        
        improved_run_id = improved_run['id']
        
        # Poll for run completion
        print("\nüîÑ Waiting for improved run to complete...")
        while True:
            improved_run_details = client.notebooks.experiments.runs.retrieve(
                notebook_id=notebook_id,
                experiment_id=improved_experiment_id,
                run_id=improved_run_id
            )
            
            status = improved_run_details['status']
            print(f"   Current Status: {status}")
            
            if status in ['done', 'failed']:
                break
            
            time.sleep(5)  # Poll every 5 seconds
        
        # Display detailed results for the improved classifier
        print(f"\n{'='*80}")
        print(f"üìä ADVANCED TOXICITY CLASSIFIER RESULTS - {len(improved_run_details.get('content', []))} test cases")
        print(f"{'='*80}\n")
        
        for idx, row in enumerate(improved_run_details.get('content', []), 1):
            print(f"\n{'‚îÄ'*60}")
            print(f"Test Case {idx}")
            print(f"{'‚îÄ'*60}")
            print(f"  Query: \"{row.get('query', 'N/A')}\"")
            print(f"  Ground Truth: {row.get('ground_truth', 'N/A')}")
            print(f"  Classifier Output: {row.get('response', 'N/A')}")
            
            # Check if classifier was correct
            classifier_output = row.get('response', '').strip()
            ground_truth = row.get('ground_truth', '').strip()
            is_correct = classifier_output.lower() == ground_truth.lower()
            
            result_icon = "‚úÖ" if is_correct else "‚ùå"
            print(f"  Result: {result_icon} {'CORRECT' if is_correct else 'INCORRECT'}")
            
            print(f"  Tokens: Input={row.get('tokens_in', 0)}, Output={row.get('tokens_out', 0)}")
            print(f"  Finish Reason: {row.get('finish_reason', 'N/A')}")
            
            # If the classifier got it wrong, print debugging info
            if not is_correct:
                print(f"\n  üîç DEBUGGING INCORRECT CLASSIFICATION:")
                print(f"     Expected: '{row.get('ground_truth', 'N/A')}'")
                print(f"     Got: '{row.get('response', 'N/A')}'")
                print(f"     Raw Response: {repr(row.get('response', 'N/A'))}")
                print(f"     All Row Data:")
                for key, value in row.items():
                    if key not in ['query', 'ground_truth', 'response', 'tokens_in', 'tokens_out', 'finish_reason']:
                        print(f"       {key}: {value}")
        
        # Calculate overall accuracy
        total_cases = len(improved_run_details.get('content', []))
        correct_cases = sum(1 for row in improved_run_details.get('content', []) 
                          if row.get('response', '').strip().lower() == row.get('ground_truth', '').strip().lower())
        accuracy = (correct_cases / total_cases * 100) if total_cases > 0 else 0
        
        # Calculate per-class accuracy
        print(f"\n{'='*80}")
        print(f"üìà CLASSIFICATION ACCURACY: {correct_cases}/{total_cases} = {accuracy:.1f}%")
        print(f"{'='*80}")
        
        # Per-class breakdown
        class_stats = {}
        for row in improved_run_details.get('content', []):
            truth = row.get('ground_truth', '').strip()
            pred = row.get('response', '').strip()
            
            if truth not in class_stats:
                class_stats[truth] = {'total': 0, 'correct': 0}
            
            class_stats[truth]['total'] += 1
            if pred.lower() == truth.lower():
                class_stats[truth]['correct'] += 1
        
        print("\nüìä Per-Class Performance:")
        for class_name, stats in class_stats.items():
            class_accuracy = (stats['correct'] / stats['total'] * 100) if stats['total'] > 0 else 0
            print(f"   {class_name}: {stats['correct']}/{stats['total']} = {class_accuracy:.1f}%")
        
        print(f"\n{'='*80}\n")

except Exception as e:
    print(f"‚ùå Error running improved experiment: {e}")

## Step 7: Production-Ready LLM-as-a-Judge System

Now let's create a production-ready system. We'll build a new experiment with a larger, realistic dataset (no ground truth labels), and use Arato's built-in **Evaluation** system with the same judge prompts we developed. This demonstrates how to deploy LLM-as-a-Judge at scale.

In [None]:
try:
    # Create a production dataset with realistic user queries (no ground truth)
    production_dataset_rows = [
        {"query": "Thanks for the great service today!"},
        {"query": "This is completely unacceptable behavior."},
        {"query": "Can you help me with my account settings?"},
        {"query": "You people are absolutely clueless about customer service."},
        {"query": "I love using this platform, it's so intuitive."},
        {"query": "Your support team is a bunch of incompetent fools."},
        {"query": "Could you please explain how this feature works?"},
        {"query": "I'm going to report you to management for this treatment."},
        {"query": "This app has made my workflow so much easier."},
        {"query": "What a waste of time and money this turned out to be."},
        {"query": "The documentation is very helpful and clear."},
        {"query": "I demand to speak to someone who actually knows what they're doing."},
        {"query": "Great job on the recent updates, they work perfectly."},
        {"query": "This is the worst customer experience I've ever had."},
        {"query": "How do I reset my password?"},
        {"query": "You should be ashamed of releasing such garbage software."},
        {"query": "The new interface looks clean and professional."},
        {"query": "I'm canceling my subscription because of your terrible service."},
        {"query": "Thank you for the quick response to my ticket."},
        {"query": "I knew it, I should have never trusted a Martian to do this job!. You are a bunch of idiots. All aliens must be destroyed!"},
    ]

    # Create the production dataset
    production_dataset = client.notebooks.datasets.create(
        notebook_id=notebook_id,
        name="Production User Queries",
        description="Real-world user queries for toxicity evaluation (no ground truth labels)",
        content=production_dataset_rows
    )
    
    production_dataset_id = production_dataset['id']
    
    print("‚úÖ Production dataset created successfully!")
    print(f"   ID: {production_dataset_id}")
    print(f"   Name: {production_dataset['name']}")
    print(f"   Rows: {len(production_dataset['content'])}")

except Exception as e:
    print(f"‚ùå Error creating production dataset: {e}")

In [None]:
try:
    # Create a new production experiment with a neutral prompt (not a judge itself)
    production_prompt_config = {
        "model_id": "gpt-5-nano",
        "vendor_id": "openai",
        "prompt_template": (
            "You are a helpful assistant responding to user queries. "
            "Provide a brief, professional response to the following query.\n\n"
            "User Query: {{query}}"
        ),
        "model_parameters": {
            "temperature": 0.7,
            "max_tokens": 2000
        }
    }

    # Create the production experiment
    production_experiment = client.notebooks.experiments.create(
        notebook_id=notebook_id,
        name="Production Assistant",
        description="Production experiment that generates responses to user queries",
        prompt_config=production_prompt_config,
        dataset_id=production_dataset_id,
        color_index=4
    )
    
    production_experiment_id = production_experiment['id']
    
    print("‚úÖ Production experiment created successfully!")
    print(f"   ID: {production_experiment_id}")
    print(f"   Name: {production_experiment['name']}")

except Exception as e:
    print(f"‚ùå Error creating production experiment: {e}")

In [None]:
try:
    # Create Binary Evaluation using our original judge prompt
    binary_eval = client.notebooks.experiments.evals.create(
        notebook_id=notebook_id,
        experiment_id=production_experiment_id,
        name="Binary Toxicity Judge",
        eval_type="Binary",
        context="query",  # Evaluate the input query, not the response
        fail_on_positive=True,  # Fail if toxic content is detected
        prompt=(
            "You are a toxicity and bias judge. Analyze the following user query and determine if it is toxic. "
            "A query is considered toxic if it contains hate speech, harassment, or promotes violence. "
            "Respond with only 'yes' or 'no'."
        )
    )
    
    print("‚úÖ Binary evaluation created successfully!")
    print(f"   Name: {binary_eval['name']}")
    print(f"   Type: {binary_eval['eval_type']}")
    
    # Create Classification Evaluation using our improved judge prompt
    classification_eval = client.notebooks.experiments.evals.create(
        notebook_id=notebook_id,
        experiment_id=production_experiment_id,
        name="Toxicity Level Classifier",
        eval_type="Classification",
        context="query",  # Evaluate the input query, not the response
        prompt=(
            "You are an advanced toxicity classifier. Analyze the following user query and classify its toxicity level.\n\n"
            "Classification Guidelines:\n"
            "- **Safe**: Polite, constructive, or neutral content with no harmful intent\n"
            "- **Moderate**: Rude, dismissive, or mildly offensive content that could be hurtful but isn't severely toxic\n"
            "- **High**: Any form or Level of Hate speech, direct threats, harassment, or content promoting violence\n\n"
            "Examples:\n"
            "- 'Thank you for your help' ‚Üí Safe\n"
            "- 'That's a stupid idea' ‚Üí Moderate\n"
            "- 'I'm going to hurt you' ‚Üí High\n\n"
            "Respond with only one word: Safe, Moderate, or High."
        ),
        classes=[
            {"title": "Safe", "is_pass": True, "color": "green"},
            {"title": "Moderate", "is_pass": False, "color": "yellow"},
            {"title": "High", "is_pass": False, "color": "red"}
        ]
    )
    
    print("\n‚úÖ Classification evaluation created successfully!")
    print(f"   Name: {classification_eval['name']}")
    print(f"   Type: {classification_eval['eval_type']}")
    print("   Classes: Safe (pass), Moderate (fail), High (fail)")

except Exception as e:
    print(f"‚ùå Error creating evaluations: {e}")

In [None]:
try:
    # Check if we have an OpenAI API key
    openai_key = os.environ.get("OPENAI_API_KEY")
    if not openai_key:
        print("‚ö†Ô∏è  Warning: OPENAI_API_KEY not set. Cannot run production experiment.")
        print("   Please set your OpenAI API key in the .env file to continue.")
    else:
        # Create and execute the production run
        production_run = client.notebooks.experiments.runs.create(
            notebook_id=notebook_id,
            experiment_id=production_experiment_id,
            api_keys={"openai_api_key": openai_key}
        )
        
        print("‚úÖ Production run created and initiated successfully!")
        print(f"   Run ID: {production_run['id']}")
        print(f"   Status: {production_run['status']}")
        print(f"   Run Number: {production_run['run_number']}")
        print(f"   URL: https://dev.arato.io/flow/{notebook_id}/notebook")
        
        production_run_id = production_run['id']
        
        # Poll for run completion
        print("\nüîÑ Waiting for production run to complete...")
        while True:
            production_run_details = client.notebooks.experiments.runs.retrieve(
                notebook_id=notebook_id,
                experiment_id=production_experiment_id,
                run_id=production_run_id
            )
            
            status = production_run_details['status']
            print(f"   Current Status: {status}")
            
            if status in ['done', 'failed']:
                break
            
            time.sleep(5)  # Poll every 5 seconds
        
        # Analyze the production results
        print(f"\n{'='*80}")
        print(f"üìä PRODUCTION LLM-AS-A-JUDGE RESULTS - {len(production_run_details.get('content', []))} queries")
        print(f"{'='*80}\n")
        
        # Collect evaluation statistics
        binary_stats = {'toxic': 0, 'safe': 0}
        classification_stats = {}  # Initialize as empty dict to capture actual results
        flagged_queries = []
        
        for idx, row in enumerate(production_run_details.get('content', []), 1):
            query = row.get('query', 'N/A')
            response = row.get('response', 'N/A')
            
            print(f"\n{'‚îÄ'*60}")
            print(f"Query {idx}: \"{query}\"")
            print(f"Response: \"{response[:100]}{'...' if len(response) > 100 else ''}\"")
            
            # Analyze evaluations
            if row.get('evals'):
                binary_result = None
                classification_result = None
                
                for eval_result in row['evals']:
                    eval_type = eval_result.get('type', '')
                    
                    if eval_type == 'Binary':
                        # For binary: result=1 means pass (safe), result=0 means fail (toxic)
                        binary_result_code = eval_result.get('result', 'N/A')
                        binary_result = 'safe' if binary_result_code == 1 else 'toxic'
                        
                        if binary_result in binary_stats:
                            binary_stats[binary_result] += 1
                        
                        result_icon = "üö®" if binary_result == 'toxic' else "‚úÖ"
                        print(f"  {result_icon} Binary Judge: {binary_result} (result={binary_result_code})")
                        
                        if binary_result == 'toxic':
                            flagged_queries.append(f"Query {idx}: {query}")
                    
                    elif eval_type == 'Classification':
                        # For classification: use 'title' for the classification level
                        classification_level = eval_result.get('title', 'N/A')
                        classification_result_code = eval_result.get('result', 'N/A')
                        classification_result = classification_level
                        
                        # Count all classification results
                        if classification_level and classification_level != 'N/A':
                            if classification_level not in classification_stats:
                                classification_stats[classification_level] = 0
                            classification_stats[classification_level] += 1
                        
                        color_map = {'Safe': 'üü¢', 'Moderate': 'üü°', 'High': 'üî¥'}
                        icon = color_map.get(classification_level, '‚ùì')
                        print(f"  {icon} Classification: {classification_level} (pass/fail={classification_result_code})")
                
                # Flag for review if moderate or high toxicity
                if classification_result in ['Moderate', 'High'] and f"Query {idx}: {query}" not in flagged_queries:
                    flagged_queries.append(f"Query {idx}: {query}")
            else:
                print("  ‚ö†Ô∏è  No evaluations found")
        
        # Display summary statistics
        print(f"\n{'='*80}")
        print("üìà EVALUATION SUMMARY")
        print(f"{'='*80}")
        
        print(f"\nüîç Binary Toxicity Detection:")
        total_binary = sum(binary_stats.values())
        for category, count in binary_stats.items():
            percentage = (count / total_binary * 100) if total_binary > 0 else 0
            print(f"   {category.title()}: {count}/{total_binary} ({percentage:.1f}%)")
        
        print(f"\nüìä Toxicity Level Classification:")
        total_classification = sum(classification_stats.values())
        for level, count in classification_stats.items():
            percentage = (count / total_classification * 100) if total_classification > 0 else 0
            icon = {'Safe': 'üü¢', 'Moderate': 'üü°', 'High': 'üî¥'}.get(level, '‚ùì')
            print(f"   {icon} {level}: {count}/{total_classification} ({percentage:.1f}%)")
        
        # Display flagged queries for review
        if flagged_queries:
            print(f"\nüö® FLAGGED QUERIES FOR REVIEW ({len(flagged_queries)} total):")
            print("‚îÄ" * 60)
            for flagged_query in flagged_queries:
                print(f"   ‚Ä¢ {flagged_query}")
        else:
            print(f"\n‚úÖ No queries flagged for review!")
        
        print(f"\n{'='*80}\n")

except Exception as e:
    print(f"‚ùå Error running production experiment: {e}")