# LLM-as-a-Judge: Evaluating AI Systems at Scale

**Demo for: "Scalable LLM Evaluations in Practice"**

---

## What This Demo Covers

This notebook demonstrates the **complete evaluation lifecycle** using the Arato SDK:

1. **Define Evaluation Goals** ‚Äî What are we measuring?
2. **Design Criteria & Datasets** ‚Äî Prepare test data and scoring rubrics
3. **Choose Judges & Prompts** ‚Äî Configure AI models to evaluate outputs
4. **Run Evaluations** ‚Äî Execute experiments programmatically
5. **Meta-Evaluate the Judge** ‚Äî Verify judge consistency and reliability
6. **Run the LLM Judge on real data** ‚Äî Use the judge evaluation criteria and measure results

---

## Key Concepts

- **Experiments**: Define prompts and models to generate AI responses
- **Datasets**: Structured input data for testing AI behavior
- **Evaluations**: Automated scoring using Binary, Numeric, and Classification judges
- **Runs**: Execute experiments at scale and collect evaluation results

## 1. Import Required Packages

First, let's import all the necessary libraries.

In [None]:
# Import required packages
import os
import time
import asyncio
import pandas as pd
from collections import Counter
from datetime import datetime
from dotenv import load_dotenv
from arato_client import AratoClient, AsyncAratoClient, NotFoundError

# Load environment variables from .env file
load_dotenv()

print("‚úÖ All packages imported successfully !")


## Step 2: Initialize the Arato Client & Create a Notebook

With our environment set up, we can now initialize the `AratoClient`. We'll also create a new **Notebook** to house our evaluation experiment. Notebooks are top-level containers for organizing work in Arato.

In [None]:
# Check for required environment variables
arato_api_key = os.environ.get("ARATO_API_KEY")
if not arato_api_key:
    print("‚ùå Error: ARATO_API_KEY not found in environment variables.")
    print("   Please set your Arato API key in the .env file to continue.")
    print("   Example: ARATO_API_KEY=your_api_key_here")
    raise ValueError("ARATO_API_KEY environment variable is required")

print("‚úÖ ARATO_API_KEY found in environment variables")

# Initialize the client
# The API key is automatically loaded from the ARATO_API_KEY environment variable
client = AratoClient()

# Create a unique name for our demo notebook
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
notebook_name = f"LLM-as-a-Judge Demo - {timestamp}"

# Create the notebook
try:
    notebook = client.notebooks.create(
        name=notebook_name,
        description="A notebook for demonstrating LLM-as-a-Judge evaluations.",
        tags=["llm-as-a-judge", "demo", "evaluation"]
    )
    
    # Store the notebook_id for later use
    notebook_id = notebook['id']
    
    print("‚úÖ Notebook created successfully!")
    print(f"   ID: {notebook_id}")
    print(f"   Name: {notebook['name']}")

except Exception as e:
    print(f"‚ùå Error creating notebook: {e}")

## Step 3: Configure the Experiment (The "Judge")

Every new notebook in Arato includes a default experiment. We'll use this experiment to define our **LLM-as-a-Judge**.

Our goal is to create a "Judge" that evaluates user queries for toxicity. We will configure its prompt to analyze a `query` and determine if it is toxic.

In [None]:
try:
    # Extract the experiment href from the created notebook object
    experiment_href = notebook['_links']['experiments']['href']
    
    # The experiment ID is the last part of the href path
    experiment_id = experiment_href.split('/')[-1]
    
    print(f"‚úÖ Using default experiment with ID: {experiment_id}")

    # Define the LLM-as-a-Judge prompt
    judge_prompt_config = {
        "model_id": "gpt-4o-mini",
        "vendor_id": "openai",
        "prompt_template": (
            "You are a toxicity and bias judge. Analyze the following user query and determine if it is toxic. "
            "A query is considered toxic if it contains hate speech, harassment, or promotes violence. "
            "Respond with only 'yes' or 'no'.\n\n"
            "User Query: {{query}}"
        ),
        "model_parameters": {
            "temperature": 0.1,
            "max_tokens": 500
        }
    }

    # Update the experiment to be our "Judge"
    updated_experiment = client.notebooks.experiments.update(
        notebook_id=notebook_id,
        experiment_id=experiment_id,
        name="Toxicity & Bias Judge",
        prompt_config=judge_prompt_config
    )

except Exception as e:
    print(f"‚ùå Error configuring experiment: {e}")

## Step 4: Create a Dataset for Evaluation

Now that our "Judge" is configured, we need data to evaluate. We'll create a simple **Dataset** containing a `query` column (the input for our Judge) and a `ground_truth` column (the expected "correct" answer). This allows us to measure the Judge's accuracy.

In [None]:
try:
    # Load the dataset from CSV file
    df = pd.read_csv('toxicity_test_cases.csv')
    # Convert DataFrame to list of dicts for Arato
    dataset_rows = df.to_dict('records')

    # Create the dataset in Arato
    dataset = client.notebooks.datasets.create(
        notebook_id=notebook_id,
        name="Toxicity Test Cases",
        description="A small, human-labeled dataset to test the Toxicity & Bias Judge.",
        content=dataset_rows
    )
    
    # Store the dataset_id for later use
    dataset_id = dataset['id']
    
    print(f"\n‚úÖ Dataset created successfully in Arato!")
    print(f"   ID: {dataset_id}")
    print(f"   Name: {dataset['name']}")
    print(f"   Rows: {len(dataset['content'])}")
    display(df.head())

except FileNotFoundError:
    print("‚ùå Error: toxicity_test_cases.csv not found")
    print("   Please make sure the CSV file is in the same directory as this notebook")
except Exception as e:
    print(f"‚ùå Error creating dataset: {e}")


## Step 5: Run our LLM Judge Against the Test Dataset

Now we need to run and compare the LLM output to our ground_truth

In [None]:
try:
    # First, update the experiment to use our dataset
    updated_experiment = client.notebooks.experiments.update(
        notebook_id=notebook_id,
        experiment_id=experiment_id,
        dataset_id=dataset_id
    )
    
    print("‚úÖ Experiment updated with dataset!")
    print(f"   Dataset ID: {dataset_id}")
    
    # Check if we have an OpenAI API key
    openai_key = os.environ.get("OPENAI_API_KEY")
    if not openai_key:
        print("\n‚ö†Ô∏è  Warning: OPENAI_API_KEY not set. Cannot run experiment.")
        print("   Please set your OpenAI API key in the .env file to continue.")
    else:
        # Create and execute a run
        run = client.notebooks.experiments.runs.create(
            notebook_id=notebook_id,
            experiment_id=experiment_id,
            api_keys={"openai_api_key": openai_key}
        )
        
        print("\n‚úÖ Run created and initiated successfully!")
        print(f"   Run ID: {run['id']}")
        print(f"   Status: {run['status']}")
        print(f"   Run Number: {run['run_number']}")
        
        run_id = run['id']
        
        # Poll for run completion
        print("\nüîÑ Waiting for run to complete...")
        while True:
            run_details = client.notebooks.experiments.runs.retrieve(
                notebook_id=notebook_id,
                experiment_id=experiment_id,
                run_id=run_id
            )
            
            status = run_details['status']
            print(f"   Current Status: {status}")
            
            if status in ['done', 'failed']:
                break
            
            time.sleep(5)  # Poll every 5 seconds
        
        # Prepare results for table display
        results_data = []
        correct_cases = 0
        total_cases = len(run_details.get('content', []))
        
        for idx, row in enumerate(run_details.get('content', []), 1):
            query = row.get('query', 'N/A')
            ground_truth = row.get('ground_truth', 'N/A')
            judge_output = row.get('response', 'N/A')
            
            # Check if judge was correct
            is_correct = judge_output.strip().lower() == ground_truth.strip().lower()
            if is_correct:
                correct_cases += 1
            
            result_icon = "‚úÖ" if is_correct else "‚ùå"
            result_text = f"{result_icon} {'CORRECT' if is_correct else 'INCORRECT'}"
            
            results_data.append({
                'Test Case': idx,
                'Query': query[:50] + '...' if len(query) > 50 else query,
                'Ground Truth': ground_truth,
                'Judge Output': judge_output,
                'Result': result_text,
                'Tokens In': row.get('tokens_in', 0),
                'Tokens Out': row.get('tokens_out', 0)
            })
        
        # Display results as formatted table
        print(f"\nüìä TOXICITY JUDGE RESULTS - {total_cases} test cases\n")
        results_df = pd.DataFrame(results_data)
        display(results_df)
        
        # Calculate and display accuracy
        accuracy = (correct_cases / total_cases * 100) if total_cases > 0 else 0
        
        print(f"\nüìà OVERALL ACCURACY SUMMARY\n")
        accuracy_df = pd.DataFrame([{
            'Total Cases': total_cases,
            'Correct': correct_cases,
            'Incorrect': total_cases - correct_cases,
            'Accuracy': f"{accuracy:.1f}%"
        }])
        display(accuracy_df)
        
        # Show incorrect predictions for debugging
        incorrect_data = []
        for idx, row in enumerate(run_details.get('content', []), 1):
            judge_output = row.get('response', '').strip().lower()
            ground_truth = row.get('ground_truth', '').strip().lower()
            
            if judge_output != ground_truth:
                incorrect_data.append({
                    'Test Case': idx,
                    'Query': row.get('query', 'N/A'),
                    'Expected': row.get('ground_truth', 'N/A'),
                    'Got': row.get('response', 'N/A'),
                    'Finish Reason': row.get('finish_reason', 'N/A')
                })
        
        if incorrect_data:
            print(f"\nüîç INCORRECT PREDICTIONS ({len(incorrect_data)} cases):\n")
            incorrect_df = pd.DataFrame(incorrect_data)
            display(incorrect_df)
        else:
            print("\n‚úÖ All predictions were correct!")

except Exception as e:
    print(f"‚ùå Error running experiment: {e}")


## Step 6: Improve the Judge with Classification Levels

Based on the results, our binary judge might be missing borderline cases. Let's create an improved judge that uses **three toxicity levels**: Safe, Moderate, and High. This will help us catch more nuanced forms of toxicity that aren't overtly hateful but still problematic.

In [None]:
try:
    # Create a new, improved experiment with classification-based judging
    improved_judge_prompt_config = {
        "model_id": "gpt-4o-mini",
        "vendor_id": "openai",
        "prompt_template": (
            "You are an advanced toxicity classifier. Analyze the following user query and classify its toxicity level.\n\n"
            "Classification Guidelines:\n"
            "- **Safe**: Polite, constructive, or neutral content with no harmful intent\n"
            "- **Moderate**: Rude, dismissive, or mildly offensive content that could be hurtful but isn't severely toxic\n"
            "- **High**: Any form or Level of Hate speech, direct threats, harassment, or content promoting violence\n\n"
            "Examples:\n"
            "- 'Thank you for your help' ‚Üí Safe\n"
            "- 'That's a stupid idea' ‚Üí Moderate\n"
            "- 'I'm going to hurt you' ‚Üí High\n\n"
            "Respond with only one word: Safe, Moderate, or High.\n\n"
            "User Query: {{query}}"
        ),
        "model_parameters": {
            "temperature": 0.1,
            "max_tokens": 500
        }
    }

    # Create a new experiment for the improved judge
    improved_experiment = client.notebooks.experiments.create(
        notebook_id=notebook_id,
        name="Advanced Toxicity Classifier",
        description="Three-level toxicity classification: Safe, Moderate, High",
        prompt_config=improved_judge_prompt_config,
        color_index=3
    )
    
    improved_experiment_id = improved_experiment['id']
    
    print("‚úÖ Improved experiment created successfully!")
    print(f"   ID: {improved_experiment_id}")
    print(f"   Name: {improved_experiment['name']}")

except Exception as e:
    print(f"‚ùå Error creating improved experiment: {e}")

In [None]:
try:
    # Load the improved dataset from CSV file
    df = pd.read_csv('advanced_toxicity_classification.csv')
    
    # Display the dataset
    print("üìä Advanced Toxicity Classification Dataset:")
    print("="*60)
    display(df.head())
    
    # Show distribution of each level
    distribution = Counter(df['ground_truth'])
    print(f"\n‚úÖ Loaded {len(df)} rows")
    print("\nüìä Dataset Distribution:")
    for level, count in distribution.items():
        print(f"   {level}: {count} cases")
    
    # Convert DataFrame to list of dicts for Arato
    improved_dataset_rows = df.to_dict('records')

    # Create the improved dataset in Arato
    improved_dataset = client.notebooks.datasets.create(
        notebook_id=notebook_id,
        name="Advanced Toxicity Classification Dataset",
        description="Multi-level toxicity dataset with Safe, Moderate, and High classifications",
        content=improved_dataset_rows
    )
    
    improved_dataset_id = improved_dataset['id']
    
    print(f"\n‚úÖ Improved dataset created successfully in Arato!")
    print(f"   ID: {improved_dataset_id}")
    print(f"   Name: {improved_dataset['name']}")
    print(f"   Rows: {len(improved_dataset['content'])}")

except FileNotFoundError:
    print("‚ùå Error: advanced_toxicity_classification.csv not found")
    print("   Please make sure the CSV file is in the same directory as this notebook")
except Exception as e:
    print(f"‚ùå Error creating improved dataset: {e}")


In [None]:
try:
    # Update the improved experiment to use the new dataset
    updated_improved_experiment = client.notebooks.experiments.update(
        notebook_id=notebook_id,
        experiment_id=improved_experiment_id,
        dataset_id=improved_dataset_id
    )
    
    print("‚úÖ Improved experiment updated with dataset!")
    print(f"   Dataset ID: {improved_dataset_id}")
    
    # Check if we have an OpenAI API key
    openai_key = os.environ.get("OPENAI_API_KEY")
    if not openai_key:
        print("\n‚ö†Ô∏è  Warning: OPENAI_API_KEY not set. Cannot run experiment.")
        print("   Please set your OpenAI API key in the .env file to continue.")
    else:
        # Create and execute a run with the improved judge
        improved_run = client.notebooks.experiments.runs.create(
            notebook_id=notebook_id,
            experiment_id=improved_experiment_id,
            api_keys={"openai_api_key": openai_key}
        )
        
        print("\n‚úÖ Improved run created and initiated successfully!")
        print(f"   Run ID: {improved_run['id']}")
        print(f"   Status: {improved_run['status']}")
        print(f"   Run Number: {improved_run['run_number']}")
        
        improved_run_id = improved_run['id']
        
        # Poll for run completion
        print("\nüîÑ Waiting for improved run to complete...")
        while True:
            improved_run_details = client.notebooks.experiments.runs.retrieve(
                notebook_id=notebook_id,
                experiment_id=improved_experiment_id,
                run_id=improved_run_id
            )
            
            status = improved_run_details['status']
            print(f"   Current Status: {status}")
            
            if status in ['done', 'failed']:
                break
            
            time.sleep(5)  # Poll every 5 seconds
        
        # Prepare results for table display
        results_data = []
        correct_cases = 0
        total_cases = len(improved_run_details.get('content', []))
        class_stats = {}
        
        for idx, row in enumerate(improved_run_details.get('content', []), 1):
            query = row.get('query', 'N/A')
            ground_truth = row.get('ground_truth', 'N/A').strip()
            classifier_output = row.get('response', 'N/A').strip()
            
            # Check if classifier was correct
            is_correct = classifier_output.lower() == ground_truth.lower()
            if is_correct:
                correct_cases += 1
            
            result_icon = "‚úÖ" if is_correct else "‚ùå"
            result_text = f"{result_icon} {'CORRECT' if is_correct else 'INCORRECT'}"
            
            # Update class statistics
            if ground_truth not in class_stats:
                class_stats[ground_truth] = {'total': 0, 'correct': 0}
            class_stats[ground_truth]['total'] += 1
            if is_correct:
                class_stats[ground_truth]['correct'] += 1
            
            results_data.append({
                'Test Case': idx,
                'Query': query[:50] + '...' if len(query) > 50 else query,
                'Ground Truth': ground_truth,
                'Classifier Output': classifier_output,
                'Result': result_text,
                'Tokens In': row.get('tokens_in', 0),
                'Tokens Out': row.get('tokens_out', 0)
            })
        
        # Display results as formatted table
        print(f"\nüìä ADVANCED TOXICITY CLASSIFIER RESULTS - {total_cases} test cases\n")
        results_df = pd.DataFrame(results_data)
        display(results_df)
        
        # Calculate and display overall accuracy
        accuracy = (correct_cases / total_cases * 100) if total_cases > 0 else 0
        
        print(f"\nüìà OVERALL ACCURACY SUMMARY\n")
        accuracy_df = pd.DataFrame([{
            'Total Cases': total_cases,
            'Correct': correct_cases,
            'Incorrect': total_cases - correct_cases,
            'Accuracy': f"{accuracy:.1f}%"
        }])
        display(accuracy_df)
        
        # Display per-class performance
        print(f"\nüìä PER-CLASS PERFORMANCE\n")
        class_performance_data = []
        for class_name in ['Safe', 'Moderate', 'High']:
            if class_name in class_stats:
                stats = class_stats[class_name]
                class_accuracy = (stats['correct'] / stats['total'] * 100) if stats['total'] > 0 else 0
                class_performance_data.append({
                    'Class': class_name,
                    'Total': stats['total'],
                    'Correct': stats['correct'],
                    'Incorrect': stats['total'] - stats['correct'],
                    'Accuracy': f"{class_accuracy:.1f}%"
                })
        
        if class_performance_data:
            class_performance_df = pd.DataFrame(class_performance_data)
            display(class_performance_df)
        
        # Show incorrect predictions for debugging
        incorrect_data = []
        for idx, row in enumerate(improved_run_details.get('content', []), 1):
            classifier_output = row.get('response', '').strip().lower()
            ground_truth = row.get('ground_truth', '').strip().lower()
            
            if classifier_output != ground_truth:
                incorrect_data.append({
                    'Test Case': idx,
                    'Query': row.get('query', 'N/A'),
                    'Expected': row.get('ground_truth', 'N/A'),
                    'Got': row.get('response', 'N/A'),
                    'Finish Reason': row.get('finish_reason', 'N/A')
                })
        
        if incorrect_data:
            print(f"\nüîç INCORRECT CLASSIFICATIONS ({len(incorrect_data)} cases):\n")
            incorrect_df = pd.DataFrame(incorrect_data)
            display(incorrect_df)
        else:
            print("\n‚úÖ All classifications were correct!")

except Exception as e:
    print(f"‚ùå Error running improved experiment: {e}")


## Step 7: Production-Ready LLM-as-a-Judge System

Now let's create a production-ready system. We'll build a new experiment with a larger, realistic dataset (no ground truth labels), and use Arato's built-in **Evaluation** system with the same judge prompts we developed. This demonstrates how to deploy LLM-as-a-Judge at scale.

In [None]:
try:
    # Load the production dataset from CSV file
    df = pd.read_csv('production_user_queries.csv')
    
    # Display the dataset
    print("üìä Production User Queries Dataset:")
    print("="*60)
    display(df.head(10))
    
    print(f"\n‚úÖ Loaded {len(df)} production queries")
    
    # Convert DataFrame to list of dicts for Arato
    production_dataset_rows = df.to_dict('records')

    # Create the production dataset
    production_dataset = client.notebooks.datasets.create(
        notebook_id=notebook_id,
        name="Production User Queries",
        description="Real-world user queries for toxicity evaluation (no ground truth labels)",
        content=production_dataset_rows
    )
    
    production_dataset_id = production_dataset['id']
    
    print(f"\n‚úÖ Production dataset created successfully in Arato!")
    print(f"   ID: {production_dataset_id}")
    print(f"   Name: {production_dataset['name']}")
    print(f"   Rows: {len(production_dataset['content'])}")

except FileNotFoundError:
    print("‚ùå Error: production_user_queries.csv not found")
    print("   Please make sure the CSV file is in the same directory as this notebook")
except Exception as e:
    print(f"‚ùå Error creating production dataset: {e}")


In [None]:
try:
    # Create a new production experiment with a neutral prompt (not a judge itself)
    production_prompt_config = {
        "model_id": "gpt-5-nano",
        "vendor_id": "openai",
        "prompt_template": (
            "You are a helpful assistant responding to user queries. "
            "Provide a brief, professional response to the following query.\n\n"
            "User Query: {{query}}"
        ),
        "model_parameters": {
            "temperature": 0.7,
            "max_tokens": 2000
        }
    }

    # Create the production experiment
    production_experiment = client.notebooks.experiments.create(
        notebook_id=notebook_id,
        name="Production Assistant",
        description="Production experiment that generates responses to user queries",
        prompt_config=production_prompt_config,
        dataset_id=production_dataset_id,
        color_index=4
    )
    
    production_experiment_id = production_experiment['id']
    
    print("‚úÖ Production experiment created successfully!")
    print(f"   ID: {production_experiment_id}")
    print(f"   Name: {production_experiment['name']}")

except Exception as e:
    print(f"‚ùå Error creating production experiment: {e}")

In [None]:
try:
    # Create Binary Evaluation using our original judge prompt
    binary_eval = client.notebooks.experiments.evals.create(
        notebook_id=notebook_id,
        experiment_id=production_experiment_id,
        name="Binary Toxicity Judge",
        eval_type="Binary",
        context="query",  # Evaluate the input query, not the response
        fail_on_positive=True,  # Fail if toxic content is detected
        prompt=(
            "You are a toxicity and bias judge. Analyze the following user query and determine if it is toxic. "
            "A query is considered toxic if it contains hate speech, harassment, or promotes violence. "
            "Respond with only 'yes' or 'no'."
        )
    )
    
    print("‚úÖ Binary evaluation created successfully!")
    print(f"   Name: {binary_eval['name']}")
    print(f"   Type: {binary_eval['eval_type']}")
    
    # Create Classification Evaluation using our improved judge prompt
    classification_eval = client.notebooks.experiments.evals.create(
        notebook_id=notebook_id,
        experiment_id=production_experiment_id,
        name="Toxicity Level Classifier",
        eval_type="Classification",
        context="query",  # Evaluate the input query, not the response
        prompt=(
            "You are an advanced toxicity classifier. Analyze the following user query and classify its toxicity level.\n\n"
            "Classification Guidelines:\n"
            "- **Safe**: Polite, constructive, or neutral content with no harmful intent\n"
            "- **Moderate**: Rude, dismissive, or mildly offensive content that could be hurtful but isn't severely toxic\n"
            "- **High**: Any form or Level of Hate speech, direct threats, harassment, or content promoting violence\n\n"
            "Examples:\n"
            "- 'Thank you for your help' ‚Üí Safe\n"
            "- 'That's a stupid idea' ‚Üí Moderate\n"
            "- 'I'm going to hurt you' ‚Üí High\n\n"
            "Respond with only one word: Safe, Moderate, or High."
        ),
        classes=[
            {"title": "Safe", "is_pass": True, "color": "green"},
            {"title": "Moderate", "is_pass": False, "color": "yellow"},
            {"title": "High", "is_pass": False, "color": "red"}
        ]
    )
    
    print("\n‚úÖ Classification evaluation created successfully!")
    print(f"   Name: {classification_eval['name']}")
    print(f"   Type: {classification_eval['eval_type']}")
    print("   Classes: Safe (pass), Moderate (fail), High (fail)")

except Exception as e:
    print(f"‚ùå Error creating evaluations: {e}")

In [None]:
try:
    # Check if we have an OpenAI API key
    openai_key = os.environ.get("OPENAI_API_KEY")
    if not openai_key:
        print("‚ö†Ô∏è  Warning: OPENAI_API_KEY not set. Cannot run production experiment.")
        print("   Please set your OpenAI API key in the .env file to continue.")
    else:
        # Create and execute the production run
        production_run = client.notebooks.experiments.runs.create(
            notebook_id=notebook_id,
            experiment_id=production_experiment_id,
            api_keys={"openai_api_key": openai_key}
        )
        
        print("‚úÖ Production run created and initiated successfully!")
        print(f"   Run ID: {production_run['id']}")
        print(f"   Status: {production_run['status']}")
        print(f"   Run Number: {production_run['run_number']}")
        print(f"   URL: https://app.arato.ai/flow/{notebook_id}/notebook")
        
        production_run_id = production_run['id']
        
        # Poll for run completion
        print("\nüîÑ Waiting for production run to complete...")
        while True:
            production_run_details = client.notebooks.experiments.runs.retrieve(
                notebook_id=notebook_id,
                experiment_id=production_experiment_id,
                run_id=production_run_id
            )
            
            status = production_run_details['status']
            print(f"   Current Status: {status}")
            
            if status in ['done', 'failed']:
                break
            
            time.sleep(5)  # Poll every 5 seconds
        
        # Prepare results for table display
        results_data = []
        binary_stats = {'toxic': 0, 'safe': 0}
        classification_stats = {}
        flagged_queries = []
        
        for idx, row in enumerate(production_run_details.get('content', []), 1):
            query = row.get('query', 'N/A')
            response = row.get('response', 'N/A')
            
            binary_result = None
            classification_result = None
            binary_icon = ''
            classification_icon = ''
            
            # Analyze evaluations
            if row.get('evals'):
                for eval_result in row['evals']:
                    eval_type = eval_result.get('type', '')
                    
                    if eval_type == 'Binary':
                        binary_result_code = eval_result.get('result', 'N/A')
                        binary_result = 'Safe' if binary_result_code == 1 else 'Toxic'
                        binary_icon = '‚úÖ' if binary_result_code == 1 else 'üö®'
                        
                        if binary_result.lower() in binary_stats:
                            binary_stats[binary_result.lower()] += 1
                        
                        if binary_result == 'Toxic':
                            flagged_queries.append(query)
                    
                    elif eval_type == 'Classification':
                        classification_result = eval_result.get('title', 'N/A')
                        
                        if classification_result and classification_result != 'N/A':
                            if classification_result not in classification_stats:
                                classification_stats[classification_result] = 0
                            classification_stats[classification_result] += 1
                        
                        color_map = {'Safe': 'üü¢', 'Moderate': 'üü°', 'High': 'üî¥'}
                        classification_icon = color_map.get(classification_result, '‚ùì')
                        
                        if classification_result in ['Moderate', 'High'] and query not in flagged_queries:
                            flagged_queries.append(query)
            
            # Add to results data
            results_data.append({
                'Query': query[:60] + '...' if len(query) > 60 else query,
                'Binary': f"{binary_icon} {binary_result}" if binary_result else 'N/A',
                'Classification': f"{classification_icon} {classification_result}" if classification_result else 'N/A',
                'Response': response[:50] + '...' if len(response) > 50 else response
            })
        
        # Display results as formatted table
        print("\nüìä PRODUCTION LLM-AS-A-JUDGE RESULTS\n")
        results_df = pd.DataFrame(results_data)
        display(results_df)
        
        # Display summary statistics
        print("\nüìà EVALUATION SUMMARY\n")
        
        # Binary statistics table
        binary_df = pd.DataFrame([
            {'Category': '‚úÖ Safe', 'Count': binary_stats.get('safe', 0), 
             'Percentage': f"{(binary_stats.get('safe', 0) / len(results_data) * 100):.1f}%"},
            {'Category': 'üö® Toxic', 'Count': binary_stats.get('toxic', 0), 
             'Percentage': f"{(binary_stats.get('toxic', 0) / len(results_data) * 100):.1f}%"}
        ])
        print("üîç Binary Toxicity Detection:")
        display(binary_df)
        
        # Classification statistics table
        if classification_stats:
            classification_data = []
            icon_map = {'Safe': 'üü¢', 'Moderate': 'üü°', 'High': 'üî¥'}
            for level in ['Safe', 'Moderate', 'High']:
                if level in classification_stats:
                    count = classification_stats[level]
                    classification_data.append({
                        'Level': f"{icon_map.get(level, '‚ùì')} {level}",
                        'Count': count,
                        'Percentage': f"{(count / len(results_data) * 100):.1f}%"
                    })
            
            classification_df = pd.DataFrame(classification_data)
            print("\nüìä Toxicity Level Classification:")
            display(classification_df)
        
        # Flagged queries table
        if flagged_queries:
            print(f"\nüö® FLAGGED QUERIES FOR REVIEW ({len(flagged_queries)} total):")
            flagged_df = pd.DataFrame({
                'Flagged Query': flagged_queries
            })
            display(flagged_df)
        else:
            print("\n‚úÖ No queries flagged for review!")

except Exception as e:
    print(f"‚ùå Error running production experiment: {e}")
