# Metrics and Evaluation

This notebook focuses on evaluating prompt optimization performance through:
- Performance metrics analysis
- Cost tracking and optimization
- Quality assessment
- Comparative analysis

## Prerequisites

Make sure you have completed the basic usage notebook first.

---

## Setup and Imports

In [None]:
import os
import sys
import json
import time
import matplotlib.pyplot as plt
import pandas as pd
from dotenv import load_dotenv

# Add the src directory to Python path
sys.path.append('../src')

# Import Promptomatic functions
from promptomatix.main import process_input, generate_feedback, optimize_with_feedback

# Load environment variables
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')

print("✅ Setup complete")

# Set up plotting
plt.style.use('default')
%matplotlib inline

## 1. Performance Metrics Analysis

Let's analyze the performance metrics from prompt optimization.

In [None]:
# Run multiple optimizations to collect metrics
test_cases = [
    {"input": "Classify sentiment as positive or negative", "task": "classification"},
    {"input": "Summarize text in 2 sentences", "task": "summarization"},
    {"input": "Extract key information", "task": "extraction"},
    {"input": "Generate creative content", "task": "generation"},
    {"input": "Answer questions accurately", "task": "qa"}
]

metrics_data = []

for i, case in enumerate(test_cases, 1):
    print(f"\n🔄 Running optimization {i}/{len(test_cases)}: {case['task']}")
    
    config = {
        "raw_input": case['input'],
        "model_name": "gpt-3.5-turbo",
        "model_api_key": api_key,
        "model_provider": "openai",
        "backend": "simple_meta_prompt",
        "synthetic_data_size": 3,
        "task_type": case['task']
    }
    
    try:
        start_time = time.time()
        result = process_input(**config)
        end_time = time.time()
        
        metrics = {
            'task_type': case['task'],
            'input': case['input'],
            'cost': result['metrics']['cost'],
            'time_taken': result['metrics']['time_taken'],
            'synthetic_data_count': len(result['synthetic_data']),
            'prompt_length': len(result['result']),
            'session_id': result['session_id']
        }
        
        metrics_data.append(metrics)
        
        print(f"✅ Completed - Cost: ${metrics['cost']:.4f}, Time: {metrics['time_taken']:.2f}s")
        
    except Exception as e:
        print(f"❌ Failed: {str(e)}")
        metrics_data.append({
            'task_type': case['task'],
            'input': case['input'],
            'cost': 0,
            'time_taken': 0,
            'synthetic_data_count': 0,
            'prompt_length': 0,
            'session_id': None,
            'error': str(e)
        })

In [None]:
# Create a DataFrame for analysis
df = pd.DataFrame(metrics_data)
print("📊 Metrics Summary:")
print(df)

# Calculate summary statistics
print("\n📈 Summary Statistics:")
print(f"Total Cost: ${df['cost'].sum():.4f}")
print(f"Total Time: {df['time_taken'].sum():.2f} seconds")
print(f"Average Cost per Task: ${df['cost'].mean():.4f}")
print(f"Average Time per Task: {df['time_taken'].mean():.2f} seconds")
print(f"Average Prompt Length: {df['prompt_length'].mean():.0f} characters")

## 2. Cost Tracking and Optimization

Let's analyze costs across different configurations.

In [None]:
# Test cost variations with different synthetic data sizes
cost_test_cases = [
    {"size": 1, "description": "Minimal data"},
    {"size": 3, "description": "Standard data"},
    {"size": 5, "description": "Extended data"},
    {"size": 10, "description": "Comprehensive data"}
]

cost_data = []

for case in cost_test_cases:
    print(f"\n🔄 Testing with {case['size']} synthetic examples...")
    
    config = {
        "raw_input": "Classify text sentiment",
        "model_name": "gpt-3.5-turbo",
        "model_api_key": api_key,
        "model_provider": "openai",
        "backend": "simple_meta_prompt",
        "synthetic_data_size": case['size'],
        "task_type": "classification"
    }
    
    try:
        result = process_input(**config)
        
        cost_data.append({
            'synthetic_data_size': case['size'],
            'description': case['description'],
            'cost': result['metrics']['cost'],
            'time_taken': result['metrics']['time_taken'],
            'cost_per_example': result['metrics']['cost'] / case['size']
        })
        
        print(f"✅ Cost: ${result['metrics']['cost']:.4f}, Cost per example: ${result['metrics']['cost'] / case['size']:.4f}")
        
    except Exception as e:
        print(f"❌ Failed: {str(e)}")
        cost_data.append({
            'synthetic_data_size': case['size'],
            'description': case['description'],
            'cost': 0,
            'time_taken': 0,
            'cost_per_example': 0,
            'error': str(e)
        })

In [None]:
# Visualize cost analysis
cost_df = pd.DataFrame(cost_data)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Total cost vs synthetic data size
ax1.plot(cost_df['synthetic_data_size'], cost_df['cost'], 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Synthetic Data Size')
ax1.set_ylabel('Total Cost ($)')
ax1.set_title('Total Cost vs Synthetic Data Size')
ax1.grid(True, alpha=0.3)

# Cost per example vs synthetic data size
ax2.plot(cost_df['synthetic_data_size'], cost_df['cost_per_example'], 'ro-', linewidth=2, markersize=8)
ax2.set_xlabel('Synthetic Data Size')
ax2.set_ylabel('Cost per Example ($)')
ax2.set_title('Cost per Example vs Synthetic Data Size')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Cost Analysis Summary:")
print(cost_df[['synthetic_data_size', 'cost', 'cost_per_example', 'time_taken']])

## 3. Quality Assessment

Let's assess the quality of optimized prompts and synthetic data.

In [None]:
# Run a comprehensive optimization with feedback
print("🔄 Running comprehensive quality assessment...")

quality_config = {
    "raw_input": "Analyze customer feedback and extract key insights",
    "model_name": "gpt-3.5-turbo",
    "model_api_key": api_key,
    "model_provider": "openai",
    "backend": "simple_meta_prompt",
    "synthetic_data_size": 5,
    "task_type": "extraction"
}

try:
    # Initial optimization
    initial_result = process_input(**quality_config)
    
    print("✅ Initial optimization completed")
    print(f"📝 Optimized prompt: {initial_result['result']}")
    print(f"📊 Synthetic data count: {len(initial_result['synthetic_data'])}")
    
    # Generate feedback
    print("\n🔄 Generating feedback...")
    feedback_result = generate_feedback(
        optimized_prompt=initial_result['result'],
        input_fields=initial_result['input_fields'],
        output_fields=initial_result['output_fields'],
        model_name="gpt-3.5-turbo",
        model_api_key=api_key,
        synthetic_data=initial_result['synthetic_data'],
        session_id=initial_result['session_id']
    )
    
    print("✅ Feedback generated")
    print(f"📋 Comprehensive feedback: {feedback_result['comprehensive_feedback'][:200]}...")
    
    # Optimize with feedback
    print("\n🔄 Optimizing with feedback...")
    improved_result = optimize_with_feedback(initial_result['session_id'])
    
    print("✅ Optimization with feedback completed")
    print(f"📝 Improved prompt: {improved_result['result']}")
    
    # Quality metrics
    quality_metrics = {
        'initial_prompt_length': len(initial_result['result']),
        'improved_prompt_length': len(improved_result['result']),
        'initial_cost': initial_result['metrics']['cost'],
        'feedback_cost': 0,  # Would need to track this separately
        'improvement_cost': improved_result['metrics']['cost'],
        'total_cost': initial_result['metrics']['cost'] + improved_result['metrics']['cost'],
        'synthetic_data_quality': len(initial_result['synthetic_data']),
        'feedback_count': len(feedback_result['individual_feedbacks'])
    }
    
    print(f"\n📊 Quality Metrics:")
    for key, value in quality_metrics.items():
        if 'cost' in key:
            print(f"  {key}: ${value:.4f}")
        else:
            print(f"  {key}: {value}")
    
except Exception as e:
    print(f"❌ Quality assessment failed: {str(e)}")

## 4. Comparative Analysis

Let's compare different approaches and configurations.

In [None]:
# Compare different task types performance
print("📊 Task Type Performance Comparison:")
print("-" * 80)
print(f"{'Task Type':<15} {'Cost ($)':<10} {'Time (s)':<10} {'Prompt Length':<15} {'Data Count':<12}")
print("-" * 80)

for _, row in df.iterrows():
    print(f"{row['task_type']:<15} {row['cost']:<10.4f} {row['time_taken']:<10.2f} {row['prompt_length']:<15} {row['synthetic_data_count']:<12}")

# Calculate efficiency metrics
df['cost_efficiency'] = df['cost'] / df['synthetic_data_count']
df['time_efficiency'] = df['time_taken'] / df['synthetic_data_count']

print(f"\n📈 Efficiency Analysis:")
print(f"Most Cost Efficient: {df.loc[df['cost_efficiency'].idxmin(), 'task_type']}")
print(f"Most Time Efficient: {df.loc[df['time_efficiency'].idxmin(), 'task_type']}")
print(f"Longest Prompts: {df.loc[df['prompt_length'].idxmax(), 'task_type']}")
print(f"Shortest Prompts: {df.loc[df['prompt_length'].idxmin(), 'task_type']}")

In [None]:
# Visualize comparative analysis
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# Cost by task type
task_costs = df.groupby('task_type')['cost'].mean()
ax1.bar(task_costs.index, task_costs.values, color='skyblue')
ax1.set_title('Average Cost by Task Type')
ax1.set_ylabel('Cost ($)')
ax1.tick_params(axis='x', rotation=45)

# Time by task type
task_times = df.groupby('task_type')['time_taken'].mean()
ax2.bar(task_times.index, task_times.values, color='lightcoral')
ax2.set_title('Average Time by Task Type')
ax2.set_ylabel('Time (seconds)')
ax2.tick_params(axis='x', rotation=45)

# Prompt length by task type
task_lengths = df.groupby('task_type')['prompt_length'].mean()
ax3.bar(task_lengths.index, task_lengths.values, color='lightgreen')
ax3.set_title('Average Prompt Length by Task Type')
ax3.set_ylabel('Characters')
ax3.tick_params(axis='x', rotation=45)

# Cost vs Time scatter
ax4.scatter(df['time_taken'], df['cost'], c=df['synthetic_data_count'], cmap='viridis', s=100)
ax4.set_xlabel('Time (seconds)')
ax4.set_ylabel('Cost ($)')
ax4.set_title('Cost vs Time (colored by data count)')
plt.colorbar(ax4.collections[0], ax=ax4, label='Synthetic Data Count')

plt.tight_layout()
plt.show()

## 5. Recommendations and Best Practices

Based on our analysis, here are recommendations for optimal usage.

In [None]:
print("💡 Recommendations Based on Metrics Analysis:")
print("\n1. Cost Optimization:")
print(f"   - Most cost-efficient task: {df.loc[df['cost_efficiency'].idxmin(), 'task_type']}")
print(f"   - Average cost per task: ${df['cost'].mean():.4f}")
print("   - Use 3-5 synthetic examples for testing")
print("   - Scale up to 10+ examples for production")

print("\n2. Time Optimization:")
print(f"   - Most time-efficient task: {df.loc[df['time_efficiency'].idxmin(), 'task_type']}")
print(f"   - Average time per task: {df['time_taken'].mean():.2f} seconds")
print("   - Use simple_meta_prompt backend for speed")
print("   - Consider parallel processing for batch tasks")

print("\n3. Quality Optimization:")
print(f"   - Average prompt length: {df['prompt_length'].mean():.0f} characters")
print("   - Longer prompts generally provide more detail")
print("   - Use feedback generation for quality improvement")
print("   - Balance prompt length with clarity")

print("\n4. Task-Specific Recommendations:")
for task_type in df['task_type'].unique():
    task_data = df[df['task_type'] == task_type].iloc[0]
    print(f"   - {task_type}: ${task_data['cost']:.4f} cost, {task_data['time_taken']:.2f}s time")

print("\n5. Monitoring and Tracking:")
print("   - Track costs per task type")
print("   - Monitor time efficiency")
print("   - Evaluate prompt quality with feedback")
print("   - Use session IDs for result tracking")

## Summary

In this notebook, we conducted comprehensive metrics and evaluation analysis:

✅ **Performance Metrics**: Analyzed cost, time, and efficiency across task types
✅ **Cost Tracking**: Explored cost variations with different configurations
✅ **Quality Assessment**: Evaluated prompt quality and improvement processes
✅ **Comparative Analysis**: Compared different approaches and configurations
✅ **Recommendations**: Provided data-driven best practices

### Key Findings:

- **Cost Efficiency**: Varies significantly by task type
- **Time Efficiency**: Generally consistent across tasks
- **Quality Improvement**: Feedback-based optimization adds value
- **Scalability**: Synthetic data size affects cost linearly

### Next Steps:

- Explore advanced features in `04_advanced_features.ipynb`
- Implement custom metrics tracking
- Set up automated monitoring systems
- Optimize for your specific use cases

---

**Ready for advanced features?** Check out the advanced features notebook to learn about custom configurations and integrations!