# Prompt Optimization with Real LLM APIs

This notebook demonstrates how to integrate the prompt optimization workflow with real LLM APIs, including:

1. Connecting to OpenAI and Anthropic APIs
2. Creating real performance measurements for prompt templates
3. Running optimization cycles with real LLM responses
4. Analyzing cost/performance tradeoffs in prompt optimization

In [3]:
# Import necessary libraries
import os
import json
import time
import asyncio
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from typing import Dict, Any, List, Optional
from pprint import pprint
from dotenv import load_dotenv

# Import AILF framework components
from ailf.cognition.prompt_library import PromptLibrary
from ailf.schemas.prompt_engineering import PromptTemplateV1, PromptLibraryConfig
from ailf.feedback.performance_analyzer import PerformanceAnalyzer
from ailf.feedback.adaptive_learning_manager import AdaptiveLearningManager
from ailf.ai.engine import AIEngine

# Load environment variables for API keys
load_dotenv()

# Check if API keys are available
api_keys_available = {
    "OPENAI_API_KEY": bool(os.getenv("OPENAI_API_KEY")),
    "ANTHROPIC_API_KEY": bool(os.getenv("ANTHROPIC_API_KEY"))
}

print("API Keys Available:")
for api, available in api_keys_available.items():
    print(f"- {api}: {'✓' if available else '✗'}")

# If no API keys are available, provide instructions
if not any(api_keys_available.values()):
    print("\nNo API keys found. To run this notebook with real APIs:")
    print("1. Create a .env file in the project root")
    print("2. Add your API keys in the following format:")
    print("   OPENAI_API_KEY=your_key_here")
    print("   ANTHROPIC_API_KEY=your_key_here")
    print("3. Restart the kernel and run the notebook again")

ModuleNotFoundError: No module named 'matplotlib'

## 1. Setting up the AI Engine with Real API Providers

First, let's set up the AI Engine that will connect to the LLM APIs. We'll support both OpenAI and Anthropic, with graceful fallback if one isn't available.

In [None]:
# Create a configured AI Engine
async def create_ai_engine():
    """Create and initialize an AI Engine with available providers."""
    # Default configuration
    config = {
        "default_provider": None,
        "log_prompts": True,
        "log_responses": True,
        "providers": {}
    }
    
    # Configure OpenAI if available
    openai_api_key = os.getenv("OPENAI_API_KEY")
    if openai_api_key:
        config["providers"]["openai"] = {
            "api_key": openai_api_key,
            "default_model": "gpt-4o-mini",
            "default_temperature": 0.2,
            "timeout": 30,
            "enabled": True
        }
        if not config["default_provider"]:
            config["default_provider"] = "openai"
    
    # Configure Anthropic if available
    anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
    if anthropic_api_key:
        config["providers"]["anthropic"] = {
            "api_key": anthropic_api_key,
            "default_model": "claude-3-haiku-20240307",
            "default_temperature": 0.2,
            "timeout": 30,
            "enabled": True
        }
        if not config["default_provider"]:
            config["default_provider"] = "anthropic"
    
    # Create engine
    engine = AIEngine(config)
    await engine.initialize()
    
    return engine

# Initialize the engine if we have at least one API key
ai_engine = None
if any(api_keys_available.values()):
    ai_engine = await create_ai_engine()
    print(f"\nAI Engine initialized with {ai_engine.config['default_provider']} as default provider.")
    print(f"Available models:")
    for provider, config in ai_engine.config["providers"].items():
        if config["enabled"]:
            print(f"- {provider}: {config['default_model']}")
else:
    print("\nNo API keys available. Running in demo mode with simulated responses.")

## 2. Creating a Prompt Library for Real-World Testing

Now, let's create a PromptLibrary with templates for real-world tasks that we'll optimize.

In [None]:
# Create a temporary directory for our prompt templates
import tempfile
import copy
library_path = tempfile.mkdtemp()
print(f"Created temporary directory for templates: {library_path}")

# Create a configuration for our PromptLibrary
config = PromptLibraryConfig(
    library_path=library_path,
    default_prompt_id="sentiment_analysis",
    auto_save=True
)

# Define real-world prompt templates for testing
templates = [
    {
        "template_id": "sentiment_analysis",
        "version": 1,
        "description": "Analyze sentiment of text (positive, negative, neutral)",
        "system_prompt": "You are an assistant that analyzes the sentiment of text.",
        "user_prompt_template": "Determine if the sentiment of this text is positive, negative, or neutral.\n\nText: {input_text}\n\nSentiment:",
        "placeholders": ["input_text"],
        "tags": ["sentiment_analysis", "classification"],
        "created_at": time.time()
    },
    {
        "template_id": "question_answering",
        "version": 1,
        "description": "Answer questions based on provided context",
        "system_prompt": "You are a helpful assistant that answers questions based on the provided context.",
        "user_prompt_template": "Context: {context}\n\nQuestion: {question}\n\nPlease provide a concise answer based only on the context provided.",
        "placeholders": ["context", "question"],
        "tags": ["qa", "information_retrieval"],
        "created_at": time.time()
    },
    {
        "template_id": "code_generation",
        "version": 1,
        "description": "Generate code based on requirements",
        "system_prompt": "You are a helpful coding assistant.",
        "user_prompt_template": "Write code in {language} that accomplishes the following:\n\n{requirements}\n\nProvide only the code without explanations.",
        "placeholders": ["language", "requirements"],
        "tags": ["code", "generation"],
        "created_at": time.time()
    }
]

# Save templates to JSON files
for template_data in templates:
    filename = f"{template_data['template_id']}_v{template_data['version']}.json"
    filepath = os.path.join(library_path, filename)
    with open(filepath, 'w') as f:
        json.dump(template_data, f, indent=2)

# Initialize our PromptLibrary
prompt_library = PromptLibrary(config)

# List available templates
print("\nAvailable templates for testing:")
for template_id in prompt_library.list_template_ids():
    template = prompt_library.get_template(template_id)
    print(f"- {template_id} (v{template.version}): {template.description}")

## 3. Creating Test Data for Real-World Evaluation

Let's create some test data for each template to evaluate their performance.

In [None]:
# Create test data for each template
test_data = {
    "sentiment_analysis": [
        {"input_text": "I absolutely loved the product! It exceeded all my expectations.", "expected": "positive"},
        {"input_text": "The service was okay, nothing special but it got the job done.", "expected": "neutral"},
        {"input_text": "This was a complete waste of money. I'm very disappointed.", "expected": "negative"},
        {"input_text": "While there were some issues with delivery, the product quality is amazing.", "expected": "mixed"},
        {"input_text": "I can't believe how terrible their customer service is. Never shopping here again.", "expected": "negative"},
        {"input_text": "The restaurant was fine, food was average, prices were reasonable.", "expected": "neutral"},
        {"input_text": "Best purchase I've made all year! Highly recommend to everyone.", "expected": "positive"},
        {"input_text": "It's an interesting concept but the execution leaves much to be desired.", "expected": "mixed"}
    ],
    
    "question_answering": [
        {
            "context": "The first Olympic Games were held in 776 BC in Olympia, Greece. They were held every four years in honor of Zeus, the king of the Greek gods. The modern Olympic Games began in 1896 in Athens, Greece.",
            "question": "When did the modern Olympic Games begin?",
            "expected": "1896"
        },
        {
            "context": "The Great Barrier Reef is the world's largest coral reef system, stretching for over 2,300 kilometers along the northeast coast of Australia. It consists of over 2,900 individual reefs and 900 islands.",
            "question": "Where is the Great Barrier Reef located?",
            "expected": "Australia"
        },
        {
            "context": "Marie Curie was a physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize, the first person to win Nobel Prizes in two different scientific fields, and the first woman to become a professor at the University of Paris.",
            "question": "How many Nobel Prizes did Marie Curie win?",
            "expected": "two"
        },
        {
            "context": "The python programming language was created by Guido van Rossum and first released in 1991. It emphasizes code readability with its notable use of significant whitespace.",
            "question": "Who created the Python programming language?",
            "expected": "Guido van Rossum"
        }
    ],
    
    "code_generation": [
        {
            "language": "Python",
            "requirements": "Create a function to calculate the Fibonacci sequence up to n terms",
            "expected_contains": ["def", "fibonacci", "return"]
        },
        {
            "language": "JavaScript",
            "requirements": "Create a function that filters an array to only include even numbers",
            "expected_contains": ["function", "filter", "even", "return"]
        },
        {
            "language": "Python",
            "requirements": "Write a function that checks if a string is a palindrome",
            "expected_contains": ["def", "palindrome", "return"]
        },
        {
            "language": "SQL",
            "requirements": "Write a query to find the top 5 customers by total purchase amount",
            "expected_contains": ["SELECT", "FROM", "ORDER BY", "LIMIT"]
        }
    ]
}

# Show the test data for each template
for template_id, dataset in test_data.items():
    print(f"\nTest Data for {template_id} ({len(dataset)} samples):")
    
    for i, example in enumerate(dataset[:2]):  # Only show first 2 examples
        print(f"- Example {i+1}:")
        for key, value in example.items():
            if key == "context" and len(value) > 100:
                print(f"  {key}: {value[:100]}...")  # Truncate long context
            else:
                print(f"  {key}: {value}")
    
    if len(dataset) > 2:
        print(f"  ... and {len(dataset) - 2} more examples")

## 4. Testing Prompt Templates with Real LLM APIs

Now, let's evaluate our prompt templates using the real LLM APIs and collect performance data.

In [None]:
# Create a function to test templates with real LLM APIs
async def test_template(engine, template_id, test_cases, performance_analyzer):
    """Test a template with real LLM responses and record performance data."""
    template = prompt_library.get_template(template_id)
    results = []
    
    print(f"Testing template: {template_id}")
    
    for i, test_case in enumerate(test_cases):
        print(f"  Running test case {i+1}/{len(test_cases)}...", end="")
        
        # Fill in the template with test data
        filled_template = template.user_prompt_template
        for placeholder in template.placeholders:
            if placeholder in test_case:
                filled_template = filled_template.replace(f"{{{placeholder}}}", str(test_case[placeholder]))
        
        start_time = time.time()
        error = None
        
        try:
            # Make the actual API call
            response = await engine.generate(
                system=template.system_prompt,
                user=filled_template,
                provider=engine.config["default_provider"]
            )
            
            result = response.get("content", "")
            
            # Evaluate success based on expected output
            successful = False
            if "expected" in test_case:
                expected = test_case["expected"].lower()
                successful = expected in result.lower()
            elif "expected_contains" in test_case:
                successful = all(item.lower() in result.lower() for item in test_case["expected_contains"])
            
            print(f" {'✓' if successful else '✗'}")
            
        except Exception as e:
            error = str(e)
            result = None
            successful = False
            print(f" ERROR: {error}")
        
        latency = time.time() - start_time
        
        # Record performance data
        interaction_data = {
            "template_id": template_id,
            "successful": successful,
            "error": bool(error),
            "error_message": error,
            "latency": latency,
            "timestamp": time.time(),
            "tokens": {
                "input": len(filled_template) // 4,  # Rough estimate
                "output": len(result) // 4 if result else 0  # Rough estimate
            }
        }
        
        performance_analyzer.add_interaction(interaction_data)
        
        results.append({
            "test_case": test_case,
            "result": result,
            "successful": successful,
            "latency": latency,
            "error": error
        })
    
    # Calculate overall performance
    success_count = sum(1 for r in results if r["successful"])
    success_rate = success_count / len(results) if results else 0
    error_count = sum(1 for r in results if r["error"])
    error_rate = error_count / len(results) if results else 0
    avg_latency = sum(r["latency"] for r in results) / len(results) if results else 0
    
    print(f"  Results: {success_count}/{len(results)} successful ({success_rate:.2%}), " 
          f"{error_count} errors, avg latency: {avg_latency:.2f}s")
    
    return {
        "template_id": template_id,
        "success_rate": success_rate,
        "error_rate": error_rate,
        "avg_latency": avg_latency,
        "results": results
    }

# Initialize the performance analyzer
real_performance_analyzer = PerformanceAnalyzer()

# Run tests if AI engine is available
test_results = {}
if ai_engine:
    for template_id, test_cases in test_data.items():
        test_results[template_id] = await test_template(
            ai_engine, template_id, test_cases, real_performance_analyzer
        )
    
    # Convert results to DataFrame for visualization
    df = pd.DataFrame({
        template_id: {
            "success_rate": results["success_rate"],
            "error_rate": results["error_rate"],
            "avg_latency": results["avg_latency"]
        }
        for template_id, results in test_results.items()
    }).T
    
    # Display the metrics
    print("\nTemplate Performance with Real LLM API:")
    display(df)
    
    # Visualize the performance metrics
    plt.figure(figsize=(12, 6))
    
    # Plot success and error rates
    plt.subplot(1, 2, 1)
    df[["success_rate", "error_rate"]].plot(kind="bar", ax=plt.gca())
    plt.title("Success and Error Rates by Template")
    plt.ylabel("Rate")
    plt.ylim(0, 1)
    
    # Plot average latency
    plt.subplot(1, 2, 2)
    df["avg_latency"].plot(kind="bar", ax=plt.gca(), color="green")
    plt.title("Average Latency by Template")
    plt.ylabel("Seconds")
    
    plt.tight_layout()
    plt.show()
else:
    print("\nSkipping real API tests since no API keys are available.")
    # Create simulated test results for demo purposes
    test_results = {
        "sentiment_analysis": {
            "template_id": "sentiment_analysis",
            "success_rate": 0.65,
            "error_rate": 0.1,
            "avg_latency": 1.2
        },
        "question_answering": {
            "template_id": "question_answering",
            "success_rate": 0.8,
            "error_rate": 0.05,
            "avg_latency": 1.8
        },
        "code_generation": {
            "template_id": "code_generation",
            "success_rate": 0.5,
            "error_rate": 0.2,
            "avg_latency": 2.3
        }
    }
    
    # Add simulated interaction data to the performance analyzer
    for template_id, metrics in test_results.items():
        for i in range(10):  # 10 simulated interactions per template
            real_performance_analyzer.add_interaction({
                "template_id": template_id,
                "successful": i < (10 * metrics["success_rate"]),
                "error": i < (10 * metrics["error_rate"]),
                "latency": metrics["avg_latency"] + (i % 5) * 0.2,
                "timestamp": time.time() - (i * 60),
                "tokens": {
                    "input": 50 + (i % 10) * 5,
                    "output": 100 + (i % 20) * 10
                }
            })
    
    # Print simulated results
    print("\nSimulated Template Performance:")
    df = pd.DataFrame({
        template_id: {
            "success_rate": metrics["success_rate"],
            "error_rate": metrics["error_rate"],
            "avg_latency": metrics["avg_latency"]
        }
        for template_id, metrics in test_results.items()
    }).T
    display(df)

## 5. Running the Optimization Workflow with Real Performance Data

Now that we have real performance data, let's run the optimization workflow to improve our templates.

In [None]:
# Set up the AdaptiveLearningManager with our real performance data
learning_manager = AdaptiveLearningManager(
    performance_analyzer=real_performance_analyzer,
    prompt_library=prompt_library,
    config={
        "success_rate_threshold": 0.7,
        "error_rate_threshold": 0.15,
        "min_sample_size": 4,
        "auto_optimize_prompts": True,
        "optimization_strategy": "rule_based"  # Use rule-based since it doesn't require additional API calls
    },
    ai_engine=ai_engine
)

# Store the initial state of templates
initial_templates = {}
for template_id in prompt_library.list_template_ids():
    initial_templates[template_id] = prompt_library.get_template(template_id)

# Identify underperforming prompts
underperforming = learning_manager.identify_underperforming_prompts()

print("Underperforming templates identified:")
for template_id, metrics in underperforming.items():
    print(f"- {template_id}:")
    for key, value in metrics.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.2f}")
    
    template = prompt_library.get_template(template_id)
    print(f"  Current template: {template.user_prompt_template[:50]}...")

# Run the learning cycle with auto-optimization
if underperforming:
    print("\nRunning optimization cycle...")
    cycle_results = await learning_manager.run_learning_cycle(auto_optimize=True)
    
    print(f"\nOptimization cycle completed (Cycle ID: {cycle_results.get('cycle_id')}):")
    print(f"- Templates analyzed: {len(cycle_results.get('analyzed_prompts', []))}")
    print(f"- Underperforming templates: {len(cycle_results.get('underperforming_prompts', []))}")
    print(f"- Templates optimized: {len(cycle_results.get('optimized_prompts', []))}")
    
    # Show before/after comparison
    print("\nOptimized Templates (Before/After):")
    for template_id in cycle_results.get('optimized_prompts', []):
        initial = initial_templates.get(template_id)
        current = prompt_library.get_template(template_id)
        
        print(f"\n{template_id} (v{initial.version} → v{current.version}):")
        print(f"BEFORE: {initial.user_prompt_template}")
        print(f"AFTER:  {current.user_prompt_template}")
        
        # Show optimization metadata if available
        if hasattr(current, 'optimization_source') and current.optimization_source:
            print(f"OPTIMIZATION SOURCE: {current.optimization_source}")
        
        if hasattr(current, 'version_notes') and current.version_notes:
            print(f"VERSION NOTES: {current.version_notes}")
else:
    print("No underperforming templates identified for optimization.")

## 6. Testing Optimized Templates (A/B Comparison)

If we've optimized templates, let's test them against the original versions to measure improvement.

In [None]:
# Test optimized templates if we have any and if the AI Engine is available
optimized_templates = []
if 'cycle_results' in locals() and ai_engine:
    optimized_templates = cycle_results.get('optimized_prompts', [])
elif 'cycle_results' not in locals() and ai_engine:
    print("No templates were optimized in the previous step.")
elif not ai_engine:
    print("Skipping A/B testing since no AI engine is available.")
    # Simulate optimized templates for demo purposes
    optimized_templates = ["sentiment_analysis", "code_generation"]

if optimized_templates and ai_engine:
    print(f"Running A/B tests for {len(optimized_templates)} optimized templates...")
    
    # Create a new performance analyzer for the optimized templates
    optimized_performance_analyzer = PerformanceAnalyzer()
    
    # Test each optimized template
    optimized_results = {}
    for template_id in optimized_templates:
        optimized_results[template_id] = await test_template(
            ai_engine, template_id, test_data[template_id], optimized_performance_analyzer
        )
    
    # Compare original vs optimized results
    print("\nA/B Test Results (Original vs. Optimized):")
    for template_id in optimized_templates:
        original = test_results.get(template_id, {})
        optimized = optimized_results.get(template_id, {})
        
        success_change = optimized.get('success_rate', 0) - original.get('success_rate', 0)
        error_change = optimized.get('error_rate', 0) - original.get('error_rate', 0)
        latency_change = optimized.get('avg_latency', 0) - original.get('avg_latency', 0)
        
        print(f"\n{template_id}:")
        print(f"- Success Rate: {original.get('success_rate', 0):.2f} → {optimized.get('success_rate', 0):.2f} ({success_change:+.2f})")
        print(f"- Error Rate: {original.get('error_rate', 0):.2f} → {optimized.get('error_rate', 0):.2f} ({error_change:+.2f})")
        print(f"- Avg Latency: {original.get('avg_latency', 0):.2f}s → {optimized.get('avg_latency', 0):.2f}s ({latency_change:+.2f}s)")
    
    # Visualize the A/B comparison
    comparison_data = {}
    for template_id in optimized_templates:
        original = test_results.get(template_id, {})
        optimized = optimized_results.get(template_id, {})
        
        comparison_data[template_id] = {
            "Original Success Rate": original.get('success_rate', 0),
            "Optimized Success Rate": optimized.get('success_rate', 0),
            "Original Error Rate": original.get('error_rate', 0),
            "Optimized Error Rate": optimized.get('error_rate', 0),
            "Original Latency": original.get('avg_latency', 0),
            "Optimized Latency": optimized.get('avg_latency', 0)
        }
    
    if comparison_data:
        # Convert to DataFrame and plot
        comparison_df = pd.DataFrame(comparison_data).T
        
        plt.figure(figsize=(15, 6))
        
        # Plot success rate comparison
        plt.subplot(1, 3, 1)
        comparison_df[["Original Success Rate", "Optimized Success Rate"]].plot(kind="bar", ax=plt.gca())
        plt.title("Success Rate Comparison")
        plt.ylabel("Rate")
        plt.ylim(0, 1)
        
        # Plot error rate comparison
        plt.subplot(1, 3, 2)
        comparison_df[["Original Error Rate", "Optimized Error Rate"]].plot(kind="bar", ax=plt.gca())
        plt.title("Error Rate Comparison")
        plt.ylabel("Rate")
        plt.ylim(0, max(0.5, comparison_df["Original Error Rate"].max() * 1.2))
        
        # Plot latency comparison
        plt.subplot(1, 3, 3)
        comparison_df[["Original Latency", "Optimized Latency"]].plot(kind="bar", ax=plt.gca())
        plt.title("Latency Comparison")
        plt.ylabel("Seconds")
        
        plt.tight_layout()
        plt.show()
elif optimized_templates:
    # Create simulated A/B test results for demo purposes
    print("\nSimulated A/B Test Results (Original vs. Optimized):")
    
    for template_id in optimized_templates:
        original = test_results.get(template_id, {})
        # Simulate improvements
        optimized_success_rate = min(1.0, original.get('success_rate', 0.5) * 1.2)  # 20% improvement
        optimized_error_rate = max(0, original.get('error_rate', 0.2) * 0.8)  # 20% reduction
        
        success_change = optimized_success_rate - original.get('success_rate', 0)
        error_change = optimized_error_rate - original.get('error_rate', 0)
        
        print(f"\n{template_id}:")
        print(f"- Success Rate: {original.get('success_rate', 0):.2f} → {optimized_success_rate:.2f} ({success_change:+.2f})")
        print(f"- Error Rate: {original.get('error_rate', 0):.2f} → {optimized_error_rate:.2f} ({error_change:+.2f})")
    
    # Create simulated visualization data
    comparison_data = {}
    for template_id in optimized_templates:
        original = test_results.get(template_id, {})
        comparison_data[template_id] = {
            "Original Success Rate": original.get('success_rate', 0.5),
            "Optimized Success Rate": min(1.0, original.get('success_rate', 0.5) * 1.2),
            "Original Error Rate": original.get('error_rate', 0.2),
            "Optimized Error Rate": max(0, original.get('error_rate', 0.2) * 0.8)
        }
    
    # Plot simulated results
    comparison_df = pd.DataFrame(comparison_data).T
    
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    comparison_df[["Original Success Rate", "Optimized Success Rate"]].plot(kind="bar", ax=plt.gca())
    plt.title("Simulated Success Rate Comparison")
    plt.ylabel("Rate")
    plt.ylim(0, 1)
    
    plt.subplot(1, 2, 2)
    comparison_df[["Original Error Rate", "Optimized Error Rate"]].plot(kind="bar", ax=plt.gca())
    plt.title("Simulated Error Rate Comparison")
    plt.ylabel("Rate")
    plt.ylim(0, max(0.5, comparison_df["Original Error Rate"].max() * 1.2))
    
    plt.tight_layout()
    plt.show()

## 7. Cost-Performance Analysis

Let's analyze the cost-performance tradeoffs of our prompt optimization strategy.

In [None]:
# Define token cost estimates for different models (per 1K tokens)
token_costs = {
    "gpt-4o-mini": {"input": 0.005, "output": 0.015},
    "gpt-4o": {"input": 0.01, "output": 0.03},
    "claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},
    "claude-3-sonnet-20240229": {"input": 0.003, "output": 0.015}
}

# Function to estimate costs
def estimate_cost(model, input_tokens, output_tokens):
    """Estimate API cost for tokens."""
    if model not in token_costs:
        return 0
        
    input_cost = (input_tokens / 1000) * token_costs[model]["input"]
    output_cost = (output_tokens / 1000) * token_costs[model]["output"]
    return input_cost + output_cost

# Calculate costs for our test cases
if ai_engine:
    current_model = ai_engine.config["providers"][ai_engine.config["default_provider"]]["default_model"]
else:
    # Use a default model for simulation
    current_model = "gpt-4o-mini" if "openai" in api_keys_available and api_keys_available["openai"] else "claude-3-haiku-20240307"

# Calculate average tokens per template
if not 'optimized_results' in locals():
    # Create simulated data for demonstration
    optimized_results = {k: {"avg_tokens": {"input": 200, "output": 400}} for k in test_results.keys()}

template_tokens = {}
for template_id in test_results.keys():
    # Note: In a real scenario, you would get these from actual API responses
    # For demo purposes, we'll estimate based on template complexity
    template = prompt_library.get_template(template_id)
    avg_input_tokens = len(template.user_prompt_template) // 4
    
    if template_id == "sentiment_analysis":
        avg_output_tokens = 20  # Short responses
    elif template_id == "question_answering":
        avg_output_tokens = 100  # Medium responses
    else:  # code_generation
        avg_output_tokens = 300  # Long responses
        
    template_tokens[template_id] = {
        "input": avg_input_tokens,
        "output": avg_output_tokens
    }

# Calculate costs per 1000 API calls
cost_analysis = {}
for template_id, tokens in template_tokens.items():
    # Cost before optimization
    original_success_rate = test_results.get(template_id, {}).get('success_rate', 0.7)
    original_cost_per_call = estimate_cost(current_model, tokens["input"], tokens["output"])
    original_calls_needed = 1000 / original_success_rate if original_success_rate > 0 else float('inf')
    original_total_cost = original_cost_per_call * original_calls_needed
    
    # Cost after optimization (if available)
    if template_id in optimized_templates:
        optimized_success_rate = (
            optimized_results.get(template_id, {}).get('success_rate', 0)
            if 'optimized_results' in locals() else
            min(1.0, original_success_rate * 1.2)  # Simulate 20% improvement
        )
        # Tokens might be slightly higher for optimized prompts
        optimized_tokens = {
            "input": tokens["input"] * 1.1,  # Assume 10% more tokens in optimized prompts
            "output": tokens["output"]
        }
        optimized_cost_per_call = estimate_cost(current_model, optimized_tokens["input"], optimized_tokens["output"])
        optimized_calls_needed = 1000 / optimized_success_rate if optimized_success_rate > 0 else float('inf')
        optimized_total_cost = optimized_cost_per_call * optimized_calls_needed
    else:
        optimized_success_rate = original_success_rate
        optimized_cost_per_call = original_cost_per_call
        optimized_calls_needed = original_calls_needed
        optimized_total_cost = original_total_cost
    
    # Calculate the return on investment (ROI)
    cost_savings = original_total_cost - optimized_total_cost
    optimization_cost = estimate_cost(current_model, 500, 1000)  # Rough estimate for running optimization
    roi = (cost_savings / optimization_cost) if optimization_cost > 0 else 0
    
    cost_analysis[template_id] = {
        "original_success_rate": original_success_rate,
        "optimized_success_rate": optimized_success_rate,
        "original_cost_per_call": original_cost_per_call,
        "optimized_cost_per_call": optimized_cost_per_call,
        "original_calls_needed": original_calls_needed,
        "optimized_calls_needed": optimized_calls_needed,
        "original_total_cost": original_total_cost,
        "optimized_total_cost": optimized_total_cost,
        "cost_savings": cost_savings,
        "optimization_cost": optimization_cost,
        "roi": roi
    }

# Display the cost analysis
print(f"Cost Analysis for {current_model}:")
print(f"Estimated costs for 1000 successful API calls:")

cost_df = pd.DataFrame({
    template_id: {
        "Original Success Rate": f"{data['original_success_rate']:.2%}",
        "Optimized Success Rate": f"{data['optimized_success_rate']:.2%}",
        "Original Cost": f"${data['original_total_cost']:.2f}",
        "Optimized Cost": f"${data['optimized_total_cost']:.2f}",
        "Cost Savings": f"${data['cost_savings']:.2f}",
        "ROI": f"{data['roi']:.1f}x"
    }
    for template_id, data in cost_analysis.items()
}).T

display(cost_df)

# Create visualization of cost savings
plt.figure(figsize=(14, 6))

# Plot costs
plt.subplot(1, 2, 1)
cost_comparison = pd.DataFrame({
    template_id: {
        "Original": data["original_total_cost"],
        "Optimized": data["optimized_total_cost"]
    }
    for template_id, data in cost_analysis.items()
}).T
cost_comparison.plot(kind="bar", ax=plt.gca())
plt.title(f"Cost for 1000 Successful Calls ({current_model})")
plt.ylabel("Cost ($)")

# Plot ROI
plt.subplot(1, 2, 2)
roi_data = [data["roi"] for data in cost_analysis.values()]
plt.bar(list(cost_analysis.keys()), roi_data)
plt.axhline(y=1.0, color='r', linestyle='--', label="Break-even (ROI = 1)")
plt.title("Return on Investment (ROI)")
plt.ylabel("ROI (x)")

plt.tight_layout()
plt.show()

## 8. Summary and Best Practices

In this notebook, we've demonstrated how to integrate the prompt optimization workflow with real LLM APIs. We've seen:

1. **Real API Integration**: How to connect to OpenAI and Anthropic APIs for testing
2. **Performance Testing**: How to evaluate prompt templates with real-world test cases
3. **Optimization Workflow**: Running the optimization cycle with real performance data
4. **A/B Testing**: Comparing original and optimized templates to measure improvement
5. **Cost Analysis**: Analyzing the cost-performance tradeoffs of prompt optimization

### Best Practices for Prompt Optimization with Real APIs:

1. **Maintain a Test Suite**: Keep a diverse set of test cases that represent real-world usage
2. **Track Multiple Metrics**: Don't just track success rate; also consider error rate, latency, and cost
3. **Cost-Aware Optimization**: Balance performance improvements against token usage increases
4. **Incremental Optimization**: Make small, targeted improvements rather than complete rewrites
5. **Regular Re-evaluation**: Continuously test optimized prompts against new test cases
6. **Version Control**: Maintain clear version history for all templates
7. **Context Preservation**: Ensure optimizations maintain the original intent and functionality

By following these practices, you can build a robust prompt optimization pipeline that continuously improves your LLM applications while managing costs effectively.