# Lab 2: Scaling LLM Evaluations
**Duration**: 60 minutes

Welcome to Lab 2! Building on the foundational concepts from Lab 1, you'll now learn how to scale your evaluation processes with Azure AI Foundry integration, create comprehensive datasets, and systematically compare models for production deployment decisions.

## 🎯 Learning Objectives
By the end of this lab, you will:
- Create and manage large-scale evaluation datasets in Azure AI Foundry
- Implement batch evaluation workflows that integrate with AI Foundry portal
- Compare multiple models and prompts systematically
- Generate synthetic evaluation data using Azure OpenAI
- Understand cost vs. quality trade-offs for production decisions

## 📋 Prerequisites
- Completed Lab 1: Evaluation Fundamentals
- Azure AI Foundry project configured (from Lab 1)
- Understanding of basic evaluation metrics
- Azure OpenAI access with sufficient quota for dataset generation

## Part 1: Advanced AI Foundry Integration Setup (10 min)

### Building on Lab 1's Foundation

In Lab 1, we established basic AI Foundry integration. In Lab 2, we'll leverage this foundation to:
- Create larger, more diverse datasets
- Implement systematic model comparisons
- Monitor evaluation trends over time
- Generate comprehensive evaluation reports

In [None]:
# Import dependencies with enhanced AI Foundry capabilities
import sys
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import asyncio
from typing import List, Dict, Any

# Add project root to path
sys.path.append(os.path.join(os.path.dirname(os.getcwd())))

# Import our enhanced AI Foundry integration from Lab 1
from shared_utils.azure_clients import azure_manager
from shared_utils.foundry_evaluation import foundry_runner
from shared_utils.evaluation_helpers import load_evaluation_data, save_evaluation_results

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print("✅ Enhanced dependencies imported successfully!")

# Check AI Foundry integration status from Lab 1
foundry_status = foundry_runner.get_status_info()
print(f"\n🏢 AI Foundry Integration Status: {'✅ ENABLED' if foundry_status['ai_foundry_available'] else '❌ DISABLED'}")

if foundry_status['ai_foundry_available']:
    print("   📊 Lab 2 evaluations will upload to AI Foundry portal")
    print("   🔗 Portal: https://ai.azure.com")
    
    # Show existing datasets from Lab 1
    try:
        client = azure_manager.get_ai_foundry_client()
        existing_datasets = list(client.datasets.list())
        print(f"   📁 Found {len(existing_datasets)} existing datasets from previous labs")
    except Exception as e:
        print(f"   ⚠️ Could not list existing datasets: {e}")
else:
    print("   💡 Configure AI Foundry integration for enhanced portal visibility")
    print("   📚 See Lab 1 setup instructions for configuration details")

## Part 2: Synthetic Dataset Generation Strategy (15 min)

### Understanding Dataset Requirements for Production

Production LLM systems require comprehensive evaluation datasets that cover:
- **Multiple domains**: Healthcare, legal, technical, customer service
- **Varying complexity**: Simple factual questions to complex reasoning
- **Edge cases**: Ambiguous queries, out-of-scope requests, adversarial inputs
- **Scale**: Hundreds to thousands of examples for statistical significance

### Azure OpenAI-Powered Data Generation

In [None]:
class AdvancedDatasetGenerator:
    """Enhanced dataset generator with AI Foundry integration."""
    
    def __init__(self, azure_client, deployment_name):
        self.client = azure_client
        self.deployment = deployment_name
        self.generation_history = []
        
    def generate_domain_dataset(self, domain: str, num_pairs: int = 20, 
                              complexity_levels: List[str] = None) -> List[Dict[str, Any]]:
        """Generate domain-specific Q&A pairs with varying complexity."""
        
        if complexity_levels is None:
            complexity_levels = ['basic', 'intermediate', 'advanced']
            
        dataset = []
        pairs_per_level = num_pairs // len(complexity_levels)
        
        print(f"🔄 Generating {num_pairs} Q&A pairs for domain: {domain}")
        print(f"📊 Complexity levels: {complexity_levels}")
        
        for complexity in complexity_levels:
            print(f"   📝 Creating {pairs_per_level} {complexity} level questions...")
            
            generation_prompt = f"""
Generate {pairs_per_level} high-quality question-answer pairs for the {domain} domain.
Complexity level: {complexity}

Requirements:
- Questions should be realistic and practical
- Answers should be accurate and comprehensive
- Include relevant context/background information
- Vary question types (factual, analytical, procedural)

Format each pair as JSON:
{{
    "query": "Your question here",
    "response": "Detailed answer here",
    "context": "Background information here",
    "domain": "{domain}",
    "complexity": "{complexity}",
    "question_type": "factual/analytical/procedural"
}}

Generate {pairs_per_level} such pairs:
"""
            
            try:
                response = self.client.chat.completions.create(
                    model=self.deployment,
                    messages=[
                        {"role": "system", "content": "You are an expert at creating high-quality evaluation datasets for LLM systems. Always respond with valid JSON."},
                        {"role": "user", "content": generation_prompt}
                    ],
                    temperature=0.7,
                    max_tokens=2000
                )
                
                generated_text = response.choices[0].message.content
                
                # Extract JSON pairs from response
                pairs = self._extract_json_pairs(generated_text)
                dataset.extend(pairs)
                
                print(f"   ✅ Generated {len(pairs)} pairs for {complexity} level")
                
            except Exception as e:
                print(f"   ❌ Failed to generate {complexity} level pairs: {e}")
                # Add fallback sample data
                fallback_pair = {
                    "query": f"Sample {complexity} question for {domain}",
                    "response": f"Sample answer for {complexity} {domain} question",
                    "context": f"Context for {domain} domain",
                    "domain": domain,
                    "complexity": complexity,
                    "question_type": "factual"
                }
                dataset.append(fallback_pair)
        
        # Record generation metadata
        generation_record = {
            "timestamp": datetime.now().isoformat(),
            "domain": domain,
            "requested_pairs": num_pairs,
            "generated_pairs": len(dataset),
            "complexity_levels": complexity_levels
        }
        self.generation_history.append(generation_record)
        
        print(f"✅ Dataset generation complete: {len(dataset)} pairs created")
        return dataset
    
    def _extract_json_pairs(self, text: str) -> List[Dict[str, Any]]:
        """Extract JSON pairs from generated text."""
        pairs = []
        
        # Try to find JSON objects in the text
        import re
        json_pattern = r'\{[^{}]*\}'
        matches = re.findall(json_pattern, text, re.DOTALL)
        
        for match in matches:
            try:
                pair = json.loads(match)
                if 'query' in pair and 'response' in pair:
                    pairs.append(pair)
            except json.JSONDecodeError:
                continue
                
        return pairs
    
    def create_multi_domain_dataset(self, domains: List[str], 
                                   pairs_per_domain: int = 15) -> List[Dict[str, Any]]:
        """Create a comprehensive multi-domain dataset."""
        
        print(f"🚀 Creating multi-domain dataset")
        print(f"📊 Domains: {domains}")
        print(f"🔢 Pairs per domain: {pairs_per_domain}")
        
        all_data = []
        
        for domain in domains:
            domain_data = self.generate_domain_dataset(domain, pairs_per_domain)
            all_data.extend(domain_data)
            print(f"✅ Completed {domain}: {len(domain_data)} pairs")
        
        print(f"\n🎉 Multi-domain dataset complete: {len(all_data)} total pairs")
        return all_data

# Initialize the advanced generator
client = azure_manager.get_openai_client()
deployment_name = os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME')

generator = AdvancedDatasetGenerator(client, deployment_name)

print("🔧 Advanced Dataset Generator initialized")
print(f"📡 Using deployment: {deployment_name}")

### Generate Production-Scale Dataset

In [None]:
# Define domains relevant to common LLM applications
production_domains = [
    "customer_support",
    "technical_documentation", 
    "business_analysis",
    "healthcare_information"
]

print("🏭 PRODUCTION-SCALE DATASET GENERATION")
print("=" * 45)

try:
    # Generate comprehensive dataset
    production_dataset = generator.create_multi_domain_dataset(
        domains=production_domains,
        pairs_per_domain=10  # Reduced for workshop time/cost management
    )
    
    print(f"\n📊 DATASET SUMMARY")
    print(f"Total pairs: {len(production_dataset)}")
    
    # Analyze dataset composition
    domain_counts = {}
    complexity_counts = {}
    type_counts = {}
    
    for item in production_dataset:
        # Count by domain
        domain = item.get('domain', 'unknown')
        domain_counts[domain] = domain_counts.get(domain, 0) + 1
        
        # Count by complexity
        complexity = item.get('complexity', 'unknown')
        complexity_counts[complexity] = complexity_counts.get(complexity, 0) + 1
        
        # Count by question type
        q_type = item.get('question_type', 'unknown')
        type_counts[q_type] = type_counts.get(q_type, 0) + 1
    
    print("\n📈 Domain Distribution:")
    for domain, count in domain_counts.items():
        print(f"   {domain}: {count} pairs")
    
    print("\n🎯 Complexity Distribution:")
    for complexity, count in complexity_counts.items():
        print(f"   {complexity}: {count} pairs")
    
    print("\n❓ Question Type Distribution:")
    for q_type, count in type_counts.items():
        print(f"   {q_type}: {count} pairs")
    
    # Show sample entries
    print("\n🔍 Sample Entries:")
    for i, sample in enumerate(production_dataset[:2]):
        print(f"\nSample {i+1}:")
        print(f"   Domain: {sample.get('domain', 'N/A')}")
        print(f"   Complexity: {sample.get('complexity', 'N/A')}")
        print(f"   Query: {sample['query'][:80]}...")
        print(f"   Response: {sample['response'][:80]}...")
    
except Exception as e:
    print(f"❌ Dataset generation failed: {e}")
    print("Using fallback sample dataset...")
    
    # Create fallback dataset for demonstration
    production_dataset = [
        {
            "query": "How do I reset my password?",
            "response": "To reset your password, go to the login page and click 'Forgot Password'. Enter your email address and follow the instructions sent to your inbox.",
            "context": "Customer support for password management",
            "domain": "customer_support",
            "complexity": "basic",
            "question_type": "procedural"
        },
        {
            "query": "What are the implications of microservice architecture on system scalability?",
            "response": "Microservice architecture enhances scalability by allowing independent scaling of services, distributing load more effectively, and enabling horizontal scaling of specific components based on demand.",
            "context": "Technical architecture and system design",
            "domain": "technical_documentation",
            "complexity": "advanced",
            "question_type": "analytical"
        }
    ]
    print(f"📋 Using {len(production_dataset)} fallback examples")

## Part 3: AI Foundry Dataset Integration (10 min)

### Upload Generated Dataset to AI Foundry

Now we'll upload our production-scale dataset to Azure AI Foundry, making it available for:
- Portal-based evaluation management
- Team collaboration and sharing
- Version control and dataset lineage
- Integration with future evaluation APIs

In [None]:
# Enhanced dataset upload with AI Foundry integration
print("📤 UPLOADING DATASET TO AI FOUNDRY")
print("=" * 35)

if foundry_status['ai_foundry_available']:
    try:
        # Create dataset name with timestamp
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        dataset_name = f"lab2_production_dataset_{timestamp}"
        
        print(f"📁 Dataset name: {dataset_name}")
        print(f"📊 Size: {len(production_dataset)} pairs")
        
        # Create temporary JSONL file
        import tempfile
        temp_file = tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False)
        for item in production_dataset:
            temp_file.write(json.dumps(item) + '\n')
        temp_file.close()
        
        # Upload to AI Foundry
        foundry_client = azure_manager.get_ai_foundry_client()
        uploaded_dataset = foundry_client.datasets.upload_file(
            name=dataset_name,
            version="1.0",
            file_path=temp_file.name
        )
        
        print(f"✅ Dataset uploaded successfully!")
        print(f"🆔 Dataset ID: {uploaded_dataset.id}")
        print(f"🏷️ Type: {uploaded_dataset.type}")
        print(f"🔗 Portal: https://ai.azure.com")
        
        # Store dataset info for later use
        lab2_dataset_info = {
            "name": dataset_name,
            "id": uploaded_dataset.id,
            "size": len(production_dataset),
            "domains": list(set(item.get('domain', 'unknown') for item in production_dataset)),
            "upload_timestamp": timestamp
        }
        
        # Clean up temp file
        os.unlink(temp_file.name)
        
        print(f"\n📊 Dataset registered for Lab 2 evaluations")
        print(f"   Domains covered: {lab2_dataset_info['domains']}")
        
    except Exception as e:
        print(f"❌ Dataset upload failed: {e}")
        print("   Will proceed with local dataset for evaluations")
        lab2_dataset_info = None
        
else:
    print("ℹ️ AI Foundry not configured - dataset stored locally")
    print("   Configure AI Foundry for enhanced portal integration")
    lab2_dataset_info = None

# Save dataset locally as well
local_dataset_path = "data/lab2_production_dataset.jsonl"
os.makedirs("data", exist_ok=True)

with open(local_dataset_path, 'w') as f:
    for item in production_dataset:
        f.write(json.dumps(item) + '\n')

print(f"\n💾 Dataset also saved locally: {local_dataset_path}")
print(f"📈 Ready for batch evaluation workflows")

## Part 4: Advanced Model Comparison Framework (20 min)

### Multi-Model Evaluation Strategy

Production deployments require systematic comparison of:
- **Different model sizes**: GPT-3.5 vs GPT-4 vs GPT-4o
- **Cost implications**: Token usage and pricing analysis
- **Performance metrics**: Quality scores across domains
- **Latency considerations**: Response time measurements

In [None]:
# Enhanced Model Comparison Pipeline with AI Foundry Integration
class ProductionModelComparison:
    """Advanced model comparison with AI Foundry integration."""
    
    def __init__(self, foundry_runner):
        self.foundry_runner = foundry_runner
        self.comparison_results = {}
        
    def setup_model_configurations(self):
        """Define model configurations for comparison."""
        
        base_config = {
            "azure_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
            "api_key": os.getenv("AZURE_OPENAI_API_KEY"),
            "api_version": os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
        }
        
        # In a real scenario, you'd have multiple deployments
        # For workshop, we'll simulate with different evaluation approaches
        model_configs = {
            "primary_model": {
                **base_config,
                "azure_deployment": os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
                "display_name": "Primary GPT Model",
                "cost_per_1k_input": 0.03,
                "cost_per_1k_output": 0.06
            }
            # Additional models would be configured here in production
        }
        
        return model_configs
    
    def create_evaluation_subset(self, dataset: List[Dict], sample_size: int = 5) -> List[Dict]:
        """Create a representative subset for model comparison."""
        
        print(f"🎯 Creating evaluation subset: {sample_size} items")
        
        # Ensure we have representation from each domain
        domains = list(set(item.get('domain', 'unknown') for item in dataset))
        items_per_domain = max(1, sample_size // len(domains))
        
        subset = []
        for domain in domains:
            domain_items = [item for item in dataset if item.get('domain') == domain]
            subset.extend(domain_items[:items_per_domain])
        
        # Fill remaining slots if needed
        while len(subset) < sample_size and len(subset) < len(dataset):
            for item in dataset:
                if item not in subset:
                    subset.append(item)
                    if len(subset) >= sample_size:
                        break
        
        print(f"✅ Subset created with {len(subset)} items")
        print(f"   Domains included: {list(set(item.get('domain', 'unknown') for item in subset))}")
        
        return subset
    
    def run_comprehensive_comparison(self, dataset: List[Dict], 
                                   evaluators: Dict) -> Dict[str, Any]:
        """Run comprehensive model comparison with AI Foundry integration."""
        
        print("🏁 COMPREHENSIVE MODEL COMPARISON")
        print("=" * 35)
        
        model_configs = self.setup_model_configurations()
        comparison_subset = self.create_evaluation_subset(dataset, sample_size=5)
        
        results_summary = {
            "timestamp": datetime.now().isoformat(),
            "dataset_size": len(comparison_subset),
            "models_compared": list(model_configs.keys()),
            "evaluators_used": list(evaluators.keys()),
            "individual_results": {},
            "summary_metrics": {}
        }
        
        for model_name, config in model_configs.items():
            print(f"\n📊 Evaluating model: {config['display_name']}")
            
            try:
                # Run evaluation using our enhanced foundry runner
                model_results = self.foundry_runner.run_evaluation(
                    data=comparison_subset,
                    evaluators=evaluators,
                    run_name=f"Lab2 {config['display_name']} Comparison",
                    description=f"Model comparison evaluation for {model_name}"
                )
                
                # Calculate cost estimates
                estimated_tokens = sum(len(item['query']) + len(item['response']) for item in comparison_subset) * 0.75  # Rough estimate
                estimated_cost = (estimated_tokens / 1000) * config['cost_per_1k_input']
                
                # Enhance results with cost analysis
                enhanced_results = {
                    **model_results,
                    "cost_analysis": {
                        "estimated_tokens": estimated_tokens,
                        "estimated_cost_usd": round(estimated_cost, 4),
                        "cost_per_1k_tokens": config['cost_per_1k_input']
                    },
                    "model_config": config
                }
                
                results_summary["individual_results"][model_name] = enhanced_results
                
                print(f"   ✅ Evaluation completed")
                print(f"   💰 Estimated cost: ${estimated_cost:.4f}")
                
                if 'metrics' in model_results:
                    print(f"   📈 Quality metrics calculated")
                    
            except Exception as e:
                print(f"   ❌ Evaluation failed: {e}")
                results_summary["individual_results"][model_name] = {"error": str(e)}
        
        # Create summary analysis
        results_summary["summary_metrics"] = self._create_comparison_summary(results_summary["individual_results"])
        
        self.comparison_results = results_summary
        return results_summary
    
    def _create_comparison_summary(self, individual_results: Dict) -> Dict:
        """Create summary comparison metrics."""
        
        summary = {
            "models_evaluated": len(individual_results),
            "successful_evaluations": 0,
            "total_estimated_cost": 0.0,
            "quality_comparison": {}
        }
        
        for model_name, results in individual_results.items():
            if "error" not in results:
                summary["successful_evaluations"] += 1
                
                if "cost_analysis" in results:
                    summary["total_estimated_cost"] += results["cost_analysis"]["estimated_cost_usd"]
                    
                if "metrics" in results:
                    summary["quality_comparison"][model_name] = results["metrics"]
        
        return summary
    
    def generate_comparison_report(self) -> str:
        """Generate a comprehensive comparison report."""
        
        if not self.comparison_results:
            return "No comparison results available. Run comparison first."
        
        report = []
        report.append("📊 MODEL COMPARISON REPORT")
        report.append("=" * 30)
        report.append(f"Timestamp: {self.comparison_results['timestamp']}")
        report.append(f"Dataset size: {self.comparison_results['dataset_size']} items")
        report.append(f"Models evaluated: {self.comparison_results['summary_metrics']['successful_evaluations']}")
        report.append(f"Total estimated cost: ${self.comparison_results['summary_metrics']['total_estimated_cost']:.4f}")
        
        report.append("\n💰 COST ANALYSIS:")
        for model_name, results in self.comparison_results["individual_results"].items():
            if "cost_analysis" in results:
                cost_info = results["cost_analysis"]
                report.append(f"   {model_name}: ${cost_info['estimated_cost_usd']:.4f} (${cost_info['cost_per_1k_tokens']:.3f}/1K tokens)")
        
        report.append("\n🎯 QUALITY METRICS:")
        quality_data = self.comparison_results['summary_metrics']['quality_comparison']
        for model_name, metrics in quality_data.items():
            report.append(f"   {model_name}:")
            for metric_name, score in metrics.items():
                if isinstance(score, (int, float)):
                    report.append(f"      {metric_name}: {score:.3f}")
        
        return "\n".join(report)

# Initialize the comparison framework
comparison_framework = ProductionModelComparison(foundry_runner)

print("🔧 Production Model Comparison Framework initialized")
print("📊 Ready for systematic model evaluation")

### Execute Comprehensive Model Comparison

In [None]:
# Set up evaluators for model comparison
from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator

print("🔧 Setting up evaluators for model comparison...")

model_config = azure_manager.get_model_config()

# Create evaluators (using fewer for workshop efficiency)
comparison_evaluators = {}

try:
    comparison_evaluators["relevance"] = RelevanceEvaluator(model_config=model_config)
    print("✅ Relevance evaluator created")
except Exception as e:
    print(f"⚠️ Relevance evaluator failed: {e}")

try:
    comparison_evaluators["coherence"] = CoherenceEvaluator(model_config=model_config)
    print("✅ Coherence evaluator created")
except Exception as e:
    print(f"⚠️ Coherence evaluator failed: {e}")

print(f"\n📊 Created {len(comparison_evaluators)} evaluators for comparison")

if comparison_evaluators:
    # Run comprehensive comparison
    print("\n🚀 Starting comprehensive model comparison...")
    print("   This will create separate evaluation runs in AI Foundry")
    
    comparison_results = comparison_framework.run_comprehensive_comparison(
        dataset=production_dataset,
        evaluators=comparison_evaluators
    )
    
    # Display comparison report
    print("\n" + comparison_framework.generate_comparison_report())
    
    # Show AI Foundry integration status
    execution_method = comparison_results.get('individual_results', {}).get('primary_model', {}).get('_execution_method', 'unknown')
    
    if execution_method == 'azure_ai_foundry_hybrid':
        print("\n🏢 MODEL COMPARISON RESULTS IN AI FOUNDRY:")
        print("   📊 Multiple evaluation datasets uploaded")
        print("   🔍 Each model comparison has its own dataset")
        print("   👀 View all results at: https://ai.azure.com")
        print("   📈 Compare quality metrics across models")
    
else:
    print("❌ No evaluators available for model comparison")
    print("   This is common in workshop environments")
    print("   The framework is ready for production use when evaluators are available")

## Part 5: Production Decision Analysis (10 min)

### Cost vs. Quality Trade-off Analysis

Real production decisions require balancing multiple factors:
- **Quality requirements**: Minimum acceptable scores per domain
- **Cost constraints**: Budget limitations and scaling economics
- **Latency requirements**: Response time SLAs
- **Throughput needs**: Requests per second capacity

In [None]:
# Advanced Production Analysis
class ProductionDecisionAnalyzer:
    """Analyze evaluation results for production deployment decisions."""
    
    def __init__(self, comparison_results: Dict):
        self.results = comparison_results
        
    def analyze_cost_quality_tradeoffs(self, 
                                     quality_threshold: float = 3.5,
                                     cost_budget_per_1k: float = 0.05) -> Dict:
        """Analyze cost vs quality trade-offs for production deployment."""
        
        print("💰 COST vs QUALITY ANALYSIS")
        print("=" * 30)
        print(f"Quality threshold: {quality_threshold}/5.0")
        print(f"Cost budget: ${cost_budget_per_1k:.3f} per 1K tokens")
        
        analysis = {
            "criteria": {
                "min_quality_score": quality_threshold,
                "max_cost_per_1k": cost_budget_per_1k
            },
            "model_analysis": {},
            "recommendations": []
        }
        
        for model_name, results in self.results.get("individual_results", {}).items():
            if "error" in results:
                continue
                
            model_analysis = {
                "meets_quality_threshold": False,
                "within_cost_budget": False,
                "average_quality_score": 0.0,
                "cost_per_1k_tokens": 0.0,
                "recommendation": "Not evaluated"
            }
            
            # Calculate average quality score
            if "metrics" in results:
                quality_scores = []
                for metric_name, score in results["metrics"].items():
                    if isinstance(score, (int, float)) and 'threshold' not in metric_name.lower():
                        quality_scores.append(score)
                
                if quality_scores:
                    avg_quality = sum(quality_scores) / len(quality_scores)
                    model_analysis["average_quality_score"] = avg_quality
                    model_analysis["meets_quality_threshold"] = avg_quality >= quality_threshold
            
            # Check cost constraints
            if "cost_analysis" in results:
                cost_per_1k = results["cost_analysis"]["cost_per_1k_tokens"]
                model_analysis["cost_per_1k_tokens"] = cost_per_1k
                model_analysis["within_cost_budget"] = cost_per_1k <= cost_budget_per_1k
            
            # Generate recommendation
            if model_analysis["meets_quality_threshold"] and model_analysis["within_cost_budget"]:
                model_analysis["recommendation"] = "✅ RECOMMENDED - Meets quality and cost requirements"
            elif model_analysis["meets_quality_threshold"]:
                model_analysis["recommendation"] = "⚠️ QUALITY GOOD - Exceeds cost budget"
            elif model_analysis["within_cost_budget"]:
                model_analysis["recommendation"] = "⚠️ COST EFFECTIVE - Below quality threshold"
            else:
                model_analysis["recommendation"] = "❌ NOT RECOMMENDED - Fails quality and cost requirements"
            
            analysis["model_analysis"][model_name] = model_analysis
        
        # Generate overall recommendations
        recommended_models = [name for name, analysis_data in analysis["model_analysis"].items() 
                            if "✅ RECOMMENDED" in analysis_data["recommendation"]]
        
        if recommended_models:
            analysis["recommendations"].append(f"Deploy: {', '.join(recommended_models)}")
        else:
            analysis["recommendations"].append("Consider adjusting quality thresholds or cost budgets")
            analysis["recommendations"].append("Evaluate additional model options")
        
        return analysis
    
    def generate_production_readiness_report(self) -> str:
        """Generate comprehensive production readiness report."""
        
        # Run analysis with production-typical thresholds
        analysis = self.analyze_cost_quality_tradeoffs(
            quality_threshold=3.5,  # Typical production minimum
            cost_budget_per_1k=0.05  # Example budget constraint
        )
        
        report = []
        report.append("🚀 PRODUCTION READINESS REPORT")
        report.append("=" * 35)
        
        report.append(f"\n📊 EVALUATION CRITERIA:")
        report.append(f"   Minimum quality score: {analysis['criteria']['min_quality_score']}/5.0")
        report.append(f"   Maximum cost per 1K tokens: ${analysis['criteria']['max_cost_per_1k']:.3f}")
        
        report.append(f"\n🎯 MODEL ANALYSIS:")
        for model_name, model_data in analysis["model_analysis"].items():
            report.append(f"   {model_name.upper()}:")
            report.append(f"      Quality Score: {model_data['average_quality_score']:.2f}/5.0")
            report.append(f"      Cost per 1K: ${model_data['cost_per_1k_tokens']:.4f}")
            report.append(f"      Status: {model_data['recommendation']}")
        
        report.append(f"\n💡 RECOMMENDATIONS:")
        for recommendation in analysis["recommendations"]:
            report.append(f"   • {recommendation}")
        
        report.append(f"\n🔄 NEXT STEPS:")
        report.append(f"   1. Validate results with larger dataset")
        report.append(f"   2. Test latency and throughput requirements")
        report.append(f"   3. Run A/B tests with recommended models")
        report.append(f"   4. Set up continuous evaluation monitoring")
        
        return "\n".join(report)

# Analyze results if available
if 'comparison_results' in locals() and comparison_results:
    analyzer = ProductionDecisionAnalyzer(comparison_results)
    
    print(analyzer.generate_production_readiness_report())
    
else:
    print("📊 PRODUCTION ANALYSIS FRAMEWORK")
    print("=" * 35)
    print("✅ Analysis framework ready")
    print("💡 Run model comparison first to generate analysis")
    print("\n🎯 Framework capabilities:")
    print("   • Cost vs Quality trade-off analysis")
    print("   • Production readiness scoring")
    print("   • Deployment recommendations")
    print("   • ROI calculations")
    print("   • Risk assessment")

## Part 6: AI Foundry Integration Summary (5 min)

### What You've Accomplished with AI Foundry

In this lab, you've built a production-ready evaluation system that leverages Azure AI Foundry for:

1. **Dataset Management**: Uploaded production-scale datasets to AI Foundry portal
2. **Evaluation Tracking**: Created traceable evaluation runs with metadata
3. **Model Comparison**: Systematic comparison with cost analysis
4. **Production Readiness**: Decision frameworks for deployment

In [None]:
# Final Lab 2 Summary and AI Foundry Integration Status
print("🎉 LAB 2 COMPLETION SUMMARY")
print("=" * 30)

# Count achievements
achievements = []

if 'production_dataset' in locals():
    achievements.append(f"Generated {len(production_dataset)} production-quality evaluation pairs")

if 'lab2_dataset_info' in locals() and lab2_dataset_info:
    achievements.append(f"Uploaded dataset '{lab2_dataset_info['name']}' to AI Foundry portal")
    achievements.append(f"Covered domains: {', '.join(lab2_dataset_info['domains'])}")

if 'comparison_results' in locals() and comparison_results:
    achievements.append("Completed systematic model comparison analysis")
    achievements.append("Generated production readiness recommendations")

achievements.append("Built scalable evaluation framework with AI Foundry integration")

for i, achievement in enumerate(achievements, 1):
    print(f"✅ {i}. {achievement}")

# AI Foundry Integration Status
print("\n🏢 AI FOUNDRY INTEGRATION STATUS:")
if foundry_status['ai_foundry_available']:
    print("   ✅ Active and functional")
    print("   📊 Evaluation datasets uploaded to portal")
    print("   🔗 View at: https://ai.azure.com")
    
    # List datasets created in this lab
    try:
        current_datasets = list(azure_manager.get_ai_foundry_client().datasets.list())
        lab2_datasets = [d for d in current_datasets if 'lab2' in d.name.lower()]
        print(f"   📁 Lab 2 datasets in portal: {len(lab2_datasets)}")
    except:
        print("   📁 Dataset count unavailable")
else:
    print("   ℹ️ Not configured (evaluations ran locally)")
    print("   💡 Configure for enhanced portal integration")

print("\n🚀 NEXT STEPS:")
print("   📈 Ready for Lab 3: Custom Evaluators & Domain-Specific Evaluation")
print("   🔧 Built scalable foundation for advanced evaluation patterns")
print("   🏭 Production-ready evaluation framework established")

print("\n🎯 KEY SKILLS MASTERED:")
skills = [
    "Synthetic dataset generation at production scale",
    "AI Foundry integration for evaluation management",
    "Systematic model comparison methodologies", 
    "Cost vs quality trade-off analysis",
    "Production deployment decision frameworks"
]

for skill in skills:
    print(f"   ✅ {skill}")

print(f"\n💡 Your evaluation system is now ready for enterprise deployment!")

## Troubleshooting & Next Steps

### Common Issues in Lab 2:

1. **Dataset Generation Failures**:
   - Rate limits: Reduce batch sizes and increase delays
   - Quota issues: Check your Azure OpenAI usage limits
   - Content filtering: Adjust domain topics if needed

2. **AI Foundry Upload Issues**:
   - Verify your AI Foundry project configuration
   - Check authentication and permissions
   - Ensure project endpoint is correct

3. **Model Comparison Problems**:
   - Start with smaller evaluation subsets
   - Monitor token usage and costs
   - Use appropriate delays between evaluations

### Ready for Lab 3:
In Lab 3, you'll learn to build **custom evaluators** that leverage the AI Foundry integration you've established, focusing on domain-specific evaluation logic and advanced evaluation patterns.

### Production Deployment:
Your Lab 2 evaluation framework provides the foundation for:
- Continuous evaluation monitoring
- A/B testing different models
- Cost optimization strategies
- Quality assurance workflows