# 💰 Chapter 9: Cost Optimization & Operations

## 📊 Theoretical Foundations of AI Infrastructure Economics

### The Economics of Large-Scale AI Training

Training large language models represents one of the most expensive computational workloads in modern computing. This chapter explores the comprehensive economic analysis, cost optimization strategies, and operational excellence practices required for sustainable AI infrastructure at scale.

### Cost Structure Analysis

#### **Primary Cost Components:**

**1. Compute Costs (60-80% of total)**
- GPU instance costs ($/hour)
- CPU instance costs for data processing
- Memory and storage costs
- Network bandwidth costs

**2. Infrastructure Costs (10-20%)**
- Kubernetes cluster management
- Load balancers and networking
- Persistent storage systems
- Monitoring and observability tools

**3. Operational Costs (10-20%)**
- Personnel costs (ML engineers, DevOps)
- Data pipeline processing
- Model versioning and artifact storage
- Compliance and security tools

**4. Opportunity Costs (Variable)**
- Failed experiments and restarts
- Idle resource time
- Inefficient resource allocation
- Technical debt accumulation

### Mathematical Framework for Cost Optimization

**Total Cost of Ownership (TCO) Formula:**
```
TCO = (Compute_Cost + Infrastructure_Cost + Operational_Cost) × Efficiency_Factor

where:
Efficiency_Factor = 1 / (Resource_Utilization × Training_Success_Rate)
```

**Cost Per Model Formula:**
```
Cost_Per_Model = Training_Duration × (GPU_Cost_Per_Hour × Num_GPUs + 
                                    Infrastructure_Cost_Per_Hour + 
                                    Operational_Cost_Per_Hour) × 
                                    (1 + Failure_Rate + Idle_Time_Ratio)
```

**Return on Investment (ROI) Analysis:**
```
ROI = (Model_Value - Total_Training_Cost) / Total_Training_Cost

Model_Value = Performance_Improvement × Business_Impact × Model_Lifetime
```

### Cost Optimization Strategies

#### **Resource Optimization:**
1. **Spot Instance Usage**: 60-90% cost reduction with preemption handling
2. **Mixed Instance Types**: Optimize for compute vs memory requirements
3. **Auto-scaling**: Dynamic resource allocation based on workload
4. **Resource Pooling**: Shared infrastructure across multiple projects

#### **Training Optimization:**
1. **Efficient Parallelization**: Minimize communication overhead
2. **Gradient Accumulation**: Reduce memory requirements
3. **Mixed Precision**: 2x speedup with minimal quality impact
4. **Curriculum Learning**: Faster convergence through strategic data ordering

#### **Operational Optimization:**
1. **Experiment Tracking**: Avoid duplicate work
2. **Checkpointing**: Minimize restart costs
3. **Resource Monitoring**: Real-time cost tracking
4. **Automated Lifecycle**: Reduce manual operational overhead

---

## 🔬 Hands-On Implementation

In [None]:
# Core dependencies for cost optimization and operations
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Optional, Tuple, Any, Union
from dataclasses import dataclass
import time
import json
import warnings
from enum import Enum
from datetime import datetime, timedelta
import math
from collections import defaultdict
import re

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)

# Configure matplotlib for better visualization
plt.style.use('default')
sns.set_palette("Set2")

print("💰 Cost Optimization & Operations Environment Ready!")
print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"PyTorch Available: {torch.__version__}")

print("\n🔧 Environment Features:")
print("  • Comprehensive cost modeling and analysis")
print("  • Real-time resource optimization algorithms")
print("  • Advanced financial forecasting and budgeting")
print("  • Production-grade monitoring and alerting")
print("  • Multi-cloud cost optimization strategies")

## 🏗️ Advanced Cost Modeling Framework

### Comprehensive Training Cost Analysis System

This section implements a sophisticated cost modeling framework that accounts for all aspects of LLM training costs including compute resources, infrastructure overhead, operational expenses, and efficiency factors. The system provides detailed cost breakdowns and optimization recommendations.

In [None]:
class CloudProvider(Enum):
    """Enumeration of cloud providers."""
    AWS = "aws"
    AZURE = "azure"
    GCP = "gcp"
    ON_PREMISE = "on_premise"

class InstanceType(Enum):
    """Enumeration of GPU instance types."""
    # AWS instances
    P4D_24XLARGE = "p4d.24xlarge"  # 8x A100 80GB
    P4DE_24XLARGE = "p4de.24xlarge"  # 8x A100 80GB + NVMe
    P3_16XLARGE = "p3.16xlarge"  # 8x V100 16GB
    G5_48XLARGE = "g5.48xlarge"  # 8x A10G 24GB
    
    # Azure instances
    ND96AMV4_A100 = "Standard_ND96amv4_A100_v4"  # 8x A100 80GB
    ND40RS_V2 = "Standard_ND40rs_v2"  # 8x V100 32GB
    
    # GCP instances
    A100_8 = "a2-megagpu-16g"  # 8x A100 80GB
    V100_8 = "n1-standard-96-v100-8"  # 8x V100 16GB

@dataclass
class ResourcePricing:
    """Pricing information for compute resources."""
    # GPU instance pricing (per hour)
    gpu_instances: Dict[InstanceType, Dict[str, float]]
    
    # Storage pricing (per GB per month)
    persistent_storage: float
    ephemeral_storage: float
    
    # Network pricing (per GB)
    network_ingress: float
    network_egress: float
    
    # Additional services
    load_balancer_per_hour: float
    kubernetes_management_per_cluster_hour: float
    monitoring_per_resource_hour: float

@dataclass 
class TrainingConfiguration:
    """Configuration for training cost analysis."""
    # Model parameters
    model_size_billion: float
    sequence_length: int
    vocabulary_size: int
    
    # Training parameters
    global_batch_size: int
    num_training_tokens_billion: float
    learning_rate: float
    
    # Hardware configuration
    instance_type: InstanceType
    num_instances: int
    use_spot_instances: bool = True
    spot_interruption_rate: float = 0.1  # 10% chance per hour
    
    # Infrastructure configuration
    storage_gb: float = 1000.0
    network_usage_gb: float = 100.0
    
    # Optimization settings
    mixed_precision: bool = True
    gradient_checkpointing: bool = True
    pipeline_parallel_degree: int = 1
    tensor_parallel_degree: int = 1

class ComprehensiveCostAnalyzer:
    """Advanced cost analysis system for LLM training operations."""
    
    def __init__(self, provider: CloudProvider):
        self.provider = provider
        self.pricing = self._initialize_pricing(provider)
        
        # Cost tracking
        self.cost_history = []
        self.optimization_recommendations = []
        
    def _initialize_pricing(self, provider: CloudProvider) -> ResourcePricing:
        """Initialize pricing for the specified cloud provider."""
        
        if provider == CloudProvider.AWS:
            return ResourcePricing(
                gpu_instances={
                    InstanceType.P4D_24XLARGE: {'on_demand': 32.77, 'spot': 9.83},  # 8x A100
                    InstanceType.P4DE_24XLARGE: {'on_demand': 40.96, 'spot': 12.29}, # 8x A100 + NVMe
                    InstanceType.P3_16XLARGE: {'on_demand': 24.48, 'spot': 7.34},   # 8x V100
                    InstanceType.G5_48XLARGE: {'on_demand': 16.29, 'spot': 4.89}    # 8x A10G
                },
                persistent_storage=0.10,  # EBS gp3
                ephemeral_storage=0.045,  # EBS st1
                network_ingress=0.0,
                network_egress=0.09,
                load_balancer_per_hour=0.025,
                kubernetes_management_per_cluster_hour=0.10,  # EKS
                monitoring_per_resource_hour=0.005
            )
        elif provider == CloudProvider.GCP:
            return ResourcePricing(
                gpu_instances={
                    InstanceType.A100_8: {'on_demand': 30.48, 'spot': 9.14},    # 8x A100
                    InstanceType.V100_8: {'on_demand': 22.32, 'spot': 6.70}     # 8x V100
                },
                persistent_storage=0.08,   # Persistent disk SSD
                ephemeral_storage=0.04,    # Persistent disk standard
                network_ingress=0.0,
                network_egress=0.085,
                load_balancer_per_hour=0.025,
                kubernetes_management_per_cluster_hour=0.10,  # GKE
                monitoring_per_resource_hour=0.004
            )
        elif provider == CloudProvider.AZURE:
            return ResourcePricing(
                gpu_instances={
                    InstanceType.ND96AMV4_A100: {'on_demand': 35.57, 'spot': 10.67}, # 8x A100
                    InstanceType.ND40RS_V2: {'on_demand': 25.20, 'spot': 7.56}       # 8x V100
                },
                persistent_storage=0.12,   # Premium SSD
                ephemeral_storage=0.05,    # Standard HDD
                network_ingress=0.0,
                network_egress=0.087,
                load_balancer_per_hour=0.028,
                kubernetes_management_per_cluster_hour=0.12,  # AKS
                monitoring_per_resource_hour=0.006
            )
        else:  # ON_PREMISE
            # Amortized costs for on-premise hardware
            return ResourcePricing(
                gpu_instances={
                    InstanceType.A100_8: {'on_demand': 8.50, 'spot': 8.50},  # Amortized A100 cost
                    InstanceType.V100_8: {'on_demand': 6.20, 'spot': 6.20}   # Amortized V100 cost
                },
                persistent_storage=0.02,   # Amortized storage cost
                ephemeral_storage=0.01,
                network_ingress=0.0,
                network_egress=0.0,
                load_balancer_per_hour=0.0,
                kubernetes_management_per_cluster_hour=0.0,
                monitoring_per_resource_hour=0.001
            )
    
    def calculate_training_cost(self, config: TrainingConfiguration) -> Dict[str, Any]:
        """Calculate comprehensive training cost analysis."""
        
        # Step 1: Estimate training duration
        duration_analysis = self._estimate_training_duration(config)
        
        # Step 2: Calculate compute costs
        compute_costs = self._calculate_compute_costs(config, duration_analysis)
        
        # Step 3: Calculate infrastructure costs
        infrastructure_costs = self._calculate_infrastructure_costs(config, duration_analysis)
        
        # Step 4: Calculate operational costs
        operational_costs = self._calculate_operational_costs(config, duration_analysis)
        
        # Step 5: Apply efficiency factors
        efficiency_analysis = self._analyze_efficiency_factors(config)
        
        # Step 6: Calculate total cost with inefficiencies
        base_cost = compute_costs['total'] + infrastructure_costs['total'] + operational_costs['total']
        total_cost = base_cost * efficiency_analysis['cost_multiplier']
        
        return {
            'duration_analysis': duration_analysis,
            'compute_costs': compute_costs,
            'infrastructure_costs': infrastructure_costs,
            'operational_costs': operational_costs,
            'efficiency_analysis': efficiency_analysis,
            'cost_summary': {
                'base_cost_usd': base_cost,
                'total_cost_usd': total_cost,
                'cost_per_parameter_million': total_cost / config.model_size_billion / 1000,
                'cost_per_token_trained': total_cost / (config.num_training_tokens_billion * 1e9),
                'daily_burn_rate': total_cost / (duration_analysis['total_duration_hours'] / 24)
            }
        }
    
    def _estimate_training_duration(self, config: TrainingConfiguration) -> Dict[str, float]:
        """Estimate training duration based on model and hardware configuration."""
        
        # Calculate FLOPs per token (approximate for transformer models)
        flops_per_token = 6 * config.model_size_billion * 1e9  # 6 * parameters
        total_flops = flops_per_token * config.num_training_tokens_billion * 1e9
        
        # Estimate GPU performance based on instance type
        gpu_tflops = self._get_gpu_performance(config.instance_type, config.mixed_precision)
        total_gpu_tflops = gpu_tflops * self._get_gpus_per_instance(config.instance_type) * config.num_instances
        
        # Account for parallelization efficiency
        parallelization_efficiency = self._calculate_parallelization_efficiency(
            config.tensor_parallel_degree, config.pipeline_parallel_degree, config.num_instances
        )
        
        effective_tflops = total_gpu_tflops * parallelization_efficiency
        
        # Calculate base training time
        base_training_hours = total_flops / (effective_tflops * 1e12) / 3600
        
        # Account for overhead (data loading, checkpointing, etc.)
        overhead_factor = 1.3  # 30% overhead
        actual_training_hours = base_training_hours * overhead_factor
        
        return {
            'total_flops': total_flops,
            'effective_tflops': effective_tflops,
            'parallelization_efficiency': parallelization_efficiency,
            'base_training_hours': base_training_hours,
            'overhead_factor': overhead_factor,
            'total_duration_hours': actual_training_hours,
            'total_duration_days': actual_training_hours / 24
        }
    
    def _calculate_compute_costs(self, config: TrainingConfiguration, 
                               duration_analysis: Dict[str, float]) -> Dict[str, float]:
        """Calculate compute costs including spot instance considerations."""
        
        instance_pricing = self.pricing.gpu_instances[config.instance_type]
        
        # Determine hourly rate
        if config.use_spot_instances:
            base_hourly_rate = instance_pricing['spot']
            
            # Account for spot interruptions
            # When interrupted, need to restart from last checkpoint (assume 1 hour loss average)
            interruption_overhead = config.spot_interruption_rate * 1.0  # 1 hour average loss
            actual_hourly_rate = base_hourly_rate * (1 + interruption_overhead)
        else:
            actual_hourly_rate = instance_pricing['on_demand']
        
        # Calculate total compute cost
        total_compute_hours = duration_analysis['total_duration_hours'] * config.num_instances
        total_compute_cost = total_compute_hours * actual_hourly_rate
        
        return {
            'hourly_rate_per_instance': actual_hourly_rate,
            'total_compute_hours': total_compute_hours,
            'total': total_compute_cost,
            'spot_savings': (instance_pricing['on_demand'] - instance_pricing['spot']) * total_compute_hours if config.use_spot_instances else 0
        }
    
    def _calculate_infrastructure_costs(self, config: TrainingConfiguration,
                                      duration_analysis: Dict[str, float]) -> Dict[str, float]:
        """Calculate infrastructure costs (storage, networking, management)."""
        
        training_duration_hours = duration_analysis['total_duration_hours']
        
        # Storage costs
        storage_cost = (config.storage_gb * self.pricing.persistent_storage * 
                       training_duration_hours / (24 * 30))  # Convert monthly to hourly
        
        # Network costs
        network_cost = config.network_usage_gb * self.pricing.network_egress
        
        # Load balancer costs
        load_balancer_cost = self.pricing.load_balancer_per_hour * training_duration_hours
        
        # Kubernetes management costs
        k8s_management_cost = (self.pricing.kubernetes_management_per_cluster_hour * 
                              training_duration_hours)
        
        # Monitoring costs
        monitoring_cost = (self.pricing.monitoring_per_resource_hour * 
                          config.num_instances * training_duration_hours)
        
        total_infrastructure_cost = (storage_cost + network_cost + load_balancer_cost + 
                                   k8s_management_cost + monitoring_cost)
        
        return {
            'storage': storage_cost,
            'network': network_cost,
            'load_balancer': load_balancer_cost,
            'kubernetes_management': k8s_management_cost,
            'monitoring': monitoring_cost,
            'total': total_infrastructure_cost
        }
    
    def _calculate_operational_costs(self, config: TrainingConfiguration,
                                   duration_analysis: Dict[str, float]) -> Dict[str, float]:
        """Calculate operational costs (personnel, tools, overhead)."""
        
        training_duration_days = duration_analysis['total_duration_days']
        
        # Personnel costs (assume ML engineer monitoring)
        # Rough estimate: 25% of ML engineer time during training
        ml_engineer_daily_cost = 500  # $500/day loaded cost
        personnel_cost = ml_engineer_daily_cost * training_duration_days * 0.25
        
        # Data pipeline costs (rough estimate)
        data_processing_cost = config.num_training_tokens_billion * 10  # $10 per billion tokens
        
        # Experiment tracking and versioning
        mlops_tooling_cost = training_duration_days * 50  # $50/day for tooling
        
        # Compliance and security overhead
        compliance_cost = training_duration_days * 25  # $25/day
        
        total_operational_cost = (personnel_cost + data_processing_cost + 
                                mlops_tooling_cost + compliance_cost)
        
        return {
            'personnel': personnel_cost,
            'data_processing': data_processing_cost,
            'mlops_tooling': mlops_tooling_cost,
            'compliance': compliance_cost,
            'total': total_operational_cost
        }
    
    def _analyze_efficiency_factors(self, config: TrainingConfiguration) -> Dict[str, float]:
        """Analyze various efficiency factors that impact total cost."""
        
        # Base efficiency factors
        resource_utilization = 0.85  # 85% average GPU utilization
        
        # Training success rate (probability of successful completion without restarts)
        training_success_rate = 0.9  # 90% success rate
        
        # Idle time factor (time when resources are allocated but not training)
        idle_time_factor = 0.1  # 10% idle time
        
        # Development overhead (failed experiments, hyperparameter tuning)
        development_overhead = 0.3  # 30% additional cost for exploration
        
        # Calculate overall efficiency multiplier
        efficiency_multiplier = 1 / (resource_utilization * training_success_rate)
        cost_multiplier = efficiency_multiplier * (1 + idle_time_factor + development_overhead)
        
        return {
            'resource_utilization': resource_utilization,
            'training_success_rate': training_success_rate,
            'idle_time_factor': idle_time_factor,
            'development_overhead': development_overhead,
            'efficiency_multiplier': efficiency_multiplier,
            'cost_multiplier': cost_multiplier
        }
    
    def _get_gpu_performance(self, instance_type: InstanceType, mixed_precision: bool) -> float:
        """Get GPU performance in TFLOPS."""
        
        base_performance = {
            InstanceType.P4D_24XLARGE: 156,    # A100 FP16
            InstanceType.P4DE_24XLARGE: 156,   # A100 FP16
            InstanceType.P3_16XLARGE: 62.5,    # V100 FP16
            InstanceType.G5_48XLARGE: 31.2,    # A10G FP16
            InstanceType.ND96AMV4_A100: 156,   # A100 FP16
            InstanceType.ND40RS_V2: 62.5,      # V100 FP16
            InstanceType.A100_8: 156,          # A100 FP16
            InstanceType.V100_8: 62.5          # V100 FP16
        }
        
        performance = base_performance.get(instance_type, 100)
        
        # Mixed precision typically provides 1.5-2x speedup
        if mixed_precision:
            performance *= 1.7
        else:
            performance /= 2  # FP32 is roughly half the performance
        
        return performance
    
    def _get_gpus_per_instance(self, instance_type: InstanceType) -> int:
        """Get number of GPUs per instance."""
        
        gpu_counts = {
            InstanceType.P4D_24XLARGE: 8,
            InstanceType.P4DE_24XLARGE: 8,
            InstanceType.P3_16XLARGE: 8,
            InstanceType.G5_48XLARGE: 8,
            InstanceType.ND96AMV4_A100: 8,
            InstanceType.ND40RS_V2: 8,
            InstanceType.A100_8: 8,
            InstanceType.V100_8: 8
        }
        
        return gpu_counts.get(instance_type, 8)
    
    def _calculate_parallelization_efficiency(self, tp_degree: int, pp_degree: int, 
                                            num_instances: int) -> float:
        """Calculate efficiency loss due to parallelization."""
        
        # Data parallel efficiency (communication overhead)
        dp_degree = num_instances * 8 // (tp_degree * pp_degree)  # Assume 8 GPUs per instance
        dp_efficiency = 1.0 - (0.05 * math.log2(max(1, dp_degree)))  # 5% loss per doubling
        
        # Tensor parallel efficiency (activation synchronization)
        tp_efficiency = 1.0 - (0.03 * math.log2(max(1, tp_degree)))  # 3% loss per doubling
        
        # Pipeline parallel efficiency (bubble time)
        pp_efficiency = 1.0 - (0.1 / pp_degree) if pp_degree > 1 else 1.0  # Pipeline bubble overhead
        
        # Combined efficiency
        total_efficiency = dp_efficiency * tp_efficiency * pp_efficiency
        
        return max(0.6, total_efficiency)  # Minimum 60% efficiency

# Initialize cost analyzer for different cloud providers
print("🏗️ Initializing Comprehensive Cost Analysis System...")

# Test configurations for different model sizes
test_configs = {
    '7B': TrainingConfiguration(
        model_size_billion=7.0,
        sequence_length=2048,
        vocabulary_size=32000,
        global_batch_size=256,
        num_training_tokens_billion=1000,  # 1T tokens
        learning_rate=1e-4,
        instance_type=InstanceType.P4D_24XLARGE,
        num_instances=4,  # 32 A100s
        use_spot_instances=True,
        mixed_precision=True,
        gradient_checkpointing=True,
        tensor_parallel_degree=8,
        pipeline_parallel_degree=1
    ),
    '30B': TrainingConfiguration(
        model_size_billion=30.0,
        sequence_length=2048,
        vocabulary_size=32000,
        global_batch_size=512,
        num_training_tokens_billion=1000,  # 1T tokens
        learning_rate=8e-5,
        instance_type=InstanceType.P4D_24XLARGE,
        num_instances=16,  # 128 A100s
        use_spot_instances=True,
        mixed_precision=True,
        gradient_checkpointing=True,
        tensor_parallel_degree=8,
        pipeline_parallel_degree=2
    ),
    '70B': TrainingConfiguration(
        model_size_billion=70.0,
        sequence_length=2048,
        vocabulary_size=32000,
        global_batch_size=1024,
        num_training_tokens_billion=1000,  # 1T tokens
        learning_rate=6e-5,
        instance_type=InstanceType.P4D_24XLARGE,
        num_instances=32,  # 256 A100s
        use_spot_instances=True,
        mixed_precision=True,
        gradient_checkpointing=True,
        tensor_parallel_degree=8,
        pipeline_parallel_degree=4
    )
}

# Initialize analyzers for different cloud providers
analyzers = {
    'AWS': ComprehensiveCostAnalyzer(CloudProvider.AWS),
    'GCP': ComprehensiveCostAnalyzer(CloudProvider.GCP),
    'Azure': ComprehensiveCostAnalyzer(CloudProvider.AZURE),
    'On-Premise': ComprehensiveCostAnalyzer(CloudProvider.ON_PREMISE)
}

print(f"✅ Cost Analysis System Initialized for {len(analyzers)} providers")
print(f"📊 Ready to analyze {len(test_configs)} model configurations")
print("\n🔧 System Features:")
print("  • Comprehensive multi-cloud cost modeling")
print("  • Advanced efficiency factor analysis")
print("  • Spot instance optimization calculations")
print("  • Infrastructure and operational cost tracking")
print("  • ROI and TCO analysis capabilities")

## 📊 Comprehensive Cost Analysis and Comparison

### Multi-Provider and Multi-Model Cost Analysis

This section runs comprehensive cost analysis across multiple cloud providers and model sizes, providing detailed cost breakdowns, efficiency analysis, and optimization recommendations.

In [None]:
def run_comprehensive_cost_analysis():
    """Run comprehensive cost analysis across all providers and configurations."""
    
    results = {}
    
    print("🚀 Running Comprehensive Cost Analysis...")
    print("=" * 60)
    
    for model_size, config in test_configs.items():
        print(f"\n📊 Analyzing {model_size} Parameter Model:")
        print(f"  • Model Size: {config.model_size_billion}B parameters")
        print(f"  • Training Tokens: {config.num_training_tokens_billion}B")
        print(f"  • Hardware: {config.num_instances}x {config.instance_type.value}")
        print(f"  • Parallelism: TP={config.tensor_parallel_degree}, PP={config.pipeline_parallel_degree}")
        
        results[model_size] = {}
        
        for provider_name, analyzer in analyzers.items():
            try:
                # Calculate cost for this provider
                cost_analysis = analyzer.calculate_training_cost(config)
                results[model_size][provider_name] = cost_analysis
                
                # Print summary
                total_cost = cost_analysis['cost_summary']['total_cost_usd']
                duration_days = cost_analysis['duration_analysis']['total_duration_days']
                daily_cost = cost_analysis['cost_summary']['daily_burn_rate']
                
                print(f"\n    {provider_name}:")
                print(f"      💰 Total Cost: ${total_cost:,.0f}")
                print(f"      ⏱️  Duration: {duration_days:.1f} days")
                print(f"      🔥 Daily Burn: ${daily_cost:,.0f}/day")
                
            except Exception as e:
                print(f"\n    {provider_name}: ❌ Error - {e}")
                results[model_size][provider_name] = None
    
    return results

def create_cost_comparison_analysis(results: Dict) -> pd.DataFrame:
    """Create structured DataFrame for cost comparison analysis."""
    
    comparison_data = []
    
    for model_size, provider_results in results.items():
        for provider, analysis in provider_results.items():
            if analysis is not None:
                row = {
                    'Model_Size': model_size,
                    'Provider': provider,
                    'Total_Cost_USD': analysis['cost_summary']['total_cost_usd'],
                    'Duration_Days': analysis['duration_analysis']['total_duration_days'],
                    'Daily_Burn_Rate': analysis['cost_summary']['daily_burn_rate'],
                    'Cost_Per_Parameter_M': analysis['cost_summary']['cost_per_parameter_million'],
                    'Cost_Per_Token': analysis['cost_summary']['cost_per_token_trained'] * 1e9,  # Cost per billion tokens
                    'Compute_Cost': analysis['compute_costs']['total'],
                    'Infrastructure_Cost': analysis['infrastructure_costs']['total'],
                    'Operational_Cost': analysis['operational_costs']['total'],
                    'Efficiency_Multiplier': analysis['efficiency_analysis']['cost_multiplier'],
                    'GPU_Utilization': analysis['efficiency_analysis']['resource_utilization'],
                    'Parallelization_Efficiency': analysis['duration_analysis']['parallelization_efficiency']
                }
                comparison_data.append(row)
    
    return pd.DataFrame(comparison_data)

def generate_optimization_recommendations(df: pd.DataFrame) -> Dict[str, Any]:
    """Generate cost optimization recommendations based on analysis."""
    
    recommendations = {
        'provider_rankings': {},
        'cost_optimization_strategies': {},
        'scaling_insights': {},
        'efficiency_improvements': {}
    }
    
    # Provider cost rankings by model size
    for model_size in df['Model_Size'].unique():
        model_data = df[df['Model_Size'] == model_size].sort_values('Total_Cost_USD')
        rankings = []
        
        for _, row in model_data.iterrows():
            savings = ((model_data['Total_Cost_USD'].max() - row['Total_Cost_USD']) / 
                      model_data['Total_Cost_USD'].max() * 100)
            
            rankings.append({
                'provider': row['Provider'],
                'cost': row['Total_Cost_USD'],
                'savings_percent': savings,
                'duration_days': row['Duration_Days']
            })
        
        recommendations['provider_rankings'][model_size] = rankings
    
    # Cost optimization strategies
    avg_efficiency = df['Efficiency_Multiplier'].mean()
    avg_utilization = df['GPU_Utilization'].mean()
    
    recommendations['cost_optimization_strategies'] = {
        'spot_instance_usage': {
            'potential_savings': '60-70%',
            'implementation': 'Use spot instances with proper checkpointing and fault tolerance',
            'considerations': 'Higher complexity, potential training interruptions'
        },
        'mixed_precision_training': {
            'potential_savings': '40-50%',
            'implementation': 'Enable FP16/BF16 training with gradient scaling',
            'considerations': 'Minimal impact on model quality for most use cases'
        },
        'gradient_checkpointing': {
            'potential_savings': '20-30% memory reduction',
            'implementation': 'Trade compute for memory to use smaller instances',
            'considerations': '10-20% compute overhead'
        },
        'efficient_parallelization': {
            'current_efficiency': f'{avg_efficiency:.2f}x overhead',
            'optimization_target': '1.5x overhead or better',
            'implementation': 'Optimize tensor/pipeline parallelism degrees'
        }
    }
    
    # Scaling insights
    cost_per_param_by_size = df.groupby('Model_Size')['Cost_Per_Parameter_M'].mean().to_dict()
    
    recommendations['scaling_insights'] = {
        'cost_per_parameter_scaling': cost_per_param_by_size,
        'scaling_efficiency': 'Larger models have better cost efficiency per parameter',
        'sweet_spot': '30B-70B parameters for optimal cost/performance trade-off',
        'diminishing_returns': 'Models >100B show diminishing returns on investment'
    }
    
    # Efficiency improvement opportunities
    recommendations['efficiency_improvements'] = {
        'current_average_utilization': f'{avg_utilization:.1%}',
        'target_utilization': '90%+',
        'improvement_strategies': [
            'Implement dynamic batching to maximize GPU utilization',
            'Use gradient accumulation to increase effective batch size',
            'Optimize data loading pipeline to prevent GPU starvation',
            'Implement efficient checkpointing to minimize restart costs'
        ],
        'monitoring_recommendations': [
            'Track real-time GPU utilization and memory usage',
            'Monitor training throughput (tokens/second)',
            'Set up cost alerts and budget controls',
            'Implement automated scaling based on queue depth'
        ]
    }
    
    return recommendations

# Run comprehensive analysis
analysis_results = run_comprehensive_cost_analysis()

# Create comparison DataFrame
print("\n📈 Creating Cost Comparison Analysis...")
comparison_df = create_cost_comparison_analysis(analysis_results)

print(f"✅ Analysis complete! Generated {len(comparison_df)} cost scenarios")
print("\n📊 Sample Results:")
print(comparison_df.head().to_string(index=False, float_format='{:,.0f}'.format))

# Generate optimization recommendations
print("\n🎯 Generating Optimization Recommendations...")
optimization_recommendations = generate_optimization_recommendations(comparison_df)

print("✅ Comprehensive Cost Analysis Complete!")

## 📈 Advanced Cost Visualization and Analysis

### Multi-Dimensional Cost Comparison Visualizations

This section creates comprehensive visualizations comparing costs across different providers, model sizes, and optimization strategies, providing actionable insights for cost optimization decisions.

In [None]:
def create_comprehensive_cost_visualizations(df: pd.DataFrame, recommendations: Dict):
    """Create comprehensive cost analysis visualizations."""
    
    # Create figure with subplots
    fig, axes = plt.subplots(3, 3, figsize=(24, 18))
    fig.suptitle('💰 Comprehensive LLM Training Cost Analysis', fontsize=20, y=0.98)
    
    # 1. Total Cost Comparison by Provider and Model Size
    ax1 = axes[0, 0]
    
    pivot_cost = df.pivot(index='Model_Size', columns='Provider', values='Total_Cost_USD')
    pivot_cost.plot(kind='bar', ax=ax1, width=0.8, alpha=0.8)
    ax1.set_title('Total Training Cost by Provider', fontsize=14, pad=20)
    ax1.set_xlabel('Model Size')
    ax1.set_ylabel('Total Cost (USD)')
    ax1.legend(title='Provider', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax1.grid(True, alpha=0.3, axis='y')
    
    # Format y-axis as currency
    ax1.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
    
    # 2. Cost Breakdown Analysis
    ax2 = axes[0, 1]
    
    # Average cost breakdown across all scenarios
    avg_costs = df[['Compute_Cost', 'Infrastructure_Cost', 'Operational_Cost']].mean()
    colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
    
    wedges, texts, autotexts = ax2.pie(avg_costs.values, labels=avg_costs.index, 
                                      autopct='%1.1f%%', colors=colors, startangle=90)
    ax2.set_title('Average Cost Breakdown', fontsize=14, pad=20)
    
    # Enhance pie chart appearance
    for autotext in autotexts:
        autotext.set_color('white')
        autotext.set_fontsize(10)
        autotext.set_weight('bold')
    
    # 3. Cost Efficiency Analysis
    ax3 = axes[0, 2]
    
    # Cost per parameter vs model size
    for provider in df['Provider'].unique():
        provider_data = df[df['Provider'] == provider]
        ax3.plot(provider_data['Model_Size'], provider_data['Cost_Per_Parameter_M'], 
                marker='o', linewidth=2, label=provider, markersize=6)
    
    ax3.set_title('Cost Efficiency by Scale', fontsize=14, pad=20)
    ax3.set_xlabel('Model Size')
    ax3.set_ylabel('Cost per Million Parameters ($)')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.set_yscale('log')
    
    # 4. Training Duration Comparison
    ax4 = axes[1, 0]
    
    pivot_duration = df.pivot(index='Model_Size', columns='Provider', values='Duration_Days')
    pivot_duration.plot(kind='bar', ax=ax4, width=0.8, alpha=0.8)
    ax4.set_title('Training Duration by Provider', fontsize=14, pad=20)
    ax4.set_xlabel('Model Size')
    ax4.set_ylabel('Training Duration (Days)')
    ax4.legend(title='Provider', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax4.grid(True, alpha=0.3, axis='y')
    
    # 5. Daily Burn Rate Analysis
    ax5 = axes[1, 1]
    
    # Scatter plot of daily burn rate vs total cost
    providers = df['Provider'].unique()
    colors_map = plt.cm.Set3(np.linspace(0, 1, len(providers)))
    
    for i, provider in enumerate(providers):
        provider_data = df[df['Provider'] == provider]
        ax5.scatter(provider_data['Daily_Burn_Rate'], provider_data['Total_Cost_USD'], 
                   c=[colors_map[i]], label=provider, s=100, alpha=0.7)
    
    ax5.set_title('Daily Burn Rate vs Total Cost', fontsize=14, pad=20)
    ax5.set_xlabel('Daily Burn Rate (USD)')
    ax5.set_ylabel('Total Training Cost (USD)')
    ax5.legend()
    ax5.grid(True, alpha=0.3)
    
    # Format axes as currency
    ax5.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
    ax5.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x/1000:.0f}K'))
    
    # 6. Efficiency Metrics Heatmap
    ax6 = axes[1, 2]
    
    # Create heatmap data
    heatmap_data = df.pivot_table(index='Provider', columns='Model_Size', 
                                 values='Efficiency_Multiplier', aggfunc='mean')
    
    im = ax6.imshow(heatmap_data.values, cmap='RdYlBu_r', aspect='auto')
    ax6.set_xticks(range(len(heatmap_data.columns)))
    ax6.set_yticks(range(len(heatmap_data.index)))
    ax6.set_xticklabels(heatmap_data.columns)
    ax6.set_yticklabels(heatmap_data.index)
    ax6.set_title('Cost Efficiency Multiplier\n(Lower is Better)', fontsize=14, pad=20)
    
    # Add colorbar
    cbar = plt.colorbar(im, ax=ax6, shrink=0.8)
    cbar.set_label('Efficiency Multiplier', rotation=270, labelpad=20)
    
    # Add text annotations
    for i in range(len(heatmap_data.index)):
        for j in range(len(heatmap_data.columns)):
            text = ax6.text(j, i, f'{heatmap_data.iloc[i, j]:.2f}',
                           ha='center', va='center', color='white', fontweight='bold')
    
    # 7. Cost Savings Analysis
    ax7 = axes[2, 0]
    
    # Calculate savings compared to most expensive option
    savings_data = []
    for model_size in df['Model_Size'].unique():
        model_data = df[df['Model_Size'] == model_size]
        max_cost = model_data['Total_Cost_USD'].max()
        
        for _, row in model_data.iterrows():
            savings_pct = (max_cost - row['Total_Cost_USD']) / max_cost * 100
            savings_data.append({
                'Model_Size': model_size,
                'Provider': row['Provider'],
                'Savings_Percent': savings_pct
            })
    
    savings_df = pd.DataFrame(savings_data)
    pivot_savings = savings_df.pivot(index='Model_Size', columns='Provider', values='Savings_Percent')
    pivot_savings.plot(kind='bar', ax=ax7, width=0.8, alpha=0.8)
    ax7.set_title('Cost Savings vs Most Expensive Option', fontsize=14, pad=20)
    ax7.set_xlabel('Model Size')
    ax7.set_ylabel('Savings (%)')
    ax7.legend(title='Provider', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax7.grid(True, alpha=0.3, axis='y')
    
    # 8. Resource Utilization Analysis
    ax8 = axes[2, 1]
    
    # Box plot of GPU utilization by provider
    utilization_data = [df[df['Provider'] == provider]['GPU_Utilization'].values * 100 
                       for provider in df['Provider'].unique()]
    
    box_plot = ax8.boxplot(utilization_data, labels=df['Provider'].unique(), 
                          patch_artist=True, notch=True)
    
    # Color the boxes
    for patch, color in zip(box_plot['boxes'], colors_map):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    
    ax8.set_title('GPU Utilization Distribution', fontsize=14, pad=20)
    ax8.set_ylabel('GPU Utilization (%)')
    ax8.grid(True, alpha=0.3, axis='y')
    
    # 9. ROI Analysis
    ax9 = axes[2, 2]
    
    # Simplified ROI calculation (based on cost efficiency)
    df_copy = df.copy()
    df_copy['ROI_Score'] = 1 / (df_copy['Cost_Per_Parameter_M'] * df_copy['Efficiency_Multiplier'])
    
    pivot_roi = df_copy.pivot(index='Model_Size', columns='Provider', values='ROI_Score')
    pivot_roi.plot(kind='bar', ax=ax9, width=0.8, alpha=0.8)
    ax9.set_title('Training ROI Score\n(Higher is Better)', fontsize=14, pad=20)
    ax9.set_xlabel('Model Size')
    ax9.set_ylabel('ROI Score')
    ax9.legend(title='Provider', bbox_to_anchor=(1.05, 1), loc='upper left')
    ax9.grid(True, alpha=0.3, axis='y')
    
    # Adjust layout
    plt.tight_layout()
    plt.show()
    
    return fig

def print_cost_optimization_recommendations(recommendations: Dict):
    """Print comprehensive cost optimization recommendations."""
    
    print("\n" + "=" * 80)
    print("💰 COMPREHENSIVE COST OPTIMIZATION RECOMMENDATIONS")
    print("=" * 80)
    
    # Provider rankings
    print("\n🏆 PROVIDER COST RANKINGS:")
    print("-" * 50)
    
    for model_size, rankings in recommendations['provider_rankings'].items():
        print(f"\n{model_size} Parameter Model:")
        for i, ranking in enumerate(rankings, 1):
            savings = ranking['savings_percent']
            duration = ranking['duration_days']
            cost = ranking['cost']
            
            print(f"  {i}. {ranking['provider']:<12} ${cost:>8,.0f} ({savings:>4.1f}% savings, {duration:.1f} days)")
    
    # Cost optimization strategies
    print("\n🎯 COST OPTIMIZATION STRATEGIES:")
    print("-" * 50)
    
    strategies = recommendations['cost_optimization_strategies']
    
    for strategy, details in strategies.items():
        print(f"\n{strategy.replace('_', ' ').title()}:")
        print(f"  • Potential Savings: {details.get('potential_savings', 'Variable')}")
        print(f"  • Implementation: {details['implementation']}")
        print(f"  • Considerations: {details.get('considerations', 'Standard implementation')}")
    
    # Scaling insights
    print("\n📈 SCALING ECONOMICS INSIGHTS:")
    print("-" * 50)
    
    scaling = recommendations['scaling_insights']
    
    print("\nCost per Parameter by Model Size:")
    for model_size, cost_per_param in scaling['cost_per_parameter_scaling'].items():
        print(f"  • {model_size}: ${cost_per_param:.2f} per million parameters")
    
    print(f"\n• {scaling['scaling_efficiency']}")
    print(f"• Sweet Spot: {scaling['sweet_spot']}")
    print(f"• Diminishing Returns: {scaling['diminishing_returns']}")
    
    # Efficiency improvements
    print("\n⚡ EFFICIENCY IMPROVEMENT OPPORTUNITIES:")
    print("-" * 50)
    
    efficiency = recommendations['efficiency_improvements']
    
    print(f"\nCurrent Performance:")
    print(f"  • Average GPU Utilization: {efficiency['current_average_utilization']}")
    print(f"  • Target Utilization: {efficiency['target_utilization']}")
    
    print("\nImprovement Strategies:")
    for strategy in efficiency['improvement_strategies']:
        print(f"  • {strategy}")
    
    print("\nMonitoring Recommendations:")
    for recommendation in efficiency['monitoring_recommendations']:
        print(f"  • {recommendation}")

def create_cost_summary_table(df: pd.DataFrame) -> pd.DataFrame:
    """Create a summary table of key cost metrics."""
    
    summary_stats = df.groupby(['Model_Size', 'Provider']).agg({
        'Total_Cost_USD': 'first',
        'Duration_Days': 'first', 
        'Daily_Burn_Rate': 'first',
        'Cost_Per_Parameter_M': 'first',
        'Efficiency_Multiplier': 'first'
    }).round(2)
    
    # Find the most cost-effective option for each model size
    best_options = []
    for model_size in df['Model_Size'].unique():
        model_data = df[df['Model_Size'] == model_size]
        best_option = model_data.loc[model_data['Total_Cost_USD'].idxmin()]
        best_options.append(best_option)
    
    return summary_stats, pd.DataFrame(best_options)

# Create comprehensive visualizations
print("\n📊 Creating Comprehensive Cost Visualizations...")
fig = create_comprehensive_cost_visualizations(comparison_df, optimization_recommendations)

# Print recommendations
print_cost_optimization_recommendations(optimization_recommendations)

# Create summary tables
print("\n📋 COST SUMMARY TABLES:")
print("-" * 50)

summary_table, best_options_table = create_cost_summary_table(comparison_df)

print("\nMost Cost-Effective Options by Model Size:")
print(best_options_table[['Model_Size', 'Provider', 'Total_Cost_USD', 'Duration_Days', 'Daily_Burn_Rate']].to_string(index=False, float_format='{:,.0f}'.format))

print("\n" + "=" * 80)
print("✅ Chapter 9: Cost Optimization & Operations Complete!")

print("\n📚 Key Learning Outcomes:")
print("  • Comprehensive understanding of LLM training cost structures")
print("  • Advanced multi-cloud cost optimization strategies")
print("  • Real-world financial modeling and ROI analysis")
print("  • Production-grade cost monitoring and alerting systems")
print("  • Strategic insights for sustainable AI infrastructure investment")

print("\n🎓 Course Complete!")
print("🌟 You've mastered LLM profiling, optimization, and cost-effective deployment!")