# Module 08: Cloud Cost Optimization Strategies

**Difficulty**: ‚≠ê‚≠ê
**Estimated Time**: 75 minutes
**Prerequisites**: 
- [Module 00: Introduction to Cloud ML Services](00_introduction_to_cloud_ml_services.ipynb)
- [Module 01: AWS SageMaker Basics](01_aws_sagemaker_basics.ipynb)
- [Module 07: Serverless ML](07_serverless_ml.ipynb)
- Basic understanding of cloud pricing models

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand cloud ML pricing models across AWS, Azure, and GCP
2. Maximize free tier benefits for learning and small projects
3. Use Spot/Preemptible instances for training to reduce costs by 70-90%
4. Implement auto-scaling and right-sizing strategies
5. Optimize storage costs with lifecycle policies
6. Set up cost monitoring, alerts, and budgets
7. Use cost allocation tags for tracking expenses
8. Make informed decisions between on-demand, reserved, and spot instances

## Why Cost Optimization Matters

Cloud costs can spiral out of control quickly:
- üö® **Forgotten resources**: Leaving an ml.p3.2xlarge instance running overnight = $90
- üö® **Over-provisioning**: Using ml.m5.xlarge when ml.t3.medium suffices = 2x unnecessary cost
- üö® **Inefficient storage**: Storing processed data in standard S3 vs glacier = 25x cost
- üö® **Missed free tier**: Not using 1M free Lambda requests = $200/year wasted

**Good news**: With proper strategies, you can run ML workloads for < $10/month or even FREE for learning!

### Free Tier Summary (2024)

| Service | AWS Free Tier | Azure Free Tier | GCP Free Tier |
|---------|---------------|-----------------|---------------|
| **Compute** | 750 hrs t2.micro/month | 750 hrs B1S VM/month | f1-micro 24/7 |
| **Storage** | 5GB S3 Standard | 5GB Blob Storage | 5GB Standard |
| **Serverless** | 1M Lambda requests | 1M Functions executions | 2M Functions invocations |
| **Database** | 750 hrs RDS db.t2.micro | 250GB SQL Database | 1GB Cloud SQL |
| **Data Transfer** | 1GB out/month | 15GB out/month | 1GB out/month |
| **Duration** | 12 months (some always free) | 12 months + $200 credit | $300 credit (90 days) |

## Setup and Imports

In [None]:
# Standard library imports
import json
import os
from datetime import datetime, timedelta
import calendar

# Data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Cloud cost estimation (simulated)
from dataclasses import dataclass
from typing import List, Dict, Optional

# Configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("Setup complete!")
print(f"Notebook executed on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## Part 1: Understanding Cloud ML Pricing Models

### 1.1: Core Pricing Components

Cloud ML costs come from 4 main areas:
1. **Compute**: Instance hours for training and inference
2. **Storage**: Data storage (S3, Blob, Cloud Storage)
3. **Data Transfer**: Moving data between regions or to internet
4. **Additional Services**: Managed services (SageMaker, Vertex AI, etc.)

In [None]:
@dataclass
class InstancePricing:
    """Represents cloud instance pricing information"""
    name: str
    vcpus: int
    memory_gb: float
    gpu: Optional[str]
    on_demand_hourly: float
    spot_hourly: Optional[float]  # Spot/Preemptible pricing
    reserved_1yr: Optional[float]  # 1-year reserved pricing
    reserved_3yr: Optional[float]  # 3-year reserved pricing

# AWS SageMaker Instance Pricing (us-east-1, 2024 approximate)
aws_instances = [
    InstancePricing('ml.t3.medium', 2, 4, None, 0.065, None, 0.042, 0.027),
    InstancePricing('ml.m5.large', 2, 8, None, 0.134, 0.040, 0.087, 0.056),
    InstancePricing('ml.m5.xlarge', 4, 16, None, 0.269, 0.081, 0.175, 0.112),
    InstancePricing('ml.c5.2xlarge', 8, 16, None, 0.476, 0.143, 0.310, 0.198),
    InstancePricing('ml.p3.2xlarge', 8, 61, 'V100', 4.284, 1.285, 2.785, 1.784),
    InstancePricing('ml.g4dn.xlarge', 4, 16, 'T4', 0.736, 0.221, 0.479, 0.307),
]

# Convert to DataFrame for analysis
pricing_df = pd.DataFrame([
    {
        'Instance': inst.name,
        'vCPUs': inst.vcpus,
        'Memory (GB)': inst.memory_gb,
        'GPU': inst.gpu or '-',
        'On-Demand ($/hr)': inst.on_demand_hourly,
        'Spot ($/hr)': inst.spot_hourly or '-',
        'Reserved 1yr ($/hr)': inst.reserved_1yr or '-',
        'Reserved 3yr ($/hr)': inst.reserved_3yr or '-',
    }
    for inst in aws_instances
])

print("AWS SageMaker Instance Pricing (us-east-1)\n")
print(pricing_df.to_string(index=False))
print("\nüí° Key Observations:")
print("   - Spot instances: 60-70% cheaper than on-demand")
print("   - Reserved 3yr: 50-60% cheaper than on-demand")
print("   - GPU instances: 10-65x more expensive than CPU")

### 1.2: Cost Breakdown Analysis

Let's analyze how costs accumulate over time for different usage patterns.

In [None]:
def calculate_monthly_cost(instance: InstancePricing, hours_per_day: float, 
                          days_per_month: int = 30, pricing_model: str = 'on-demand'):
    """
    Calculate monthly cost for an instance based on usage pattern
    
    pricing_model: 'on-demand', 'spot', 'reserved-1yr', 'reserved-3yr'
    """
    total_hours = hours_per_day * days_per_month
    
    pricing_map = {
        'on-demand': instance.on_demand_hourly,
        'spot': instance.spot_hourly or instance.on_demand_hourly,
        'reserved-1yr': instance.reserved_1yr or instance.on_demand_hourly,
        'reserved-3yr': instance.reserved_3yr or instance.on_demand_hourly
    }
    
    hourly_rate = pricing_map[pricing_model]
    monthly_cost = total_hours * hourly_rate
    
    return monthly_cost

# Compare different usage scenarios
usage_scenarios = [
    {'name': 'Light Development', 'hours_per_day': 2},
    {'name': 'Active Development', 'hours_per_day': 8},
    {'name': 'Production 24/7', 'hours_per_day': 24},
]

# Analyze ml.m5.large costs
instance = aws_instances[1]  # ml.m5.large

cost_analysis = []
for scenario in usage_scenarios:
    for model in ['on-demand', 'spot', 'reserved-1yr', 'reserved-3yr']:
        cost = calculate_monthly_cost(instance, scenario['hours_per_day'], pricing_model=model)
        cost_analysis.append({
            'Usage Pattern': scenario['name'],
            'Hours/Day': scenario['hours_per_day'],
            'Pricing Model': model,
            'Monthly Cost ($)': cost
        })

cost_df = pd.DataFrame(cost_analysis)

# Pivot for better visualization
pivot_df = cost_df.pivot(index='Usage Pattern', columns='Pricing Model', values='Monthly Cost ($)')
print(f"\nMonthly Cost Analysis: {instance.name}\n")
print(pivot_df.to_string())
print("\nüí∞ Cost Savings Opportunities:")
print(f"   - Using Spot for development: Save ${pivot_df['on-demand'][0] - pivot_df['spot'][0]:.2f}/month")
print(f"   - Reserved 3yr for production: Save ${(pivot_df['on-demand'][2] - pivot_df['reserved-3yr'][2]):.2f}/month")
print(f"   - Turning off 24/7 when unused: Save ${pivot_df['on-demand'][2]:.2f}/month")

## Part 2: Maximizing Free Tier Benefits

### 2.1: Free Tier Strategy

The free tier is perfect for:
- Learning and experimentation
- Small personal projects
- Proof-of-concept development
- Side projects with low traffic

**How to Stay Within Free Tier:**

In [None]:
class FreeTierTracker:
    """Track usage against AWS free tier limits"""
    
    def __init__(self):
        # Free tier limits (monthly)
        self.limits = {
            'ec2_hours': 750,  # t2.micro hours
            's3_storage_gb': 5,
            's3_get_requests': 20000,
            's3_put_requests': 2000,
            'lambda_requests': 1_000_000,
            'lambda_gb_seconds': 400_000,
            'data_transfer_gb': 1,  # outbound
        }
        
        # Current usage
        self.usage = {key: 0 for key in self.limits.keys()}
    
    def add_usage(self, resource: str, amount: float):
        """Record resource usage"""
        if resource in self.usage:
            self.usage[resource] += amount
    
    def get_utilization(self) -> pd.DataFrame:
        """Get utilization report"""
        data = []
        for resource, limit in self.limits.items():
            used = self.usage[resource]
            utilization_pct = (used / limit) * 100
            remaining = max(0, limit - used)
            
            status = '‚úÖ Safe'
            if utilization_pct > 90:
                status = 'üö® Over Limit'
            elif utilization_pct > 75:
                status = '‚ö†Ô∏è Warning'
            
            data.append({
                'Resource': resource,
                'Used': f"{used:,.0f}",
                'Limit': f"{limit:,.0f}",
                'Remaining': f"{remaining:,.0f}",
                'Utilization': f"{utilization_pct:.1f}%",
                'Status': status
            })
        
        return pd.DataFrame(data)

# Example: Track a month of usage
tracker = FreeTierTracker()

# Simulate usage
tracker.add_usage('ec2_hours', 200)  # Running t2.micro for ~6.5 hours/day
tracker.add_usage('s3_storage_gb', 2.5)  # 2.5GB stored
tracker.add_usage('s3_get_requests', 5000)  # 5k GET requests
tracker.add_usage('lambda_requests', 450_000)  # 450k Lambda invocations
tracker.add_usage('lambda_gb_seconds', 180_000)  # Lambda compute
tracker.add_usage('data_transfer_gb', 0.3)  # 300MB data transfer

# Display utilization
print("AWS Free Tier Utilization Report\n")
print(tracker.get_utilization().to_string(index=False))
print("\nüí° Tips to Stay in Free Tier:")
print("   1. Stop instances when not in use (don't just pause)")
print("   2. Use S3 Intelligent-Tiering for automatic cost optimization")
print("   3. Set up billing alerts at 50%, 75%, and 90% of free tier limits")
print("   4. Delete old snapshots and unused volumes")
print("   5. Use CloudWatch to monitor usage in real-time")

### 2.2: Free Tier ML Project Blueprint

Here's how to run a complete ML project on AWS free tier:

In [None]:
free_tier_ml_blueprint = {
    'architecture': {
        'data_storage': 'S3 (5GB free)',
        'training': 'Local laptop or Colab/Kaggle (free GPU)',
        'model_storage': 'S3 (< 100MB model)',
        'inference': 'AWS Lambda (1M free requests)',
        'api': 'API Gateway (1M free calls for 12 months)',
        'monitoring': 'CloudWatch (basic metrics free)'
    },
    'estimated_costs': {
        'storage': '$0 (< 5GB)',
        'lambda': '$0 (< 1M requests)',
        'api_gateway': '$0 (first 12 months)',
        'data_transfer': '$0 (< 1GB)',
        'total_monthly': '$0'
    },
    'limitations': [
        'No GPU training (use Colab instead)',
        'Small datasets only (< 5GB)',
        'Inference only (no real-time endpoints)',
        'Low traffic (< 1M predictions/month)'
    ],
    'best_practices': [
        'Train models locally or on Colab',
        'Use lightweight models (< 50MB)',
        'Implement caching to reduce Lambda cold starts',
        'Compress data before uploading to S3',
        'Set up lifecycle policies to auto-delete old data'
    ]
}

print("Free Tier ML Project Blueprint\n")
print("Architecture:")
for component, detail in free_tier_ml_blueprint['architecture'].items():
    print(f"  - {component}: {detail}")

print("\nEstimated Monthly Costs:")
for item, cost in free_tier_ml_blueprint['estimated_costs'].items():
    print(f"  - {item}: {cost}")

print("\n‚úÖ This architecture supports:")
print("   - 1M predictions/month")
print("   - ~100 API calls/hour average")
print("   - 5GB of training data")
print("   - Multiple model versions")

## Part 3: Managed Spot Training

Spot instances can reduce training costs by **70-90%** but may be interrupted.

### 3.1: Understanding Spot Instances

In [None]:
def compare_spot_vs_ondemand(instance: InstancePricing, training_hours: float,
                             spot_interruption_rate: float = 0.1):
    """
    Compare cost and reliability of spot vs on-demand training
    
    spot_interruption_rate: Probability of spot interruption (typical: 5-10%)
    """
    # On-demand costs
    ondemand_cost = training_hours * instance.on_demand_hourly
    ondemand_reliability = 1.0  # 100% reliable
    
    # Spot costs (assuming interruption adds 10% overhead)
    spot_effective_hours = training_hours * (1 + spot_interruption_rate)
    spot_cost = spot_effective_hours * (instance.spot_hourly or instance.on_demand_hourly * 0.3)
    spot_reliability = 1.0 - spot_interruption_rate
    
    savings = ondemand_cost - spot_cost
    savings_pct = (savings / ondemand_cost) * 100
    
    return {
        'instance': instance.name,
        'training_hours': training_hours,
        'ondemand_cost': ondemand_cost,
        'spot_cost': spot_cost,
        'savings': savings,
        'savings_pct': savings_pct,
        'spot_reliability': spot_reliability * 100
    }

# Compare different training scenarios
training_scenarios = [
    {'instance': aws_instances[1], 'hours': 10, 'name': 'Quick experiment'},
    {'instance': aws_instances[3], 'hours': 24, 'name': 'Medium training job'},
    {'instance': aws_instances[4], 'hours': 100, 'name': 'Large GPU training'},
]

comparisons = []
for scenario in training_scenarios:
    result = compare_spot_vs_ondemand(scenario['instance'], scenario['hours'])
    result['scenario'] = scenario['name']
    comparisons.append(result)

comparison_df = pd.DataFrame(comparisons)

print("Spot vs On-Demand Training Cost Comparison\n")
print(comparison_df[[
    'scenario', 'instance', 'training_hours', 
    'ondemand_cost', 'spot_cost', 'savings', 'savings_pct'
]].to_string(index=False))

print("\nüí∞ Key Insights:")
print(f"   - Average savings with Spot: {comparison_df['savings_pct'].mean():.1f}%")
print(f"   - Total potential savings: ${comparison_df['savings'].sum():.2f}")
print(f"   - Spot reliability: ~{comparison_df['spot_reliability'].mean():.0f}% (with checkpointing)")

### 3.2: Spot Training Best Practices

To use spot instances effectively for ML training:

In [None]:
# SageMaker Spot Training configuration example
spot_training_config = {
    'estimator_config': {
        'instance_type': 'ml.p3.2xlarge',
        'instance_count': 1,
        'use_spot_instances': True,
        'max_run': 86400,  # 24 hours max
        'max_wait': 86400,  # How long to wait for spot capacity
        # Key: Enable checkpointing!
        'checkpoint_s3_uri': 's3://my-bucket/checkpoints/',
        'checkpoint_local_path': '/opt/ml/checkpoints'
    },
    'training_script_requirements': [
        'Save checkpoints every N epochs',
        'Resume from checkpoint on restart',
        'Handle SIGTERM gracefully (save state before shutdown)'
    ],
    'terraform_example': '''
resource "aws_sagemaker_training_job" "spot_training" {
  training_job_name = "my-spot-training-job"
  role_arn          = aws_iam_role.sagemaker.arn

  algorithm_specification {
    training_image = "your-training-image"
  }

  resource_config {
    instance_type   = "ml.p3.2xlarge"
    instance_count  = 1
    volume_size_in_gb = 50
  }

  # Enable Spot Training
  enable_managed_spot_training = true
  
  stopping_condition {
    max_runtime_in_seconds = 86400
    max_wait_time_in_seconds = 86400
  }

  # Critical: Configure checkpointing
  checkpoint_config {
    s3_uri = "s3://my-bucket/checkpoints/"
    local_path = "/opt/ml/checkpoints"
  }
}
'''
}

print("Spot Training Best Practices\n")
print("Configuration:")
print(json.dumps(spot_training_config['estimator_config'], indent=2))
print("\n‚úÖ Requirements for Spot Training Success:")
for idx, req in enumerate(spot_training_config['training_script_requirements'], 1):
    print(f"   {idx}. {req}")

print("\n‚ö†Ô∏è When NOT to use Spot:")
print("   - Time-critical training (deadline-driven)")
print("   - Training jobs < 1 hour (interruption overhead too high)")
print("   - Cannot implement checkpointing (legacy code)")
print("   - Need guaranteed completion time")

## Part 4: Storage Optimization

Storage costs can accumulate quickly if not managed properly.

### 4.1: S3 Storage Tiers and Lifecycle Policies

In [None]:
# S3 Storage Classes and Pricing (per GB/month)
s3_storage_classes = pd.DataFrame([
    {
        'Storage Class': 'S3 Standard',
        'Cost ($/GB/mo)': 0.023,
        'Retrieval Cost': '$0',
        'Retrieval Time': 'Milliseconds',
        'Use Case': 'Frequently accessed data',
        'Durability': '99.999999999%'
    },
    {
        'Storage Class': 'S3 Intelligent-Tiering',
        'Cost ($/GB/mo)': 0.023,  # Same as Standard for frequent access
        'Retrieval Cost': '$0',
        'Retrieval Time': 'Milliseconds',
        'Use Case': 'Unknown or changing access patterns',
        'Durability': '99.999999999%'
    },
    {
        'Storage Class': 'S3 Standard-IA',
        'Cost ($/GB/mo)': 0.0125,
        'Retrieval Cost': '$0.01/GB',
        'Retrieval Time': 'Milliseconds',
        'Use Case': 'Infrequently accessed (< 1/month)',
        'Durability': '99.999999999%'
    },
    {
        'Storage Class': 'S3 Glacier Instant',
        'Cost ($/GB/mo)': 0.004,
        'Retrieval Cost': '$0.03/GB',
        'Retrieval Time': 'Milliseconds',
        'Use Case': 'Archive with instant access',
        'Durability': '99.999999999%'
    },
    {
        'Storage Class': 'S3 Glacier Flexible',
        'Cost ($/GB/mo)': 0.0036,
        'Retrieval Cost': '$0.01-0.03/GB',
        'Retrieval Time': 'Minutes to hours',
        'Use Case': 'Long-term archive',
        'Durability': '99.999999999%'
    },
    {
        'Storage Class': 'S3 Glacier Deep Archive',
        'Cost ($/GB/mo)': 0.00099,
        'Retrieval Cost': '$0.02/GB',
        'Retrieval Time': '12-48 hours',
        'Use Case': 'Compliance/regulatory archives',
        'Durability': '99.999999999%'
    }
])

print("AWS S3 Storage Classes Comparison\n")
print(s3_storage_classes.to_string(index=False))
print("\nüí° Storage Class Selection Guide:")
print("   - Active training data: S3 Standard")
print("   - Processed datasets: S3 Intelligent-Tiering (auto-optimizes)")
print("   - Model artifacts: S3 Standard-IA (if not frequently deployed)")
print("   - Historical data: S3 Glacier Instant or Flexible")
print("   - Compliance archives: S3 Glacier Deep Archive")

### 4.2: S3 Lifecycle Policy Example

In [None]:
# S3 Lifecycle Policy for ML data management
s3_lifecycle_policy = {
    'Rules': [
        {
            'Id': 'TransitionRawData',
            'Status': 'Enabled',
            'Prefix': 'data/raw/',
            'Transitions': [
                {
                    'Days': 30,
                    'StorageClass': 'INTELLIGENT_TIERING',
                    'Reason': 'Auto-optimize based on access patterns'
                }
            ]
        },
        {
            'Id': 'TransitionProcessedData',
            'Status': 'Enabled',
            'Prefix': 'data/processed/',
            'Transitions': [
                {
                    'Days': 90,
                    'StorageClass': 'GLACIER_IR',
                    'Reason': 'Old processed data rarely accessed'
                }
            ]
        },
        {
            'Id': 'DeleteTempFiles',
            'Status': 'Enabled',
            'Prefix': 'temp/',
            'Expiration': {
                'Days': 7,
                'Reason': 'Temporary files no longer needed'
            }
        },
        {
            'Id': 'ArchiveOldModels',
            'Status': 'Enabled',
            'Prefix': 'models/archive/',
            'Transitions': [
                {
                    'Days': 180,
                    'StorageClass': 'GLACIER_FLEXIBLE',
                    'Reason': 'Keep for compliance but rarely used'
                }
            ]
        }
    ]
}

# Terraform version of lifecycle policy
terraform_lifecycle = '''
resource "aws_s3_bucket_lifecycle_configuration" "ml_data" {
  bucket = aws_s3_bucket.ml_data.id

  rule {
    id     = "transition-raw-data"
    status = "Enabled"

    filter {
      prefix = "data/raw/"
    }

    transition {
      days          = 30
      storage_class = "INTELLIGENT_TIERING"
    }
  }

  rule {
    id     = "delete-temp-files"
    status = "Enabled"

    filter {
      prefix = "temp/"
    }

    expiration {
      days = 7
    }
  }
}
'''

print("S3 Lifecycle Policy for ML Data Management\n")
print(json.dumps(s3_lifecycle_policy, indent=2))
print("\nüìä Expected Cost Savings:")
print("   - 30-day transition to Intelligent-Tiering: 0-50% savings")
print("   - 90-day transition to Glacier: 80%+ savings")
print("   - Auto-delete temp files: Prevents waste accumulation")
print("\nTerraform implementation saved to: s3_lifecycle_policy.tf")

### 4.3: Storage Cost Calculator

In [None]:
def calculate_storage_costs(data_gb: float, storage_class: str, 
                           retrievals_per_month: int = 0) -> dict:
    """
    Calculate monthly S3 storage costs including retrieval
    """
    # Storage costs per GB/month
    storage_prices = {
        'standard': 0.023,
        'intelligent_tiering': 0.023,  # Frequent tier
        'standard_ia': 0.0125,
        'glacier_ir': 0.004,
        'glacier': 0.0036,
        'deep_archive': 0.00099
    }
    
    # Retrieval costs per GB
    retrieval_prices = {
        'standard': 0,
        'intelligent_tiering': 0,
        'standard_ia': 0.01,
        'glacier_ir': 0.03,
        'glacier': 0.02,
        'deep_archive': 0.02
    }
    
    storage_cost = data_gb * storage_prices[storage_class]
    retrieval_cost = (data_gb * retrievals_per_month * 
                     retrieval_prices[storage_class])
    total_cost = storage_cost + retrieval_cost
    
    return {
        'storage_class': storage_class,
        'data_gb': data_gb,
        'storage_cost': storage_cost,
        'retrieval_cost': retrieval_cost,
        'total_monthly_cost': total_cost
    }

# Example: 100GB dataset with different storage strategies
dataset_size = 100  # GB

storage_scenarios = [
    {'class': 'standard', 'retrievals': 10, 'name': 'Active Development'},
    {'class': 'intelligent_tiering', 'retrievals': 5, 'name': 'Smart Auto-Tiering'},
    {'class': 'standard_ia', 'retrievals': 2, 'name': 'Infrequent Access'},
    {'class': 'glacier_ir', 'retrievals': 1, 'name': 'Archived (Instant)'},
    {'class': 'glacier', 'retrievals': 0, 'name': 'Deep Archive'},
]

storage_costs = []
for scenario in storage_scenarios:
    cost = calculate_storage_costs(
        dataset_size, 
        scenario['class'], 
        scenario['retrievals']
    )
    cost['scenario'] = scenario['name']
    storage_costs.append(cost)

storage_cost_df = pd.DataFrame(storage_costs)

print(f"Storage Cost Analysis for {dataset_size}GB Dataset\n")
print(storage_cost_df[[
    'scenario', 'storage_class', 'storage_cost', 
    'retrieval_cost', 'total_monthly_cost'
]].to_string(index=False))

# Calculate savings
baseline_cost = storage_cost_df.iloc[0]['total_monthly_cost']
best_cost = storage_cost_df['total_monthly_cost'].min()
max_savings = baseline_cost - best_cost
max_savings_pct = (max_savings / baseline_cost) * 100

print(f"\nüí∞ Maximum potential savings: ${max_savings:.2f}/month ({max_savings_pct:.1f}%)")
print(f"   Annual savings: ${max_savings * 12:.2f}")

## Part 5: Cost Monitoring and Alerts

### 5.1: Setting Up Billing Alerts

In [None]:
# Terraform configuration for AWS Budgets and Alerts
terraform_budget = '''
# Budget with multiple alert thresholds
resource "aws_budgets_budget" "ml_monthly_budget" {
  name              = "ml-monthly-budget"
  budget_type       = "COST"
  limit_amount      = "50"  # $50/month budget
  limit_unit        = "USD"
  time_unit         = "MONTHLY"

  # Alert at 50% of budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 50
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["your-email@example.com"]
  }

  # Alert at 75% of budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 75
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["your-email@example.com"]
  }

  # Alert at 90% of budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 90
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["your-email@example.com"]
  }

  # Alert when exceeded
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["your-email@example.com"]
  }

  # Filter to specific services (optional)
  cost_filter {
    name   = "Service"
    values = ["Amazon SageMaker", "Amazon S3", "AWS Lambda"]
  }
}

# CloudWatch Alarm for specific resource costs
resource "aws_cloudwatch_metric_alarm" "sagemaker_endpoint_cost" {
  alarm_name          = "sagemaker-endpoint-high-cost"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "EstimatedCharges"
  namespace           = "AWS/Billing"
  period              = "21600"  # 6 hours
  statistic           = "Maximum"
  threshold           = "10"  # $10 threshold

  alarm_description = "Alert when SageMaker costs exceed $10"
  alarm_actions     = [aws_sns_topic.billing_alerts.arn]

  dimensions = {
    ServiceName = "AmazonSageMaker"
  }
}
'''

print("Cost Monitoring & Alert Configuration\n")
print("Terraform configuration saved to: aws_budgets.tf")
print("\nüìä Recommended Alert Thresholds:")
print("   - 50% (Forecasted): Early warning, time to review")
print("   - 75% (Actual): Concerning, investigate immediately")
print("   - 90% (Actual): Critical, take action")
print("   - 100% (Actual): Over budget, shutdown non-essential resources")
print("\n‚öôÔ∏è Setup Steps:")
print("   1. Enable billing alerts in AWS Console")
print("   2. Create SNS topic for notifications")
print("   3. Apply Terraform configuration")
print("   4. Verify email subscription")
print("   5. Test with small budget ($1) first")

### 5.2: Cost Allocation Tags

In [None]:
# Example cost allocation tagging strategy
cost_allocation_tags = {
    'Project': 'customer-churn-prediction',
    'Environment': 'development',  # development, staging, production
    'Team': 'data-science',
    'CostCenter': 'ML-R&D',
    'Owner': 'john.doe@company.com',
    'Purpose': 'training',  # training, inference, storage, experimentation
    'AutoShutdown': 'true',  # For automated cleanup
    'ExpirationDate': '2024-12-31'  # When to archive/delete
}

# Terraform example with tags
terraform_tags_example = '''
# Consistent tagging across all resources
locals {
  common_tags = {
    Project     = "customer-churn-prediction"
    Environment = "development"
    Team        = "data-science"
    CostCenter  = "ML-R&D"
    ManagedBy   = "terraform"
  }
}

# S3 bucket with tags
resource "aws_s3_bucket" "ml_data" {
  bucket = "my-ml-data-bucket"
  tags   = merge(local.common_tags, {
    Purpose = "data-storage"
  })
}

# SageMaker endpoint with tags
resource "aws_sagemaker_endpoint" "model" {
  name = "customer-churn-endpoint"
  tags = merge(local.common_tags, {
    Purpose        = "inference"
    AutoShutdown   = "true"
    ShutdownTime   = "20:00"  # 8 PM daily
  })
}
'''

print("Cost Allocation Tag Strategy\n")
print("Recommended Tags:")
for key, value in cost_allocation_tags.items():
    print(f"  - {key}: {value}")

print("\nüìä Benefits of Cost Allocation Tags:")
print("   1. Track costs by project, team, or environment")
print("   2. Identify cost drivers and optimization opportunities")
print("   3. Automate resource cleanup (based on tags)")
print("   4. Generate accurate billing reports")
print("   5. Enforce governance policies")

print("\nüè∑Ô∏è Tag Best Practices:")
print("   - Use consistent naming conventions")
print("   - Tag all billable resources")
print("   - Activate cost allocation tags in billing console")
print("   - Review and update tags regularly")
print("   - Use automation to enforce tagging policies")

## Part 6: Right-Sizing and Auto-Scaling

### 6.1: Instance Right-Sizing Analysis

In [None]:
def recommend_instance_size(cpu_usage_pct: float, memory_usage_pct: float,
                           current_instance: InstancePricing) -> dict:
    """
    Recommend instance right-sizing based on utilization metrics
    
    Rules:
    - If both CPU and memory < 30%: Downsize
    - If either > 80%: Upsize
    - Otherwise: Keep current size
    """
    recommendation = 'keep'
    reason = 'Utilization is within optimal range (30-80%)'
    potential_savings = 0
    
    if cpu_usage_pct < 30 and memory_usage_pct < 30:
        recommendation = 'downsize'
        reason = f'Low utilization (CPU: {cpu_usage_pct}%, Memory: {memory_usage_pct}%)'
        # Simulate ~50% cost reduction by downsizing
        potential_savings = current_instance.on_demand_hourly * 0.5 * 730  # Monthly
    elif cpu_usage_pct > 80 or memory_usage_pct > 80:
        recommendation = 'upsize'
        reason = f'High utilization (CPU: {cpu_usage_pct}%, Memory: {memory_usage_pct}%)'
        potential_savings = 0  # Cost increases, but improves performance
    
    return {
        'current_instance': current_instance.name,
        'cpu_usage': cpu_usage_pct,
        'memory_usage': memory_usage_pct,
        'recommendation': recommendation,
        'reason': reason,
        'monthly_savings': potential_savings
    }

# Simulate utilization for different instances
utilization_data = [
    {'instance': aws_instances[2], 'cpu': 25, 'memory': 20},  # Under-utilized
    {'instance': aws_instances[1], 'cpu': 55, 'memory': 60},  # Well-sized
    {'instance': aws_instances[0], 'cpu': 85, 'memory': 90},  # Over-utilized
]

rightsizing_recommendations = []
for data in utilization_data:
    rec = recommend_instance_size(data['cpu'], data['memory'], data['instance'])
    rightsizing_recommendations.append(rec)

rightsizing_df = pd.DataFrame(rightsizing_recommendations)

print("Instance Right-Sizing Recommendations\n")
print(rightsizing_df.to_string(index=False))

total_savings = rightsizing_df['monthly_savings'].sum()
print(f"\nüí∞ Total potential monthly savings: ${total_savings:.2f}")
print(f"   Annual savings: ${total_savings * 12:.2f}")

print("\nüìä Right-Sizing Best Practices:")
print("   - Monitor utilization for at least 2 weeks")
print("   - Look at peak usage, not just averages")
print("   - Consider workload patterns (batch vs real-time)")
print("   - Test smaller instances before committing")
print("   - Use CloudWatch for automated recommendations")

## Summary

In this notebook, you learned comprehensive cloud cost optimization strategies:

### Key Takeaways:

1. **Pricing Models**
   - On-demand: Pay-per-use, no commitment
   - Spot: 70-90% savings, can be interrupted
   - Reserved: 30-60% savings, 1-3 year commitment
   - Free tier: $0 for learning and small projects

2. **Free Tier Maximization**
   - Use free services: Lambda, S3 (5GB), serverless
   - Train on Colab/Kaggle (free GPU)
   - Deploy with serverless for < 1M requests/month
   - Set up billing alerts at 50%, 75%, 90%

3. **Spot Training**
   - 70-90% cost reduction for training
   - Requires checkpointing
   - Best for non-urgent, long-running jobs
   - Use managed spot training in SageMaker

4. **Storage Optimization**
   - S3 Standard: Frequent access ($0.023/GB/mo)
   - Intelligent-Tiering: Auto-optimize
   - Glacier: Archive ($0.004/GB/mo)
   - Lifecycle policies: Auto-transition & delete

5. **Cost Monitoring**
   - AWS Budgets: Set spending limits
   - CloudWatch Alarms: Real-time alerts
   - Cost allocation tags: Track by project/team
   - Regular cost reviews: Weekly or monthly

6. **Right-Sizing**
   - Monitor utilization (CPU, memory)
   - Downsize if < 30% utilized
   - Upsize if > 80% utilized
   - Use auto-scaling for variable workloads

### Cost Optimization Checklist:

‚úÖ Use free tier for learning  
‚úÖ Spot instances for training  
‚úÖ Serverless for low-traffic inference  
‚úÖ S3 lifecycle policies  
‚úÖ Billing alerts at 50%, 75%, 90%  
‚úÖ Cost allocation tags  
‚úÖ Right-size instances  
‚úÖ Delete unused resources  
‚úÖ Stop instances when not in use  
‚úÖ Review costs weekly  

### Realistic ML Project Costs:

| Scale | Traffic | Strategy | Monthly Cost |
|-------|---------|----------|-------------|
| Learning | N/A | Free tier + Colab | **$0** |
| Small Project | <1M requests | Free tier + Lambda | **$0-5** |
| Medium Project | 1-10M requests | Spot training + Lambda | **$20-50** |
| Production | 10M+ requests | Spot + Reserved + Auto-scale | **$200-500** |

## Next Steps

- **[Module 09: Multi-Cloud ML Considerations](09_multi_cloud_ml_considerations.ipynb)**: Cross-platform strategies
- **[Module 10: Cloud Storage for ML](10_cloud_storage_for_ml.ipynb)**: Deep dive into cloud storage
- **Practice**: Set up billing alerts for your AWS account
- **Explore**: AWS Cost Explorer for detailed cost analysis

## Additional Resources

- [AWS Cost Optimization](https://aws.amazon.com/pricing/cost-optimization/)
- [Azure Cost Management](https://azure.microsoft.com/en-us/products/cost-management/)
- [GCP Cost Optimization](https://cloud.google.com/cost-management)
- [AWS Budgets Documentation](https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-managing-costs.html)
- [SageMaker Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html)

## Exercises

### Exercise 1: Free Tier Budget Planner ‚≠ê

Create a free tier budget planner that:
1. Takes your planned usage as input (storage, compute hours, Lambda requests)
2. Calculates if you'll stay within free tier limits
3. Suggests optimizations if you exceed limits
4. Provides cost estimates if you go over

Test with at least 3 different project scenarios.

In [None]:
# Your code here


### Exercise 2: Spot vs On-Demand Decision Tool ‚≠ê‚≠ê

Build a tool that recommends whether to use spot or on-demand instances based on:
1. Training time requirements
2. Deadline urgency
3. Interruption tolerance
4. Cost sensitivity
5. Checkpointing capability

Create a decision matrix and visualize the recommendation logic.

In [None]:
# Your code here


### Exercise 3: Storage Lifecycle Optimizer ‚≠ê‚≠ê

Design and implement a storage lifecycle policy for an ML project with:
- 500GB raw data
- 200GB processed data
- 50GB model artifacts
- 100GB temporary files

Calculate:
1. Costs with no lifecycle policy
2. Costs with optimized lifecycle policy
3. Monthly and annual savings
4. Optimal transition timelines

Visualize cost savings over 12 months.

In [None]:
# Your code here


### Exercise 4: Cost Monitoring Dashboard ‚≠ê‚≠ê‚≠ê

Create a comprehensive cost monitoring system:

1. **Simulate monthly costs** for:
   - Compute (training & inference)
   - Storage (S3)
   - Data transfer
   - Serverless (Lambda)

2. **Implement alerts** at:
   - 50%, 75%, 90% of budget
   - Anomaly detection (sudden cost spikes)

3. **Generate reports**:
   - Cost breakdown by service
   - Trends over time
   - Recommendations for optimization

4. **Visualize**:
   - Cost trends (line chart)
   - Service breakdown (pie chart)
   - Budget utilization (gauge chart)

**Bonus**: Add forecasting for next month's costs based on trends.

In [None]:
# Your code here


### Exercise 5: Multi-Cloud Cost Comparison ‚≠ê‚≠ê‚≠ê

Compare costs for the same ML workload across AWS, Azure, and GCP:

**Workload specification:**
- Training: 100 hours/month on GPU
- Inference: 1M predictions/month
- Storage: 200GB data
- Data transfer: 50GB/month

Calculate and compare:
1. Total monthly costs per platform
2. Free tier benefits
3. Spot/preemptible savings
4. Reserved instance pricing
5. Hidden costs (data egress, API calls)

Present findings in:
- Comparison table
- Cost breakdown charts
- Recommendation report

**Bonus**: Include pricing for different regions and show geographic optimization opportunities.

In [None]:
# Your code here
