# Module 10: Cloud Storage for ML

**Difficulty**: ‚≠ê‚≠ê
**Estimated Time**: 70 minutes
**Prerequisites**: 
- [Module 00: Introduction to Cloud ML Services](00_introduction_to_cloud_ml_services.ipynb)
- [Module 08: Cost Optimization Strategies](08_cost_optimization_strategies.ipynb)
- Basic understanding of object storage

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand differences between AWS S3, Azure Blob, and Google Cloud Storage
2. Implement efficient data lake architectures for ML
3. Use versioning and encryption for data security
4. Configure access control with IAM and SAS tokens
5. Optimize data transfer and costs
6. Choose appropriate storage tiers for ML workloads
7. Implement lifecycle policies for automated cost optimization
8. Design scalable storage strategies for large-scale ML

## Why Cloud Storage Matters for ML

ML projects involve massive amounts of data:
- **Raw data**: Original datasets (TBs to PBs)
- **Processed data**: Cleaned, transformed features
- **Model artifacts**: Trained models, checkpoints
- **Experiment results**: Logs, metrics, visualizations
- **Production data**: Inference requests/responses

Cloud storage provides:
- ‚úÖ **Scalability**: Store petabytes without infrastructure
- ‚úÖ **Durability**: 99.999999999% (11 nines) data durability
- ‚úÖ **Accessibility**: Access from anywhere, any compute
- ‚úÖ **Cost-effective**: Pay only for what you store
- ‚úÖ **Integration**: Works seamlessly with ML services

### Storage Cost Distribution (Typical ML Project)
- 60-70%: Raw and processed datasets
- 15-20%: Model artifacts and checkpoints
- 10-15%: Logs and experiment tracking
- 5-10%: Data transfer costs

## Setup and Imports

In [None]:
# Standard library imports
import json
import os
from datetime import datetime, timedelta
from pathlib import Path
import hashlib

# Data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# File handling
import pickle
import joblib

# Cloud storage simulation
from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum

# Configuration
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
np.random.seed(42)

print("Setup complete!")
print(f"Notebook executed on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## Part 1: Cloud Storage Services Comparison

### 1.1: Service Overview

In [None]:
# Comprehensive comparison of cloud storage services
storage_comparison = pd.DataFrame([
    {
        'Feature': 'Service Name',
        'AWS': 'S3 (Simple Storage Service)',
        'Azure': 'Blob Storage',
        'GCP': 'Cloud Storage',
        'Notes': 'All offer object storage'
    },
    {
        'Feature': 'Storage Unit',
        'AWS': 'Bucket',
        'Azure': 'Container',
        'GCP': 'Bucket',
        'Notes': 'Top-level organization'
    },
    {
        'Feature': 'Object Naming',
        'AWS': 'Key (with prefixes)',
        'Azure': 'Blob name',
        'GCP': 'Object name',
        'Notes': 'Flat namespace, not true folders'
    },
    {
        'Feature': 'Max Object Size',
        'AWS': '5TB',
        'Azure': '190.7TB (block blob)',
        'GCP': '5TB',
        'Notes': 'Use multipart upload for large files'
    },
    {
        'Feature': 'Durability',
        'AWS': '99.999999999% (11 nines)',
        'Azure': '99.999999999% (11 nines)',
        'GCP': '99.999999999% (11 nines)',
        'Notes': 'Standard across all providers'
    },
    {
        'Feature': 'Availability',
        'AWS': '99.99% (Standard)',
        'Azure': '99.9% (LRS), 99.99% (GRS)',
        'GCP': '99.95% (Standard)',
        'Notes': 'Varies by storage class'
    },
    {
        'Feature': 'Storage Classes',
        'AWS': '6 tiers',
        'Azure': '4 tiers',
        'GCP': '4 classes',
        'Notes': 'Hot to archive tiers'
    },
    {
        'Feature': 'Standard Pricing',
        'AWS': '$0.023/GB/month',
        'Azure': '$0.018/GB/month',
        'GCP': '$0.020/GB/month',
        'Notes': 'First 5GB free (varies by tier)'
    },
    {
        'Feature': 'Data Transfer IN',
        'AWS': 'Free',
        'Azure': 'Free',
        'GCP': 'Free',
        'Notes': 'Upload is always free'
    },
    {
        'Feature': 'Data Transfer OUT',
        'AWS': '$0.09/GB (>10TB)',
        'Azure': '$0.087/GB (>10TB)',
        'GCP': '$0.12/GB (>10TB)',
        'Notes': 'Download costs (internet egress)'
    },
    {
        'Feature': 'Free Tier',
        'AWS': '5GB Standard, 20k GET, 2k PUT',
        'Azure': '5GB LRS, 50k ops',
        'GCP': '5GB Standard, 5k Class A ops',
        'Notes': '12 months for AWS/Azure, always for GCP'
    }
])

print("Cloud Storage Services Comparison\n")
print(storage_comparison.to_string(index=False))
print("\nüí° Key Insights:")
print("   - All three are very similar in capabilities")
print("   - Azure Blob is slightly cheaper for standard storage")
print("   - GCP has highest egress costs")
print("   - S3 has most mature ecosystem and integrations")

### 1.2: Storage Tiers Deep Dive

In [None]:
class StorageTier:
    """Represents a storage tier with pricing and characteristics"""
    def __init__(self, name, storage_cost, retrieval_cost, 
                 retrieval_time, min_days=0, best_for=""):
        self.name = name
        self.storage_cost = storage_cost  # $/GB/month
        self.retrieval_cost = retrieval_cost  # $/GB
        self.retrieval_time = retrieval_time
        self.min_days = min_days  # Minimum storage duration
        self.best_for = best_for

# AWS S3 Storage Classes
aws_s3_tiers = [
    StorageTier("S3 Standard", 0.023, 0, "Milliseconds", 0, "Frequently accessed"),
    StorageTier("S3 Intelligent-Tiering", 0.023, 0, "Milliseconds", 0, "Unknown patterns"),
    StorageTier("S3 Standard-IA", 0.0125, 0.01, "Milliseconds", 30, "Monthly access"),
    StorageTier("S3 One Zone-IA", 0.01, 0.01, "Milliseconds", 30, "Reproducible data"),
    StorageTier("S3 Glacier Instant", 0.004, 0.03, "Milliseconds", 90, "Quarterly access"),
    StorageTier("S3 Glacier Flexible", 0.0036, 0.02, "Minutes-Hours", 90, "Yearly access"),
    StorageTier("S3 Glacier Deep", 0.00099, 0.02, "12-48 hours", 180, "Compliance archives"),
]

# Azure Blob Storage Tiers
azure_blob_tiers = [
    StorageTier("Hot (LRS)", 0.018, 0, "Milliseconds", 0, "Active data"),
    StorageTier("Cool (LRS)", 0.01, 0.01, "Milliseconds", 30, "Monthly access"),
    StorageTier("Cold (LRS)", 0.0045, 0.03, "Milliseconds", 90, "Quarterly access"),
    StorageTier("Archive (LRS)", 0.00099, 0.022, "Hours", 180, "Compliance archives"),
]

# GCP Cloud Storage Classes
gcp_storage_classes = [
    StorageTier("Standard", 0.020, 0, "Milliseconds", 0, "Frequently accessed"),
    StorageTier("Nearline", 0.010, 0.01, "Milliseconds", 30, "Monthly access"),
    StorageTier("Coldline", 0.004, 0.02, "Milliseconds", 90, "Quarterly access"),
    StorageTier("Archive", 0.0012, 0.05, "Milliseconds", 365, "Yearly access"),
]

def compare_storage_tiers(data_gb=100, accesses_per_month=10):
    """
    Compare total costs across different storage tiers
    """
    results = []
    
    for tier in aws_s3_tiers + azure_blob_tiers + gcp_storage_classes:
        storage_cost = data_gb * tier.storage_cost
        retrieval_cost = data_gb * accesses_per_month * tier.retrieval_cost
        total_cost = storage_cost + retrieval_cost
        
        provider = 'AWS' if 'S3' in tier.name else ('Azure' if 'LRS' in tier.name else 'GCP')
        
        results.append({
            'Provider': provider,
            'Tier': tier.name,
            'Storage ($/mo)': storage_cost,
            'Retrieval ($/mo)': retrieval_cost,
            'Total ($/mo)': total_cost,
            'Retrieval Time': tier.retrieval_time,
            'Best For': tier.best_for
        })
    
    return pd.DataFrame(results)

# Compare costs for 100GB with 10 accesses/month
cost_comparison = compare_storage_tiers(100, 10)

print(f"Cost Comparison: 100GB Storage, 10 Full Retrievals/Month\n")
print(cost_comparison.to_string(index=False))

# Find cheapest option
cheapest = cost_comparison.loc[cost_comparison['Total ($/mo)'].idxmin()]
print(f"\nüí∞ Cheapest Option: {cheapest['Provider']} {cheapest['Tier']}")
print(f"   Total Cost: ${cheapest['Total ($/mo)']:.2f}/month")

## Part 2: Data Lake Architecture for ML

A data lake is a centralized repository that stores all structured and unstructured data at any scale.

### 2.1: Recommended Folder Structure

In [None]:
# Recommended data lake structure for ML projects
data_lake_structure = {
    'raw/': {
        'description': 'Original, immutable data',
        'storage_class': 'Standard ‚Üí Intelligent-Tiering',
        'retention': 'Permanent (or 1-2 years)',
        'example': 'raw/2024/11/19/data_source_001.parquet',
        'characteristics': [
            'Never modify after upload',
            'Partition by date for efficient querying',
            'Use lifecycle policies to transition to cheaper tiers'
        ]
    },
    'processed/': {
        'description': 'Cleaned, transformed data',
        'storage_class': 'Standard',
        'retention': '6-12 months',
        'example': 'processed/customer_features_v2.parquet',
        'characteristics': [
            'Result of ETL pipelines',
            'Frequently accessed for training',
            'Version controlled (v1, v2, etc.)'
        ]
    },
    'features/': {
        'description': 'Feature store data',
        'storage_class': 'Standard',
        'retention': '3-6 months active + archive',
        'example': 'features/user_behavior/2024_11.parquet',
        'characteristics': [
            'Engineered features ready for ML',
            'Organized by feature group',
            'Includes metadata and lineage'
        ]
    },
    'models/': {
        'description': 'Trained model artifacts',
        'storage_class': 'Standard for active, IA for old versions',
        'retention': 'Current + last 5 versions',
        'example': 'models/customer_churn/v1.2.3/model.pkl',
        'characteristics': [
            'Semantic versioning',
            'Include metadata.json with metrics',
            'Archive old versions to Glacier'
        ]
    },
    'experiments/': {
        'description': 'Experiment tracking data',
        'storage_class': 'Standard ‚Üí IA after 30 days',
        'retention': '3 months active + archive',
        'example': 'experiments/exp_20241119_001/metrics.json',
        'characteristics': [
            'Hyperparameters, metrics, plots',
            'Organized by experiment ID',
            'Use MLflow or similar for tracking'
        ]
    },
    'predictions/': {
        'description': 'Batch prediction outputs',
        'storage_class': 'Standard ‚Üí IA after 7 days',
        'retention': '30-90 days',
        'example': 'predictions/2024/11/19/batch_001.parquet',
        'characteristics': [
            'Input features + predictions',
            'For debugging and monitoring',
            'Lifecycle policy to auto-delete'
        ]
    },
    'logs/': {
        'description': 'Application and training logs',
        'storage_class': 'Standard ‚Üí IA after 7 days',
        'retention': '30 days',
        'example': 'logs/training/2024_11_19.log',
        'characteristics': [
            'Training logs, error logs',
            'Auto-delete after 30 days',
            'Consider CloudWatch/Stackdriver instead'
        ]
    },
    'temp/': {
        'description': 'Temporary intermediate files',
        'storage_class': 'Standard',
        'retention': '7 days (auto-delete)',
        'example': 'temp/job_12345/intermediate.parquet',
        'characteristics': [
            'Short-lived processing artifacts',
            'Aggressive lifecycle policy',
            'Clean up immediately after job'
        ]
    }
}

print("Recommended Data Lake Structure for ML\n")
for folder, details in data_lake_structure.items():
    print(f"üìÅ {folder}")
    print(f"   {details['description']}")
    print(f"   Storage: {details['storage_class']}")
    print(f"   Retention: {details['retention']}")
    print(f"   Example: {details['example']}")
    print()

print("üéØ Best Practices:")
print("   - Partition by date for time-series data")
print("   - Use meaningful prefixes (folders) for organization")
print("   - Implement lifecycle policies for automated tiering")
print("   - Version all datasets and models")
print("   - Never modify raw data (immutable)")
print("   - Use metadata files (JSON/YAML) alongside data")

### 2.2: Data Lake Cost Simulation

In [None]:
def simulate_data_lake_costs(months=12):
    """
    Simulate data lake growth and costs over time
    """
    # Initial data sizes (GB)
    data = {
        'raw': 100,
        'processed': 50,
        'features': 20,
        'models': 2,
        'experiments': 5,
        'predictions': 10,
        'logs': 5,
        'temp': 3
    }
    
    # Monthly growth rates
    growth_rates = {
        'raw': 1.10,  # 10% growth/month
        'processed': 1.08,
        'features': 1.05,
        'models': 1.15,  # Accumulates more models
        'experiments': 1.20,  # Many experiments
        'predictions': 1.05,
        'logs': 1.0,  # Deleted monthly
        'temp': 1.0  # Deleted weekly
    }
    
    # Storage costs ($/GB/month)
    storage_costs = {
        'raw': 0.015,  # Intelligent-Tiering average
        'processed': 0.023,  # Standard
        'features': 0.023,  # Standard
        'models': 0.0125,  # Standard-IA for old versions
        'experiments': 0.015,  # Mix of Standard and IA
        'predictions': 0.0125,  # Standard-IA
        'logs': 0.023,  # Standard
        'temp': 0.023  # Standard
    }
    
    results = []
    
    for month in range(1, months + 1):
        month_cost = 0
        month_data = {}
        
        for category in data.keys():
            # Apply growth
            data[category] *= growth_rates[category]
            
            # Calculate cost
            cost = data[category] * storage_costs[category]
            month_cost += cost
            month_data[category] = data[category]
        
        total_storage = sum(data.values())
        
        results.append({
            'Month': month,
            'Total Storage (GB)': total_storage,
            'Monthly Cost ($)': month_cost,
            **{f'{k} (GB)': v for k, v in month_data.items()}
        })
    
    return pd.DataFrame(results)

# Simulate 12 months
cost_projection = simulate_data_lake_costs(12)

print("Data Lake Cost Projection (12 months)\n")
print(cost_projection[['Month', 'Total Storage (GB)', 'Monthly Cost ($)']].to_string(index=False))

# Summary statistics
total_cost = cost_projection['Monthly Cost ($)'].sum()
avg_monthly_cost = cost_projection['Monthly Cost ($)'].mean()
final_storage = cost_projection.iloc[-1]['Total Storage (GB)']

print(f"\nüìä Summary:")
print(f"   Total 12-month cost: ${total_cost:.2f}")
print(f"   Average monthly cost: ${avg_monthly_cost:.2f}")
print(f"   Final storage size: {final_storage:.0f} GB")
print(f"\nüí° Cost Optimization Opportunities:")
print(f"   - Implement lifecycle policies: Save ~30%")
print(f"   - Delete old experiments: Save ~15%")
print(f"   - Compress data: Save ~20-40%")
print(f"   - Archive old models: Save ~10%")

## Part 3: Versioning and Encryption

### 3.1: S3 Versioning

In [None]:
# S3 Versioning example (simulated)
class S3ObjectVersion:
    """Simulates S3 object versioning"""
    def __init__(self, key, version_id, size_bytes, last_modified):
        self.key = key
        self.version_id = version_id
        self.size_bytes = size_bytes
        self.last_modified = last_modified
        self.is_latest = False
        self.is_delete_marker = False

# Simulate versioning for a model file
model_versions = [
    S3ObjectVersion('models/customer_churn/model.pkl', 'v1', 10_000_000, 
                    datetime.now() - timedelta(days=30)),
    S3ObjectVersion('models/customer_churn/model.pkl', 'v2', 12_000_000, 
                    datetime.now() - timedelta(days=15)),
    S3ObjectVersion('models/customer_churn/model.pkl', 'v3', 11_500_000, 
                    datetime.now() - timedelta(days=7)),
    S3ObjectVersion('models/customer_churn/model.pkl', 'v4', 11_800_000, 
                    datetime.now()),
]
model_versions[-1].is_latest = True

# Display version history
version_data = [{
    'Version ID': v.version_id,
    'Size (MB)': v.size_bytes / 1_000_000,
    'Last Modified': v.last_modified.strftime('%Y-%m-%d'),
    'Latest': '‚úÖ' if v.is_latest else '',
} for v in model_versions]

version_df = pd.DataFrame(version_data)
print("S3 Object Versioning Example\n")
print(f"Key: {model_versions[0].key}\n")
print(version_df.to_string(index=False))

# Calculate storage cost with versioning
total_storage_mb = sum(v.size_bytes for v in model_versions) / 1_000_000
monthly_cost = (total_storage_mb / 1024) * 0.023  # Standard S3 pricing

print(f"\nüí∞ Versioning Storage Cost:")
print(f"   Total storage (all versions): {total_storage_mb:.2f} MB")
print(f"   Monthly cost: ${monthly_cost:.4f}")
print(f"\n‚ö†Ô∏è Versioning Considerations:")
print("   - Every overwrite creates a new version")
print("   - All versions count toward storage costs")
print("   - Use lifecycle policies to delete old versions")
print("   - Can recover from accidental deletions")
print("\n‚úÖ Best Practices:")
print("   - Enable versioning for critical data (models, features)")
print("   - Delete old versions after N days (e.g., 90 days)")
print("   - Use version IDs for reproducibility")
print("   - Consider versioning in application (not S3) for temp data")

### 3.2: Encryption Options

In [None]:
# Encryption options comparison
encryption_options = pd.DataFrame([
    {
        'Method': 'SSE-S3 (AWS Managed)',
        'Provider': 'AWS',
        'Key Management': 'AWS manages keys',
        'Cost': 'Free',
        'Ease': 'Easiest',
        'Control': 'Low',
        'Use Case': 'Default encryption'
    },
    {
        'Method': 'SSE-KMS (AWS KMS)',
        'Provider': 'AWS',
        'Key Management': 'AWS KMS (you control)',
        'Cost': '$1/month/key + API calls',
        'Ease': 'Medium',
        'Control': 'High',
        'Use Case': 'Compliance, audit trail'
    },
    {
        'Method': 'SSE-C (Customer Provided)',
        'Provider': 'AWS',
        'Key Management': 'You manage keys',
        'Cost': 'Free',
        'Ease': 'Hard',
        'Control': 'Full',
        'Use Case': 'Strict data sovereignty'
    },
    {
        'Method': 'Client-Side Encryption',
        'Provider': 'All',
        'Key Management': 'You manage everything',
        'Cost': 'Free (storage)',
        'Ease': 'Hardest',
        'Control': 'Full',
        'Use Case': 'Zero-trust, maximum security'
    },
    {
        'Method': 'Azure Storage Service Encryption',
        'Provider': 'Azure',
        'Key Management': 'Microsoft manages',
        'Cost': 'Free',
        'Ease': 'Easiest',
        'Control': 'Low',
        'Use Case': 'Default encryption'
    },
    {
        'Method': 'GCP Default Encryption',
        'Provider': 'GCP',
        'Key Management': 'Google manages',
        'Cost': 'Free',
        'Ease': 'Easiest',
        'Control': 'Low',
        'Use Case': 'Default encryption'
    }
])

print("Cloud Storage Encryption Options\n")
print(encryption_options.to_string(index=False))
print("\nüîí Encryption Best Practices:")
print("   - Always use encryption (at minimum SSE-S3/default)")
print("   - Use SSE-KMS for compliance requirements")
print("   - Enable encryption in transit (HTTPS)")
print("   - Rotate keys regularly (automated with KMS)")
print("   - Consider client-side encryption for PII data")
print("\nüí° For ML Projects:")
print("   - Default encryption (SSE-S3): Sufficient for most cases")
print("   - KMS encryption: Healthcare, finance, regulated industries")
print("   - Client-side: Sensitive customer data, PII")

## Part 4: Access Control

### 4.1: IAM Policies for S3

In [None]:
# Example IAM policies for ML data access
iam_policies = {
    'data_scientist_readonly': {
        'Description': 'Read-only access to processed data and models',
        'Policy': {
            'Version': '2012-10-17',
            'Statement': [
                {
                    'Effect': 'Allow',
                    'Action': [
                        's3:GetObject',
                        's3:ListBucket'
                    ],
                    'Resource': [
                        'arn:aws:s3:::ml-data-bucket/processed/*',
                        'arn:aws:s3:::ml-data-bucket/models/*',
                        'arn:aws:s3:::ml-data-bucket'
                    ]
                }
            ]
        }
    },
    'ml_engineer_full': {
        'Description': 'Full access to all ML data',
        'Policy': {
            'Version': '2012-10-17',
            'Statement': [
                {
                    'Effect': 'Allow',
                    'Action': 's3:*',
                    'Resource': [
                        'arn:aws:s3:::ml-data-bucket/*',
                        'arn:aws:s3:::ml-data-bucket'
                    ]
                }
            ]
        }
    },
    'training_job_role': {
        'Description': 'SageMaker training job access',
        'Policy': {
            'Version': '2012-10-17',
            'Statement': [
                {
                    'Effect': 'Allow',
                    'Action': [
                        's3:GetObject',
                        's3:PutObject'
                    ],
                    'Resource': [
                        'arn:aws:s3:::ml-data-bucket/processed/*',
                        'arn:aws:s3:::ml-data-bucket/models/*',
                        'arn:aws:s3:::ml-data-bucket/experiments/*'
                    ]
                },
                {
                    'Effect': 'Deny',
                    'Action': 's3:*',
                    'Resource': 'arn:aws:s3:::ml-data-bucket/raw/*'
                }
            ]
        }
    }
}

print("IAM Policy Examples for ML Data Access\n")
for policy_name, details in iam_policies.items():
    print(f"üìã {policy_name}")
    print(f"   {details['Description']}")
    print(f"   Policy: {json.dumps(details['Policy'], indent=6)}")
    print()

print("üîê Access Control Best Practices:")
print("   - Principle of least privilege")
print("   - Use IAM roles, not access keys")
print("   - Separate read and write permissions")
print("   - Protect raw data (read-only for most users)")
print("   - Use bucket policies + IAM policies together")
print("   - Enable MFA delete for critical data")
print("   - Regular access audits with CloudTrail")

## Part 5: Data Transfer Optimization

### 5.1: Data Transfer Costs

In [None]:
def calculate_transfer_costs(transfer_gb, source, destination):
    """
    Calculate data transfer costs between different locations
    
    Pricing (AWS example, approximate):
    - Within same region: FREE
    - Between regions: $0.02/GB
    - To internet: $0.09/GB (first 10TB)
    - Between AWS and CloudFront: FREE
    """
    transfer_types = {
        ('s3', 'ec2_same_region'): 0,
        ('s3', 's3_same_region'): 0,
        ('s3', 'ec2_different_region'): 0.02,
        ('s3', 's3_different_region'): 0.02,
        ('s3', 'internet'): 0.09,
        ('s3', 'cloudfront'): 0,
        ('s3', 'sagemaker_same_region'): 0,
    }
    
    key = (source, destination)
    rate = transfer_types.get(key, 0.09)  # Default to internet pricing
    cost = transfer_gb * rate
    
    return {
        'transfer_gb': transfer_gb,
        'source': source,
        'destination': destination,
        'rate_per_gb': rate,
        'total_cost': cost
    }

# Example transfer scenarios
transfer_scenarios = [
    ('s3', 'sagemaker_same_region', 100, 'Training data to SageMaker'),
    ('s3', 'ec2_different_region', 50, 'Inference data cross-region'),
    ('s3', 'internet', 10, 'Download model for local testing'),
    ('s3', 's3_different_region', 200, 'Backup to different region'),
]

transfer_results = []
for source, dest, gb, description in transfer_scenarios:
    result = calculate_transfer_costs(gb, source, dest)
    result['description'] = description
    transfer_results.append(result)

transfer_df = pd.DataFrame(transfer_results)

print("Data Transfer Cost Analysis\n")
print(transfer_df[[
    'description', 'transfer_gb', 'rate_per_gb', 'total_cost'
]].to_string(index=False))

total_transfer_cost = transfer_df['total_cost'].sum()
print(f"\nüí∞ Total transfer cost: ${total_transfer_cost:.2f}")

print("\nüìä Transfer Cost Optimization Tips:")
print("   1. Train in same region as data (FREE transfer)")
print("   2. Use CloudFront for downloads (FREE from S3)")
print("   3. Compress data before transfer (30-80% reduction)")
print("   4. Use S3 Transfer Acceleration for large uploads")
print("   5. Minimize cross-region transfers")
print("   6. Download only what you need (filter, sample)")

## Summary

In this notebook, you learned comprehensive cloud storage strategies for ML:

### Key Takeaways:

1. **Storage Services Comparison**
   - AWS S3, Azure Blob, GCP Storage are functionally equivalent
   - Azure Blob slightly cheaper for standard storage
   - All offer 11 nines (99.999999999%) durability
   - Free tier: 5GB for 12 months

2. **Storage Tiers**
   - Standard: Frequent access ($0.02-0.023/GB/month)
   - Infrequent Access: Monthly access ($0.01-0.0125/GB/month)
   - Archive: Rare access ($0.001-0.004/GB/month)
   - Use lifecycle policies for automatic tiering

3. **Data Lake Architecture**
   - Organized folder structure: raw/, processed/, features/, models/
   - Raw data is immutable
   - Version control for models and features
   - Lifecycle policies for cost optimization

4. **Versioning**
   - Enable for critical data (models, features)
   - All versions count toward storage costs
   - Use lifecycle policies to delete old versions
   - Enables rollback and reproducibility

5. **Encryption**
   - Default encryption (SSE-S3): Free and sufficient
   - KMS encryption: Compliance, audit trail
   - Client-side: Maximum security, full control
   - Always use HTTPS for transfer

6. **Access Control**
   - Principle of least privilege
   - Use IAM roles, not access keys
   - Protect raw data (read-only)
   - Regular audits with CloudTrail

7. **Data Transfer**
   - Within region: FREE
   - Cross-region: $0.02/GB
   - To internet: $0.09/GB
   - Optimize: compress, filter, same-region training

### Cost Optimization Checklist:

‚úÖ Implement lifecycle policies  
‚úÖ Use appropriate storage tiers  
‚úÖ Compress data (Parquet, gzip)  
‚úÖ Delete temporary files  
‚úÖ Archive old experiments  
‚úÖ Minimize cross-region transfers  
‚úÖ Delete old model versions  
‚úÖ Use Intelligent-Tiering for unknown patterns  
‚úÖ Monitor storage usage regularly  
‚úÖ Clean up failed jobs  

### Typical ML Project Storage Costs:

| Project Size | Monthly Data | Storage Cost | Transfer Cost | Total |
|--------------|--------------|--------------|---------------|-------|
| Small | 50GB | $1 | $0.50 | **$1.50** |
| Medium | 500GB | $10 | $2 | **$12** |
| Large | 5TB | $100 | $10 | **$110** |
| Enterprise | 50TB | $1000 | $50 | **$1050** |

## Next Steps

- **[Module 11: Final Project - Deploy Model on Cloud](11_final_project_deploy_model_on_cloud.ipynb)**: Capstone project
- **Practice**: Set up S3 bucket with lifecycle policies
- **Explore**: AWS DataSync for large-scale migrations
- **Implement**: Data lake architecture for your project

## Additional Resources

- [AWS S3 Documentation](https://docs.aws.amazon.com/s3/)
- [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/)
- [GCP Cloud Storage](https://cloud.google.com/storage/docs)
- [Data Lake Best Practices](https://aws.amazon.com/big-data/datalakes-and-analytics/)
- [S3 Intelligent-Tiering](https://aws.amazon.com/s3/storage-classes/intelligent-tiering/)

## Exercises

### Exercise 1: Storage Tier Optimizer ‚≠ê

Create a storage tier recommendation tool:

1. Input: Data size, access frequency (daily, weekly, monthly, yearly)
2. Calculate costs for all storage tiers
3. Include retrieval costs based on access frequency
4. Recommend optimal tier
5. Show cost savings vs standard tier

Test with at least 5 different scenarios.

In [None]:
# Your code here


### Exercise 2: Lifecycle Policy Generator ‚≠ê‚≠ê

Build a lifecycle policy generator for ML projects:

1. **Define data categories** with retention requirements:
   - Raw data: 1 year ‚Üí archive
   - Processed: 6 months ‚Üí IA ‚Üí delete
   - Models: Keep latest 5 versions
   - Experiments: 90 days ‚Üí delete
   - Temp: 7 days ‚Üí delete

2. **Generate**:
   - AWS S3 lifecycle policy (JSON)
   - Azure Blob lifecycle policy
   - Cost savings estimate

3. **Visualize** storage costs over time with and without policies.

In [None]:
# Your code here


### Exercise 3: Data Lake Cost Analyzer ‚≠ê‚≠ê

Analyze your existing or planned data lake:

1. **Inventory**: List all data categories and sizes
2. **Current costs**: Calculate with current storage classes
3. **Optimized costs**: Apply appropriate tiers and lifecycle policies
4. **Savings**: Show monthly and annual savings
5. **Growth projection**: Model 12-month growth
6. **Visualization**:
   - Storage breakdown by category
   - Cost trends over time
   - Savings opportunities

Present findings as a management report.

In [None]:
# Your code here


### Exercise 4: Access Control Policy Designer ‚≠ê‚≠ê‚≠ê

Design a complete access control system:

**Roles**:
- Data Scientists (read-only)
- ML Engineers (read-write processed, models)
- Data Engineers (full access)
- Training Jobs (specific paths only)
- Inference Services (models only)

**Tasks**:
1. Create IAM policies for each role
2. Implement bucket policies for defense in depth
3. Document permission matrix
4. Identify potential security gaps
5. Generate Terraform code for policies

**Bonus**: Add MFA requirements for sensitive data.

In [None]:
# Your code here


### Exercise 5: Multi-Cloud Storage Migration Plan ‚≠ê‚≠ê‚≠ê

Plan a migration from AWS S3 to Azure Blob (or vice versa):

**Scenario**: 10TB ML data lake on AWS ‚Üí Azure

**Tasks**:
1. **Inventory**: List all objects, sizes, access patterns
2. **Mapping**: Map S3 buckets to Azure containers
3. **Cost analysis**:
   - Transfer costs
   - Downtime costs
   - Tool/service costs
4. **Migration strategy**:
   - Tools (AWS DataSync, AzCopy, rclone)
   - Phased approach vs big bang
   - Validation and testing
5. **Timeline**: Week-by-week plan
6. **Risks**: Identify and mitigation strategies
7. **Rollback plan**: How to revert if needed

**Deliverable**: Complete migration runbook with costs, timeline, and risks.

**Bonus**: Implement sync script using boto3 and Azure SDK (simulated).

In [None]:
# Your code here
