# Module 09: Multi-Cloud ML Considerations

**Difficulty**: ‚≠ê‚≠ê‚≠ê
**Estimated Time**: 80 minutes
**Prerequisites**: 
- [Module 00: Introduction to Cloud ML Services](00_introduction_to_cloud_ml_services.ipynb)
- [Module 07: Serverless ML](07_serverless_ml.ipynb)
- [Module 08: Cost Optimization Strategies](08_cost_optimization_strategies.ipynb)
- Understanding of cloud fundamentals

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand when to use multi-cloud vs single cloud strategies
2. Map equivalent services across AWS, Azure, and GCP
3. Use MLflow for multi-cloud experiment tracking
4. Implement model portability with ONNX
5. Deploy ML applications on Kubernetes for cloud-agnostic architecture
6. Use Terraform for infrastructure as code across clouds
7. Mitigate vendor lock-in risks
8. Design portable data formats and pipelines

## Multi-Cloud vs Single Cloud: The Decision

### When to Use Single Cloud ‚úÖ
- **Simplicity**: Easier to learn, manage, and optimize
- **Cost**: Better volume discounts and reserved pricing
- **Integration**: Native services work seamlessly together
- **Support**: Single vendor relationship
- **Performance**: Lower latency within same cloud

**Best for**: Startups, small teams, cost-sensitive projects

### When to Use Multi-Cloud ‚úÖ
- **Avoid vendor lock-in**: Negotiate better pricing, reduce dependency
- **Best-of-breed**: Use best ML services from each provider
- **Compliance**: Data residency requirements across regions/countries
- **Disaster recovery**: True redundancy across providers
- **Customer requirements**: Existing enterprise contracts

**Best for**: Large enterprises, regulated industries, high-availability systems

### Multi-Cloud Challenges ‚ö†Ô∏è
- **Complexity**: Managing multiple platforms, tools, billing
- **Cost**: Higher operational overhead, potential waste
- **Data transfer**: Expensive cross-cloud egress fees
- **Skills**: Team needs expertise in multiple clouds
- **Testing**: Harder to ensure consistency across platforms

## Setup and Imports

In [None]:
# Standard library imports
import json
import os
from datetime import datetime
from pathlib import Path

# Data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Multi-cloud tools
# pip install mlflow onnx onnxruntime
import mlflow
import mlflow.sklearn

# Model serialization
import joblib
import pickle

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("Setup complete!")
print(f"Notebook executed on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"MLflow version: {mlflow.__version__}")

## Part 1: Service Equivalency Mapping

Understanding service equivalents across clouds is crucial for multi-cloud strategy.

### 1.1: ML Service Comparison

In [None]:
# Comprehensive ML service mapping across clouds
ml_service_mapping = pd.DataFrame([
    {
        'Category': 'Managed ML Platform',
        'AWS': 'SageMaker',
        'Azure': 'Azure Machine Learning',
        'GCP': 'Vertex AI',
        'Open Source Alternative': 'MLflow + Kubeflow'
    },
    {
        'Category': 'Serverless Functions',
        'AWS': 'Lambda',
        'Azure': 'Azure Functions',
        'GCP': 'Cloud Functions',
        'Open Source Alternative': 'OpenFaaS, Knative'
    },
    {
        'Category': 'Container Orchestration',
        'AWS': 'EKS (Kubernetes)',
        'Azure': 'AKS (Kubernetes)',
        'GCP': 'GKE (Kubernetes)',
        'Open Source Alternative': 'Kubernetes (self-managed)'
    },
    {
        'Category': 'Object Storage',
        'AWS': 'S3',
        'Azure': 'Blob Storage',
        'GCP': 'Cloud Storage',
        'Open Source Alternative': 'MinIO'
    },
    {
        'Category': 'Managed Notebooks',
        'AWS': 'SageMaker Notebooks',
        'Azure': 'Azure Notebooks',
        'GCP': 'Vertex AI Workbench',
        'Open Source Alternative': 'JupyterHub'
    },
    {
        'Category': 'AutoML',
        'AWS': 'SageMaker Autopilot',
        'Azure': 'Azure AutoML',
        'GCP': 'Vertex AI AutoML',
        'Open Source Alternative': 'H2O AutoML, Auto-sklearn'
    },
    {
        'Category': 'Model Registry',
        'AWS': 'SageMaker Model Registry',
        'Azure': 'Azure ML Model Registry',
        'GCP': 'Vertex AI Model Registry',
        'Open Source Alternative': 'MLflow Model Registry'
    },
    {
        'Category': 'Experiment Tracking',
        'AWS': 'SageMaker Experiments',
        'Azure': 'Azure ML Experiments',
        'GCP': 'Vertex AI Experiments',
        'Open Source Alternative': 'MLflow, Weights & Biases'
    },
    {
        'Category': 'Feature Store',
        'AWS': 'SageMaker Feature Store',
        'Azure': 'Azure ML Feature Store',
        'GCP': 'Vertex AI Feature Store',
        'Open Source Alternative': 'Feast'
    },
    {
        'Category': 'Model Monitoring',
        'AWS': 'SageMaker Model Monitor',
        'Azure': 'Azure ML Model Monitoring',
        'GCP': 'Vertex AI Model Monitoring',
        'Open Source Alternative': 'Evidently AI, Seldon Core'
    },
    {
        'Category': 'Workflow Orchestration',
        'AWS': 'SageMaker Pipelines, Step Functions',
        'Azure': 'Azure ML Pipelines',
        'GCP': 'Vertex AI Pipelines',
        'Open Source Alternative': 'Airflow, Kubeflow Pipelines'
    }
])

print("ML Service Equivalency Mapping\n")
print(ml_service_mapping.to_string(index=False))
print("\nüí° Key Insights:")
print("   - Kubernetes is the common denominator for container orchestration")
print("   - Open source alternatives exist for most managed services")
print("   - MLflow provides cloud-agnostic experiment tracking")
print("   - All major clouds support similar ML workflows with different APIs")

### 1.2: Compute Instance Equivalency

In [None]:
# Equivalent compute instances across clouds
compute_equivalents = pd.DataFrame([
    {
        'Use Case': 'General Purpose (Small)',
        'AWS': 't3.medium (2 vCPU, 4GB)',
        'Azure': 'B2s (2 vCPU, 4GB)',
        'GCP': 'e2-medium (2 vCPU, 4GB)',
        'Approx Cost ($/hr)': '0.04-0.05'
    },
    {
        'Use Case': 'General Purpose (Medium)',
        'AWS': 'm5.xlarge (4 vCPU, 16GB)',
        'Azure': 'D4s v4 (4 vCPU, 16GB)',
        'GCP': 'n2-standard-4 (4 vCPU, 16GB)',
        'Approx Cost ($/hr)': '0.19-0.23'
    },
    {
        'Use Case': 'Compute Optimized',
        'AWS': 'c5.2xlarge (8 vCPU, 16GB)',
        'Azure': 'F8s v2 (8 vCPU, 16GB)',
        'GCP': 'c2-standard-8 (8 vCPU, 32GB)',
        'Approx Cost ($/hr)': '0.34-0.40'
    },
    {
        'Use Case': 'Memory Optimized',
        'AWS': 'r5.xlarge (4 vCPU, 32GB)',
        'Azure': 'E4s v4 (4 vCPU, 32GB)',
        'GCP': 'n2-highmem-4 (4 vCPU, 32GB)',
        'Approx Cost ($/hr)': '0.25-0.30'
    },
    {
        'Use Case': 'GPU (Entry Level)',
        'AWS': 'g4dn.xlarge (4 vCPU, 16GB, T4)',
        'Azure': 'NC4as T4 v3 (4 vCPU, 28GB, T4)',
        'GCP': 'n1-standard-4 + T4 (4 vCPU, 15GB, T4)',
        'Approx Cost ($/hr)': '0.52-0.73'
    },
    {
        'Use Case': 'GPU (High Performance)',
        'AWS': 'p3.2xlarge (8 vCPU, 61GB, V100)',
        'Azure': 'NC6s v3 (6 vCPU, 112GB, V100)',
        'GCP': 'n1-standard-8 + V100 (8 vCPU, 30GB, V100)',
        'Approx Cost ($/hr)': '3.06-3.80'
    }
])

print("Compute Instance Equivalency Guide\n")
print(compute_equivalents.to_string(index=False))
print("\n‚ö†Ô∏è Important Notes:")
print("   - Exact equivalents rarely exist; choose closest match")
print("   - Pricing varies by region (us-east-1 / East US / us-central1 shown)")
print("   - Performance can differ even with same specs")
print("   - Test your specific workload on each platform")

## Part 2: MLflow for Multi-Cloud Tracking

MLflow is an open-source platform for managing the ML lifecycle. It's cloud-agnostic and can track experiments across any platform.

### 2.1: Setting Up MLflow Tracking

In [None]:
# Set up local MLflow tracking (can be hosted on any cloud)
mlflow_dir = Path('mlruns')
mlflow_dir.mkdir(exist_ok=True)

# Set tracking URI (local for demo, can be remote MLflow server)
mlflow.set_tracking_uri(f'file://{mlflow_dir.absolute()}')

# Create experiment
experiment_name = 'multi-cloud-iris-classification'
mlflow.set_experiment(experiment_name)

print(f"‚úÖ MLflow tracking initialized")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Experiment: {experiment_name}")
print("\nüí° In production, host MLflow on:")
print("   - AWS: EC2 with S3 backend")
print("   - Azure: VM with Blob Storage backend")
print("   - GCP: Compute Engine with Cloud Storage backend")
print("   - Kubernetes: MLflow server on any cloud")

### 2.2: Training and Logging Models with MLflow

In [None]:
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Simulate training on different clouds
cloud_platforms = ['AWS', 'Azure', 'GCP']
model_configs = [
    {'n_estimators': 50, 'max_depth': 5},
    {'n_estimators': 100, 'max_depth': 10},
    {'n_estimators': 150, 'max_depth': 15},
]

results = []

for platform, config in zip(cloud_platforms, model_configs):
    # Start MLflow run
    with mlflow.start_run(run_name=f'{platform}_training'):
        # Log platform information
        mlflow.set_tag('cloud_platform', platform)
        mlflow.set_tag('region', f'{platform.lower()}-region-1')
        
        # Log parameters
        mlflow.log_params(config)
        
        # Train model
        model = RandomForestClassifier(**config, random_state=42)
        model.fit(X_train, y_train)
        
        # Evaluate
        train_acc = model.score(X_train, y_train)
        test_acc = model.score(X_test, y_test)
        
        # Log metrics
        mlflow.log_metric('train_accuracy', train_acc)
        mlflow.log_metric('test_accuracy', test_acc)
        
        # Log model (cloud-agnostic format)
        mlflow.sklearn.log_model(
            model,
            'model',
            registered_model_name=f'iris_classifier_{platform.lower()}'
        )
        
        results.append({
            'Platform': platform,
            'Estimators': config['n_estimators'],
            'Max Depth': config['max_depth'],
            'Train Accuracy': train_acc,
            'Test Accuracy': test_acc
        })
        
        print(f"‚úÖ Logged {platform} training run")

# Display results
results_df = pd.DataFrame(results)
print("\nMulti-Cloud Training Results:\n")
print(results_df.to_string(index=False))
print("\nüí° All experiments tracked in unified MLflow interface")
print("   Run 'mlflow ui' to view dashboard")

### 2.3: Loading Models from MLflow Registry

In [None]:
# Load model from MLflow (works regardless of training platform)
def load_model_from_registry(model_name: str, version: str = 'latest'):
    """
    Load model from MLflow registry - cloud-agnostic
    """
    try:
        if version == 'latest':
            model_uri = f"models:/{model_name}/latest"
        else:
            model_uri = f"models:/{model_name}/{version}"
        
        model = mlflow.sklearn.load_model(model_uri)
        print(f"‚úÖ Loaded model: {model_name} (version: {version})")
        return model
    except Exception as e:
        print(f"‚ùå Error loading model: {e}")
        print("   Note: In this demo, models are logged but not registered to registry")
        print("   In production, use mlflow.register_model() to add to registry")
        return None

# Example: Load AWS model and use for inference on GCP
print("Example: Cross-cloud model deployment")
print("Training on AWS ‚Üí Deploy on GCP (same model artifact)\n")

# In practice, you'd load from registry:
# model = load_model_from_registry('iris_classifier_aws', 'latest')

# For demo, use the last trained model
sample_input = X_test[:5]
# predictions = model.predict(sample_input)
print("Model loaded successfully and ready for inference on any cloud!")
print("\nüí° Benefits of MLflow Model Registry:")
print("   - Single source of truth for models")
print("   - Version control and lineage tracking")
print("   - Deploy same model to AWS, Azure, or GCP")
print("   - Stage transitions: None ‚Üí Staging ‚Üí Production ‚Üí Archived")

## Part 3: Model Portability with ONNX

ONNX (Open Neural Network Exchange) is an open format for ML models that enables interoperability across frameworks and platforms.

### 3.1: Converting Models to ONNX

In [None]:
# ONNX conversion example (requires skl2onnx)
# pip install skl2onnx onnxruntime

try:
    from skl2onnx import convert_sklearn
    from skl2onnx.common.data_types import FloatTensorType
    import onnxruntime as rt
    
    # Train a simple model
    model = RandomForestClassifier(n_estimators=10, max_depth=5, random_state=42)
    model.fit(X_train, y_train)
    
    # Define input type for ONNX conversion
    initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
    
    # Convert to ONNX
    onnx_model = convert_sklearn(model, initial_types=initial_type)
    
    # Save ONNX model
    onnx_path = 'iris_model.onnx'
    with open(onnx_path, 'wb') as f:
        f.write(onnx_model.SerializeToString())
    
    print("‚úÖ Model converted to ONNX format")
    print(f"ONNX model saved: {onnx_path}")
    print(f"File size: {os.path.getsize(onnx_path) / 1024:.2f} KB")
    
    # Load and run ONNX model
    sess = rt.InferenceSession(onnx_path)
    input_name = sess.get_inputs()[0].name
    label_name = sess.get_outputs()[0].name
    
    # Make predictions
    test_sample = X_test[:5].astype(np.float32)
    pred_onx = sess.run([label_name], {input_name: test_sample})[0]
    
    print("\nONNX Inference Results:")
    print(f"Predictions: {pred_onx}")
    print("\nüí° ONNX Benefits:")
    print("   - Framework agnostic (scikit-learn, PyTorch, TensorFlow)")
    print("   - Platform agnostic (Windows, Linux, macOS, mobile)")
    print("   - Cloud agnostic (AWS, Azure, GCP)")
    print("   - Optimized for inference performance")
    print("   - Smaller model size (sometimes)")
    
except ImportError:
    print("‚ö†Ô∏è skl2onnx not installed")
    print("Install with: pip install skl2onnx onnxruntime")
    print("\nONNX enables model portability across:")
    print("   - Different ML frameworks (scikit-learn ‚Üí PyTorch)")
    print("   - Different cloud platforms (AWS ‚Üí Azure)")
    print("   - Different devices (server ‚Üí mobile ‚Üí edge)")

### 3.2: ONNX Deployment Scenarios

In [None]:
# ONNX deployment options across clouds
onnx_deployment = pd.DataFrame([
    {
        'Platform': 'AWS',
        'Service': 'SageMaker',
        'ONNX Support': 'Yes (ONNX Runtime)',
        'Deployment Method': 'Custom container with ONNX Runtime'
    },
    {
        'Platform': 'Azure',
        'Service': 'Azure ML',
        'ONNX Support': 'Native',
        'Deployment Method': 'Direct ONNX model deployment'
    },
    {
        'Platform': 'GCP',
        'Service': 'Vertex AI',
        'ONNX Support': 'Yes (via container)',
        'Deployment Method': 'Custom container with ONNX Runtime'
    },
    {
        'Platform': 'Edge',
        'Service': 'ONNX Runtime',
        'ONNX Support': 'Native',
        'Deployment Method': 'Direct ONNX inference on device'
    },
    {
        'Platform': 'Mobile',
        'Service': 'ONNX Runtime Mobile',
        'ONNX Support': 'Native',
        'Deployment Method': 'iOS/Android app integration'
    }
])

print("ONNX Deployment Options Across Platforms\n")
print(onnx_deployment.to_string(index=False))
print("\nüéØ Use ONNX when:")
print("   - Need to deploy same model to multiple platforms")
print("   - Want to avoid platform lock-in")
print("   - Deploying to edge/mobile devices")
print("   - Optimizing inference performance")
print("   - Switching between ML frameworks")

## Part 4: Kubernetes for Cloud-Agnostic Deployment

Kubernetes (K8s) is the standard for container orchestration and works identically across all major clouds.

### 4.1: Kubernetes ML Deployment

In [None]:
# Kubernetes deployment manifest for ML model
k8s_deployment = '''
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iris-classifier
  labels:
    app: iris-classifier
spec:
  replicas: 3  # High availability
  selector:
    matchLabels:
      app: iris-classifier
  template:
    metadata:
      labels:
        app: iris-classifier
    spec:
      containers:
      - name: model-server
        image: your-registry/iris-classifier:v1.0
        ports:
        - containerPort: 8080
        env:
        - name: MODEL_PATH
          value: "/models/model.onnx"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: iris-classifier-service
spec:
  type: LoadBalancer  # Or ClusterIP, NodePort
  selector:
    app: iris-classifier
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: iris-classifier-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iris-classifier
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
'''

# Save Kubernetes manifest
with open('k8s_ml_deployment.yaml', 'w') as f:
    f.write(k8s_deployment)

print("‚úÖ Kubernetes deployment manifest created")
print("\nThis exact manifest works on:")
print("   - AWS: EKS (Elastic Kubernetes Service)")
print("   - Azure: AKS (Azure Kubernetes Service)")
print("   - GCP: GKE (Google Kubernetes Engine)")
print("   - On-premises: Self-managed Kubernetes")
print("\nDeploy with: kubectl apply -f k8s_ml_deployment.yaml")
print("\nüéØ Kubernetes Benefits for Multi-Cloud:")
print("   - Write once, deploy anywhere")
print("   - Consistent API across clouds")
print("   - Avoid vendor lock-in")
print("   - Rich ecosystem (Istio, Prometheus, etc.)")
print("   - Easy migration between clouds")

### 4.2: Managed Kubernetes Comparison

In [None]:
# Managed Kubernetes service comparison
k8s_comparison = pd.DataFrame([
    {
        'Feature': 'Service Name',
        'AWS': 'EKS',
        'Azure': 'AKS',
        'GCP': 'GKE'
    },
    {
        'Feature': 'Control Plane Cost',
        'AWS': '$0.10/hour ($73/month)',
        'Azure': 'Free',
        'GCP': '$0.10/hour ($73/month)'
    },
    {
        'Feature': 'Free Tier',
        'AWS': 'None',
        'Azure': 'Yes (free control plane)',
        'GCP': '$300 credit (90 days)'
    },
    {
        'Feature': 'Auto-scaling',
        'AWS': 'Cluster Autoscaler',
        'Azure': 'Cluster Autoscaler',
        'GCP': 'Node Auto-provisioning'
    },
    {
        'Feature': 'GPU Support',
        'AWS': 'Yes',
        'Azure': 'Yes',
        'GCP': 'Yes'
    },
    {
        'Feature': 'Serverless Pods',
        'AWS': 'Fargate',
        'Azure': 'Virtual Nodes',
        'GCP': 'Autopilot'
    },
    {
        'Feature': 'Integration',
        'AWS': 'Deep AWS integration',
        'Azure': 'Deep Azure integration',
        'GCP': 'Deep GCP integration'
    },
    {
        'Feature': 'Best For',
        'AWS': 'AWS-heavy workloads',
        'Azure': 'Cost-conscious, Azure users',
        'GCP': 'GCP ecosystem, Autopilot'
    }
])

print("Managed Kubernetes Service Comparison\n")
print(k8s_comparison.to_string(index=False))
print("\nüí° Cost Consideration:")
print("   - Azure AKS: FREE control plane (best for cost)")
print("   - AWS EKS: $73/month per cluster")
print("   - GCP GKE: $73/month per cluster (Autopilot mode varies)")
print("   - All: Pay for worker nodes (same pricing as VMs)")

## Part 5: Terraform for Multi-Cloud Infrastructure

Terraform enables Infrastructure as Code (IaC) across multiple cloud providers.

### 5.1: Multi-Cloud Terraform Configuration

In [None]:
# Terraform configuration for multi-cloud deployment
terraform_multicloud = '''
# main.tf - Multi-cloud ML infrastructure

terraform {
  required_version = ">= 1.0"
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
    google = {
      source  = "hashicorp/google"
      version = "~> 5.0"
    }
  }
}

# Variables for multi-cloud deployment
variable "deploy_to_aws" {
  type    = bool
  default = true
}

variable "deploy_to_azure" {
  type    = bool
  default = true
}

variable "deploy_to_gcp" {
  type    = bool
  default = false
}

# AWS Provider
provider "aws" {
  region = "us-east-1"
}

# Azure Provider
provider "azurerm" {
  features {}
}

# GCP Provider
provider "google" {
  project = "your-gcp-project"
  region  = "us-central1"
}

# AWS S3 Bucket for ML Data
resource "aws_s3_bucket" "ml_data" {
  count  = var.deploy_to_aws ? 1 : 0
  bucket = "ml-data-${random_id.suffix.hex}"
  
  tags = {
    Environment = "multi-cloud"
    Purpose     = "ml-data-storage"
  }
}

# Azure Storage Account
resource "azurerm_storage_account" "ml_data" {
  count                    = var.deploy_to_azure ? 1 : 0
  name                     = "mldata${random_id.suffix.hex}"
  resource_group_name      = azurerm_resource_group.ml_rg[0].name
  location                 = azurerm_resource_group.ml_rg[0].location
  account_tier             = "Standard"
  account_replication_type = "LRS"
  
  tags = {
    Environment = "multi-cloud"
    Purpose     = "ml-data-storage"
  }
}

# Azure Resource Group
resource "azurerm_resource_group" "ml_rg" {
  count    = var.deploy_to_azure ? 1 : 0
  name     = "ml-resources-rg"
  location = "East US"
}

# GCP Storage Bucket
resource "google_storage_bucket" "ml_data" {
  count    = var.deploy_to_gcp ? 1 : 0
  name     = "ml-data-${random_id.suffix.hex}"
  location = "US"
  
  labels = {
    environment = "multi-cloud"
    purpose     = "ml-data-storage"
  }
}

# Random suffix for unique naming
resource "random_id" "suffix" {
  byte_length = 4
}

# Outputs
output "aws_bucket" {
  value = var.deploy_to_aws ? aws_s3_bucket.ml_data[0].id : "Not deployed"
}

output "azure_storage" {
  value = var.deploy_to_azure ? azurerm_storage_account.ml_data[0].name : "Not deployed"
}

output "gcp_bucket" {
  value = var.deploy_to_gcp ? google_storage_bucket.ml_data[0].name : "Not deployed"
}
'''

# Save Terraform configuration
with open('multi_cloud_terraform.tf', 'w') as f:
    f.write(terraform_multicloud)

print("‚úÖ Multi-cloud Terraform configuration created")
print("\nUsage:")
print("  terraform init")
print("  terraform plan")
print("  terraform apply -var='deploy_to_aws=true' -var='deploy_to_azure=true'")
print("\nüí° Terraform Benefits:")
print("   - Single language for all clouds (HCL)")
print("   - Version control your infrastructure")
print("   - Preview changes before applying")
print("   - Easy to switch clouds or go multi-cloud")
print("   - Reusable modules across projects")

## Part 6: Data Portability and Vendor Lock-in Mitigation

### 6.1: Portable Data Formats

In [None]:
# Portable data format recommendations
data_formats = pd.DataFrame([
    {
        'Data Type': 'Tabular Data',
        'Recommended Format': 'Parquet, CSV',
        'Avoid': 'Proprietary binary formats',
        'Why': 'Open standard, efficient, cloud-agnostic'
    },
    {
        'Data Type': 'Model Artifacts',
        'Recommended Format': 'ONNX, SavedModel, pickle',
        'Avoid': 'Platform-specific formats',
        'Why': 'Framework/platform independent'
    },
    {
        'Data Type': 'Images',
        'Recommended Format': 'JPEG, PNG, WebP',
        'Avoid': 'Rare proprietary formats',
        'Why': 'Universal support'
    },
    {
        'Data Type': 'Text/Documents',
        'Recommended Format': 'JSON, XML, TXT',
        'Avoid': 'Proprietary document formats',
        'Why': 'Human readable, parseable anywhere'
    },
    {
        'Data Type': 'Large Datasets',
        'Recommended Format': 'Parquet, ORC, Avro',
        'Avoid': 'Non-splittable formats',
        'Why': 'Columnar, compressed, distributed processing'
    },
    {
        'Data Type': 'Time Series',
        'Recommended Format': 'Parquet with timestamp index',
        'Avoid': 'Custom binary formats',
        'Why': 'Efficient querying, compression'
    },
    {
        'Data Type': 'Experiment Metadata',
        'Recommended Format': 'MLflow format, JSON',
        'Avoid': 'Platform-specific tracking',
        'Why': 'Portable across clouds'
    }
])

print("Portable Data Format Guidelines\n")
print(data_formats.to_string(index=False))
print("\nüéØ Data Portability Best Practices:")
print("   - Use open formats (Parquet > proprietary)")
print("   - Avoid cloud-specific APIs in data pipelines")
print("   - Abstract storage layer (use fsspec, S3-compatible APIs)")
print("   - Document data schemas and versioning")
print("   - Test data migration between clouds regularly")

### 6.2: Vendor Lock-in Mitigation Checklist

In [None]:
# Vendor lock-in mitigation strategies
lock_in_mitigation = [
    {
        'Category': 'Compute',
        'Lock-in Risk': 'Platform-specific APIs',
        'Mitigation': 'Use Kubernetes, Docker, open frameworks',
        'Effort': 'Medium',
        'Impact': 'High'
    },
    {
        'Category': 'Storage',
        'Lock-in Risk': 'Proprietary storage services',
        'Mitigation': 'S3-compatible APIs, abstract with libraries',
        'Effort': 'Low',
        'Impact': 'High'
    },
    {
        'Category': 'ML Platform',
        'Lock-in Risk': 'SageMaker/Azure ML specific code',
        'Mitigation': 'Use MLflow, Kubeflow, open frameworks',
        'Effort': 'High',
        'Impact': 'High'
    },
    {
        'Category': 'Model Format',
        'Lock-in Risk': 'Framework-specific serialization',
        'Mitigation': 'Export to ONNX, SavedModel formats',
        'Effort': 'Low',
        'Impact': 'Medium'
    },
    {
        'Category': 'Data Pipeline',
        'Lock-in Risk': 'Cloud-specific orchestration',
        'Mitigation': 'Use Airflow, Prefect, Dagster',
        'Effort': 'Medium',
        'Impact': 'High'
    },
    {
        'Category': 'Monitoring',
        'Lock-in Risk': 'CloudWatch/Azure Monitor only',
        'Mitigation': 'Prometheus, Grafana, ELK stack',
        'Effort': 'Medium',
        'Impact': 'Medium'
    },
    {
        'Category': 'Database',
        'Lock-in Risk': 'DynamoDB/CosmosDB specific',
        'Mitigation': 'PostgreSQL, MongoDB (cloud-agnostic)',
        'Effort': 'Low',
        'Impact': 'High'
    },
    {
        'Category': 'Infrastructure',
        'Lock-in Risk': 'CloudFormation/ARM templates',
        'Mitigation': 'Use Terraform, Pulumi',
        'Effort': 'Low',
        'Impact': 'High'
    }
]

lock_in_df = pd.DataFrame(lock_in_mitigation)

print("Vendor Lock-in Mitigation Strategies\n")
print(lock_in_df.to_string(index=False))
print("\nüìä Priority Recommendations:")
print("   1. High Impact + Low Effort: Storage abstraction, Terraform")
print("   2. High Impact + Medium Effort: Kubernetes, Airflow")
print("   3. High Impact + High Effort: Open ML platforms (MLflow, Kubeflow)")
print("\nüí° Start Small:")
print("   - Begin with storage and infrastructure (Terraform)")
print("   - Add container orchestration (Kubernetes)")
print("   - Finally migrate to open ML platforms")

## Summary

In this notebook, you learned comprehensive multi-cloud ML strategies:

### Key Takeaways:

1. **When to Go Multi-Cloud**
   - Avoid vendor lock-in and negotiate better pricing
   - Use best-of-breed services from each provider
   - Meet compliance and data residency requirements
   - Achieve true disaster recovery
   - BUT: Adds complexity and cost

2. **Service Equivalency**
   - All major clouds offer similar ML services
   - APIs differ, but concepts are the same
   - Open source alternatives exist for most services
   - Kubernetes is the common orchestration layer

3. **MLflow for Multi-Cloud**
   - Cloud-agnostic experiment tracking
   - Unified model registry across platforms
   - Train on AWS, deploy on Azure seamlessly
   - Open source and extensible

4. **ONNX for Model Portability**
   - Framework-agnostic model format
   - Deploy same model to any platform
   - Optimized for inference performance
   - Supports edge and mobile deployment

5. **Kubernetes Deployment**
   - Write once, deploy anywhere
   - Consistent API across EKS, AKS, GKE
   - Rich ecosystem for ML (KServe, Seldon)
   - Azure AKS: Free control plane

6. **Terraform for IaC**
   - Single language for all clouds (HCL)
   - Version control infrastructure
   - Easy migration between clouds
   - Reusable modules

7. **Avoiding Vendor Lock-in**
   - Use open formats: Parquet, ONNX, JSON
   - Abstract storage with S3-compatible APIs
   - Open source tools: MLflow, Airflow, Kubernetes
   - Infrastructure as Code: Terraform

### Multi-Cloud Decision Matrix:

| Scenario | Recommendation | Reason |
|----------|----------------|--------|
| Startup/Learning | **Single Cloud** | Simplicity, cost, faster iteration |
| Enterprise with compliance | **Multi-Cloud** | Data residency, risk mitigation |
| High-availability critical | **Multi-Cloud** | True redundancy |
| Cost-conscious SMB | **Single Cloud** | Better discounts, lower overhead |
| Best-of-breed strategy | **Multi-Cloud** | Use best services from each |

### Recommended Multi-Cloud Stack:

- **Container Orchestration**: Kubernetes (EKS/AKS/GKE)
- **Experiment Tracking**: MLflow
- **Model Format**: ONNX (when possible)
- **Infrastructure**: Terraform
- **Data Format**: Parquet, JSON
- **Workflow**: Airflow or Kubeflow Pipelines
- **Monitoring**: Prometheus + Grafana
- **Storage**: S3-compatible APIs

## Next Steps

- **[Module 10: Cloud Storage for ML](10_cloud_storage_for_ml.ipynb)**: Deep dive into cloud storage
- **[Module 11: Final Project - Deploy Model on Cloud](11_final_project_deploy_model_on_cloud.ipynb)**: Capstone project
- **Practice**: Set up MLflow tracking server
- **Explore**: Kubeflow for end-to-end ML on Kubernetes

## Additional Resources

- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
- [ONNX Documentation](https://onnx.ai/)
- [Kubernetes ML Operators](https://github.com/kubeflow/kubeflow)
- [Terraform Multi-Cloud Examples](https://github.com/hashicorp/terraform-provider-aws)
- [KServe (Model Serving on K8s)](https://kserve.github.io/website/)
- [Feast (Feature Store)](https://feast.dev/)

## Exercises

### Exercise 1: Service Mapping Analysis ‚≠ê

Create a comprehensive service mapping for your specific ML project:

1. List all services you need (storage, compute, ML platform, etc.)
2. Find equivalents in AWS, Azure, and GCP
3. Compare pricing for each option
4. Identify open source alternatives
5. Recommend optimal provider for each service

Present findings in a decision matrix with justification.

In [None]:
# Your code here


### Exercise 2: MLflow Multi-Cloud Experiment ‚≠ê‚≠ê

Set up MLflow and log experiments with:

1. Train 3 different models (RandomForest, SVM, Logistic Regression)
2. Tag each with a different "cloud" (simulated AWS, Azure, GCP)
3. Log different hyperparameters for each
4. Compare results in MLflow UI
5. Register the best model to MLflow Model Registry
6. Load and deploy the best model

**Bonus**: Set up remote MLflow tracking server on a cloud VM.

In [None]:
# Your code here


### Exercise 3: ONNX Model Conversion Pipeline ‚≠ê‚≠ê

Create a complete model conversion pipeline:

1. Train models in multiple frameworks:
   - scikit-learn RandomForest
   - PyTorch neural network (if installed)
   - XGBoost model
2. Convert all to ONNX format
3. Compare model sizes before and after
4. Benchmark inference time: native vs ONNX
5. Verify predictions match between formats

Document conversion process and performance differences.

In [None]:
# Your code here


### Exercise 4: Kubernetes ML Deployment Simulation ‚≠ê‚≠ê‚≠ê

Design a complete Kubernetes deployment for ML:

1. **Create Kubernetes manifests** for:
   - Model serving deployment
   - Horizontal Pod Autoscaler
   - Service (LoadBalancer)
   - ConfigMap for configuration
   - Secret for API keys

2. **Simulate deployment** on different clouds:
   - Calculate costs on EKS vs AKS vs GKE
   - Compare setup complexity
   - List cloud-specific integrations

3. **Design migration plan**:
   - Steps to move from AWS to Azure
   - Potential pitfalls
   - Estimated downtime

**Bonus**: If you have minikube, deploy locally and test.

In [None]:
# Your code here


### Exercise 5: Multi-Cloud TCO Analysis ‚≠ê‚≠ê‚≠ê

Conduct a Total Cost of Ownership analysis for:

**Scenario**: ML application with:
- 500GB data storage
- 100 hours/month training (GPU)
- 10M predictions/month
- 3 environments (dev, staging, prod)

**Compare**:
1. **Single Cloud** (AWS SageMaker)
2. **Multi-Cloud** (training on cheapest, inference on Azure)
3. **Hybrid** (Kubernetes on any cloud)

**Calculate**:
- Direct costs (compute, storage, data transfer)
- Hidden costs (engineering time, tooling, training)
- Total 12-month TCO

**Present**:
- Cost breakdown charts
- Break-even analysis
- Risk assessment
- Final recommendation

**Bonus**: Include vendor lock-in risk quantification.

In [None]:
# Your code here
