# Module 02: AWS SageMaker Model Deployment

**Difficulty**: ⭐⭐⭐
**Estimated Time**: 60 minutes
**Prerequisites**: 
- [Module 01: AWS SageMaker Basics](01_aws_sagemaker_basics.ipynb)
- Understanding of ML model deployment concepts
- Basic knowledge of AWS services

## Learning Objectives

By the end of this notebook, you will be able to:
1. Deploy ML models to SageMaker real-time endpoints with auto-scaling
2. Use Batch Transform for large-scale offline inference
3. Configure multi-model endpoints to reduce costs
4. Implement A/B testing and traffic splitting strategies
5. Set up model monitoring and CloudWatch logging
6. Apply cost optimization techniques for SageMaker endpoints

## 1. Setup and Introduction

### SageMaker Deployment Options

AWS SageMaker provides several deployment patterns:

```
┌─────────────────────────────────────────────────────────┐
│           SageMaker Deployment Options                  │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. Real-time Endpoints                                 │
│     ├─ Low latency (ms)                                 │
│     ├─ Auto-scaling                                     │
│     └─ 24/7 availability                                │
│                                                         │
│  2. Batch Transform                                     │
│     ├─ Large datasets                                   │
│     ├─ No real-time requirement                         │
│     └─ Cost-effective                                   │
│                                                         │
│  3. Serverless Inference                                │
│     ├─ Auto-scales to zero                              │
│     ├─ Intermittent traffic                             │
│     └─ Pay per request                                  │
│                                                         │
│  4. Asynchronous Inference                              │
│     ├─ Long processing times                            │
│     ├─ Queue-based                                      │
│     └─ S3 input/output                                  │
└─────────────────────────────────────────────────────────┘
```

**Cost Warning**: Real-time endpoints run continuously and incur hourly charges. Always delete endpoints after testing!

In [None]:
# Setup and imports
import numpy as np
import pandas as pd
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Mock SageMaker SDK classes for demonstration
# In production, use: import boto3, sagemaker

class MockSageMakerClient:
    """Simulates AWS SageMaker client for educational purposes"""
    
    def __init__(self):
        self.endpoints = {}
        self.models = {}
        self.endpoint_configs = {}
    
    def create_model(self, **kwargs):
        model_name = kwargs['ModelName']
        self.models[model_name] = kwargs
        return {'ModelArn': f'arn:aws:sagemaker:model/{model_name}'}

## 2. Real-time Endpoint Deployment

Real-time endpoints provide low-latency predictions for online applications.

### Deployment Architecture

```
┌──────────────┐     ┌──────────────────────────────────┐
│   Client     │────▶│   Load Balancer                  │
│ Application  │     └──────────────────────────────────┘
└──────────────┘                    │
                                    ▼
                    ┌───────────────────────────────────┐
                    │      SageMaker Endpoint           │
                    ├───────────────────────────────────┤
                    │  ┌─────────┐  ┌─────────┐        │
                    │  │Instance │  │Instance │        │
                    │  │  (min)  │  │ (scaled)│        │
                    │  └─────────┘  └─────────┘        │
                    │                                   │
                    │  Auto-scaling based on:           │
                    │  - InvocationsPerInstance         │
                    │  - CPUUtilization                 │
                    │  - Custom metrics                 │
                    └───────────────────────────────────┘
```

In [None]:
# Example: Real-time endpoint configuration

def create_endpoint_config_dict():
    """Define endpoint configuration with production variants"""
    config = {
        'EndpointConfigName': 'my-model-config-v1',
        'ProductionVariants': [
            {
                'VariantName': 'AllTraffic',
                'ModelName': 'my-trained-model',
                'InstanceType': 'ml.t2.medium',  # Free tier eligible
                'InitialInstanceCount': 1,
                'InitialVariantWeight': 1.0
            }
        ]
    }
    return config

endpoint_config = create_endpoint_config_dict()
print("Endpoint Configuration:")
print(json.dumps(endpoint_config, indent=2))

In [None]:
# Example: Auto-scaling configuration

def create_autoscaling_config():
    """Configure auto-scaling for endpoint based on traffic"""
    autoscaling_config = {
        'ResourceId': 'endpoint/my-endpoint/variant/AllTraffic',
        'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
        'MinCapacity': 1,  # Minimum instances
        'MaxCapacity': 5,  # Maximum instances
        'TargetTrackingScalingPolicyConfiguration': {
            'TargetValue': 70.0,  # Target 70 invocations/instance
            'PredefinedMetricSpecification': {
                'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
            }
        }
    }
    return autoscaling_config

autoscaling = create_autoscaling_config()
print("Auto-scaling Configuration:")
print(json.dumps(autoscaling, indent=2))

### Making Predictions

Once deployed, invoke the endpoint with your data:

In [None]:
# Simulated endpoint invocation

class MockEndpoint:
    """Simulates a SageMaker endpoint for predictions"""
    
    def __init__(self, model_type='classifier'):
        self.model_type = model_type
        self.invocation_count = 0
        self.latencies = []
    
    def invoke(self, data):
        """Simulate model prediction with latency tracking"""
        import time
        start_time = time.time()
        
        # Simulate prediction
        if self.model_type == 'classifier':
            prediction = np.random.choice([0, 1], p=[0.7, 0.3])
        else:
            prediction = np.random.randn() * 10 + 50
        
        latency = (time.time() - start_time) * 1000  # ms
        self.latencies.append(latency)
        self.invocation_count += 1
        
        return {'prediction': prediction, 'latency_ms': latency}

In [None]:
# Test endpoint invocation
endpoint = MockEndpoint(model_type='classifier')

# Make sample predictions
test_data = np.random.randn(5, 10)
results = []

for i, sample in enumerate(test_data):
    result = endpoint.invoke(sample)
    results.append(result)
    print(f"Request {i+1}: Prediction={result['prediction']}, "
          f"Latency={result['latency_ms']:.2f}ms")

print(f"\nAverage latency: {np.mean(endpoint.latencies):.2f}ms")

## 3. Batch Transform for Large-Scale Inference

Batch Transform is ideal for:
- Processing large datasets offline
- No real-time latency requirements
- Cost optimization (pay only during processing)

### Batch Transform Architecture

```
┌─────────────────────────────────────────────────────────┐
│                  Batch Transform                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  S3 Input Bucket        SageMaker           S3 Output  │
│  ┌──────────┐         ┌──────────┐         ┌────────┐ │
│  │ data.csv │───────▶ │ Transform│────────▶│results │ │
│  │ (large)  │         │   Job    │         │  .csv  │ │
│  └──────────┘         └──────────┘         └────────┘ │
│                            │                           │
│                       Auto-scales                      │
│                       instances                        │
│                       based on                         │
│                       data size                        │
└─────────────────────────────────────────────────────────┘
```

In [None]:
# Example: Batch transform configuration

def create_batch_transform_job():
    """Configure batch transform for large-scale inference"""
    job_config = {
        'TransformJobName': 'batch-inference-job-2024-01',
        'ModelName': 'my-trained-model',
        'TransformInput': {
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'S3Prefix',
                    'S3Uri': 's3://my-bucket/input-data/'
                }
            },
            'ContentType': 'text/csv',
            'SplitType': 'Line'  # Process line-by-line
        },
        'TransformOutput': {
            'S3OutputPath': 's3://my-bucket/predictions/',
            'AssembleWith': 'Line'
        },
        'TransformResources': {
            'InstanceType': 'ml.m5.xlarge',
            'InstanceCount': 2  # Parallel processing
        }
    }
    return job_config

batch_config = create_batch_transform_job()
print("Batch Transform Configuration:")
print(json.dumps(batch_config, indent=2))

In [None]:
# Simulate batch transform processing

class BatchTransformSimulator:
    """Simulates batch transform job execution"""
    
    def __init__(self, instance_count=2):
        self.instance_count = instance_count
        self.processed_records = 0
    
    def process_batch(self, data_size):
        """Simulate processing large dataset in batches"""
        batch_size = data_size // self.instance_count
        
        print(f"Processing {data_size} records...")
        print(f"Using {self.instance_count} instances")
        print(f"Batch size per instance: {batch_size}\n")
        
        for i in range(self.instance_count):
            start = i * batch_size
            end = start + batch_size
            print(f"Instance {i+1}: Processing records {start}-{end}")
            self.processed_records += batch_size
        
        return self.processed_records

# Simulate batch processing
batch_job = BatchTransformSimulator(instance_count=2)
total_records = 100000
processed = batch_job.process_batch(total_records)
print(f"\nTotal processed: {processed:,} records")

## 4. Multi-Model Endpoints

Multi-model endpoints allow hosting multiple models on a single endpoint, significantly reducing costs.

### Multi-Model Architecture

```
┌─────────────────────────────────────────────────────────┐
│          Multi-Model Endpoint                            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  S3 Model Store         Endpoint Container             │
│  ┌──────────────┐      ┌─────────────────────┐        │
│  │ model-a.tar  │      │  Memory Cache       │        │
│  │ model-b.tar  │◀────▶│  ┌──────┐ ┌──────┐ │        │
│  │ model-c.tar  │      │  │Model │ │Model │ │        │
│  │     ...      │      │  │  A   │ │  B   │ │        │
│  └──────────────┘      │  └──────┘ └──────┘ │        │
│                        │                     │        │
│                        │  Lazy loading       │        │
│                        │  LRU eviction       │        │
│                        └─────────────────────┘        │
│                                                         │
│  Benefits:                                              │
│  - Share compute across models                         │
│  - Cost-effective for many models                      │
│  - Dynamic loading based on demand                     │
└─────────────────────────────────────────────────────────┘
```

In [None]:
# Multi-model endpoint configuration

def create_multi_model_config():
    """Configure endpoint to host multiple models"""
    config = {
        'EndpointConfigName': 'multi-model-endpoint-config',
        'ProductionVariants': [
            {
                'VariantName': 'AllModels',
                'ModelName': 'multi-model-container',
                'InstanceType': 'ml.m5.xlarge',
                'InitialInstanceCount': 1,
                # Key setting for multi-model endpoint
                'ModelDataUrl': 's3://my-bucket/models/',
                'Mode': 'MultiModel'
            }
        ]
    }
    return config

multi_model_config = create_multi_model_config()
print("Multi-Model Endpoint Configuration:")
print(json.dumps(multi_model_config, indent=2))

In [None]:
# Simulate multi-model endpoint invocation

class MultiModelEndpoint:
    """Simulates multi-model endpoint with caching"""
    
    def __init__(self, cache_size=2):
        self.cache = {}  # LRU cache for loaded models
        self.cache_size = cache_size
        self.load_count = {}
    
    def invoke(self, model_name, data):
        """Invoke specific model from multi-model endpoint"""
        if model_name not in self.cache:
            # Simulate loading model from S3
            print(f"Loading {model_name} from S3...")
            if len(self.cache) >= self.cache_size:
                # Evict least recently used model
                evicted = list(self.cache.keys())[0]
                print(f"Evicting {evicted} from cache")
                del self.cache[evicted]
            
            self.cache[model_name] = True
            self.load_count[model_name] = self.load_count.get(model_name, 0) + 1
        
        # Make prediction
        prediction = np.random.randn()
        return {'model': model_name, 'prediction': prediction}

In [None]:
# Test multi-model endpoint
mme = MultiModelEndpoint(cache_size=2)

# Invoke different models
models = ['model-a', 'model-b', 'model-c', 'model-a']
test_data = np.random.randn(10)

for model in models:
    result = mme.invoke(model, test_data)
    print(f"Prediction from {model}: {result['prediction']:.3f}")

print(f"\nModel load statistics: {mme.load_count}")

## 5. A/B Testing and Traffic Splitting

Production variants allow testing multiple model versions with controlled traffic distribution.

### A/B Testing Architecture

```
┌─────────────────────────────────────────────────────────┐
│             A/B Testing Endpoint                         │
├─────────────────────────────────────────────────────────┤
│                                                         │
│         Incoming Traffic (100%)                         │
│                   │                                     │
│                   ▼                                     │
│         ┌─────────────────────┐                        │
│         │  Load Balancer      │                        │
│         └─────────────────────┘                        │
│              │           │                              │
│         70%  │           │  30%                         │
│              ▼           ▼                              │
│    ┌──────────────┐  ┌──────────────┐                 │
│    │  Variant A   │  │  Variant B   │                 │
│    │ (Champion)   │  │ (Challenger) │                 │
│    │  Model v1.0  │  │  Model v2.0  │                 │
│    └──────────────┘  └──────────────┘                 │
│                                                         │
│    Compare metrics:                                     │
│    - Accuracy                                           │
│    - Latency                                            │
│    - Cost                                               │
└─────────────────────────────────────────────────────────┘
```

In [None]:
# A/B testing endpoint configuration

def create_ab_test_config():
    """Configure endpoint with multiple model variants for A/B testing"""
    config = {
        'EndpointConfigName': 'ab-test-config-v1',
        'ProductionVariants': [
            {
                'VariantName': 'VariantA-Champion',
                'ModelName': 'model-v1-stable',
                'InstanceType': 'ml.t2.medium',
                'InitialInstanceCount': 2,
                'InitialVariantWeight': 0.7  # 70% traffic
            },
            {
                'VariantName': 'VariantB-Challenger',
                'ModelName': 'model-v2-experimental',
                'InstanceType': 'ml.t2.medium',
                'InitialInstanceCount': 1,
                'InitialVariantWeight': 0.3  # 30% traffic
            }
        ]
    }
    return config

ab_config = create_ab_test_config()
print("A/B Testing Configuration:")
print(json.dumps(ab_config, indent=2))

In [None]:
# Simulate A/B testing traffic distribution

class ABTestEndpoint:
    """Simulates A/B testing with traffic splitting"""
    
    def __init__(self, variant_weights):
        self.variant_weights = variant_weights
        self.variant_metrics = {v: {'count': 0, 'latencies': [], 'predictions': []} 
                               for v in variant_weights.keys()}
    
    def invoke(self, data):
        """Route request to variant based on weights"""
        # Randomly select variant based on weights
        variants = list(self.variant_weights.keys())
        weights = list(self.variant_weights.values())
        selected_variant = np.random.choice(variants, p=weights)
        
        # Simulate prediction with different characteristics
        if selected_variant == 'VariantA':
            latency = np.random.gamma(2, 50)  # ms
            prediction = np.random.randn() * 10 + 50
        else:
            latency = np.random.gamma(1.5, 40)  # Faster but experimental
            prediction = np.random.randn() * 12 + 52
        
        # Track metrics
        self.variant_metrics[selected_variant]['count'] += 1
        self.variant_metrics[selected_variant]['latencies'].append(latency)
        self.variant_metrics[selected_variant]['predictions'].append(prediction)
        
        return {'variant': selected_variant, 'prediction': prediction}

In [None]:
# Run A/B test simulation
ab_endpoint = ABTestEndpoint({
    'VariantA': 0.7,
    'VariantB': 0.3
})

# Simulate 1000 requests
num_requests = 1000
for _ in range(num_requests):
    test_data = np.random.randn(10)
    ab_endpoint.invoke(test_data)

# Analyze results
print("A/B Test Results:\n")
for variant, metrics in ab_endpoint.variant_metrics.items():
    print(f"{variant}:")
    print(f"  Requests: {metrics['count']} ({metrics['count']/num_requests*100:.1f}%)")
    print(f"  Avg Latency: {np.mean(metrics['latencies']):.2f}ms")
    print(f"  Avg Prediction: {np.mean(metrics['predictions']):.2f}")
    print()

## 6. Model Monitoring and CloudWatch Integration

Monitor endpoint performance and model quality in production.

### Monitoring Architecture

```
┌─────────────────────────────────────────────────────────┐
│           SageMaker Model Monitoring                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Endpoint         CloudWatch         Alerts             │
│  ┌────────┐      ┌──────────┐      ┌────────┐         │
│  │Metrics │─────▶│ Dashboards│─────▶│  SNS   │         │
│  │- Latency│     │ - Graphs  │      │ Email  │         │
│  │- Errors │     │ - Logs    │      │ Lambda │         │
│  │- Invocs │     └──────────┘      └────────┘         │
│  └────────┘                                             │
│                                                         │
│  Model Monitor   Data Quality       Data Drift         │
│  ┌───────────┐   ┌────────────┐    ┌────────────┐    │
│  │Capture    │──▶│ Violations │───▶│  Alerts    │    │
│  │Input/Output│   │ Detected   │    │  Triggered │    │
│  └───────────┘   └────────────┘    └────────────┘    │
└─────────────────────────────────────────────────────────┘
```

In [None]:
# Model monitoring configuration

def create_monitoring_schedule():
    """Configure model monitoring for data quality and drift detection"""
    schedule_config = {
        'MonitoringScheduleName': 'model-quality-monitor',
        'MonitoringScheduleConfig': {
            'MonitoringJobDefinition': {
                'MonitoringInputs': [
                    {
                        'EndpointInput': {
                            'EndpointName': 'my-endpoint',
                            'LocalPath': '/opt/ml/processing/input'
                        }
                    }
                ],
                'MonitoringOutputConfig': {
                    'MonitoringOutputs': [
                        {
                            'S3Output': {
                                'S3Uri': 's3://my-bucket/monitoring-reports/',
                                'LocalPath': '/opt/ml/processing/output'
                            }
                        }
                    ]
                }
            },
            'ScheduleConfig': {
                'ScheduleExpression': 'cron(0 * * * ? *)'  # Hourly
            }
        }
    }
    return schedule_config

monitor_config = create_monitoring_schedule()
print("Model Monitoring Schedule:")
print(json.dumps(monitor_config, indent=2))

In [None]:
# Simulate CloudWatch metrics collection

class EndpointMonitor:
    """Simulates CloudWatch metrics for endpoint monitoring"""
    
    def __init__(self):
        self.metrics = []
    
    def record_invocation(self, latency, success=True):
        """Record metrics for endpoint invocation"""
        metric = {
            'timestamp': datetime.now(),
            'latency': latency,
            'success': success,
            'invocations': 1,
            'errors': 0 if success else 1
        }
        self.metrics.append(metric)
    
    def get_summary_stats(self):
        """Calculate summary statistics from metrics"""
        latencies = [m['latency'] for m in self.metrics]
        total_invocations = len(self.metrics)
        total_errors = sum(m['errors'] for m in self.metrics)
        
        return {
            'total_invocations': total_invocations,
            'error_rate': total_errors / total_invocations,
            'avg_latency': np.mean(latencies),
            'p50_latency': np.percentile(latencies, 50),
            'p95_latency': np.percentile(latencies, 95),
            'p99_latency': np.percentile(latencies, 99)
        }

In [None]:
# Simulate endpoint monitoring
monitor = EndpointMonitor()

# Simulate 100 requests with varying latencies
for _ in range(100):
    # Most requests are fast
    if np.random.rand() < 0.95:
        latency = np.random.gamma(2, 30)  # Fast requests
        success = True
    else:
        # Occasional slow or failed request
        latency = np.random.gamma(5, 100)  # Slow request
        success = np.random.rand() > 0.1
    
    monitor.record_invocation(latency, success)

# Display monitoring summary
stats = monitor.get_summary_stats()
print("Endpoint Monitoring Summary:\n")
print(f"Total Invocations: {stats['total_invocations']}")
print(f"Error Rate: {stats['error_rate']*100:.2f}%")
print(f"\nLatency Statistics (ms):")
print(f"  Average: {stats['avg_latency']:.2f}")
print(f"  P50: {stats['p50_latency']:.2f}")
print(f"  P95: {stats['p95_latency']:.2f}")
print(f"  P99: {stats['p99_latency']:.2f}")

## 7. Cost Optimization Strategies

SageMaker endpoints can be expensive. Here are strategies to optimize costs:

### Cost Optimization Techniques

```
┌─────────────────────────────────────────────────────────┐
│         SageMaker Cost Optimization                      │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  1. Right-sizing Instances                              │
│     - Start small (ml.t2.medium)                        │
│     - Monitor CPU/Memory usage                          │
│     - Upgrade only when needed                          │
│                                                         │
│  2. Auto-scaling                                        │
│     - Scale down during low traffic                     │
│     - Set appropriate min/max instances                 │
│     - Use target tracking policies                      │
│                                                         │
│  3. Deployment Patterns                                 │
│     - Multi-model endpoints (many models)               │
│     - Serverless inference (intermittent)               │
│     - Batch transform (no real-time needed)             │
│                                                         │
│  4. Lifecycle Management                                │
│     - Delete unused endpoints                           │
│     - Stop non-production endpoints                     │
│     - Use scheduled scaling                             │
│                                                         │
│  5. Data Capture                                        │
│     - Sample requests (not 100%)                        │
│     - Compress captured data                            │
│     - Set S3 lifecycle policies                         │
└─────────────────────────────────────────────────────────┘
```

In [None]:
# Cost estimation calculator

def estimate_endpoint_cost(instance_type, instance_count, hours_per_month):
    """Estimate monthly cost for SageMaker endpoint"""
    
    # Simplified pricing (actual prices vary by region)
    hourly_rates = {
        'ml.t2.medium': 0.065,
        'ml.m5.large': 0.134,
        'ml.m5.xlarge': 0.269,
        'ml.c5.xlarge': 0.238,
    }
    
    hourly_rate = hourly_rates.get(instance_type, 0.10)
    monthly_cost = hourly_rate * instance_count * hours_per_month
    
    return {
        'instance_type': instance_type,
        'instance_count': instance_count,
        'hours_per_month': hours_per_month,
        'hourly_rate': hourly_rate,
        'monthly_cost': monthly_cost
    }

# Compare different deployment scenarios
scenarios = [
    ('Always-on (24/7)', 'ml.t2.medium', 1, 730),
    ('Business hours only', 'ml.t2.medium', 1, 200),
    ('Auto-scaled (avg)', 'ml.t2.medium', 2, 500),
    ('Multi-model endpoint', 'ml.m5.large', 1, 730),
]

print("Cost Comparison for Different Deployment Strategies:\n")
for name, instance, count, hours in scenarios:
    cost = estimate_endpoint_cost(instance, count, hours)
    print(f"{name}:")
    print(f"  Configuration: {count}x {instance}")
    print(f"  Hours/month: {hours}")
    print(f"  Estimated cost: ${cost['monthly_cost']:.2f}/month\n")

In [None]:
# Serverless inference configuration (cost-effective for low traffic)

def create_serverless_config():
    """Configure serverless inference endpoint (scales to zero)"""
    config = {
        'EndpointConfigName': 'serverless-endpoint-config',
        'ProductionVariants': [
            {
                'VariantName': 'ServerlessVariant',
                'ModelName': 'my-model',
                'ServerlessConfig': {
                    'MemorySizeInMB': 2048,  # 2GB
                    'MaxConcurrency': 10,  # Max concurrent invocations
                    'ProvisionedConcurrency': 0  # Scales to zero
                }
            }
        ]
    }
    
    print("Serverless Inference Benefits:")
    print("- No charge when not in use")
    print("- Automatic scaling")
    print("- Pay per request (after free tier)")
    print("- Ideal for intermittent traffic\n")
    
    return config

serverless_config = create_serverless_config()
print(json.dumps(serverless_config, indent=2))

## 8. Cleanup and Best Practices

**CRITICAL**: Always delete endpoints when not in use to avoid charges!

In [None]:
# Endpoint cleanup helper

def cleanup_endpoint(endpoint_name, delete_config=True, delete_model=True):
    """
    Clean up SageMaker resources to avoid charges
    
    Steps:
    1. Delete endpoint (stops billing immediately)
    2. Delete endpoint configuration
    3. Delete model
    """
    cleanup_plan = {
        'endpoint_name': endpoint_name,
        'steps': [
            f"1. Delete endpoint: {endpoint_name}",
            f"2. Delete endpoint config: {endpoint_name}-config" if delete_config else "Skip",
            f"3. Delete model: {endpoint_name}-model" if delete_model else "Skip"
        ],
        'warning': 'BILLING STOPS AFTER STEP 1!'
    }
    return cleanup_plan

# Example cleanup
cleanup = cleanup_endpoint('my-production-endpoint')
print("Cleanup Plan:")
for step in cleanup['steps']:
    if step != "Skip":
        print(f"  {step}")
print(f"\n⚠️  {cleanup['warning']}")

### Best Practices Summary

1. **Development**: Use smaller instances (ml.t2.medium) for testing
2. **Production**: Right-size based on load testing
3. **Monitoring**: Set up CloudWatch alarms for errors and latency
4. **Cost**: Use serverless or multi-model endpoints when possible
5. **Testing**: Use A/B testing before full rollout
6. **Cleanup**: Always delete unused endpoints
7. **Security**: Use VPC endpoints and encryption
8. **Logging**: Enable data capture for model monitoring

## Exercises

### Exercise 1: Design Deployment Strategy

You have three models to deploy:
- Model A: Used by 10,000 users daily (9am-5pm)
- Model B: Used by 100 users sporadically
- Model C: Batch processing 1M records nightly

Design the optimal deployment strategy for each model. Consider:
- Endpoint type (real-time, serverless, batch)
- Instance type and count
- Auto-scaling configuration
- Estimated monthly cost

In [None]:
# Your solution here
def design_deployment_strategy():
    """
    Design deployment strategy for three different use cases
    
    Consider:
    - Traffic patterns
    - Latency requirements
    - Cost optimization
    - Scalability needs
    """
    strategies = {
        'model_a': {
            # TODO: Fill in deployment strategy
        },
        'model_b': {
            # TODO: Fill in deployment strategy
        },
        'model_c': {
            # TODO: Fill in deployment strategy
        }
    }
    return strategies

# Test your design
# strategies = design_deployment_strategy()
# print(json.dumps(strategies, indent=2))

### Exercise 2: A/B Test Analysis

You're running an A/B test with:
- Variant A: Current model (70% traffic)
- Variant B: New model (30% traffic)

After 1 week:
- Variant A: 95% accuracy, 80ms average latency
- Variant B: 97% accuracy, 120ms average latency

Should you switch to Variant B? Write code to analyze the trade-offs.

In [None]:
# Your solution here
def analyze_ab_test(variant_a_metrics, variant_b_metrics, sla_latency=100):
    """
    Analyze A/B test results and make deployment recommendation
    
    Args:
        variant_a_metrics: dict with accuracy, latency
        variant_b_metrics: dict with accuracy, latency
        sla_latency: maximum acceptable latency in ms
    
    Returns:
        dict with recommendation and reasoning
    """
    # TODO: Implement analysis logic
    pass

# Test your analysis
# variant_a = {'accuracy': 0.95, 'latency': 80}
# variant_b = {'accuracy': 0.97, 'latency': 120}
# recommendation = analyze_ab_test(variant_a, variant_b)
# print(recommendation)

### Exercise 3: Cost Optimization Calculator

Create a function that compares costs between:
1. Single model endpoint (always-on)
2. Multi-model endpoint (10 models)
3. Serverless inference

Assume:
- Each model receives 1000 requests/day
- Average processing time: 50ms
- SLA: 99.9% availability

In [None]:
# Your solution here
def compare_deployment_costs(num_models, requests_per_day, avg_process_time_ms):
    """
    Compare monthly costs for different deployment patterns
    
    Calculate costs for:
    - Individual endpoints (num_models separate endpoints)
    - Multi-model endpoint (1 endpoint hosting all models)
    - Serverless inference
    
    Returns:
        dict with cost breakdown for each option
    """
    # TODO: Implement cost comparison
    pass

# Test your calculator
# costs = compare_deployment_costs(
#     num_models=10,
#     requests_per_day=1000,
#     avg_process_time_ms=50
# )
# print(json.dumps(costs, indent=2))

### Exercise 4: Monitoring Alert System

Design a monitoring system that:
1. Tracks endpoint latency, errors, and invocations
2. Detects anomalies (latency > 2x normal, error rate > 1%)
3. Generates alerts with severity levels
4. Suggests remediation actions

In [None]:
# Your solution here
class EndpointAlertSystem:
    """Monitor endpoint metrics and generate alerts"""
    
    def __init__(self, baseline_latency, baseline_error_rate):
        self.baseline_latency = baseline_latency
        self.baseline_error_rate = baseline_error_rate
        self.alerts = []
    
    def check_metrics(self, current_metrics):
        """
        Check current metrics against baselines
        Generate alerts for anomalies
        
        Args:
            current_metrics: dict with latency, error_rate, invocations
        
        Returns:
            list of alerts with severity and remediation
        """
        # TODO: Implement anomaly detection
        pass

# Test your alert system
# alert_system = EndpointAlertSystem(baseline_latency=50, baseline_error_rate=0.001)
# current = {'latency': 150, 'error_rate': 0.05, 'invocations': 1000}
# alerts = alert_system.check_metrics(current)
# for alert in alerts:
#     print(f"{alert['severity']}: {alert['message']}")
#     print(f"Remediation: {alert['remediation']}\n")

## Summary

In this notebook, you learned:

1. **Real-time Endpoints**: Deploy models with auto-scaling for production traffic
2. **Batch Transform**: Cost-effective offline inference for large datasets
3. **Multi-Model Endpoints**: Host multiple models on single endpoint to reduce costs
4. **A/B Testing**: Safely test new model versions with traffic splitting
5. **Monitoring**: Track endpoint performance with CloudWatch metrics
6. **Cost Optimization**: Strategies to minimize SageMaker deployment costs

### Key Takeaways

- Choose deployment pattern based on traffic patterns and latency requirements
- Always monitor endpoint performance and set up alerts
- Use A/B testing to validate new models before full rollout
- Implement auto-scaling to handle variable traffic
- Delete unused endpoints to avoid unnecessary charges
- Consider serverless or multi-model endpoints for cost savings

### What's Next?

- [Module 03: Azure ML Studio Introduction](03_azure_ml_studio_introduction.ipynb)
- Practice deploying models with different instance types
- Experiment with CloudWatch dashboards and alarms
- Explore SageMaker Pipelines for automated deployments

### Additional Resources

- [SageMaker Deployment Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/best-practices.html)
- [SageMaker Pricing Calculator](https://aws.amazon.com/sagemaker/pricing/)
- [Auto-scaling Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html)
- [Model Monitoring](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html)