# Module 07: Serverless ML Deployment

**Difficulty**: ‚≠ê‚≠ê‚≠ê
**Estimated Time**: 90 minutes
**Prerequisites**: 
- [Module 00: Introduction to Cloud ML Services](00_introduction_to_cloud_ml_services.ipynb)
- [Module 01: AWS SageMaker Basics](01_aws_sagemaker_basics.ipynb)
- Basic understanding of REST APIs
- Familiarity with Docker (recommended)

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand when serverless is appropriate for ML inference
2. Deploy ML models using AWS Lambda, Azure Functions, and Google Cloud Functions
3. Optimize cold start times and manage dependencies using Lambda layers
4. Integrate API Gateway for production-ready endpoints
5. Use container-based serverless for larger ML models
6. Compare costs between serverless and dedicated endpoints
7. Implement best practices for serverless ML deployment

## What is Serverless ML?

Serverless computing allows you to run code without managing servers. For ML inference, this means:

**Advantages:**
- ‚úÖ **Pay-per-use**: Only charged when functions execute
- ‚úÖ **Auto-scaling**: Handles 1 or 10,000 requests automatically
- ‚úÖ **Zero maintenance**: No server management required
- ‚úÖ **Cost-effective**: Ideal for sporadic or low-volume predictions

**Limitations:**
- ‚ö†Ô∏è **Cold starts**: Initial latency when function hasn't run recently (100ms - 10s)
- ‚ö†Ô∏è **Execution limits**: Timeouts (AWS Lambda: 15min max, typically use 30s-3min)
- ‚ö†Ô∏è **Memory constraints**: Limited RAM (AWS Lambda: 128MB - 10GB)
- ‚ö†Ô∏è **Package size limits**: Deployment package restrictions (50MB zipped, 250MB unzipped for Lambda)

**When to Use Serverless for ML:**
- Inference frequency: < 1000 requests/day or highly variable traffic
- Model size: < 250MB (or use containers for up to 10GB)
- Latency tolerance: Can accept 100ms - 1s additional cold start latency
- Budget-conscious: Want to minimize costs for low-volume use cases

## Setup and Imports

In [None]:
# Standard library imports
import json
import os
import base64
import time
from datetime import datetime
import zipfile
from io import BytesIO

# Data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import joblib

# Cloud SDKs (install only what you need for your chosen platform)
# pip install boto3  # For AWS
# pip install azure-functions  # For Azure
# pip install google-cloud-functions  # For GCP

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("Setup complete!")
print(f"Notebook executed on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## Part 1: Training a Lightweight Model for Serverless Deployment

For serverless deployment, we want models that are:
- **Small**: < 50MB ideally (< 250MB maximum for Lambda)
- **Fast**: Inference in < 100ms
- **Simple**: Minimal dependencies

Let's train a simple Random Forest classifier that meets these criteria.

In [None]:
# Load sample dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a lightweight model
# Using fewer trees and limited depth for smaller model size
model = RandomForestClassifier(
    n_estimators=10,  # Fewer trees = smaller model
    max_depth=5,      # Limited depth = faster inference
    random_state=42
)

model.fit(X_train, y_train)

# Evaluate model
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"Model Training Complete")
print(f"Train Accuracy: {train_score:.4f}")
print(f"Test Accuracy: {test_score:.4f}")

# Save model to file
model_path = 'iris_model.joblib'
joblib.dump(model, model_path)

# Check model size (important for serverless)
model_size_mb = os.path.getsize(model_path) / (1024 * 1024)
print(f"\nModel Size: {model_size_mb:.2f} MB")

if model_size_mb < 50:
    print("‚úÖ Model size is optimal for serverless deployment")
elif model_size_mb < 250:
    print("‚ö†Ô∏è Model size is acceptable but may need Lambda layers")
else:
    print("‚ùå Model too large for standard Lambda, consider container-based deployment")

## Part 2: AWS Lambda for ML Inference

AWS Lambda is the most popular serverless platform. Let's create a Lambda function for ML inference.

### 2.1: Lambda Function Handler

The Lambda handler is the entry point for your function. Here's a production-ready handler:

In [None]:
# This code shows what goes inside lambda_function.py
# In actual deployment, this would be a separate file

lambda_function_code = '''
import json
import joblib
import numpy as np
import logging

# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Global variable to cache model (persists between warm starts)
model = None

def load_model():
    """Load model once and cache for subsequent invocations (warm starts)"""
    global model
    if model is None:
        logger.info("Loading model (cold start)...")
        model = joblib.load("/opt/ml/model.joblib")  # /opt is for Lambda layers
        logger.info("Model loaded successfully")
    return model

def lambda_handler(event, context):
    """
    AWS Lambda handler for ML inference
    
    Expected input format:
    {
        "features": [[5.1, 3.5, 1.4, 0.2]]
    }
    """
    try:
        # Load model (cached after first call)
        clf = load_model()
        
        # Parse input
        if isinstance(event, str):
            event = json.loads(event)
        
        features = event.get('features')
        if features is None:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'Missing features field'})
            }
        
        # Convert to numpy array
        X = np.array(features)
        
        # Make prediction
        predictions = clf.predict(X)
        probabilities = clf.predict_proba(X)
        
        # Format response
        response = {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*'  # For CORS
            },
            'body': json.dumps({
                'predictions': predictions.tolist(),
                'probabilities': probabilities.tolist()
            })
        }
        
        logger.info(f"Prediction successful: {predictions}")
        return response
        
    except Exception as e:
        logger.error(f"Error during prediction: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }
'''

# Save to file for reference
with open('lambda_function.py', 'w') as f:
    f.write(lambda_function_code)

print("‚úÖ Lambda function code created: lambda_function.py")
print("\nKey features:")
print("- Model caching to avoid reloading on warm starts")
print("- Proper error handling and logging")
print("- CORS headers for web applications")
print("- Input validation")

### 2.2: Creating a Lambda Deployment Package

Lambda requires all code and dependencies in a zip file. Let's create a deployment package.

In [None]:
def create_lambda_package(function_file, model_file, output_zip='lambda_deployment.zip'):
    """
    Create a Lambda deployment package with function code and model
    
    Note: For production, dependencies should be in a Lambda Layer
    to keep deployment package small
    """
    with zipfile.ZipFile(output_zip, 'w', zipfile.ZIP_DEFLATED) as zipf:
        # Add Lambda function
        zipf.write(function_file, arcname='lambda_function.py')
        print(f"Added {function_file} to package")
        
        # Add model file
        zipf.write(model_file, arcname='model.joblib')
        print(f"Added {model_file} to package")
        
        # In production, you'd also add:
        # - requirements.txt dependencies (in a separate layer)
        # - Any helper modules
    
    package_size = os.path.getsize(output_zip) / (1024 * 1024)
    print(f"\n‚úÖ Deployment package created: {output_zip}")
    print(f"Package size: {package_size:.2f} MB")
    
    if package_size < 50:
        print("‚úÖ Package size is optimal for direct upload")
    else:
        print("‚ö†Ô∏è Package is large, consider using S3 for upload")
    
    return output_zip

# Create deployment package
package_path = create_lambda_package('lambda_function.py', 'iris_model.joblib')

### 2.3: Lambda Layers for Dependencies

Lambda Layers allow you to separate dependencies from your function code. This has several benefits:
- Faster deployments (don't re-upload dependencies each time)
- Stay under the 50MB direct upload limit
- Share dependencies across multiple functions

**Creating a Lambda Layer (in production):**

```bash
# Structure for Python Lambda Layer
mkdir -p layer/python/lib/python3.9/site-packages
cd layer/python/lib/python3.9/site-packages

# Install dependencies
pip install numpy scikit-learn joblib -t .

# Create layer zip
cd ../../../..
zip -r sklearn-layer.zip python/

# Upload to AWS (using AWS CLI)
aws lambda publish-layer-version \
    --layer-name sklearn-numpy \
    --zip-file fileb://sklearn-layer.zip \
    --compatible-runtimes python3.9
```

In [None]:
# Simulated Lambda Layer structure
layer_structure = {
    'layer_name': 'sklearn-numpy-layer',
    'compatible_runtimes': ['python3.9', 'python3.10', 'python3.11'],
    'size_mb': 45.3,
    'packages': ['numpy', 'scikit-learn', 'joblib'],
    'arn': 'arn:aws:lambda:us-east-1:123456789012:layer:sklearn-numpy-layer:1'
}

print("Lambda Layer Configuration:")
print(json.dumps(layer_structure, indent=2))
print("\nüí° Tip: AWS maintains public layers for popular libraries")
print("   Check: https://github.com/keithrozario/Klayers for ready-made layers")

### 2.4: Simulated Lambda Deployment with boto3

Here's how you would deploy to AWS Lambda using boto3 (simulated since we don't have AWS credentials):

In [None]:
# Simulated boto3 Lambda deployment
# In production, you'd use actual boto3 client

class SimulatedLambdaClient:
    """Simulates AWS Lambda API calls for educational purposes"""
    
    def __init__(self):
        self.functions = {}
    
    def create_function(self, FunctionName, Runtime, Role, Handler, Code, 
                       Timeout=30, MemorySize=512, Environment=None, Layers=None):
        """Simulate Lambda function creation"""
        function_config = {
            'FunctionName': FunctionName,
            'FunctionArn': f'arn:aws:lambda:us-east-1:123456789012:function:{FunctionName}',
            'Runtime': Runtime,
            'Role': Role,
            'Handler': Handler,
            'CodeSize': len(Code.get('ZipFile', b'')),
            'Timeout': Timeout,
            'MemorySize': MemorySize,
            'LastModified': datetime.now().isoformat(),
            'State': 'Active',
            'Layers': Layers or []
        }
        self.functions[FunctionName] = function_config
        return function_config
    
    def invoke(self, FunctionName, Payload):
        """Simulate Lambda invocation"""
        if FunctionName not in self.functions:
            raise Exception(f"Function {FunctionName} not found")
        
        # Simulate cold start on first invocation
        start_time = time.time()
        
        # Simulate inference (using our actual model)
        event = json.loads(Payload)
        features = np.array(event['features'])
        predictions = model.predict(features)
        probabilities = model.predict_proba(features)
        
        response_payload = {
            'statusCode': 200,
            'body': json.dumps({
                'predictions': predictions.tolist(),
                'probabilities': probabilities.tolist()
            })
        }
        
        execution_time = (time.time() - start_time) * 1000  # ms
        
        return {
            'StatusCode': 200,
            'ExecutedVersion': '$LATEST',
            'Payload': BytesIO(json.dumps(response_payload).encode()),
            'ExecutionTime': execution_time
        }

# Initialize simulated client
lambda_client = SimulatedLambdaClient()

# Create Lambda function
with open(package_path, 'rb') as f:
    zip_content = f.read()

function_response = lambda_client.create_function(
    FunctionName='iris-classifier',
    Runtime='python3.9',
    Role='arn:aws:iam::123456789012:role/lambda-execution-role',
    Handler='lambda_function.lambda_handler',
    Code={'ZipFile': zip_content},
    Timeout=30,
    MemorySize=512,
    Layers=[
        'arn:aws:lambda:us-east-1:123456789012:layer:sklearn-numpy-layer:1'
    ]
)

print("‚úÖ Lambda Function Created (Simulated)")
print(f"Function Name: {function_response['FunctionName']}")
print(f"Function ARN: {function_response['FunctionArn']}")
print(f"Runtime: {function_response['Runtime']}")
print(f"Memory: {function_response['MemorySize']} MB")
print(f"Timeout: {function_response['Timeout']} seconds")

### 2.5: Testing Lambda Invocation

In [None]:
# Test Lambda invocation
test_payload = {
    'features': [[5.1, 3.5, 1.4, 0.2]]  # Sample iris flower
}

response = lambda_client.invoke(
    FunctionName='iris-classifier',
    Payload=json.dumps(test_payload)
)

# Parse response
result = json.loads(response['Payload'].read())
body = json.loads(result['body'])

print("Lambda Invocation Result:")
print(f"Status Code: {response['StatusCode']}")
print(f"Execution Time: {response['ExecutionTime']:.2f} ms")
print(f"\nPrediction: {body['predictions']}")
print(f"Probabilities: {np.array(body['probabilities'])[0]}")
print(f"\nPredicted Class: {iris.target_names[body['predictions'][0]]}")

## Part 3: API Gateway Integration

API Gateway creates a REST API endpoint for your Lambda function, enabling:
- HTTPS endpoints with custom domains
- Request/response transformation
- Authentication and authorization
- Rate limiting and throttling
- API keys and usage plans

### 3.1: Terraform Configuration for API Gateway + Lambda

In [None]:
# Terraform configuration for complete serverless ML API
terraform_config = '''
# main.tf - Complete serverless ML deployment

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# Lambda Function
resource "aws_lambda_function" "ml_inference" {
  filename         = "lambda_deployment.zip"
  function_name    = "iris-classifier"
  role            = aws_iam_role.lambda_execution.arn
  handler         = "lambda_function.lambda_handler"
  runtime         = "python3.9"
  timeout         = 30
  memory_size     = 512
  
  # Use Lambda layers for dependencies
  layers = [var.sklearn_layer_arn]
  
  environment {
    variables = {
      MODEL_PATH = "/opt/ml/model.joblib"
    }
  }
  
  # Enable function URL (simpler alternative to API Gateway)
  # function_url_enabled = true
}

# IAM Role for Lambda
resource "aws_iam_role" "lambda_execution" {
  name = "iris-classifier-lambda-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "lambda.amazonaws.com"
      }
    }]
  })
}

# Attach basic Lambda execution policy
resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

# API Gateway REST API
resource "aws_api_gateway_rest_api" "ml_api" {
  name        = "iris-classifier-api"
  description = "ML Inference API for Iris Classification"
}

# API Gateway Resource (/predict)
resource "aws_api_gateway_resource" "predict" {
  rest_api_id = aws_api_gateway_rest_api.ml_api.id
  parent_id   = aws_api_gateway_rest_api.ml_api.root_resource_id
  path_part   = "predict"
}

# POST method
resource "aws_api_gateway_method" "predict_post" {
  rest_api_id   = aws_api_gateway_rest_api.ml_api.id
  resource_id   = aws_api_gateway_resource.predict.id
  http_method   = "POST"
  authorization = "NONE"  # Consider using API keys or Cognito in production
}

# Lambda integration
resource "aws_api_gateway_integration" "lambda" {
  rest_api_id = aws_api_gateway_rest_api.ml_api.id
  resource_id = aws_api_gateway_resource.predict.id
  http_method = aws_api_gateway_method.predict_post.http_method

  integration_http_method = "POST"
  type                    = "AWS_PROXY"  # Lambda proxy integration
  uri                     = aws_lambda_function.ml_inference.invoke_arn
}

# Lambda permission for API Gateway
resource "aws_lambda_permission" "api_gateway" {
  statement_id  = "AllowAPIGatewayInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.ml_inference.function_name
  principal     = "apigateway.amazonaws.com"
  source_arn    = "${aws_api_gateway_rest_api.ml_api.execution_arn}/*/*"
}

# API Gateway Deployment
resource "aws_api_gateway_deployment" "prod" {
  depends_on = [
    aws_api_gateway_integration.lambda
  ]

  rest_api_id = aws_api_gateway_rest_api.ml_api.id
  stage_name  = "prod"
}

# Outputs
output "api_endpoint" {
  value = "${aws_api_gateway_deployment.prod.invoke_url}/predict"
  description = "API Gateway endpoint URL"
}

output "lambda_arn" {
  value = aws_lambda_function.ml_inference.arn
}
'''

# Save Terraform configuration
with open('lambda_api_gateway.tf', 'w') as f:
    f.write(terraform_config)

print("‚úÖ Terraform configuration saved: lambda_api_gateway.tf")
print("\nTo deploy:")
print("  terraform init")
print("  terraform plan")
print("  terraform apply")
print("\n‚ö†Ô∏è FREE TIER: AWS offers 1M free Lambda requests/month")
print("   API Gateway: 1M free API calls/month (12 months free tier)")

## Part 4: Container-Based Lambda for Larger Models

For models > 250MB, use container-based Lambda:
- Package size up to 10GB
- Full control over runtime environment
- Use any base image (must implement Lambda Runtime API)

### 4.1: Dockerfile for Lambda Container

In [None]:
# Dockerfile for containerized Lambda
dockerfile_content = '''
# Use AWS Lambda Python base image
FROM public.ecr.aws/lambda/python:3.9

# Install system dependencies if needed
RUN yum install -y gcc-c++ && yum clean all

# Copy requirements and install Python dependencies
COPY requirements.txt ${LAMBDA_TASK_ROOT}/
RUN pip install --no-cache-dir -r ${LAMBDA_TASK_ROOT}/requirements.txt

# Copy model and function code
COPY model.joblib ${LAMBDA_TASK_ROOT}/
COPY lambda_function.py ${LAMBDA_TASK_ROOT}/

# Set the CMD to your handler
CMD ["lambda_function.lambda_handler"]
'''

# Save Dockerfile
with open('Dockerfile.lambda', 'w') as f:
    f.write(dockerfile_content)

print("‚úÖ Dockerfile created: Dockerfile.lambda")
print("\nTo build and deploy:")
print("  1. Build: docker build -t iris-classifier -f Dockerfile.lambda .")
print("  2. Tag: docker tag iris-classifier:latest <account-id>.dkr.ecr.<region>.amazonaws.com/iris-classifier:latest")
print("  3. Push to ECR: docker push <account-id>.dkr.ecr.<region>.amazonaws.com/iris-classifier:latest")
print("  4. Create Lambda from container image in AWS Console or Terraform")

## Part 5: Azure Functions for ML

Azure Functions is Microsoft's serverless platform. Very similar to Lambda but integrates with Azure ecosystem.

### 5.1: Azure Function Handler

In [None]:
# Azure Functions code (in __init__.py)
azure_function_code = '''
import azure.functions as func
import json
import joblib
import numpy as np
import logging

# Load model once at startup
model = joblib.load('model.joblib')

def main(req: func.HttpRequest) -> func.HttpResponse:
    """
    Azure Function for ML inference
    """
    logging.info('Python HTTP trigger function processed a request.')
    
    try:
        # Parse request body
        req_body = req.get_json()
        features = req_body.get('features')
        
        if features is None:
            return func.HttpResponse(
                json.dumps({'error': 'Missing features field'}),
                status_code=400
            )
        
        # Make prediction
        X = np.array(features)
        predictions = model.predict(X)
        probabilities = model.predict_proba(X)
        
        # Return response
        response = {
            'predictions': predictions.tolist(),
            'probabilities': probabilities.tolist()
        }
        
        return func.HttpResponse(
            json.dumps(response),
            mimetype='application/json',
            status_code=200
        )
        
    except Exception as e:
        logging.error(f'Error: {str(e)}')
        return func.HttpResponse(
            json.dumps({'error': str(e)}),
            status_code=500
        )
'''

# Azure function.json configuration
azure_function_config = {
    "scriptFile": "__init__.py",
    "bindings": [
        {
            "authLevel": "function",
            "type": "httpTrigger",
            "direction": "in",
            "name": "req",
            "methods": ["post"]
        },
        {
            "type": "http",
            "direction": "out",
            "name": "$return"
        }
    ]
}

print("Azure Functions Configuration:")
print(json.dumps(azure_function_config, indent=2))
print("\n‚ö†Ô∏è FREE TIER: Azure offers 1M free executions/month")
print("   Plus 400,000 GB-seconds of compute")

## Part 6: Google Cloud Functions

Google Cloud Functions (2nd gen) uses Cloud Run underneath and supports up to 16GB memory.

### 6.1: Google Cloud Function Handler

In [None]:
# Google Cloud Functions code (main.py)
gcp_function_code = '''
import functions_framework
import json
import joblib
import numpy as np

# Load model at startup
model = joblib.load('model.joblib')

@functions_framework.http
def predict(request):
    """
    HTTP Cloud Function for ML inference
    """
    # Set CORS headers for web applications
    if request.method == 'OPTIONS':
        headers = {
            'Access-Control-Allow-Origin': '*',
            'Access-Control-Allow-Methods': 'POST',
            'Access-Control-Allow-Headers': 'Content-Type',
        }
        return ('', 204, headers)
    
    headers = {'Access-Control-Allow-Origin': '*'}
    
    try:
        request_json = request.get_json(silent=True)
        
        if request_json is None or 'features' not in request_json:
            return (json.dumps({'error': 'Missing features field'}), 400, headers)
        
        features = request_json['features']
        X = np.array(features)
        
        predictions = model.predict(X)
        probabilities = model.predict_proba(X)
        
        response = {
            'predictions': predictions.tolist(),
            'probabilities': probabilities.tolist()
        }
        
        return (json.dumps(response), 200, headers)
        
    except Exception as e:
        return (json.dumps({'error': str(e)}), 500, headers)
'''

print("Google Cloud Functions Code:")
print(gcp_function_code[:500] + "...")
print("\n‚ö†Ô∏è FREE TIER: GCP offers 2M invocations/month")
print("   Plus 400,000 GB-seconds, 200,000 GHz-seconds compute")

## Part 7: Cold Start Optimization

Cold starts are the biggest challenge with serverless ML. Here are optimization strategies:

### 7.1: Measuring Cold Starts

In [None]:
def simulate_cold_starts(num_invocations=10, cold_start_probability=0.3):
    """
    Simulate Lambda invocations with cold starts
    
    Cold starts occur when:
    - Function hasn't been invoked recently (5-15 min)
    - Scaling up requires new container instances
    """
    results = []
    
    for i in range(num_invocations):
        is_cold_start = np.random.random() < cold_start_probability
        
        if is_cold_start:
            # Cold start: container init + model loading + inference
            container_init = np.random.uniform(100, 500)  # ms
            model_loading = np.random.uniform(200, 1000)  # ms
            inference = np.random.uniform(10, 50)  # ms
            total = container_init + model_loading + inference
        else:
            # Warm start: only inference time
            inference = np.random.uniform(10, 50)  # ms
            total = inference
        
        results.append({
            'invocation': i + 1,
            'cold_start': is_cold_start,
            'latency_ms': total
        })
    
    return pd.DataFrame(results)

# Simulate invocations
latency_data = simulate_cold_starts(num_invocations=100)

# Analyze results
cold_starts = latency_data[latency_data['cold_start']]
warm_starts = latency_data[~latency_data['cold_start']]

print("Cold Start Analysis:")
print(f"Total invocations: {len(latency_data)}")
print(f"Cold starts: {len(cold_starts)} ({len(cold_starts)/len(latency_data)*100:.1f}%)")
print(f"Warm starts: {len(warm_starts)} ({len(warm_starts)/len(latency_data)*100:.1f}%)")
print(f"\nLatency Statistics:")
print(f"Cold start latency: {cold_starts['latency_ms'].mean():.1f} ms (avg)")
print(f"Warm start latency: {warm_starts['latency_ms'].mean():.1f} ms (avg)")
print(f"P95 cold start: {cold_starts['latency_ms'].quantile(0.95):.1f} ms")
print(f"P95 warm start: {warm_starts['latency_ms'].quantile(0.95):.1f} ms")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Latency over time
axes[0].scatter(latency_data[latency_data['cold_start']]['invocation'],
                latency_data[latency_data['cold_start']]['latency_ms'],
                color='red', label='Cold Start', alpha=0.6, s=50)
axes[0].scatter(latency_data[~latency_data['cold_start']]['invocation'],
                latency_data[~latency_data['cold_start']]['latency_ms'],
                color='green', label='Warm Start', alpha=0.6, s=50)
axes[0].set_xlabel('Invocation Number')
axes[0].set_ylabel('Latency (ms)')
axes[0].set_title('Lambda Invocation Latency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Latency distribution
axes[1].hist(cold_starts['latency_ms'], bins=20, alpha=0.6, color='red', label='Cold Start')
axes[1].hist(warm_starts['latency_ms'], bins=20, alpha=0.6, color='green', label='Warm Start')
axes[1].set_xlabel('Latency (ms)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Latency Distribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 7.2: Cold Start Optimization Techniques

In [None]:
cold_start_optimizations = [
    {
        'technique': 'Provisioned Concurrency',
        'description': 'Keep N instances warm at all times',
        'cost_impact': 'High - pay for idle instances',
        'latency_improvement': '95% reduction in cold starts',
        'use_case': 'Production APIs with consistent traffic',
        'aws_example': '''
aws lambda put-provisioned-concurrency-config \
    --function-name iris-classifier \
    --provisioned-concurrent-executions 2
'''
    },
    {
        'technique': 'Reduce Package Size',
        'description': 'Use Lambda layers, remove unused dependencies',
        'cost_impact': 'None',
        'latency_improvement': '30-50% faster cold starts',
        'use_case': 'All serverless deployments',
        'example': '''
# Instead of full scikit-learn, use specific modules
from sklearn.ensemble import RandomForestClassifier  # Specific import
# vs
import sklearn  # Imports everything
'''
    },
    {
        'technique': 'Lazy Loading',
        'description': 'Load model only when needed, cache in global scope',
        'cost_impact': 'None',
        'latency_improvement': 'Faster cold starts, same warm performance',
        'use_case': 'Functions with multiple code paths',
        'example': '''
model = None  # Global variable

def load_model():
    global model
    if model is None:  # Load only once
        model = joblib.load('model.joblib')
    return model
'''
    },
    {
        'technique': 'Use Smaller Models',
        'description': 'Quantize, prune, or use simpler architectures',
        'cost_impact': 'None',
        'latency_improvement': 'Faster loading and inference',
        'use_case': 'When model complexity isn\'t critical',
        'example': '''
# Use fewer trees/parameters
model = RandomForestClassifier(n_estimators=10, max_depth=5)
# vs
model = RandomForestClassifier(n_estimators=100, max_depth=None)
'''
    },
    {
        'technique': 'Warm-up Pings',
        'description': 'Scheduled CloudWatch Events to keep function warm',
        'cost_impact': 'Low - minimal invocation costs',
        'latency_improvement': 'Reduces cold start frequency',
        'use_case': 'Predictable traffic patterns',
        'aws_example': '''
# CloudWatch Event rule (every 5 minutes)
aws events put-rule --schedule-expression "rate(5 minutes)" \
    --name WarmupLambda
'''
    },
    {
        'technique': 'Increase Memory',
        'description': 'More memory = more CPU = faster initialization',
        'cost_impact': 'Medium - higher per-invocation cost',
        'latency_improvement': '20-40% faster for compute-heavy tasks',
        'use_case': 'When initialization is CPU-bound',
        'example': 'Set Lambda memory to 1024MB or 2048MB instead of 512MB'
    }
]

# Display as DataFrame
optimization_df = pd.DataFrame(cold_start_optimizations)
print("Cold Start Optimization Strategies:\n")
for idx, row in optimization_df.iterrows():
    print(f"\n{idx+1}. {row['technique']}")
    print(f"   Description: {row['description']}")
    print(f"   Cost Impact: {row['cost_impact']}")
    print(f"   Improvement: {row['latency_improvement']}")
    print(f"   Best For: {row['use_case']}")

## Part 8: Cost Comparison - Serverless vs Dedicated Endpoints

When should you use serverless vs dedicated endpoints? Let's compare costs.

In [None]:
def calculate_serverless_cost(requests_per_month, avg_duration_ms=500, memory_mb=512):
    """
    Calculate AWS Lambda costs
    
    Pricing (as of 2024):
    - $0.20 per 1M requests
    - $0.0000166667 per GB-second
    - Free tier: 1M requests + 400,000 GB-seconds/month
    """
    # Request charges
    free_requests = 1_000_000
    billable_requests = max(0, requests_per_month - free_requests)
    request_cost = (billable_requests / 1_000_000) * 0.20
    
    # Compute charges
    gb_seconds = (memory_mb / 1024) * (avg_duration_ms / 1000) * requests_per_month
    free_gb_seconds = 400_000
    billable_gb_seconds = max(0, gb_seconds - free_gb_seconds)
    compute_cost = billable_gb_seconds * 0.0000166667
    
    total_cost = request_cost + compute_cost
    
    return {
        'request_cost': request_cost,
        'compute_cost': compute_cost,
        'total_cost': total_cost,
        'free_tier_savings': min(free_requests * 0.20 / 1_000_000 + 
                                free_gb_seconds * 0.0000166667, total_cost)
    }

def calculate_endpoint_cost(instance_type='ml.t3.medium', hours_per_month=730):
    """
    Calculate SageMaker endpoint costs
    
    Pricing examples:
    - ml.t3.medium: $0.065/hour
    - ml.m5.large: $0.134/hour
    - ml.c5.xlarge: $0.238/hour
    """
    pricing = {
        'ml.t3.medium': 0.065,
        'ml.m5.large': 0.134,
        'ml.c5.xlarge': 0.238
    }
    
    hourly_rate = pricing.get(instance_type, 0.065)
    total_cost = hourly_rate * hours_per_month
    
    return total_cost

# Compare different traffic levels
traffic_scenarios = [
    {'name': 'Low Traffic', 'requests_per_month': 10_000},
    {'name': 'Medium Traffic', 'requests_per_month': 500_000},
    {'name': 'High Traffic', 'requests_per_month': 5_000_000},
    {'name': 'Very High Traffic', 'requests_per_month': 50_000_000}
]

comparison_results = []

for scenario in traffic_scenarios:
    serverless = calculate_serverless_cost(scenario['requests_per_month'])
    endpoint = calculate_endpoint_cost('ml.t3.medium')
    
    comparison_results.append({
        'Scenario': scenario['name'],
        'Requests/Month': f"{scenario['requests_per_month']:,}",
        'Lambda Cost': f"${serverless['total_cost']:.2f}",
        'SageMaker Endpoint': f"${endpoint:.2f}",
        'Cheaper Option': 'Lambda' if serverless['total_cost'] < endpoint else 'Endpoint',
        'Savings': f"${abs(serverless['total_cost'] - endpoint):.2f}"
    })

comparison_df = pd.DataFrame(comparison_results)
print("Cost Comparison: Serverless vs Dedicated Endpoint\n")
print(comparison_df.to_string(index=False))
print("\nüí° Key Insights:")
print("   - Serverless is cheaper for < 1M requests/month")
print("   - Dedicated endpoints become cost-effective at high volumes")
print("   - Consider latency requirements and traffic predictability")

## Summary

In this notebook, you learned:

1. **Serverless ML Fundamentals**
   - When to use serverless for ML inference
   - Advantages and limitations
   - Cost-effectiveness for low-volume traffic

2. **AWS Lambda Deployment**
   - Creating Lambda functions with ML models
   - Using Lambda layers for dependencies
   - Deployment packages and size limits
   - Container-based Lambda for larger models

3. **API Gateway Integration**
   - Creating production-ready REST APIs
   - Terraform infrastructure as code
   - Authentication and rate limiting

4. **Multi-Cloud Serverless**
   - Azure Functions for ML
   - Google Cloud Functions
   - Platform comparison

5. **Cold Start Optimization**
   - Measuring and analyzing cold starts
   - Provisioned concurrency
   - Package size reduction
   - Lazy loading and warm-up strategies

6. **Cost Analysis**
   - Serverless vs dedicated endpoint pricing
   - Free tier maximization
   - Traffic-based decision making

### When to Use Serverless for ML:
‚úÖ Low or variable traffic (< 1M requests/month)  
‚úÖ Cost optimization is critical  
‚úÖ Can tolerate cold start latency  
‚úÖ Model size < 10GB  
‚úÖ Simple inference (no complex preprocessing)  

### When to Use Dedicated Endpoints:
‚úÖ High, consistent traffic  
‚úÖ Strict latency requirements (< 100ms)  
‚úÖ Large models requiring GPU  
‚úÖ Complex preprocessing pipelines  
‚úÖ Need for auto-scaling with no cold starts  

## Next Steps

- **[Module 08: Cost Optimization Strategies](08_cost_optimization_strategies.ipynb)**: Deep dive into cloud cost management
- **[Module 09: Multi-Cloud ML Considerations](09_multi_cloud_ml_considerations.ipynb)**: Cross-platform ML deployment
- **Practice**: Deploy your own model to Lambda using free tier
- **Explore**: Provisioned concurrency and edge deployment (Lambda@Edge)

## Additional Resources

- [AWS Lambda Documentation](https://docs.aws.amazon.com/lambda/)
- [Serverless Framework](https://www.serverless.com/) - Multi-cloud deployment tool
- [AWS Lambda Powertools](https://awslabs.github.io/aws-lambda-powertools-python/) - Best practices utilities
- [Azure Functions ML Tutorial](https://learn.microsoft.com/en-us/azure/azure-functions/)
- [Google Cloud Functions Python](https://cloud.google.com/functions/docs/create-deploy-http-python)

## Exercises

### Exercise 1: Model Size Optimization ‚≠ê

Train two versions of a model:
1. Full model with max performance
2. Lightweight model optimized for serverless (< 10MB)

Compare:
- Model sizes
- Accuracy differences
- Loading times

**Hint**: Use fewer estimators, lower max_depth, or try a simpler algorithm.

In [None]:
# Your code here


### Exercise 2: Lambda Handler with Validation ‚≠ê‚≠ê

Enhance the Lambda handler to include:
1. Input validation (check feature count, data types)
2. Response time logging
3. Error handling for malformed requests
4. Return confidence scores only if above threshold

Test with various inputs including edge cases.

In [None]:
# Your code here


### Exercise 3: Cost Analysis for Your Use Case ‚≠ê‚≠ê

Create a cost calculator that:
1. Takes your expected monthly requests as input
2. Calculates costs for Lambda, Azure Functions, and Google Cloud Functions
3. Calculates equivalent SageMaker/Azure ML/Vertex AI endpoint costs
4. Recommends the most cost-effective option
5. Shows cost at different traffic levels (plot)

Consider:
- Free tier benefits
- Traffic variability
- Geographic region pricing differences

In [None]:
# Your code here


### Exercise 4: Cold Start Mitigation Strategy ‚≠ê‚≠ê‚≠ê

Design and implement a complete cold start mitigation strategy:

1. **Measure baseline**: Simulate cold vs warm starts
2. **Apply optimizations**:
   - Reduce package size
   - Implement lazy loading
   - Use global variable caching
3. **Implement warm-up logic**: Scheduled pings to keep function warm
4. **Compare before/after**: Plot latency distributions
5. **Calculate ROI**: Cost of optimizations vs latency improvement

**Bonus**: Implement provisioned concurrency logic and calculate when it's worth the cost.

In [None]:
# Your code here


### Exercise 5: Multi-Cloud Deployment Comparison ‚≠ê‚≠ê‚≠ê

Create a comprehensive comparison of deploying the same model to:
- AWS Lambda
- Azure Functions
- Google Cloud Functions

Compare:
1. Deployment complexity (steps required)
2. Package size limits
3. Memory and timeout limits
4. Cold start characteristics
5. Pricing at different traffic levels
6. Integration with other services
7. Free tier benefits

Present findings in a decision matrix to help choose the best platform for different scenarios.

In [None]:
# Your code here
