# Lab 4.4.3: Deploying Models to AWS SageMaker

**Module:** 4.4 - Containerization & Cloud Deployment  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand AWS SageMaker deployment concepts
- [ ] Package models for SageMaker deployment
- [ ] Deploy HuggingFace models to real-time endpoints
- [ ] Configure auto-scaling for production
- [ ] Monitor endpoint performance
- [ ] Calculate and optimize costs

---

## Prerequisites

- AWS Account with SageMaker permissions
- AWS CLI configured (`aws configure`)
- Completed: Labs 4.4.1-4.4.2

**Note:** This lab can be completed in simulation mode without an AWS account.

---

## Real-World Context

**When to use SageMaker vs. self-hosted?**

| Use Case | SageMaker | Self-Hosted (DGX Spark) |
|----------|-----------|------------------------|
| Variable traffic | Best (auto-scaling) | Harder |
| Cost at scale | $$$ | $ (if GPU already owned) |
| Latency sensitive | Good (multi-region) | Best (no network hop) |
| Compliance | AWS certifications | Full control |
| Experimentation | More setup | Faster iteration |

**SageMaker shines when:**
- You need global availability
- Traffic is unpredictable
- You want managed infrastructure

---

## ELI5: What is SageMaker?

> **Imagine running a lemonade stand...**
>
> You could build your own stand, buy ingredients, make lemonade, and sell it yourself.
>
> **OR** you could use a food truck service that:
> - Provides the truck (infrastructure)
> - Handles permits (security)
> - Sends more trucks when busy (auto-scaling)
> - Tracks your sales (monitoring)
>
> **SageMaker is that food truck service for ML models.** You bring the recipe (your model), they handle everything else.
>
> **The tradeoff?** You pay rent for the trucks instead of owning them.

---

## Part 1: SageMaker Architecture

### Key Components

```
┌─────────────────────────────────────────────────────────────┐
│                    AWS SageMaker                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │    Model     │    │   Endpoint   │    │   Endpoint   │  │
│  │   Artifact   │───>│    Config    │───>│  (Runtime)   │  │
│  │   (S3)       │    │              │    │              │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│                                                 │           │
│                                                 ▼           │
│                                          ┌──────────────┐  │
│                                          │  Auto-Scale  │  │
│                                          │   Variants   │  │
│                                          └──────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

| Component | Description |
|-----------|-------------|
| **Model Artifact** | Your model files (model.tar.gz in S3) |
| **Endpoint Config** | Instance type, count, model settings |
| **Endpoint** | The actual running inference service |
| **Variant** | Traffic-split for A/B testing |

In [None]:
# Check for AWS credentials and dependencies
import os
import subprocess

print("AWS Environment Check")
print("=" * 60)

# Check AWS CLI
result = subprocess.run(["aws", "--version"], capture_output=True, text=True)
if result.returncode == 0:
    print(f"AWS CLI: {result.stdout.strip().split()[0]}")
else:
    print(" AWS CLI not installed")

# Check credentials
creds_configured = False
if os.path.exists(os.path.expanduser("~/.aws/credentials")):
    print("AWS Credentials: Configured")
    creds_configured = True
elif os.environ.get("AWS_ACCESS_KEY_ID"):
    print("AWS Credentials: From environment")
    creds_configured = True
else:
    print(" AWS Credentials: Not configured")
    print("   Run: aws configure")

# Check boto3
try:
    import boto3
    print(f"boto3: {boto3.__version__}")
except ImportError:
    print(" boto3 not installed. Run: pip install boto3")

# Check sagemaker SDK
try:
    import sagemaker
    print(f"sagemaker SDK: {sagemaker.__version__}")
except ImportError:
    print(" sagemaker not installed. Run: pip install sagemaker")

print("\n" + "=" * 60)
if not creds_configured:
    print("\n This lab will run in SIMULATION mode.")
    print("All examples will work without real AWS access.")

In [None]:
# Import our cloud utilities
import sys
sys.path.insert(0, '..')

from scripts.cloud_utils import (
    SageMakerDeployer,
    estimate_cloud_costs,
    compare_platforms,
    create_deployment_checklist,
)

print("Cloud utilities loaded!")

---

## Part 2: Understanding SageMaker Instance Types

### GPU Instances for ML Inference

| Instance | GPU | VRAM | Cost/hr | Best For |
|----------|-----|------|---------|----------|
| ml.g5.xlarge | 1x A10G | 24GB | ~$1.00 | 7B models |
| ml.g5.2xlarge | 1x A10G | 24GB | ~$1.50 | 7B + more CPU |
| ml.g5.4xlarge | 1x A10G | 24GB | ~$2.50 | 7B + large batch |
| ml.g5.12xlarge | 4x A10G | 96GB | ~$7.60 | 30B models |
| ml.p4d.24xlarge | 8x A100 | 320GB | ~$32.77 | 70B+ models |

### Choosing the Right Instance

**Rule of thumb:** GPU memory needed ≈ 2x model size (for FP16)

| Model Size | FP16 Memory | Recommended Instance |
|------------|-------------|----------------------|
| 7B params | ~14GB | ml.g5.xlarge |
| 13B params | ~26GB | ml.g5.12xlarge |
| 70B params | ~140GB | ml.p4d.24xlarge |

In [None]:
# Let's look at SageMaker instance pricing
print("SageMaker GPU Instance Pricing (us-west-2)")
print("=" * 60)

for instance, price in SageMakerDeployer.INSTANCE_PRICING.items():
    if 'g5' in instance or 'p4' in instance or 'p5' in instance:
        monthly = price * 24 * 30
        print(f"{instance:20} ${price:>8.3f}/hr  (${monthly:>8,.0f}/month)")

print("\n Tip: Use spot instances for 60-80% savings on non-critical workloads!")

---

## Part 3: Deploying a HuggingFace Model

SageMaker has built-in support for HuggingFace models through their Deep Learning Containers.

In [None]:
# Create a SageMaker deployer
deployer = SageMakerDeployer(region="us-west-2")

print("SageMaker Deployer initialized")
print(f"Region: {deployer.region}")
print(f"AWS SDK available: {deployer._sagemaker_available}")

In [None]:
# Deploy a HuggingFace model (simulated if no AWS access)
print("Deploying model to SageMaker...")
print("=" * 60)

# Configuration
model_config = {
    "model_id": "meta-llama/Llama-2-7b-chat-hf",
    "instance_type": "ml.g5.xlarge",
    "instance_count": 1,
    "quantization": "bitsandbytes4",  # 4-bit quantization for memory efficiency
    "environment": {
        "MAX_INPUT_LENGTH": "2048",
        "MAX_TOTAL_TOKENS": "4096",
        "MAX_BATCH_SIZE": "8",
    }
}

print(f"Model: {model_config['model_id']}")
print(f"Instance: {model_config['instance_type']}")
print(f"Quantization: {model_config['quantization']}")
print()

# Deploy (will simulate if no AWS access)
endpoint = deployer.deploy_huggingface_model(**model_config)

print("\nDeployment Result:")
print(f"  Endpoint Name: {endpoint.name}")
print(f"  Status: {endpoint.status}")
print(f"  Instance Type: {endpoint.instance_type}")
print(f"  Cost: ${endpoint.cost_per_hour:.2f}/hour")
print(f"  Monthly Cost: ${endpoint.cost_per_hour * 24 * 30:.2f}/month")

In [None]:
# Show the deployment details
import json

print("Endpoint Details:")
print("=" * 60)
print(json.dumps(endpoint.to_dict(), indent=2, default=str))

---

## Part 4: Invoking the Endpoint

Once deployed, you can invoke the endpoint via the SageMaker Runtime API.

In [None]:
# Example: Invoke the endpoint
print("Invoking SageMaker Endpoint")
print("=" * 60)

# Request payload
payload = {
    "inputs": "What is the capital of France?",
    "parameters": {
        "max_new_tokens": 100,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True,
    }
}

print(f"Request: {payload['inputs']}")
print()

# Invoke (simulated response if no AWS access)
response = deployer.invoke_endpoint(endpoint.name, payload)

print("Response:")
print(json.dumps(response, indent=2))

In [None]:
# Example Python code for real invocation (copy this for your applications)
example_code = '''
# Real SageMaker Invocation Code
# ================================

import boto3
import json

# Create runtime client
runtime = boto3.client("sagemaker-runtime", region_name="us-west-2")

# Prepare request
payload = {
    "inputs": "What is the capital of France?",
    "parameters": {
        "max_new_tokens": 100,
        "temperature": 0.7,
    }
}

# Invoke endpoint
response = runtime.invoke_endpoint(
    EndpointName="my-llm-endpoint",
    ContentType="application/json",
    Body=json.dumps(payload),
)

# Parse response
result = json.loads(response["Body"].read().decode())
print(result[0]["generated_text"])
'''

print(example_code)

---

## Part 5: Auto-Scaling Configuration

### ELI5: Auto-Scaling

> **Imagine a restaurant during lunch rush...**
>
> When it's busy, you need more servers (waiters, not computers!).
> When it's quiet, you send some home to save money.
>
> **Auto-scaling does this automatically:**
> - Watches metrics (requests/second, latency)
> - Adds instances when busy
> - Removes instances when quiet
> - Keeps costs proportional to usage

In [None]:
# Configure auto-scaling
print("Configuring Auto-Scaling")
print("=" * 60)

autoscaling_config = deployer.configure_autoscaling(
    endpoint_name=endpoint.name,
    min_capacity=1,          # Never go below 1 instance
    max_capacity=5,          # Scale up to 5 instances
    target_invocations=70,   # Target 70 invocations per instance per minute
)

print(f"Auto-scaling configured:")
print(f"  Min instances: {autoscaling_config['min_capacity']}")
print(f"  Max instances: {autoscaling_config['max_capacity']}")
print(f"  Target: {autoscaling_config.get('target_invocations', 70)} invocations/instance/min")
print()
print("Scaling behavior:")
print("  - Scale OUT when avg > 70 invocations/min (5 min cooldown)")
print("  - Scale IN when avg < 70 invocations/min (10 min cooldown)")

In [None]:
# Example auto-scaling policy (for reference)
autoscaling_policy = '''
# AWS Auto-Scaling Policy Configuration
# ======================================

# Using boto3 directly:

import boto3

autoscaling = boto3.client("application-autoscaling")

# Register scalable target
autoscaling.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1,
    MaxCapacity=5,
)

# Create scaling policy
autoscaling.put_scaling_policy(
    PolicyName=f"{endpoint_name}-scaling-policy",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70,  # Target invocations per instance per minute
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
        },
        "ScaleInCooldown": 600,   # 10 minutes
        "ScaleOutCooldown": 300,  # 5 minutes
    },
)
'''

print(autoscaling_policy)

---

## Part 6: Cost Analysis

Let's analyze the costs for different deployment scenarios.

In [None]:
# Cost estimation for different scenarios
print("Cost Analysis")
print("=" * 60)

# Estimate costs for a 7B model
estimates = estimate_cloud_costs(
    model_size_gb=14.0,          # 7B model at FP16
    expected_requests_per_day=10000,
    avg_latency_ms=150,          # 150ms per request
)

print("\nCost Estimates for 7B Model (10K requests/day):")
print("-" * 60)
for est in estimates:
    print(f"\n{est.platform}:")
    print(f"  Instance: {est.instance_type}")
    print(f"  Hourly: ${est.hourly_cost:.2f}")
    print(f"  Monthly: ${est.monthly_cost:.0f}")
    print(f"  Per 1K requests: ${est.cost_per_1k_requests:.4f}")
    if est.notes:
        print(f"  Note: {est.notes}")

In [None]:
# Compare with DGX Spark
print("\nDGX Spark vs. Cloud Comparison")
print("=" * 60)

dgx_spark_cost = 4999  # Approximate purchase price
power_cost_monthly = 50  # ~500W * 24h * 30d * $0.15/kWh

# Break-even analysis
cloud_monthly = estimates[0].monthly_cost  # SageMaker
months_to_breakeven = dgx_spark_cost / (cloud_monthly - power_cost_monthly)

print(f"\nDGX Spark:")
print(f"  Purchase: ${dgx_spark_cost:,}")
print(f"  Monthly power: ~${power_cost_monthly}")
print(f"  Capability: Can run up to 50B models locally!")

print(f"\nAWS SageMaker (ml.g5.xlarge):")
print(f"  Monthly: ${cloud_monthly:.0f}")
print(f"  Capability: 7B models with 4-bit quantization")

print(f"\n Break-even: {months_to_breakeven:.1f} months")
print(f"   If you run 24/7 for more than {months_to_breakeven:.1f} months,")
print(f"   DGX Spark is more cost-effective!")

---

## Part 7: Deployment Checklist

In [None]:
# Get deployment checklist
checklist = create_deployment_checklist()

print("Production Deployment Checklist")
print("=" * 60)

for category in checklist:
    print(f"\n{category['category']}")
    print("-" * 40)
    for item in category['items']:
        print(f"  [ ] {item}")

---

## Part 8: Custom Container Deployment

For more control, you can deploy your own container to SageMaker.

In [None]:
# Example: Deploy custom container
custom_deployment_code = '''
# Custom Container Deployment to SageMaker
# =========================================

# 1. Build and push container to ECR
# -----------------------------------

# Build image
docker build -t my-inference:latest .

# Tag for ECR
aws ecr get-login-password --region us-west-2 | \
    docker login --username AWS --password-stdin 123456789.dkr.ecr.us-west-2.amazonaws.com

docker tag my-inference:latest 123456789.dkr.ecr.us-west-2.amazonaws.com/my-inference:latest
docker push 123456789.dkr.ecr.us-west-2.amazonaws.com/my-inference:latest

# 2. Package model artifacts
# --------------------------

# Create model.tar.gz
tar -czvf model.tar.gz model/

# Upload to S3
aws s3 cp model.tar.gz s3://my-bucket/models/

# 3. Deploy using Python SDK
# --------------------------

from sagemaker.model import Model
from sagemaker import get_execution_role

model = Model(
    image_uri="123456789.dkr.ecr.us-west-2.amazonaws.com/my-inference:latest",
    model_data="s3://my-bucket/models/model.tar.gz",
    role=get_execution_role(),
    env={
        "MODEL_PATH": "/opt/ml/model",
        "CUDA_VISIBLE_DEVICES": "0",
    },
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    endpoint_name="my-custom-endpoint",
)
'''

print(custom_deployment_code)

---

## Common Mistakes

### Mistake 1: Wrong Instance Size

```python
# BAD - Model too large for instance
model.deploy(instance_type="ml.g5.xlarge")  # 24GB VRAM for 70B model

# GOOD - Match model size to instance
model.deploy(instance_type="ml.p4d.24xlarge")  # 320GB VRAM for 70B model

# BETTER - Use quantization
model.deploy(
    instance_type="ml.g5.xlarge",
    env={"HF_MODEL_QUANTIZE": "bitsandbytes-nf4"},  # 4-bit quantization
)
```

---

### Mistake 2: No Health Check Timeout

```python
# BAD - Default timeout too short for LLM loading
model.deploy(instance_type="ml.g5.xlarge")

# GOOD - Increase container startup timeout
model.deploy(
    instance_type="ml.g5.xlarge",
    container_startup_health_check_timeout=600,  # 10 minutes
)
```

---

### Mistake 3: Forgetting to Delete Endpoints

```python
# SageMaker charges by the hour even if not in use!
# Always clean up when done testing

predictor.delete_endpoint()  # Stop billing
predictor.delete_model()     # Clean up model artifact
```

In [None]:
# Cleanup - Delete the endpoint
print("Cleanup")
print("=" * 60)

if endpoint.status == "Simulated":
    print("Simulated endpoint - no cleanup needed.")
else:
    print(f"To delete the real endpoint, run:")
    print(f"  deployer.delete_endpoint('{endpoint.name}')")

print("\n Remember: Endpoints charge by the hour!")

---

## Try It Yourself

### Exercise 1: Cost Calculator

Create a function that calculates the monthly cost for a given:
- Number of requests per day
- Average latency
- Model size

<details>
<summary>Hint</summary>
Use the `estimate_cloud_costs` function and consider auto-scaling.
</details>

In [None]:
# TODO: Implement cost calculator
def calculate_monthly_cost(
    requests_per_day: int,
    avg_latency_ms: float,
    model_size_gb: float,
) -> float:
    """
    Calculate estimated monthly cost on SageMaker.
    
    Args:
        requests_per_day: Expected daily request volume
        avg_latency_ms: Average request latency
        model_size_gb: Model size in GB
    
    Returns:
        Estimated monthly cost in USD
    """
    # TODO: Your implementation here
    pass

# Test your function:
# cost = calculate_monthly_cost(10000, 150, 14.0)
# print(f"Estimated monthly cost: ${cost:.2f}")

---

## Checkpoint

You've learned:
- SageMaker architecture and components
- How to deploy HuggingFace models
- Instance type selection for different model sizes
- Auto-scaling configuration
- Cost analysis and optimization
- Custom container deployment

---

## Challenge (Optional)

Create a SageMaker deployment pipeline that:
1. Packages a fine-tuned model from your DGX Spark
2. Uploads to S3 with versioning
3. Deploys to SageMaker with A/B testing (80/20 traffic split)
4. Sets up CloudWatch alarms for latency and errors
5. Implements automatic rollback on high error rate

---

## Further Reading

- [SageMaker Documentation](https://docs.aws.amazon.com/sagemaker/)
- [HuggingFace SageMaker Integration](https://huggingface.co/docs/sagemaker/)
- [SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/)
- [SageMaker Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/best-practices.html)

---

## Cleanup

In [None]:
# List all endpoints (if using real AWS)
print("To list all your SageMaker endpoints:")
print("  aws sagemaker list-endpoints")
print("\nTo delete an endpoint:")
print("  aws sagemaker delete-endpoint --endpoint-name <name>")
print("  aws sagemaker delete-endpoint-config --endpoint-config-name <name>")
print("  aws sagemaker delete-model --model-name <name>")