# üß† SageMaker Serverless Exploration - Complete Summary
**Total Cost:** $0.00 üéâ  


## üìã Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [IAM Role Setup](#iam-role-setup)
3. [SageMaker SDK Setup](#sagemaker-sdk-setup)
4. [Deploying a Serverless Endpoint](#deploying-serverless-endpoint)
5. [Testing the Endpoint](#testing-the-endpoint)
6. [Cleanup & Cost Management](#cleanup-cost-management)
7. [Production Workflows](#production-workflows)
8. [High-Performance Options](#high-performance-options)
9. [CPU vs GPU Selection](#cpu-vs-gpu-selection)
10. [Quick Reference](#quick-reference)

---
## üèóÔ∏è Architecture Overview <a name="architecture-overview"></a>

### What We Built

```
IAM Role ‚Üí HuggingFace Model ‚Üí Serverless Endpoint ‚Üí Inference Call ‚Üí Cleanup
```

### AWS Services Used

| Service | Purpose | Cost |
|---------|---------|------|
| **IAM** | Permission management | Free |
| **SageMaker** | ML model hosting | Pay per inference |
| **HuggingFace Hub** | Pre-trained model source | Free |

> üí° **Sticky Analogy: Food Truck Service**
>
> Think of SageMaker Serverless as ordering a **food truck on-demand**:
> - **IAM Role** = Your ID badge proving you're allowed to order
> - **HuggingFaceModel** = The menu item you're ordering
> - **ServerlessInferenceConfig** = Delivery preferences (memory, concurrency)
> - **model.deploy()** = Actually placing the order
> - **Endpoint** = The food truck arrives and flips the "OPEN" sign

---
## üîê IAM Role Setup <a name="iam-role-setup"></a>

### Why We Need It

SageMaker needs permission to access S3, ECR, and other AWS services on your behalf.

> üí° **Analogy:** Like giving a delivery driver your house key to drop off packages while you're away.

### What We Created

- **Role Name:** `SageMakerExecutionRole`
- **ARN:** `arn:aws:iam::609662024349:role/SageMakerExecutionRole`
- **Trust Policy:** Allows `sagemaker.amazonaws.com` to assume the role
- **Permission Policy:** `AmazonSageMakerFullAccess`

### Trust Policy JSON

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
```

---
## üì¶ SageMaker SDK Setup <a name="sagemaker-sdk-setup"></a>

### Version Compatibility Issue

**Problem:** `ModuleNotFoundError: No module named 'sagemaker.huggingface'`

**Root Cause:** SageMaker v3.x restructured modules - HuggingFace integration was removed/moved.

> üí° **Analogy: App Store Update**
>
> Like buying a new iPhone and finding your favorite app hasn't been updated for the new iOS yet. Rolling back to v2 is like using the "classic" version that still has everything built-in.

### Solution

```bash
pip3 install "sagemaker>=2.0,<3.0"
```

### Version Comparison

| Version | HuggingFaceModel | Notes |
|---------|------------------|-------|
| **v3.x** | ‚ùå Not bundled | Modular architecture |
| **v2.x** | ‚úÖ Included | Use this for HuggingFace |

---
## üöÄ Deploying a Serverless Endpoint <a name="deploying-serverless-endpoint"></a>

### Initial Approach (Failed)

Using S3 path directly:

```python
model = HuggingFaceModel(
    model_data="s3://huggingface-sagemaker-models/...",  # ‚ùå Access denied
    ...
)
```

**Error:** `ValidationException: Could not access model data at s3://...`

> üí° **Analogy: Supplier vs Warehouse**
>
> Instead of giving the delivery truck a specific warehouse address that might be outdated, tell them "order directly from the supplier" (HuggingFace Hub) - always fresh and accessible!

### Working Solution

Using HuggingFace Hub directly via environment variable:

In [None]:
# sagemaker-test.py - Working deployment script
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig

role = "arn:aws:iam::609662024349:role/SageMakerExecutionRole"

# Use HuggingFace Hub directly instead of S3
model = HuggingFaceModel(
    transformers_version="4.26",
    pytorch_version="1.13",
    py_version="py39",
    role=role,
    env={"HF_MODEL_ID": "distilbert-base-uncased-finetuned-sst-2-english"}
)

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=1
)

predictor = model.deploy(serverless_inference_config=serverless_config)
print(f"Endpoint name: {predictor.endpoint_name}")

### Deployment Progress

```
----!
Endpoint name: huggingface-pytorch-inference-2025-12-23-13-07-31-668
```

**What the symbols mean:**
- Each `-` = Health check in progress
- `!` = Endpoint is ready!

> üí° **Analogy:** The food truck is driving to the location, setting up the kitchen, firing up the grill, and flipping the "OPEN" sign.

---
## üß™ Testing the Endpoint <a name="testing-the-endpoint"></a>

### Shell Quoting Lesson Learned

**Problem:** Inline Python via SSH causes quote escaping nightmares.

> üí° **Analogy: Noisy Drive-Through**
>
> Instead of shouting a complicated order through a noisy speaker (nested shell quotes), write it on paper first (file), then hand it through the window!

### Solution: File Approach

In [None]:
# test-endpoint.py - Inference script
import boto3
import json

runtime = boto3.client("sagemaker-runtime", region_name="us-east-1")

response = runtime.invoke_endpoint(
    EndpointName="huggingface-pytorch-inference-2025-12-23-13-07-31-668",
    ContentType="application/json",
    Body=json.dumps({"inputs": "I love learning AWS!"})
)

print(json.loads(response["Body"].read().decode()))

### Test Results

| Input | Label | Score |
|-------|-------|-------|
| "I love learning AWS!" | POSITIVE | 99.95% |

```json
[{"label": "POSITIVE", "score": 0.9995132684707642}]
```

> üí° **The model is like a mood detector** - it reads the emotional tone of text and tells you whether it's positive or negative, with a confidence percentage.

---
## üßπ Cleanup & Cost Management <a name="cleanup-cost-management"></a>

### Why Cleanup Matters

> üí° **Analogy: Closing the Food Truck**
>
> The endpoint is like a food truck parked with the "OPEN" sign on. Even if no customers come, there's a small cost for being ready to serve. Deleting = packing up and leaving!

### Cleanup Commands

```bash
# 1. Delete endpoint (stops billing)
aws sagemaker delete-endpoint --endpoint-name <endpoint-name>

# 2. Delete endpoint config (free, but keeps things clean)
aws sagemaker delete-endpoint-config --endpoint-config-name <config-name>

# 3. Delete model (free, but keeps things clean)
aws sagemaker delete-model --model-name <model-name>

# 4. Verify everything is gone
aws sagemaker list-endpoints           # Should be empty
aws sagemaker list-endpoint-configs    # Should be empty
aws sagemaker list-models              # Should be empty
```

### What Are These Resources?

| Resource | What It Is | Cost | Analogy |
|----------|------------|------|--------|
| **Endpoint** | Running inference service | üí∞ Charges | The food truck serving |
| **Endpoint Config** | Blueprint for endpoint setup | Free | Recipe card |
| **Model** | Registration record pointing to model | Free | Catalog entry |

### AWS Billing Note

AWS billing has a **6-24 hour delay**. Charges may not appear immediately, but for a few test calls, expect **fractions of a penny**.

---
## üè≠ Production Workflows <a name="production-workflows"></a>

### Learning vs Production

| Stage | Source | Code |
|-------|--------|------|
| **Learning** | HuggingFace Hub | `env={"HF_MODEL_ID": "..."}` |
| **Production** | Your S3 bucket | `model_data="s3://your-bucket/model.tar.gz"` |

### Production Flow

```
Train model locally/SageMaker
        ‚Üì
Save/export model (model.tar.gz)
        ‚Üì
Upload to YOUR S3 bucket
        ‚Üì
Deploy from S3
```

> üí° **Analogy: Restaurant vs Home Cooking**
>
> - **Today:** Ordered pre-made dish from restaurant (HuggingFace Hub)
> - **Production:** Cook your own recipe, package it, store in your pantry (S3), serve from there

---
## üöÄ High-Performance Options <a name="high-performance-options"></a>

### Endpoint Type Comparison

| Type | Behavior | Best For |
|------|----------|----------|
| **Serverless** | Spins up on-demand, scales to zero | Low traffic, cost-sensitive, dev/test |
| **Real-time** | Instance runs 24/7 | High throughput, low latency, production |
| **Async** | Queue-based, for long jobs | Large payloads, batch processing |

### Serverless vs Real-time Trade-offs

| | Serverless | Real-time |
|--|-----------|----------|
| **Cold start** | 10-30 sec first call | None (always warm) |
| **Latency** | Higher | Lower (~ms) |
| **Cost when idle** | $0 | Paying 24/7 |
| **High traffic** | ‚ùå | ‚úÖ |

> üí° **Analogy:**
> - **Serverless** = Food truck that parks only when you call (cheap but slow to arrive)
> - **Real-time** = Restaurant that's always open (instant service but paying rent 24/7)

### High Throughput + Low Latency Solution

**Real-time Endpoints with Auto-Scaling**

> üí° **Analogy: Fleet of Food Trucks**
>
> Instead of one truck that shows up when called (serverless), you have a **fleet** that automatically dispatches more trucks during lunch rush and sends them home when quiet.

In [None]:
# 1. Deploy real-time (not serverless)
predictor = model.deploy(
    initial_instance_count=2,      # Start with 2 instances
    instance_type="ml.m5.large"    # Always-on instance type
)

# 2. Add auto-scaling
import boto3

client = boto3.client("application-autoscaling")

# Register scalable target
client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=2,
    MaxCapacity=10
)

# Add scaling policy (scale based on invocations)
client.put_scaling_policy(
    PolicyName="scale-on-invocations",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70.0,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleOutCooldown": 60,
        "ScaleInCooldown": 300
    }
)

### High-Performance Checklist üìã

When asked "How do you handle throughput & latency in SageMaker?", know these:

| Concept | Remember It As... |
|---------|-------------------|
| **Endpoint Type** | How eager is your service? (Always ready / Wake on call / Queue it) |
| **Instance Selection** | Brains (CPU) vs Muscle (GPU) - match worker to job |
| **Scaling Strategy** | When to hire/fire more workers |
| **Cooldown Periods** | Don't panic-hire or panic-fire |
| **Model Optimization** | Make the model faster, not just more hardware |

> üí° **Analogy: Restaurant Staffing**
>
> Running a high-performance ML service is like managing a restaurant - decide if you're 24/7 or pop-up (endpoint), hire cooks vs dishwashers (instance), know when to call in extra staff (scaling), don't overreact to one busy hour (cooldown), and train your staff to work faster (optimization).

---
## üß† CPU vs GPU Selection <a name="cpu-vs-gpu-selection"></a>

### The Confusion

"Why do we need GPU for inference? I thought GPU was only for training."

### The Answer

It depends on **model size and throughput**, not just training vs inference.

| Scenario | CPU | GPU |
|----------|-----|-----|
| Training | ‚ùå (too slow) | ‚úÖ Always |
| Inference - Small model | ‚úÖ | Overkill |
| Inference - Large model (BERT, GPT) | ‚ùå (too slow) | ‚úÖ |
| Inference - High batch volume | ‚ùå | ‚úÖ |

### Quick Decision Guide

| Use CPU | Use GPU |
|---------|---------|
| Traditional ML (XGBoost, RF) | Deep Learning (Transformers, CNNs) |
| Small models | Large models (100M+ params) |
| Low inference volume | High batch throughput |
| Cost-sensitive | Latency-critical |

**Simple Rule:** If it's a neural network AND (large OR fast) ‚Üí GPU

> üí° **Analogy: Pizza Kitchen**
>
> Even after you've **learned** to cook (training), making **100 pizzas at once** (inference) still needs industrial ovens (GPU). But making **one sandwich**? A regular kitchen (CPU) works fine!

---
## üìö Quick Reference <a name="quick-reference"></a>

### Complete Command Sequence

```bash
# 1. Install SDK (use v2 for HuggingFace)
pip3 install "sagemaker>=2.0,<3.0"

# 2. Deploy (see Python script above)
python3 sagemaker-test.py

# 3. Test
python3 test-endpoint.py

# 4. Cleanup
aws sagemaker delete-endpoint --endpoint-name <name>
aws sagemaker delete-endpoint-config --endpoint-config-name <name>
aws sagemaker delete-model --model-name <name>

# 5. Verify
aws sagemaker list-endpoints
aws sagemaker list-endpoint-configs
aws sagemaker list-models
```

### All Sticky Analogies

| Concept | Analogy |
|---------|--------|
| SageMaker Serverless | Food truck that arrives on-demand |
| IAM Role | ID badge / house key for delivery driver |
| HF Hub vs S3 | Ordering from supplier vs specific warehouse |
| SDK v3 vs v2 | New iPhone missing your favorite app |
| Shell quoting | Noisy drive-through vs written order |
| Endpoint deletion | Closing the food truck |
| Real-time + Auto-scaling | Fleet of food trucks |
| CPU vs GPU | Brains vs Muscle / Regular vs Industrial kitchen |
| Cooldown periods | Don't panic-hire or panic-fire |

### Key Takeaways

1. **Always use SageMaker SDK v2.x** for HuggingFace models
2. **Use `env={"HF_MODEL_ID": ...}`** instead of S3 paths for learning
3. **Always clean up endpoints** after testing
4. **Serverless = cheap but slow** / **Real-time = fast but expensive**
5. **GPU for inference** only for large models or high throughput

---

## ‚úÖ What We Accomplished

| Step | Status |
|------|--------|
| Created IAM Role (`SageMakerExecutionRole`) | ‚úÖ |
| Installed SageMaker SDK (v2) | ‚úÖ |
| Deployed serverless HuggingFace model | ‚úÖ |
| Tested sentiment analysis | ‚úÖ |
| Cleaned up all resources | ‚úÖ |
| **Total cost** | **$0.00** üéâ |

---


In [None]:
#| hide

import subprocess
from pathlib import Path
from dialoghelper import curr_dialog

def deploy_notebook():
    nb_name = Path(curr_dialog()['name']).name + '.ipynb'
    src = f'/app/data/{curr_dialog()["name"]}.ipynb'
    dst = '/app/data/publish/portfolio/static/'
    print(nb_name)
    
    # Copy notebook to static folder
    subprocess.run(['cp', src, dst])
    
    # Deploy with plash
    subprocess.run(['plash_deploy'], cwd='/app/data/publish/portfolio')

In [None]:
#| hide
deploy_notebook()