# Lab 4.4.4: Deploying Models to GCP Vertex AI

**Module:** 4.4 - Containerization & Cloud Deployment  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand GCP Vertex AI architecture
- [ ] Deploy custom containers to Vertex AI
- [ ] Configure GPU-accelerated endpoints
- [ ] Compare Vertex AI with AWS SageMaker
- [ ] Optimize for cost and performance

---

## Prerequisites

- GCP Account with Vertex AI enabled
- gcloud CLI configured (`gcloud auth login`)
- Completed: Lab 4.4.3 (SageMaker for comparison)

**Note:** This lab can be completed in simulation mode without GCP access.

---

## Real-World Context

**Why consider Vertex AI?**

| Feature | Vertex AI | SageMaker |
|---------|-----------|----------|
| BigQuery integration | Native | Via S3/Athena |
| Pricing simplicity | Simpler | Complex |
| GPU options | Good | Excellent |
| AutoML | Strong | Good |
| Ecosystem | GCP-centric | AWS-centric |

**Choose Vertex AI when:**
- Your data is in BigQuery
- You're already on GCP
- You need simpler pricing

---

## ELI5: Vertex AI vs SageMaker

> **It's like choosing between pizza delivery services...**
>
> **SageMaker** is like a big chain (Domino's). Huge menu, lots of options, sometimes confusing pricing, works everywhere.
>
> **Vertex AI** is like a local pizzeria that knows your neighborhood. Fewer options, but simpler menu, and if you already shop at nearby stores (BigQuery, GCS), they know your preferences.
>
> **Both deliver great pizza (ML models)** - the choice depends on where you live (your cloud ecosystem)!

In [None]:
# Check GCP environment
import subprocess
import os

print("GCP Environment Check")
print("=" * 60)

# Check gcloud CLI
result = subprocess.run(["gcloud", "--version"], capture_output=True, text=True)
if result.returncode == 0:
    version = result.stdout.split('\n')[0]
    print(f"gcloud CLI: {version}")
else:
    print(" gcloud CLI not installed")

# Check authentication
result = subprocess.run(
    ["gcloud", "auth", "list", "--format=value(account)"],
    capture_output=True, text=True
)
if result.stdout.strip():
    print(f"Authenticated as: {result.stdout.strip().split()[0]}")
else:
    print(" Not authenticated. Run: gcloud auth login")

# Check google-cloud-aiplatform SDK
try:
    from google.cloud import aiplatform
    print(f"google-cloud-aiplatform: Installed")
except ImportError:
    print(" google-cloud-aiplatform not installed")
    print("   Run: pip install google-cloud-aiplatform")

print("\n" + "=" * 60)

In [None]:
# Import our cloud utilities
import sys
sys.path.insert(0, '..')

from scripts.cloud_utils import (
    VertexAIDeployer,
    compare_platforms,
    estimate_cloud_costs,
)

print("Cloud utilities loaded!")

---

## Part 1: Vertex AI Architecture

### Key Components

```
┌─────────────────────────────────────────────────────────────┐
│                    GCP Vertex AI                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │   Model      │    │   Model      │    │   Endpoint   │  │
│  │   Artifact   │───>│   Registry   │───>│              │  │
│  │   (GCS)      │    │              │    │              │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│                                                 │           │
│                            ┌────────────────────┴─────┐    │
│                            │      Deployed Model       │    │
│                            │  ┌─────┐ ┌─────┐ ┌─────┐ │    │
│                            │  │ v1  │ │ v2  │ │ v3  │ │    │
│                            │  │ 50% │ │ 30% │ │ 20% │ │    │
│                            │  └─────┘ └─────┘ └─────┘ │    │
│                            └──────────────────────────┘    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

| Component | Description |
|-----------|-------------|
| **Model Artifact** | Model files in Google Cloud Storage |
| **Model Registry** | Versioned model management |
| **Endpoint** | The deployed prediction service |
| **Deployed Model** | A model version with traffic allocation |

---

## Part 2: Vertex AI Machine Types

### GPU Options

In [None]:
# Vertex AI pricing information
print("Vertex AI GPU Pricing (us-central1)")
print("=" * 60)

vertex_pricing = {
    "NVIDIA_TESLA_T4": {"vram": "16GB", "cost": 0.35, "best_for": "7B models"},
    "NVIDIA_L4": {"vram": "24GB", "cost": 0.70, "best_for": "7-13B models"},
    "NVIDIA_TESLA_A100": {"vram": "40GB", "cost": 2.93, "best_for": "30B models"},
    "NVIDIA_A100_80GB": {"vram": "80GB", "cost": 3.67, "best_for": "70B models"},
    "NVIDIA_H100_80GB": {"vram": "80GB", "cost": 10.00, "best_for": "High performance"},
}

for gpu, info in vertex_pricing.items():
    monthly = info['cost'] * 24 * 30
    print(f"{gpu:25} {info['vram']:>6} ${info['cost']:>6.2f}/hr (${monthly:>7,.0f}/mo) - {info['best_for']}")

print("\n Note: Add machine type cost (~$0.19-0.76/hr for n1-standard-4 to n1-standard-16)")

---

## Part 3: Deploying to Vertex AI

### Custom Container Deployment

In [None]:
# Create Vertex AI deployer
deployer = VertexAIDeployer(
    project="my-gcp-project",  # Replace with your project
    region="us-central1",
)

print("Vertex AI Deployer initialized")
print(f"Project: {deployer.project}")
print(f"Region: {deployer.region}")
print(f"SDK available: {deployer._aiplatform_available}")

In [None]:
# Deploy model to Vertex AI
print("Deploying model to Vertex AI...")
print("=" * 60)

# Configuration
deploy_config = {
    "model_path": "gs://my-bucket/models/llama-7b",
    "serving_container_image_uri": "us-docker.pkg.dev/my-project/inference/llm-server:latest",
    "machine_type": "n1-standard-8",
    "accelerator_type": "NVIDIA_L4",
    "accelerator_count": 1,
    "min_replica_count": 1,
    "max_replica_count": 3,
}

print(f"Configuration:")
print(f"  Model: {deploy_config['model_path']}")
print(f"  Machine: {deploy_config['machine_type']}")
print(f"  GPU: {deploy_config['accelerator_type']}")
print(f"  Replicas: {deploy_config['min_replica_count']}-{deploy_config['max_replica_count']}")
print()

# Deploy (simulated if no GCP access)
endpoint = deployer.deploy_model(**deploy_config)

print("\nDeployment Result:")
print(f"  Endpoint Name: {endpoint.name}")
print(f"  Status: {endpoint.status}")
print(f"  Instance Type: {endpoint.instance_type}")
print(f"  Cost: ${endpoint.cost_per_hour:.2f}/hour")

---

## Part 4: Platform Comparison

In [None]:
# Compare platforms
import json

comparison = compare_platforms("Qwen/Qwen3-8B-Instruct")

print("Platform Comparison")
print("=" * 60)

for platform, info in comparison["platforms"].items():
    print(f"\n{platform.upper()}")
    print("-" * 40)
    print(f"  Instance: {info['instance']}")
    print(f"  Hourly cost: ${info['hourly_cost']:.2f}")
    print(f"  Setup: {info['setup_complexity']}")
    print(f"  Auto-scaling: {info['auto_scaling']}")
    print(f"  Cold start: {info['cold_start']}")
    print(f"  \n  Pros:")
    for pro in info['pros'][:2]:
        print(f"    + {pro}")
    print(f"  Cons:")
    for con in info['cons'][:2]:
        print(f"    - {con}")

print(f"\n Recommendation:")
print(comparison['recommendation'])

In [None]:
# Side-by-side cost comparison
print("\nCost Comparison for 7B Model")
print("=" * 60)

scenarios = [
    {"name": "Light (1K req/day)", "requests": 1000},
    {"name": "Medium (10K req/day)", "requests": 10000},
    {"name": "Heavy (100K req/day)", "requests": 100000},
]

print(f"{'Scenario':<25} {'SageMaker':>12} {'Vertex AI':>12} {'Winner':>10}")
print("-" * 60)

for scenario in scenarios:
    estimates = estimate_cloud_costs(
        model_size_gb=14.0,
        expected_requests_per_day=scenario['requests'],
        avg_latency_ms=150,
    )
    
    sm_cost = estimates[0].monthly_cost
    vertex_cost = estimates[1].monthly_cost
    winner = "SageMaker" if sm_cost < vertex_cost else "Vertex AI"
    
    print(f"{scenario['name']:<25} ${sm_cost:>10,.0f} ${vertex_cost:>10,.0f} {winner:>10}")

---

## Part 5: Custom Container Requirements

Vertex AI has specific requirements for custom containers.

In [None]:
# Vertex AI container requirements
vertex_requirements = '''
# Vertex AI Custom Container Requirements
# ========================================

## Required Endpoints

1. Health Check: GET /health or /ping
   - Must return 200 when ready
   
2. Prediction: POST /predict
   - Input: {"instances": [{...}, {...}]}
   - Output: {"predictions": [{...}, {...}]}

## Environment Variables (Auto-set)

- AIP_STORAGE_URI: GCS path to model artifacts
- AIP_HTTP_PORT: Port to listen on (default: 8080)
- AIP_HEALTH_ROUTE: Health check route
- AIP_PREDICT_ROUTE: Prediction route

## Dockerfile Template

FROM nvcr.io/nvidia/pytorch:24.12-py3

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app/ /app/

# Vertex AI expects port 8080 by default
ENV AIP_HTTP_PORT=8080
ENV AIP_HEALTH_ROUTE=/health
ENV AIP_PREDICT_ROUTE=/predict

EXPOSE 8080

CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
'''

print(vertex_requirements)

In [None]:
# Example FastAPI app for Vertex AI
vertex_app_code = '''
# FastAPI Application for Vertex AI
# ==================================

import os
from typing import List, Dict, Any
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

# Model loading (happens once on startup)
@app.on_event("startup")
async def load_model():
    global model, tokenizer
    
    # Vertex AI provides model path via environment
    model_path = os.environ.get("AIP_STORAGE_URI", "/models")
    
    from transformers import AutoModelForCausalLM, AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype="auto",
        device_map="auto",
    )

# Health check (required)
@app.get("/health")
@app.get("/ping")  # Alternative route
async def health():
    return {"status": "healthy"}

# Prediction request format
class PredictionRequest(BaseModel):
    instances: List[Dict[str, Any]]

class PredictionResponse(BaseModel):
    predictions: List[Dict[str, Any]]

# Prediction endpoint (required)
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    predictions = []
    
    for instance in request.instances:
        prompt = instance.get("prompt", "")
        max_tokens = instance.get("max_tokens", 100)
        
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=max_tokens)
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        predictions.append({"generated_text": text})
    
    return PredictionResponse(predictions=predictions)
'''

print(vertex_app_code)

---

## Part 6: Deployment Workflow

Complete workflow for deploying to Vertex AI.

In [None]:
# Complete Vertex AI deployment workflow
deployment_workflow = '''
# Vertex AI Deployment Workflow
# ==============================

# 1. Build and push container to Artifact Registry
# -------------------------------------------------

# Enable APIs
gcloud services enable artifactregistry.googleapis.com
gcloud services enable aiplatform.googleapis.com

# Create repository
gcloud artifacts repositories create ml-models \\
    --repository-format=docker \\
    --location=us-central1

# Configure Docker
gcloud auth configure-docker us-central1-docker.pkg.dev

# Build and push
docker build -t us-central1-docker.pkg.dev/PROJECT/ml-models/inference:v1 .
docker push us-central1-docker.pkg.dev/PROJECT/ml-models/inference:v1

# 2. Upload model to GCS
# ----------------------

# Create bucket
gsutil mb -l us-central1 gs://my-models-bucket

# Upload model files
gsutil -m cp -r ./model/* gs://my-models-bucket/llama-7b/

# 3. Deploy using Python SDK
# --------------------------

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# Upload model to registry
model = aiplatform.Model.upload(
    display_name="llama-7b",
    artifact_uri="gs://my-models-bucket/llama-7b",
    serving_container_image_uri="us-central1-docker.pkg.dev/PROJECT/ml-models/inference:v1",
    serving_container_predict_route="/predict",
    serving_container_health_route="/health",
)

# Deploy to endpoint
endpoint = model.deploy(
    machine_type="n1-standard-8",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=3,
)

# 4. Test the endpoint
# --------------------

response = endpoint.predict(
    instances=[{"prompt": "Hello, how are you?", "max_tokens": 50}]
)
print(response.predictions)
'''

print(deployment_workflow)

---

## Common Mistakes

### Mistake 1: Wrong Port Configuration

```dockerfile
# BAD - Vertex AI expects 8080 by default
EXPOSE 8000
CMD ["uvicorn", "main:app", "--port", "8000"]

# GOOD - Use AIP_HTTP_PORT
ENV AIP_HTTP_PORT=8080
EXPOSE 8080
CMD ["uvicorn", "main:app", "--port", "8080"]
```

---

### Mistake 2: Wrong Request/Response Format

```python
# BAD - Vertex AI expects specific format
@app.post("/predict")
def predict(prompt: str):  # Wrong input format
    return {"text": "..."}  # Wrong output format

# GOOD - Use instances/predictions format
@app.post("/predict")
def predict(request: dict):  # {"instances": [...]}
    return {"predictions": [...]}  # Required format
```

---

### Mistake 3: Forgetting Model Artifacts Path

```python
# BAD - Hardcoded path
model = load_model("/models/llama-7b")

# GOOD - Use environment variable
model_path = os.environ.get("AIP_STORAGE_URI", "/models")
model = load_model(model_path)
```

---

## Checkpoint

You've learned:
- Vertex AI architecture and components
- Custom container requirements
- GPU machine types and pricing
- Platform comparison with SageMaker
- Complete deployment workflow

---

## Challenge (Optional)

Create a multi-region deployment:
1. Deploy to us-central1 (primary)
2. Deploy to europe-west1 (DR)
3. Implement traffic splitting
4. Create monitoring dashboard

---

## Further Reading

- [Vertex AI Documentation](https://cloud.google.com/vertex-ai/docs)
- [Custom Containers Guide](https://cloud.google.com/vertex-ai/docs/predictions/use-custom-container)
- [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)

---

## Cleanup

In [None]:
# Cleanup commands
print("Vertex AI Cleanup")
print("=" * 60)
print("\n# Delete endpoint (undeploy model first)")
print("endpoint.undeploy_all()")
print("endpoint.delete()")
print("\n# Delete model from registry")
print("model.delete()")
print("\n# Via gcloud:")
print("gcloud ai endpoints delete ENDPOINT_ID --region=us-central1")
print("gcloud ai models delete MODEL_ID --region=us-central1")