# ü§ñ LLM GPU Deployment with Seldon Core 2
**Production-Ready Large Language Model Serving with GPU Acceleration**

## üéØ Overview

This notebook demonstrates how to deploy and serve Large Language Models (LLMs) using Seldon Core 2 with GPU acceleration. We'll cover:

- üöÄ **GPU Cluster Management**: Efficiently manage costly GPU resources
- üß† **LLM Deployment Options**: API-based, local GPU, and optimized prompt runtime
- ‚ö° **Performance Optimization**: Model caching and prompt runtime for faster inference
- üí∞ **Cost Management**: Auto-scaling down GPU nodes when not in use
- üîß **Production Patterns**: HPA + native server autoscaling

## ‚ö†Ô∏è Prerequisites

- GCloud CLI configured with access to `dev-sherif` project
- `kubectl` and `kubectx` installed
- Access to GPU cluster: `llm-demos-alex`
- Seldon Core 2 installed on the cluster

## üí∏ Important: GPU Cost Management

**GPU nodes are expensive!** Always:
1. Scale up nodes only when needed
2. Scale down immediately after use
3. Monitor costs in GCP console

## üîß Cluster Management Scripts

First, let's set up the cluster management functions:

In [None]:
import subprocess
import time
import os
import json
import requests
from IPython.display import display, Markdown, Code
from dataclasses import dataclass
from typing import Optional, Dict, List
import warnings
warnings.filterwarnings('ignore')

@dataclass
class GPUClusterConfig:
    """GPU cluster configuration"""
    cluster_name: str = "llm-demos-alex"
    region: str = "europe-west4"
    project: str = "dev-sherif"
    context: str = "gke_dev-sherif_europe-west4_llm-demos-alex"
    
    # Node pool configurations
    pool_1: str = "pool-1"  # Standard nodes
    pool_4: str = "pool-4"  # GPU nodes (optional)
    pool_7: str = "pool-7"  # GPU nodes (primary)
    
    # Sizes when scaled up
    pool_1_size_up: int = 6
    pool_4_size_up: int = 0  # Currently not used
    pool_7_size_up: int = 1  # 1 GPU node
    
    # All pools scale to 0 when down
    pool_size_down: int = 0

config = GPUClusterConfig()

def run_command(cmd: str, check: bool = True) -> subprocess.CompletedProcess:
    """Run command with proper error handling"""
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if check and result.returncode != 0:
        print(f"‚ùå Command failed: {cmd}")
        print(f"Error: {result.stderr}")
    return result

def log(msg: str, level: str = "INFO"):
    """Pretty logging"""
    icons = {"INFO": "‚ÑπÔ∏è", "SUCCESS": "‚úÖ", "WARNING": "‚ö†Ô∏è", "ERROR": "‚ùå"}
    colors = {"INFO": "blue", "SUCCESS": "green", "WARNING": "orange", "ERROR": "red"}
    icon = icons.get(level, "üìù")
    color = colors.get(level, "black")
    display(Markdown(f"<span style='color: {color}'>{icon} **{msg}**</span>"))

## üöÄ GPU Cluster Management Functions

In [None]:
def connect_to_cluster():
    """Connect to the GPU cluster"""
    log("Connecting to GPU cluster...", "INFO")
    
    # Get cluster credentials
    cmd = f"gcloud container clusters get-credentials {config.cluster_name} --region {config.region} --project {config.project}"
    result = run_command(cmd)
    
    if result.returncode == 0:
        log("Successfully connected to cluster", "SUCCESS")
        
        # Switch context
        result = run_command(f"kubectl config use-context {config.context}")
        if result.returncode == 0:
            log(f"Switched to context: {config.context}", "SUCCESS")
        else:
            log("Failed to switch context", "ERROR")
    else:
        log("Failed to connect to cluster", "ERROR")
        log("Make sure you have access to the dev-sherif project", "WARNING")

def scale_gpu_up():
    """Scale up GPU nodes - COSTLY OPERATION!"""
    log("‚ö†Ô∏è SCALING UP GPU NODES - THIS WILL INCUR COSTS!", "WARNING")
    
    # Confirm context
    current_context = run_command("kubectl config current-context", check=False)
    if config.context not in current_context.stdout:
        log(f"Wrong context! Expected {config.context}", "ERROR")
        return
    
    # Scale up pools
    pools = [
        (config.pool_1, config.pool_1_size_up, "Standard nodes"),
        (config.pool_7, config.pool_7_size_up, "GPU nodes")
    ]
    
    for pool_name, size, desc in pools:
        if size > 0:
            log(f"Scaling {desc} ({pool_name}) to {size} nodes...", "INFO")
            cmd = f"gcloud container clusters resize {config.cluster_name} --node-pool {pool_name} --num-nodes {size} --zone {config.region} --project {config.project} --quiet"
            result = run_command(cmd)
            if result.returncode == 0:
                log(f"{desc} scaled to {size}", "SUCCESS")
            else:
                log(f"Failed to scale {desc}", "ERROR")
    
    log("Waiting for nodes to be ready...", "INFO")
    time.sleep(60)
    
    # Check node status
    result = run_command("kubectl get nodes | grep Ready | wc -l")
    ready_nodes = int(result.stdout.strip()) if result.stdout.strip().isdigit() else 0
    log(f"Ready nodes: {ready_nodes}", "INFO")
    
    # Check GPU availability
    result = run_command("kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable[\"nvidia.com/gpu\"] != null) | .metadata.name'")
    if result.stdout.strip():
        log(f"GPU nodes available: {result.stdout.strip()}", "SUCCESS")
    else:
        log("No GPU nodes found!", "WARNING")

def scale_gpu_down():
    """Scale down ALL nodes to save costs"""
    log("üí∞ SCALING DOWN ALL NODES TO SAVE COSTS", "WARNING")
    
    # Confirm context
    current_context = run_command("kubectl config current-context", check=False)
    if config.context not in current_context.stdout:
        log(f"Wrong context! Expected {config.context}", "ERROR")
        return
    
    # Scale down all pools
    pools = [config.pool_1, config.pool_4, config.pool_7]
    
    for pool_name in pools:
        log(f"Scaling down {pool_name} to 0 nodes...", "INFO")
        cmd = f"gcloud container clusters resize {config.cluster_name} --node-pool {pool_name} --num-nodes 0 --zone {config.region} --project {config.project} --quiet"
        result = run_command(cmd)
        if result.returncode == 0:
            log(f"{pool_name} scaled to 0", "SUCCESS")
        else:
            log(f"Failed to scale down {pool_name}", "ERROR")
    
    log("All nodes scaled down - cluster is now cost-effective", "SUCCESS")

def check_cluster_status():
    """Check current cluster status"""
    log("Checking cluster status...", "INFO")
    
    # Check nodes
    result = run_command("kubectl get nodes")
    if result.returncode == 0:
        display(Code(result.stdout, language='text'))
    
    # Check GPU resources
    result = run_command("kubectl describe nodes | grep -E 'nvidia.com/gpu|Allocatable:' -A 5 | grep nvidia")
    if result.stdout:
        log("GPU resources found:", "SUCCESS")
        display(Code(result.stdout, language='text'))
    else:
        log("No GPU resources available", "WARNING")

# Create cluster management interface
display(Markdown("""
### üéÆ Cluster Management Commands

Run these cells in order:
1. **Connect**: `connect_to_cluster()`
2. **Scale Up**: `scale_gpu_up()` - ‚ö†Ô∏è INCURS COSTS
3. **Check Status**: `check_cluster_status()`
4. **Scale Down**: `scale_gpu_down()` - üí∞ SAVES COSTS

**Remember**: Always scale down when finished!
"""))

## üîó Step 1: Connect to GPU Cluster

First, let's connect to the cluster:

In [None]:
# Connect to the GPU cluster
connect_to_cluster()

## ‚ö° Step 2: Scale Up GPU Nodes (When Needed)

**‚ö†Ô∏è WARNING**: This will incur GPU costs! Only run when you need to deploy LLMs.

In [None]:
# UNCOMMENT TO SCALE UP - THIS COSTS MONEY!
# scale_gpu_up()

In [None]:
# Check cluster status
check_cluster_status()

## ü§ñ LLM Deployment Options

We'll demonstrate three approaches:
1. **API-based Model** - No GPU required, uses external API
2. **Local GPU Model** - Runs on GPU nodes
3. **Optimized Prompt Runtime** - Cached model with prompt runtime

In [None]:
# LLM deployment configuration
@dataclass
class LLMConfig:
    namespace: str = "llm-demo"
    model_name: str = "llama2-7b"
    api_model: str = "gpt-3.5-turbo"  # For API-based deployment
    
llm_config = LLMConfig()

# Create namespace
run_command(f"kubectl create namespace {llm_config.namespace} --dry-run=client -o yaml | kubectl apply -f -")
run_command(f"kubectl label namespace {llm_config.namespace} istio-injection=enabled --overwrite")

### Option 1: API-Based LLM (No GPU Required)

In [None]:
# Deploy API-based LLM server
api_server_yaml = f"""
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: llm-api-server
  namespace: {llm_config.namespace}
spec:
  replicas: 2
  serverConfig: mlserver
  extraEnv:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: openai-secret
        key: api-key
"""

# Deploy API-based model
api_model_yaml = f"""
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: loan-approval-api
  namespace: {llm_config.namespace}
spec:
  storageUri: gs://seldon-models/llm/loan-approval-api
  requirements:
  - openai
  - langchain
  memory: 2Gi
  env:
  - name: MODEL_TYPE
    value: "api"
  - name: API_MODEL
    value: "{llm_config.api_model}"
"""

log("Deploying API-based LLM (no GPU required)...", "INFO")

# Save and apply configurations
with open('/tmp/llm-api-server.yaml', 'w') as f:
    f.write(api_server_yaml)
with open('/tmp/llm-api-model.yaml', 'w') as f:
    f.write(api_model_yaml)

# Note: You need to create the secret with your API key
display(Markdown("""
### üîë Create API Key Secret

Before deploying, create a secret with your OpenAI API key:

```bash
kubectl create secret generic openai-secret \
  --from-literal=api-key=YOUR_OPENAI_API_KEY \
  -n llm-demo
```
"""))

### Option 2: Local GPU LLM Deployment

In [None]:
# Deploy GPU-based LLM server
gpu_server_yaml = f"""
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: llm-gpu-server
  namespace: {llm_config.namespace}
spec:
  replicas: 1
  serverConfig: triton
  resources:
    requests:
      nvidia.com/gpu: 1
      memory: 16Gi
      cpu: 4
    limits:
      nvidia.com/gpu: 1
      memory: 32Gi
      cpu: 8
  nodeSelector:
    cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
"""

# Deploy local GPU model
gpu_model_yaml = f"""
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama2-7b-gpu
  namespace: {llm_config.namespace}
spec:
  storageUri: gs://seldon-models/llm/llama2-7b-chat
  requirements:
  - transformers
  - torch
  - accelerate
  memory: 16Gi
  env:
  - name: MODEL_TYPE
    value: "local"
  - name: LOAD_IN_8BIT
    value: "true"
  - name: DEVICE_MAP
    value: "auto"
"""

log("Deploying GPU-based LLM (requires GPU nodes)...", "INFO")

# Save configurations
with open('/tmp/llm-gpu-server.yaml', 'w') as f:
    f.write(gpu_server_yaml)
with open('/tmp/llm-gpu-model.yaml', 'w') as f:
    f.write(gpu_model_yaml)

### Option 3: Optimized Prompt Runtime (Best Performance)

In [None]:
# Deploy optimized prompt runtime
prompt_runtime_yaml = f"""
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama2-prompt-runtime
  namespace: {llm_config.namespace}
spec:
  storageUri: gs://seldon-models/llm/llama2-7b-chat
  requirements:
  - transformers
  - torch
  - accelerate
  memory: 16Gi
  runtime: prompt-runtime  # Special runtime for optimized prompt handling
  env:
  - name: MODEL_TYPE
    value: "prompt-optimized"
  - name: LOAD_IN_8BIT
    value: "true"
  - name: CACHE_MODEL
    value: "true"
  - name: MAX_BATCH_SIZE
    value: "8"
  - name: MAX_SEQUENCE_LENGTH
    value: "2048"
"""

# Deploy loan approval pipeline with prompt runtime
pipeline_yaml = f"""
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: loan-approval-pipeline
  namespace: {llm_config.namespace}
spec:
  steps:
  - name: prompt-builder
    implementation: PROMPT_BUILDER
    parameters:
      template: |
        You are a loan approval assistant. Based on the following application details, 
        provide a decision (APPROVED/DENIED) and explanation.
        
        Application Details:
        {{application_details}}
        
        Decision:
  - name: llama2-prompt-runtime
    inputs: [prompt-builder.outputs]
  - name: response-parser
    implementation: RESPONSE_PARSER
    inputs: [llama2-prompt-runtime.outputs]
    parameters:
      extract_fields:
      - decision
      - explanation
  output:
    steps: [response-parser]
"""

log("Deploying optimized prompt runtime pipeline...", "INFO")

with open('/tmp/prompt-runtime.yaml', 'w') as f:
    f.write(prompt_runtime_yaml)
with open('/tmp/loan-pipeline.yaml', 'w') as f:
    f.write(pipeline_yaml)

## üöÄ Deploy Selected Configuration

In [None]:
# Choose deployment option
deployment_option = "api"  # Change to "gpu" or "prompt-runtime" as needed

if deployment_option == "api":
    log("Deploying API-based solution (no GPU required)", "INFO")
    run_command("kubectl apply -f /tmp/llm-api-server.yaml")
    run_command("kubectl apply -f /tmp/llm-api-model.yaml")
    
elif deployment_option == "gpu":
    log("Deploying GPU-based solution", "INFO")
    run_command("kubectl apply -f /tmp/llm-gpu-server.yaml")
    run_command("kubectl apply -f /tmp/llm-gpu-model.yaml")
    
elif deployment_option == "prompt-runtime":
    log("Deploying prompt runtime solution", "INFO")
    run_command("kubectl apply -f /tmp/llm-gpu-server.yaml")  # Still needs GPU server
    run_command("kubectl apply -f /tmp/prompt-runtime.yaml")
    run_command("kubectl apply -f /tmp/loan-pipeline.yaml")

# Wait for deployment
log("Waiting for deployment to be ready...", "INFO")
time.sleep(60)

# Check deployment status
run_command(f"kubectl get all -n {llm_config.namespace}")

## üß™ Test LLM Inference

In [None]:
def test_loan_approval(application_data: dict, model_name: str = "loan-approval-api"):
    """Test loan approval inference"""
    
    # Get gateway endpoint
    result = run_command("kubectl get svc istio-ingressgateway -n istio-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}'")
    gateway_ip = result.stdout.strip() or "localhost"
    
    url = f"http://{gateway_ip}/v2/models/{model_name}/infer"
    
    # Prepare request
    payload = {
        "inputs": [{
            "name": "application",
            "shape": [1],
            "datatype": "BYTES",
            "data": [json.dumps(application_data)]
        }]
    }
    
    headers = {
        "Content-Type": "application/json",
        "Seldon-Model": model_name
    }
    
    log(f"Testing loan approval with {model_name}...", "INFO")
    
    try:
        start_time = time.time()
        response = requests.post(url, json=payload, headers=headers, timeout=30)
        latency = (time.time() - start_time) * 1000
        
        if response.status_code == 200:
            result = response.json()
            log(f"Inference successful! Latency: {latency:.0f}ms", "SUCCESS")
            
            # Extract decision
            try:
                outputs = result.get("outputs", [{}])[0]
                decision_data = outputs.get("data", [{}])[0]
                display(Markdown(f"""
### üìã Loan Decision

**Application**: {application_data.get('applicant_name', 'Unknown')}
**Decision**: {decision_data.get('decision', 'PENDING')}
**Explanation**: {decision_data.get('explanation', 'No explanation provided')}
**Processing Time**: {latency:.0f}ms
"""))
            except:
                display(Code(json.dumps(result, indent=2), language='json'))
        else:
            log(f"Inference failed: {response.status_code}", "ERROR")
            print(response.text)
            
    except Exception as e:
        log(f"Error during inference: {str(e)}", "ERROR")

# Test application
test_application = {
    "applicant_name": "John Doe",
    "annual_income": 75000,
    "credit_score": 720,
    "loan_amount": 250000,
    "loan_purpose": "home_purchase",
    "employment_years": 5,
    "debt_to_income_ratio": 0.35
}

# Test the deployment
test_loan_approval(test_application)

## üìä Production Patterns: Model HPA + Server Native Autoscaling

The recommended approach for production LLM deployments:

In [None]:
# Production autoscaling configuration
autoscaling_yaml = f"""
# Model-level HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-model-hpa
  namespace: {llm_config.namespace}
spec:
  scaleTargetRef:
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    name: llama2-7b-gpu
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Resource
    resource:
      name: gpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: inference_queue_size
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120
---
# Server native autoscaling
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: llm-autoscaling-server
  namespace: {llm_config.namespace}
spec:
  replicas: 2
  serverConfig: triton
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 5
    metrics:
    - type: gpu
      targetUtilization: 80
    - type: memory
      targetUtilization: 70
  resources:
    requests:
      nvidia.com/gpu: 1
      memory: 16Gi
    limits:
      nvidia.com/gpu: 1
      memory: 32Gi
"""

display(Markdown("""
### üöÄ Production Autoscaling Strategy

**Recommended Pattern**: Model HPA + Server Native Autoscaling

1. **Model-level HPA**:
   - Scales based on GPU utilization and queue size
   - Fast scale-up (50% increase per minute)
   - Conservative scale-down (1 pod every 2 minutes)

2. **Server Native Autoscaling**:
   - Built-in server scaling based on resource metrics
   - Handles infrastructure-level scaling
   - Works in tandem with model HPA

3. **Benefits**:
   - Cost-efficient GPU utilization
   - Responsive to load changes
   - Prevents resource starvation
   - Smooth scaling behavior
"""))

# Save autoscaling config
with open('/tmp/llm-autoscaling.yaml', 'w') as f:
    f.write(autoscaling_yaml)

## üîç Monitoring LLM Performance

In [None]:
# LLM-specific monitoring queries
monitoring_queries = {
    "GPU Utilization": f'nvidia_gpu_utilization{{namespace="{llm_config.namespace}"}}',
    "GPU Memory Usage": f'nvidia_gpu_memory_used_bytes{{namespace="{llm_config.namespace}"}} / nvidia_gpu_memory_total_bytes{{namespace="{llm_config.namespace}"}} * 100',
    "Model Inference Latency P95": f'histogram_quantile(0.95, rate(seldon_model_infer_duration_seconds_bucket{{namespace="{llm_config.namespace}"}}[5m]))',
    "Token Generation Rate": f'rate(llm_tokens_generated_total{{namespace="{llm_config.namespace}"}}[5m])',
    "Queue Size": f'inference_queue_size{{namespace="{llm_config.namespace}"}}',
    "Active Requests": f'llm_active_requests{{namespace="{llm_config.namespace}"}}'
}

display(Markdown("### üìä LLM Monitoring Queries"))
for name, query in monitoring_queries.items():
    display(Markdown(f"**{name}**:"))
    display(Code(query, language='promql'))

# Check current metrics
def check_llm_metrics():
    """Check current LLM metrics"""
    log("Checking LLM metrics...", "INFO")
    
    # GPU metrics
    result = run_command("kubectl top nodes | grep gpu")
    if result.stdout:
        display(Markdown("### GPU Node Resources"))
        display(Code(result.stdout, language='text'))
    
    # Pod metrics
    result = run_command(f"kubectl top pods -n {llm_config.namespace}")
    if result.stdout:
        display(Markdown("### Pod Resources"))
        display(Code(result.stdout, language='text'))

check_llm_metrics()

## üí∞ Step 3: Scale Down GPU Nodes (IMPORTANT!)

**‚ö†Ô∏è CRITICAL**: Always scale down GPU nodes when finished to avoid unnecessary costs!

In [None]:
# ALWAYS RUN THIS WHEN FINISHED!
scale_gpu_down()

display(Markdown("""
### ‚úÖ Cluster Scaled Down

GPU nodes have been scaled to 0 to save costs.

**Next time you need GPUs**:
1. Run `connect_to_cluster()`
2. Run `scale_gpu_up()`
3. Deploy your models
4. Run `scale_gpu_down()` when finished

**Cost Tracking**: Check costs in [GCP Console](https://console.cloud.google.com/kubernetes/clusters/details/europe-west4/llm-demos-alex/nodes?project=dev-sherif)
"""))

## üìö Additional Resources

### Demo Repository
Full loan approval demo with all three approaches:
https://github.com/SeldonIO/customer-success/tree/master/tutorials/llm-module/demos/loan-approval-decision-system

### Best Practices
1. **Always use API models for development** - No GPU costs
2. **Test with small batches** before scaling up
3. **Monitor GPU memory** - LLMs can OOM easily
4. **Use quantization** (8-bit) to fit larger models
5. **Implement request queuing** for burst handling
6. **Set up alerts** for GPU utilization and costs

### Troubleshooting
- **GPU not available**: Check node pool status and tolerations
- **OOM errors**: Reduce batch size or enable 8-bit quantization
- **Slow inference**: Check GPU utilization and queue depth
- **High costs**: Ensure nodes are scaled down when not in use