# Lab 3.3.7: Production API with FastAPI

**Module:** 3.3 - Model Deployment & Inference Engines  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Build a production-ready LLM API with FastAPI
- [ ] Implement Server-Sent Events (SSE) for streaming responses
- [ ] Add health checks, rate limiting, and monitoring
- [ ] Deploy with proper error handling and logging

---

## üìö Prerequisites

- Completed: Labs 3.3.1-3.3.6
- Knowledge of: Python async programming, REST APIs, HTTP
- Having: At least one inference engine running (Ollama, vLLM, SGLang)

---

## üåç Real-World Context

**The Problem:** You have an inference engine running, but:
- It's not exposed securely to the internet
- No rate limiting = one user can DoS your system
- No monitoring = you don't know when it's failing
- No authentication = anyone can use your expensive GPU

**The Solution:** A production API layer that:
- Handles authentication and rate limiting
- Provides streaming for real-time responses
- Monitors health and performance
- Gracefully handles errors

**Real Impact:**
- OpenAI, Anthropic, and all major LLM providers use similar patterns
- This is exactly how production LLM APIs are built

---

## üßí ELI5: What is a Production API?

> **Imagine you built an amazing lemonade stand...**
>
> Your lemonade (LLM) is great! But right now:
> - Anyone can walk up and take infinite lemonade (no rate limiting)
> - You don't know how much lemonade you've served (no monitoring)
> - If you run out of lemons, you just stare blankly (no error handling)
> - Anyone can claim to be a paying customer (no authentication)
>
> A **production API** is like hiring a professional manager:
> - They check customers' membership cards (authentication)
> - They limit each customer to 5 cups per hour (rate limiting)
> - They track sales and inventory (monitoring)
> - They politely explain when you're out of lemons (error handling)
> - They pour lemonade into cups as it's ready (streaming)
>
> **In AI terms:** FastAPI helps us build a professional "manager" that sits between
> users and our inference engine, handling all the production concerns.

---

## Part 1: Setting Up FastAPI

In [None]:
# Install required packages (run once)
# !pip install fastapi uvicorn sse-starlette python-multipart aiohttp

# Standard imports
import asyncio
import json
import os
import sys
import time
import logging
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional, AsyncGenerator
from dataclasses import dataclass, field
import warnings
warnings.filterwarnings('ignore')

# Third-party imports
import requests
import aiohttp

print("‚úÖ Imports successful!")

In [None]:
# Production API architecture
print("""
üìä PRODUCTION LLM API ARCHITECTURE
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                         CLIENTS                                  ‚îÇ
‚îÇ    (Web Apps, Mobile Apps, CLI Tools, Other Services)           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           ‚îÇ
                           ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    LOAD BALANCER (Optional)                      ‚îÇ
‚îÇ              (nginx, HAProxy, or cloud LB)                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           ‚îÇ
                           ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     FastAPI APPLICATION                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê              ‚îÇ
‚îÇ  ‚îÇ   Auth      ‚îÇ  ‚îÇ Rate Limit  ‚îÇ  ‚îÇ  Logging    ‚îÇ              ‚îÇ
‚îÇ  ‚îÇ Middleware  ‚îÇ  ‚îÇ Middleware  ‚îÇ  ‚îÇ Middleware  ‚îÇ              ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò              ‚îÇ
‚îÇ                                                                  ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îÇ
‚îÇ  ‚îÇ                    API ENDPOINTS                          ‚îÇ  ‚îÇ
‚îÇ  ‚îÇ  POST /v1/chat/completions    (OpenAI compatible)        ‚îÇ  ‚îÇ
‚îÇ  ‚îÇ  GET  /health                 (Health check)              ‚îÇ  ‚îÇ
‚îÇ  ‚îÇ  GET  /metrics                (Prometheus metrics)        ‚îÇ  ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           ‚îÇ
                           ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                   INFERENCE ENGINE                               ‚îÇ
‚îÇ          (vLLM, SGLang, TensorRT-LLM, Ollama)                   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

---

## Part 2: Building the API Server

Let's build a complete production API. We'll create the code here, then save it to a file.

In [None]:
# API server code
api_server_code = '''
"""
Production LLM API Server

A production-ready FastAPI server for LLM inference with:
- OpenAI-compatible API endpoints
- Server-Sent Events (SSE) streaming
- Rate limiting
- Health checks
- Request logging
- Error handling

Usage:
    uvicorn api_server:app --host 0.0.0.0 --port 8080 --workers 1

Environment Variables:
    BACKEND_URL: URL of the inference backend (default: http://localhost:8000)
    API_KEY: Required API key for authentication (optional)
    RATE_LIMIT: Requests per minute per client (default: 60)
"""

import asyncio
import json
import logging
import os
import time
import uuid
from collections import defaultdict
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, AsyncGenerator, Dict, List, Optional

import aiohttp
from fastapi import FastAPI, HTTPException, Request, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
from sse_starlette.sse import EventSourceResponse

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Configuration
BACKEND_URL = os.getenv("BACKEND_URL", "http://localhost:8000")
API_KEY = os.getenv("API_KEY", None)
RATE_LIMIT = int(os.getenv("RATE_LIMIT", "60"))


# ============================================================================
# Pydantic Models (OpenAI-compatible)
# ============================================================================

class Message(BaseModel):
    role: str = Field(..., description="Role: system, user, or assistant")
    content: str = Field(..., description="Message content")


class ChatCompletionRequest(BaseModel):
    model: str = Field(default="default", description="Model to use")
    messages: List[Message] = Field(..., description="Conversation messages")
    max_tokens: int = Field(default=512, ge=1, le=4096, description="Max tokens to generate")
    temperature: float = Field(default=0.7, ge=0.0, le=2.0, description="Sampling temperature")
    top_p: float = Field(default=0.9, ge=0.0, le=1.0, description="Nucleus sampling")
    stream: bool = Field(default=False, description="Enable streaming")
    stop: Optional[List[str]] = Field(default=None, description="Stop sequences")


class HealthResponse(BaseModel):
    status: str
    backend_status: str
    uptime_seconds: float
    total_requests: int
    active_requests: int


# ============================================================================
# Metrics & Rate Limiting
# ============================================================================

@dataclass
class ServerMetrics:
    """Track server metrics."""
    start_time: float = field(default_factory=time.time)
    total_requests: int = 0
    successful_requests: int = 0
    failed_requests: int = 0
    active_requests: int = 0
    total_tokens_generated: int = 0
    total_latency_ms: float = 0.0
    
    @property
    def uptime_seconds(self) -> float:
        return time.time() - self.start_time
    
    @property
    def avg_latency_ms(self) -> float:
        if self.successful_requests == 0:
            return 0.0
        return self.total_latency_ms / self.successful_requests


class RateLimiter:
    """Simple in-memory rate limiter."""
    
    def __init__(self, requests_per_minute: int = 60):
        self.requests_per_minute = requests_per_minute
        self.requests: Dict[str, List[float]] = defaultdict(list)
    
    def is_allowed(self, client_id: str) -> bool:
        """Check if client is within rate limit."""
        now = time.time()
        minute_ago = now - 60
        
        # Clean old requests
        self.requests[client_id] = [
            t for t in self.requests[client_id] if t > minute_ago
        ]
        
        if len(self.requests[client_id]) >= self.requests_per_minute:
            return False
        
        self.requests[client_id].append(now)
        return True
    
    def get_remaining(self, client_id: str) -> int:
        """Get remaining requests for client."""
        now = time.time()
        minute_ago = now - 60
        current = len([t for t in self.requests[client_id] if t > minute_ago])
        return max(0, self.requests_per_minute - current)


# Global instances
metrics = ServerMetrics()
rate_limiter = RateLimiter(RATE_LIMIT)


# ============================================================================
# FastAPI App
# ============================================================================

app = FastAPI(
    title="LLM Inference API",
    description="Production-ready API for LLM inference",
    version="1.0.0",
    docs_url="/docs",
    redoc_url="/redoc",
)

# CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)


# ============================================================================
# Dependencies
# ============================================================================

async def verify_api_key(request: Request) -> str:
    """Verify API key if configured."""
    if API_KEY is None:
        return "anonymous"
    
    auth_header = request.headers.get("Authorization", "")
    if not auth_header.startswith("Bearer "):
        raise HTTPException(
            status_code=401,
            detail="Missing or invalid Authorization header"
        )
    
    token = auth_header[7:]
    if token != API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")
    
    return token[:8]  # Return truncated for logging


async def check_rate_limit(request: Request):
    """Check rate limit for client."""
    client_id = request.client.host if request.client else "unknown"
    
    if not rate_limiter.is_allowed(client_id):
        remaining = rate_limiter.get_remaining(client_id)
        raise HTTPException(
            status_code=429,
            detail=f"Rate limit exceeded. Try again later. Remaining: {remaining}"
        )


# ============================================================================
# Backend Communication
# ============================================================================

async def check_backend_health() -> bool:
    """Check if backend is healthy."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(
                f"{BACKEND_URL}/v1/models",
                timeout=aiohttp.ClientTimeout(total=5)
            ) as response:
                return response.status == 200
    except Exception:
        return False


async def stream_from_backend(
    request_data: Dict[str, Any]
) -> AsyncGenerator[str, None]:
    """Stream response from backend."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{BACKEND_URL}/v1/chat/completions",
                json=request_data,
                timeout=aiohttp.ClientTimeout(total=120)
            ) as response:
                if response.status != 200:
                    error_text = await response.text()
                    yield f"data: {{\"error\": \"{error_text}\"}}\\n\\n"
                    return
                
                async for line in response.content:
                    if line:
                        yield line.decode("utf-8")
                        
    except asyncio.TimeoutError:
        yield "data: {\"error\": \"Request timeout\"}\\n\\n"
    except Exception as e:
        yield f"data: {{\"error\": \"{str(e)}\"}}\\n\\n"


async def forward_to_backend(request_data: Dict[str, Any]) -> Dict[str, Any]:
    """Forward non-streaming request to backend."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{BACKEND_URL}/v1/chat/completions",
                json=request_data,
                timeout=aiohttp.ClientTimeout(total=120)
            ) as response:
                if response.status != 200:
                    error_text = await response.text()
                    raise HTTPException(
                        status_code=response.status,
                        detail=f"Backend error: {error_text}"
                    )
                return await response.json()
                
    except asyncio.TimeoutError:
        raise HTTPException(status_code=504, detail="Backend timeout")
    except aiohttp.ClientError as e:
        raise HTTPException(status_code=502, detail=f"Backend error: {str(e)}")


# ============================================================================
# API Endpoints
# ============================================================================

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint."""
    backend_healthy = await check_backend_health()
    
    return HealthResponse(
        status="healthy" if backend_healthy else "degraded",
        backend_status="connected" if backend_healthy else "disconnected",
        uptime_seconds=metrics.uptime_seconds,
        total_requests=metrics.total_requests,
        active_requests=metrics.active_requests
    )


@app.get("/metrics")
async def get_metrics():
    """Get server metrics."""
    return {
        "uptime_seconds": metrics.uptime_seconds,
        "total_requests": metrics.total_requests,
        "successful_requests": metrics.successful_requests,
        "failed_requests": metrics.failed_requests,
        "active_requests": metrics.active_requests,
        "total_tokens_generated": metrics.total_tokens_generated,
        "avg_latency_ms": metrics.avg_latency_ms
    }


@app.get("/v1/models")
async def list_models():
    """List available models (OpenAI compatible)."""
    return {
        "object": "list",
        "data": [
            {
                "id": "default",
                "object": "model",
                "created": int(metrics.start_time),
                "owned_by": "local"
            }
        ]
    }


@app.post("/v1/chat/completions")
async def chat_completions(
    request: ChatCompletionRequest,
    raw_request: Request,
    _: str = Depends(verify_api_key),
    __: None = Depends(check_rate_limit)
):
    """Chat completions endpoint (OpenAI compatible)."""
    request_id = str(uuid.uuid4())[:8]
    start_time = time.time()
    
    # Update metrics
    metrics.total_requests += 1
    metrics.active_requests += 1
    
    client_ip = raw_request.client.host if raw_request.client else "unknown"
    logger.info(f"[{request_id}] Request from {client_ip}: {len(request.messages)} messages")
    
    try:
        # Prepare request for backend
        backend_request = {
            "model": request.model,
            "messages": [m.dict() for m in request.messages],
            "max_tokens": request.max_tokens,
            "temperature": request.temperature,
            "top_p": request.top_p,
            "stream": request.stream
        }
        
        if request.stop:
            backend_request["stop"] = request.stop
        
        if request.stream:
            # Streaming response
            async def event_generator():
                try:
                    async for chunk in stream_from_backend(backend_request):
                        yield chunk
                finally:
                    metrics.active_requests -= 1
                    metrics.successful_requests += 1
                    latency = (time.time() - start_time) * 1000
                    metrics.total_latency_ms += latency
                    logger.info(f"[{request_id}] Streaming completed in {latency:.0f}ms")
            
            return EventSourceResponse(event_generator(), media_type="text/event-stream")
        
        else:
            # Non-streaming response
            response = await forward_to_backend(backend_request)
            
            # Update metrics
            latency = (time.time() - start_time) * 1000
            metrics.successful_requests += 1
            metrics.total_latency_ms += latency
            
            usage = response.get("usage", {})
            metrics.total_tokens_generated += usage.get('completion_tokens', 0)
            
            completion_tokens = usage.get('completion_tokens', 0)
            logger.info(f"[{request_id}] Completed in {latency:.0f}ms, {completion_tokens} tokens")
            
            return response
            
    except HTTPException:
        metrics.failed_requests += 1
        raise
    except Exception as e:
        metrics.failed_requests += 1
        logger.error(f"[{request_id}] Error: {str(e)}")
        raise HTTPException(status_code=500, detail=str(e))
    finally:
        if not request.stream:
            metrics.active_requests -= 1


@app.exception_handler(Exception)
async def global_exception_handler(request: Request, exc: Exception):
    """Global exception handler."""
    logger.error(f"Unhandled exception: {str(exc)}")
    return JSONResponse(
        status_code=500,
        content={"error": {"message": "Internal server error", "type": "server_error"}}
    )


# ============================================================================
# Startup/Shutdown
# ============================================================================

@app.on_event("startup")
async def startup_event():
    """Run on startup."""
    logger.info("Starting LLM API server...")
    logger.info(f"Backend URL: {BACKEND_URL}")
    logger.info(f"Rate limit: {RATE_LIMIT} requests/minute")
    logger.info(f"API key required: {API_KEY is not None}")
    
    # Check backend
    backend_healthy = await check_backend_health()
    if backend_healthy:
        logger.info("Backend is healthy")
    else:
        logger.warning("Backend is not responding - will retry on requests")


@app.on_event("shutdown")
async def shutdown_event():
    """Run on shutdown."""
    logger.info(f"Shutting down... Total requests served: {metrics.total_requests}")


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
'''

# Save to file
api_path = Path("../api/api_server.py")
api_path.parent.mkdir(exist_ok=True)
api_path.write_text(api_server_code)

print(f"‚úÖ API server code saved to: {api_path.resolve()}")
print(f"\nüìù File size: {len(api_server_code)} bytes")

In [None]:
# Explain key components
print("""
üìä API SERVER KEY COMPONENTS
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

1. PYDANTIC MODELS
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - ChatCompletionRequest: Validates incoming requests
   - HealthResponse: Health check response structure
   - Automatic OpenAPI documentation

2. RATE LIMITING
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - In-memory sliding window counter
   - Per-client (by IP) rate limiting
   - Configurable via RATE_LIMIT env var

3. AUTHENTICATION
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - Bearer token authentication
   - Optional (disabled if API_KEY not set)
   - Standard "Authorization: Bearer <token>" header

4. STREAMING (SSE)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - Server-Sent Events for real-time responses
   - Compatible with OpenAI client libraries
   - Proper cleanup on disconnect

5. METRICS
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - Request counts (total, success, failed)
   - Latency tracking
   - Token counting
   - Exposed via /metrics endpoint

6. ERROR HANDLING
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   - Structured error responses
   - Proper HTTP status codes
   - Global exception handler
""")

---

## Part 3: Running and Testing the API

In [None]:
# How to run the API server
print("""
üìù RUNNING THE API SERVER
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

1. Start your inference backend (choose one):
   
   # Ollama
   ollama serve
   
   # vLLM
   python -m vllm.entrypoints.openai.api_server \\
       --model Qwen/Qwen3-8B-Instruct \\
       --port 8000
   
   # SGLang
   python -m sglang.launch_server \\
       --model-path Qwen/Qwen3-8B-Instruct \\
       --port 8000

2. Start the API server:
   
   cd api/
   
   # Basic (no auth)
   BACKEND_URL=http://localhost:8000 uvicorn api_server:app --port 8080
   
   # With authentication
   BACKEND_URL=http://localhost:8000 \\
   API_KEY=your-secret-key \\
   RATE_LIMIT=30 \\
   uvicorn api_server:app --port 8080

3. Access the API:
   
   # Health check
   curl http://localhost:8080/health
   
   # API docs
   open http://localhost:8080/docs
   
   # Chat completion
   curl http://localhost:8080/v1/chat/completions \\
       -H "Content-Type: application/json" \\
       -H "Authorization: Bearer your-secret-key" \\
       -d '{
           "messages": [{"role": "user", "content": "Hello!"}],
           "max_tokens": 100
       }'
""")

In [None]:
# Test function for the API
def test_api_endpoint(url: str = "http://localhost:8080", api_key: Optional[str] = None):
    """
    Test the production API endpoints.
    """
    print(f"\nüß™ Testing API at {url}...")
    print("="*50)
    
    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    
    # Test 1: Health check
    print("\n1. Health check...")
    try:
        response = requests.get(f"{url}/health", timeout=5)
        if response.status_code == 200:
            data = response.json()
            print(f"   ‚úÖ Status: {data['status']}")
            print(f"   Backend: {data['backend_status']}")
            print(f"   Uptime: {data['uptime_seconds']:.0f}s")
        else:
            print(f"   ‚ùå Status code: {response.status_code}")
    except Exception as e:
        print(f"   ‚ùå Error: {e}")
        return
    
    # Test 2: List models
    print("\n2. List models...")
    try:
        response = requests.get(f"{url}/v1/models", timeout=5)
        if response.status_code == 200:
            data = response.json()
            print(f"   ‚úÖ Models: {[m['id'] for m in data['data']]}")
        else:
            print(f"   ‚ùå Status code: {response.status_code}")
    except Exception as e:
        print(f"   ‚ùå Error: {e}")
    
    # Test 3: Chat completion (non-streaming)
    print("\n3. Chat completion (non-streaming)...")
    try:
        response = requests.post(
            f"{url}/v1/chat/completions",
            headers=headers,
            json={
                "messages": [{"role": "user", "content": "Say 'Hello, World!' and nothing else."}],
                "max_tokens": 20,
                "temperature": 0.1
            },
            timeout=30
        )
        if response.status_code == 200:
            data = response.json()
            content = data['choices'][0]['message']['content']
            print(f"   ‚úÖ Response: {content[:50]}...")
        else:
            print(f"   ‚ùå Status code: {response.status_code}")
            print(f"   Response: {response.text[:100]}")
    except Exception as e:
        print(f"   ‚ùå Error: {e}")
    
    # Test 4: Metrics
    print("\n4. Metrics...")
    try:
        response = requests.get(f"{url}/metrics", timeout=5)
        if response.status_code == 200:
            data = response.json()
            print(f"   ‚úÖ Total requests: {data['total_requests']}")
            print(f"   Successful: {data['successful_requests']}")
            print(f"   Avg latency: {data['avg_latency_ms']:.0f}ms")
        else:
            print(f"   ‚ùå Status code: {response.status_code}")
    except Exception as e:
        print(f"   ‚ùå Error: {e}")
    
    print("\n" + "="*50)
    print("Tests complete!")


# Uncomment to test (requires API server running)
# test_api_endpoint("http://localhost:8080")

In [None]:
# Python client example
client_code = '''
"""
Example Python client for the LLM API.

This client is compatible with the OpenAI Python library,
so you can also use:

    from openai import OpenAI
    client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-key")
"""

import requests
from typing import Iterator, Optional


class LLMClient:
    """Simple client for the LLM API."""
    
    def __init__(self, base_url: str = "http://localhost:8080", api_key: Optional[str] = None):
        self.base_url = base_url.rstrip("/")
        self.headers = {"Content-Type": "application/json"}
        if api_key:
            self.headers["Authorization"] = f"Bearer {api_key}"
    
    def chat(self, message: str, max_tokens: int = 256, temperature: float = 0.7) -> str:
        """Send a chat message and get a response."""
        response = requests.post(
            f"{self.base_url}/v1/chat/completions",
            headers=self.headers,
            json={
                "messages": [{"role": "user", "content": message}],
                "max_tokens": max_tokens,
                "temperature": temperature,
                "stream": False
            }
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    
    def stream_chat(self, message: str, max_tokens: int = 256) -> Iterator[str]:
        """Stream a chat response token by token."""
        response = requests.post(
            f"{self.base_url}/v1/chat/completions",
            headers=self.headers,
            json={
                "messages": [{"role": "user", "content": message}],
                "max_tokens": max_tokens,
                "stream": True
            },
            stream=True
        )
        response.raise_for_status()
        
        for line in response.iter_lines():
            if line:
                line = line.decode("utf-8")
                if line.startswith("data: "):
                    data = line[6:]
                    if data == "[DONE]":
                        break
                    try:
                        import json
                        chunk = json.loads(data)
                        content = chunk.get("choices", [{}])[0].get("delta", {}).get("content", "")
                        if content:
                            yield content
                    except json.JSONDecodeError:
                        continue
    
    def health(self) -> dict:
        """Check API health."""
        response = requests.get(f"{self.base_url}/health")
        response.raise_for_status()
        return response.json()


if __name__ == "__main__":
    # Example usage
    client = LLMClient(api_key="your-api-key")
    
    # Health check
    print("Health:", client.health())
    
    # Simple chat
    response = client.chat("What is the capital of France?")
    print(f"Response: {response}")
    
    # Streaming chat
    print("Streaming: ", end="")
    for chunk in client.stream_chat("Tell me a short joke."):
        print(chunk, end="", flush=True)
    print()
'''

# Save client code
client_path = Path("../api/client.py")
client_path.write_text(client_code)
print(f"‚úÖ Client code saved to: {client_path.resolve()}")

---

## Part 4: Production Deployment Considerations

In [None]:
print("""
üìä PRODUCTION DEPLOYMENT CHECKLIST
=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""=""

‚ñ° SECURITY
  ‚îú‚îÄ ‚ñ° Enable HTTPS (use nginx/caddy as reverse proxy)
  ‚îú‚îÄ ‚ñ° Set strong API keys
  ‚îú‚îÄ ‚ñ° Rate limit per API key, not just per IP
  ‚îú‚îÄ ‚ñ° Input validation (max prompt length)
  ‚îî‚îÄ ‚ñ° Output sanitization if needed

‚ñ° RELIABILITY
  ‚îú‚îÄ ‚ñ° Health checks with proper timeouts
  ‚îú‚îÄ ‚ñ° Graceful shutdown handling
  ‚îú‚îÄ ‚ñ° Request timeout configuration
  ‚îú‚îÄ ‚ñ° Retry logic for transient failures
  ‚îî‚îÄ ‚ñ° Circuit breaker for backend failures

‚ñ° OBSERVABILITY
  ‚îú‚îÄ ‚ñ° Structured logging (JSON format)
  ‚îú‚îÄ ‚ñ° Request tracing (correlation IDs)
  ‚îú‚îÄ ‚ñ° Prometheus metrics export
  ‚îú‚îÄ ‚ñ° Alerting on error rates
  ‚îî‚îÄ ‚ñ° Dashboard for monitoring

‚ñ° PERFORMANCE
  ‚îú‚îÄ ‚ñ° Connection pooling
  ‚îú‚îÄ ‚ñ° Async request handling
  ‚îú‚îÄ ‚ñ° Proper worker configuration
  ‚îú‚îÄ ‚ñ° Load testing before launch
  ‚îî‚îÄ ‚ñ° Caching for repeated requests (optional)

‚ñ° DEPLOYMENT
  ‚îú‚îÄ ‚ñ° Docker containerization
  ‚îú‚îÄ ‚ñ° Environment variable configuration
  ‚îú‚îÄ ‚ñ° Health check in container spec
  ‚îú‚îÄ ‚ñ° Resource limits (memory, CPU)
  ‚îî‚îÄ ‚ñ° Horizontal scaling plan
""")

In [None]:
# Docker deployment example
dockerfile_content = '''
# Dockerfile for LLM API Server
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
RUN pip install --no-cache-dir \\
    fastapi \\
    uvicorn \\
    sse-starlette \\
    aiohttp \\
    python-multipart

# Copy application
COPY api_server.py .

# Expose port
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \\
    CMD curl -f http://localhost:8080/health || exit 1

# Run
CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8080"]
'''

docker_compose_content = '''
# docker-compose.yml for complete LLM stack
version: "3.8"

services:
  # Inference backend (choose one)
  vllm:
    image: nvcr.io/nvidia/pytorch:25.11-py3
    command: >
      bash -c "pip install vllm &&
      python -m vllm.entrypoints.openai.api_server
      --model Qwen/Qwen3-8B-Instruct
      --port 8000
      --enforce-eager
      --dtype bfloat16"
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ipc: host
  
  # API server
  api:
    build: .
    ports:
      - "8080:8080"
    environment:
      - BACKEND_URL=http://vllm:8000
      - API_KEY=${API_KEY}
      - RATE_LIMIT=60
    depends_on:
      - vllm
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
'''

# Save Docker files
dockerfile_path = Path("../api/Dockerfile")
dockerfile_path.write_text(dockerfile_content)

compose_path = Path("../api/docker-compose.yml")
compose_path.write_text(docker_compose_content)

print(f"‚úÖ Dockerfile saved to: {dockerfile_path.resolve()}")
print(f"‚úÖ docker-compose.yml saved to: {compose_path.resolve()}")

print("\nüìù To deploy with Docker Compose:")
print("   cd api/")
print("   export HF_TOKEN=your-huggingface-token")
print("   export API_KEY=your-api-key")
print("   docker-compose up -d")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Using Async Properly

```python
# ‚ùå Wrong - Blocking the event loop
@app.post("/chat")
def chat(request: ChatRequest):
    response = requests.post(backend_url, json=request.dict())  # Blocking!
    return response.json()

# ‚úÖ Right - Async all the way
@app.post("/chat")
async def chat(request: ChatRequest):
    async with aiohttp.ClientSession() as session:
        async with session.post(backend_url, json=request.dict()) as resp:
            return await resp.json()
```

### Mistake 2: No Request Timeouts

```python
# ‚ùå Wrong - Request hangs forever if backend is slow
async with session.post(url, json=data) as resp:
    return await resp.json()

# ‚úÖ Right - Always set timeouts
timeout = aiohttp.ClientTimeout(total=120, connect=10)
async with session.post(url, json=data, timeout=timeout) as resp:
    return await resp.json()
```

### Mistake 3: Exposing Internal Errors

```python
# ‚ùå Wrong - Leaks internal details
@app.exception_handler(Exception)
async def handle_error(request, exc):
    return JSONResponse({"error": str(exc)})  # Full stack trace!

# ‚úÖ Right - Generic error for clients, log details
@app.exception_handler(Exception)
async def handle_error(request, exc):
    logger.error(f"Internal error: {exc}")  # Log full error
    return JSONResponse(
        status_code=500,
        content={"error": "Internal server error"}  # Generic for client
    )
```

---

## ‚úã Try It Yourself

### Exercise 1: Add Request Caching

Implement caching for repeated identical requests.

In [None]:
# Exercise 1: Your code here
# Add a simple in-memory cache for non-streaming requests
# Key: hash of (messages, temperature)
# Value: response + timestamp
# TTL: 5 minutes

# TODO: Implement the cache
# Hints:
# - Use hashlib to hash the request
# - Store (response, timestamp) tuples
# - Check TTL before returning cached response

### Exercise 2: Add Prometheus Metrics

Expose metrics in Prometheus format for monitoring.

In [None]:
# Exercise 2: Your code here
# Add a /metrics endpoint that returns Prometheus format:
# 
# # HELP llm_requests_total Total requests
# # TYPE llm_requests_total counter
# llm_requests_total{status="success"} 100
# llm_requests_total{status="error"} 5
# 
# # HELP llm_request_duration_seconds Request duration
# # TYPE llm_request_duration_seconds histogram
# llm_request_duration_seconds_bucket{le="0.1"} 50
# llm_request_duration_seconds_bucket{le="0.5"} 80
# ...

# TODO: Implement Prometheus metrics

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How to build a production-ready LLM API with FastAPI
- ‚úÖ How to implement SSE streaming for real-time responses
- ‚úÖ How to add rate limiting, authentication, and monitoring
- ‚úÖ How to deploy with Docker and handle errors gracefully

---

## üìñ Further Reading

- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [Server-Sent Events (MDN)](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
- [Uvicorn Deployment](https://www.uvicorn.org/deployment/)

---

## üßπ Cleanup

In [None]:
# Cleanup
import gc

# Clear Python garbage
gc.collect()

# Clear GPU memory cache if torch is available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        print("‚úÖ GPU memory cache cleared!")
except ImportError:
    pass

print("‚úÖ Cleanup complete!")
print("\nüìÅ Files created in ../api/:")
print("   - api_server.py  (Main API server)")
print("   - client.py      (Python client)")
print("   - Dockerfile     (Container image)")
print("   - docker-compose.yml (Full stack deployment)")