# Task 12.5: Production API with FastAPI

**Module:** 12 - Model Deployment & Inference Engines  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐

---

## 🎯 Learning Objectives

By the end of this notebook, you will:
- [ ] Build an OpenAI-compatible API with FastAPI
- [ ] Implement streaming responses (SSE)
- [ ] Add production features: rate limiting, monitoring, error handling
- [ ] Deploy and test under load

---

## 📚 Prerequisites

- Completed: Tasks 12.1-12.4
- Knowledge of: REST APIs, async Python
- Running: At least one inference engine (Ollama recommended)

---

## 🌍 Real-World Context

**Why build a custom API layer?**

While inference engines like Ollama and vLLM provide APIs, production deployments often need:
- **Rate limiting**: Prevent abuse and ensure fair usage
- **Authentication**: Secure access with API keys
- **Monitoring**: Track latency, errors, and usage
- **Load balancing**: Distribute requests across multiple engines
- **Request transformation**: Adapt to specific frontend needs

A FastAPI wrapper gives you full control over these features while maintaining OpenAI API compatibility.

---

## 🧒 ELI5: What is an API?

> **Imagine a restaurant...**
>
> You (the client) don't walk into the kitchen to make your food.
> Instead, you talk to the waiter (the API) who:
> 1. Takes your order (receives your request)
> 2. Brings it to the kitchen (forwards to the model)
> 3. Checks if it's ready (handles streaming)
> 4. Brings you your food (returns the response)
>
> **Streaming is like a sushi conveyor belt:**
> Instead of waiting for all dishes at once, pieces arrive as they're ready.
> You can start eating immediately!
>
> **In AI terms:**
> - Client sends a message via HTTP request
> - API validates, rate-limits, logs the request
> - Forwards to inference engine
> - Streams tokens back as they're generated
> - Logs completion and metrics

---

## Part 1: Basic FastAPI Setup

Let's start with a simple API that wraps an inference engine.

In [None]:
# Install dependencies if needed
# !pip install fastapi uvicorn aiohttp pydantic

import asyncio
import json
import time
import uuid
from datetime import datetime
from typing import List, Optional, AsyncIterator
from dataclasses import dataclass, field

# Check imports
try:
    from fastapi import FastAPI, HTTPException, Request
    from fastapi.responses import StreamingResponse, JSONResponse
    from pydantic import BaseModel, Field
    import aiohttp
    print("✅ All dependencies installed!")
except ImportError as e:
    print(f"❌ Missing dependency: {e}")
    print("   Install with: pip install fastapi uvicorn aiohttp pydantic")

In [None]:
# Define OpenAI-compatible request/response models

class ChatMessage(BaseModel):
    """A single chat message."""
    role: str = Field(..., description="Role: system, user, or assistant")
    content: str = Field(..., description="Message content")

class ChatCompletionRequest(BaseModel):
    """OpenAI-compatible chat completion request."""
    model: str = Field(default="default")
    messages: List[ChatMessage]
    max_tokens: Optional[int] = Field(default=512)
    temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
    top_p: Optional[float] = Field(default=0.9, ge=0.0, le=1.0)
    stream: Optional[bool] = Field(default=False)
    stop: Optional[List[str]] = None
    user: Optional[str] = None

class ChatCompletionChoice(BaseModel):
    """A single completion choice."""
    index: int = 0
    message: ChatMessage
    finish_reason: str = "stop"

class UsageInfo(BaseModel):
    """Token usage information."""
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int

class ChatCompletionResponse(BaseModel):
    """OpenAI-compatible chat completion response."""
    id: str = Field(default_factory=lambda: f"chatcmpl-{uuid.uuid4().hex[:8]}")
    object: str = "chat.completion"
    created: int = Field(default_factory=lambda: int(time.time()))
    model: str
    choices: List[ChatCompletionChoice]
    usage: UsageInfo

print("✅ Pydantic models defined!")

In [None]:
# Simple inference client that wraps Ollama
import requests

class OllamaBackend:
    """
    Backend client for Ollama inference.
    """
    
    def __init__(self, base_url: str = "http://localhost:11434", model: str = "llama3.1:8b"):
        self.base_url = base_url
        self.model = model
    
    def is_healthy(self) -> bool:
        """Check if Ollama is running."""
        try:
            response = requests.get(f"{self.base_url}/api/tags", timeout=5)
            return response.status_code == 200
        except:
            return False
    
    def chat(self, messages: List[dict], max_tokens: int = 512, temperature: float = 0.7) -> str:
        """Generate a non-streaming response."""
        response = requests.post(
            f"{self.base_url}/api/chat",
            json={
                "model": self.model,
                "messages": messages,
                "options": {
                    "num_predict": max_tokens,
                    "temperature": temperature
                },
                "stream": False
            },
            timeout=120
        )
        response.raise_for_status()
        return response.json().get("message", {}).get("content", "")
    
    async def stream_chat(
        self, 
        messages: List[dict], 
        max_tokens: int = 512, 
        temperature: float = 0.7
    ) -> AsyncIterator[str]:
        """Generate a streaming response."""
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{self.base_url}/api/chat",
                json={
                    "model": self.model,
                    "messages": messages,
                    "options": {
                        "num_predict": max_tokens,
                        "temperature": temperature
                    },
                    "stream": True
                },
                timeout=aiohttp.ClientTimeout(total=120)
            ) as response:
                async for line in response.content:
                    if line:
                        try:
                            chunk = json.loads(line)
                            content = chunk.get("message", {}).get("content", "")
                            if content:
                                yield content
                        except json.JSONDecodeError:
                            pass

# Test connection
backend = OllamaBackend()
if backend.is_healthy():
    print("✅ Connected to Ollama!")
else:
    print("❌ Ollama not running. Start with: ollama serve")

---

## Part 2: Building the FastAPI Server

Now let's create the full API server with streaming support.

In [None]:
# Complete FastAPI server code
# We'll write this as an EXAMPLE file (the module includes a more complete api_server.py)

api_server_code = '''
"""
Production LLM API Server with FastAPI (Learning Example)

This is a simplified learning example. For production use, see:
    ../api/api_server.py (full-featured version with inference_client integration)

Features:
- OpenAI-compatible API
- Streaming responses (SSE)
- Rate limiting
- Request logging
- Health checks
- Error handling

Usage:
    uvicorn simple_api_example:app --host 0.0.0.0 --port 8080 --reload
"""

import asyncio
import json
import logging
import time
import uuid
from collections import deque
from datetime import datetime
from typing import List, Optional, AsyncIterator, Dict, Any

import aiohttp
from fastapi import FastAPI, HTTPException, Request, status
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse, JSONResponse
from pydantic import BaseModel, Field

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger("api")

# ============================================================
# Configuration
# ============================================================

OLLAMA_URL = "http://localhost:11434"
DEFAULT_MODEL = "llama3.1:8b"
RATE_LIMIT_RPM = 60  # Requests per minute

# ============================================================
# Pydantic Models
# ============================================================

class ChatMessage(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str = DEFAULT_MODEL
    messages: List[ChatMessage]
    max_tokens: Optional[int] = 512
    temperature: Optional[float] = 0.7
    stream: Optional[bool] = False
    user: Optional[str] = None

class ChatCompletionResponse(BaseModel):
    id: str
    object: str = "chat.completion"
    created: int
    model: str
    choices: List[dict]
    usage: dict

# ============================================================
# Rate Limiter
# ============================================================

class RateLimiter:
    """Token bucket rate limiter per IP."""
    
    def __init__(self, requests_per_minute: int = 60):
        self.rpm = requests_per_minute
        self.requests: Dict[str, deque] = {}
    
    def is_allowed(self, client_ip: str) -> bool:
        now = time.time()
        if client_ip not in self.requests:
            self.requests[client_ip] = deque()
        
        # Remove old requests
        while self.requests[client_ip] and now - self.requests[client_ip][0] > 60:
            self.requests[client_ip].popleft()
        
        if len(self.requests[client_ip]) >= self.rpm:
            return False
        
        self.requests[client_ip].append(now)
        return True

# ============================================================
# Metrics Tracker
# ============================================================

class Metrics:
    """Simple metrics tracker."""
    
    def __init__(self):
        self.start_time = time.time()
        self.total_requests = 0
        self.successful_requests = 0
        self.failed_requests = 0
        self.latencies: deque = deque(maxlen=1000)
    
    def record_request(self, latency_ms: float, success: bool):
        self.total_requests += 1
        self.latencies.append(latency_ms)
        if success:
            self.successful_requests += 1
        else:
            self.failed_requests += 1
    
    def get_stats(self) -> dict:
        latencies = list(self.latencies)
        return {
            "uptime_seconds": time.time() - self.start_time,
            "total_requests": self.total_requests,
            "successful_requests": self.successful_requests,
            "failed_requests": self.failed_requests,
            "avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
            "p90_latency_ms": sorted(latencies)[int(len(latencies) * 0.9)] if latencies else 0
        }

# ============================================================
# FastAPI App
# ============================================================

app = FastAPI(
    title="LLM Inference API",
    description="OpenAI-compatible API for local LLM inference",
    version="1.0.0"
)

# CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"]
)

# Global state
rate_limiter = RateLimiter(RATE_LIMIT_RPM)
metrics = Metrics()

# ============================================================
# Endpoints
# ============================================================

@app.get("/health")
async def health_check():
    """Health check endpoint."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(f"{OLLAMA_URL}/api/tags", timeout=5) as resp:
                backend_healthy = resp.status == 200
    except:
        backend_healthy = False
    
    return {
        "status": "healthy" if backend_healthy else "degraded",
        "backend": "ollama",
        "backend_url": OLLAMA_URL,
        "backend_healthy": backend_healthy,
        **metrics.get_stats()
    }

@app.get("/v1/models")
async def list_models():
    """List available models."""
    return {
        "object": "list",
        "data": [{"id": DEFAULT_MODEL, "object": "model", "owned_by": "local"}]
    }

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest, req: Request):
    """OpenAI-compatible chat completion endpoint."""
    start_time = time.time()
    client_ip = req.client.host if req.client else "unknown"
    
    # Rate limiting
    if not rate_limiter.is_allowed(client_ip):
        raise HTTPException(
            status_code=status.HTTP_429_TOO_MANY_REQUESTS,
            detail="Rate limit exceeded"
        )
    
    logger.info(f"Request from {client_ip}: {len(request.messages)} messages, stream={request.stream}")
    
    # Convert messages
    messages = [{"role": m.role, "content": m.content} for m in request.messages]
    
    if request.stream:
        return await stream_response(messages, request, start_time)
    else:
        return await non_stream_response(messages, request, start_time)

async def non_stream_response(messages: List[dict], request: ChatCompletionRequest, start_time: float):
    """Handle non-streaming response."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"{OLLAMA_URL}/api/chat",
                json={
                    "model": request.model,
                    "messages": messages,
                    "options": {
                        "num_predict": request.max_tokens,
                        "temperature": request.temperature
                    },
                    "stream": False
                },
                timeout=aiohttp.ClientTimeout(total=120)
            ) as resp:
                data = await resp.json()
        
        content = data.get("message", {}).get("content", "")
        latency_ms = (time.time() - start_time) * 1000
        metrics.record_request(latency_ms, True)
        
        return ChatCompletionResponse(
            id=f"chatcmpl-{uuid.uuid4().hex[:8]}",
            created=int(time.time()),
            model=request.model,
            choices=[{
                "index": 0,
                "message": {"role": "assistant", "content": content},
                "finish_reason": "stop"
            }],
            usage={
                "prompt_tokens": data.get("prompt_eval_count", 0),
                "completion_tokens": data.get("eval_count", 0),
                "total_tokens": data.get("prompt_eval_count", 0) + data.get("eval_count", 0)
            }
        )
    except Exception as e:
        latency_ms = (time.time() - start_time) * 1000
        metrics.record_request(latency_ms, False)
        logger.error(f"Error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

async def stream_response(messages: List[dict], request: ChatCompletionRequest, start_time: float):
    """Handle streaming response."""
    completion_id = f"chatcmpl-{uuid.uuid4().hex[:8]}"
    
    async def generate():
        try:
            # Initial chunk with role
            initial = {
                "id": completion_id,
                "object": "chat.completion.chunk",
                "created": int(time.time()),
                "model": request.model,
                "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": None}]
            }
            yield f"data: {json.dumps(initial)}\\n\\n"
            
            async with aiohttp.ClientSession() as session:
                async with session.post(
                    f"{OLLAMA_URL}/api/chat",
                    json={
                        "model": request.model,
                        "messages": messages,
                        "options": {
                            "num_predict": request.max_tokens,
                            "temperature": request.temperature
                        },
                        "stream": True
                    },
                    timeout=aiohttp.ClientTimeout(total=120)
                ) as resp:
                    async for line in resp.content:
                        if line:
                            try:
                                chunk = json.loads(line)
                                content = chunk.get("message", {}).get("content", "")
                                if content:
                                    data = {
                                        "id": completion_id,
                                        "object": "chat.completion.chunk",
                                        "created": int(time.time()),
                                        "model": request.model,
                                        "choices": [{"index": 0, "delta": {"content": content}, "finish_reason": None}]
                                    }
                                    yield f"data: {json.dumps(data)}\\n\\n"
                            except json.JSONDecodeError:
                                pass
            
            # Final chunk
            final = {
                "id": completion_id,
                "object": "chat.completion.chunk",
                "created": int(time.time()),
                "model": request.model,
                "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]
            }
            yield f"data: {json.dumps(final)}\\n\\n"
            yield "data: [DONE]\\n\\n"
            
            latency_ms = (time.time() - start_time) * 1000
            metrics.record_request(latency_ms, True)
            
        except Exception as e:
            latency_ms = (time.time() - start_time) * 1000
            metrics.record_request(latency_ms, False)
            logger.error(f"Streaming error: {e}")
            yield f"data: {{\\"error\\": \\"{str(e)}\\"}}\\n\\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "Connection": "keep-alive"}
    )

@app.get("/metrics")
async def get_metrics():
    """Get server metrics."""
    return metrics.get_stats()

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
'''

# Write to file - use a unique name to avoid overwriting the full-featured api_server.py
from pathlib import Path
import os

# Get the notebook's directory reliably
# When running in Jupyter, the cwd is typically the notebook's directory
notebook_dir = Path.cwd()

# Navigate to expected api directory location
api_dir = (notebook_dir / "../api").resolve()

# Verify we're in the right directory structure
scripts_dir = (notebook_dir / "../scripts").resolve()
if not scripts_dir.exists():
    print("⚠️  Warning: Cannot verify correct directory structure")
    print(f"   Expected 'scripts/' folder at: {scripts_dir}")
    print(f"   Current working directory: {notebook_dir}")
    print("\n   If you're running from a different directory, the file may be")
    print("   written to an unexpected location.")
    user_confirm = input("   Continue anyway? (y/n): ").strip().lower()
    if user_confirm != 'y':
        print("   Aborted. Please run this notebook from the notebooks/ directory.")
        raise SystemExit(0)

# Create the api directory if it doesn't exist
api_dir.mkdir(exist_ok=True)

# Write as an example/learning file
output_file = api_dir / "simple_api_example.py"
with open(output_file, "w") as f:
    f.write(api_server_code)

print(f"✅ Example API server code written to {output_file}")
print("\n💡 Note: The module also includes a full-featured server at ../api/api_server.py")
print("   that integrates with the inference_client for multi-engine support.")
print("\n🚀 To start the example server, run:")
print("   cd ../api && uvicorn simple_api_example:app --host 0.0.0.0 --port 8080 --reload")

---

## Part 3: Testing the API

Let's create test functions to verify our API works correctly.

In [None]:
import requests
import json
import time

API_URL = "http://localhost:8080"

def test_health():
    """Test health endpoint."""
    try:
        response = requests.get(f"{API_URL}/health", timeout=5)
        if response.status_code == 200:
            data = response.json()
            print(f"✅ Health check passed: {data['status']}")
            print(f"   Backend: {data.get('backend', 'unknown')}")
            print(f"   Uptime: {data.get('uptime_seconds', 0):.1f}s")
            return True
    except requests.exceptions.ConnectionError:
        print("❌ Cannot connect to API server")
        print("   Start it with: uvicorn production_api:app --port 8080")
    except Exception as e:
        print(f"❌ Health check failed: {e}")
    return False

def test_non_streaming():
    """Test non-streaming completion."""
    try:
        start = time.time()
        response = requests.post(
            f"{API_URL}/v1/chat/completions",
            json={
                "model": "llama3.1:8b",
                "messages": [{"role": "user", "content": "Say hello!"}],
                "max_tokens": 50,
                "stream": False
            },
            timeout=60
        )
        latency = (time.time() - start) * 1000
        
        if response.status_code == 200:
            data = response.json()
            content = data["choices"][0]["message"]["content"]
            print(f"✅ Non-streaming works! ({latency:.0f}ms)")
            print(f"   Response: {content[:100]}...")
            return True
        else:
            print(f"❌ Error: {response.status_code} - {response.text}")
    except Exception as e:
        print(f"❌ Non-streaming test failed: {e}")
    return False

def test_streaming():
    """Test streaming completion."""
    try:
        start = time.time()
        first_token = None
        chunks = []
        
        response = requests.post(
            f"{API_URL}/v1/chat/completions",
            json={
                "model": "llama3.1:8b",
                "messages": [{"role": "user", "content": "Count from 1 to 5."}],
                "max_tokens": 50,
                "stream": True
            },
            stream=True,
            timeout=60
        )
        
        for line in response.iter_lines():
            if line:
                line_str = line.decode()
                if line_str.startswith("data: "):
                    data_str = line_str[6:]
                    if data_str == "[DONE]":
                        break
                    try:
                        chunk = json.loads(data_str)
                        content = chunk.get("choices", [{}])[0].get("delta", {}).get("content", "")
                        if content:
                            if first_token is None:
                                first_token = time.time()
                            chunks.append(content)
                    except:
                        pass
        
        total_time = (time.time() - start) * 1000
        ttft = ((first_token - start) * 1000) if first_token else 0
        
        print(f"✅ Streaming works! (TTFT: {ttft:.0f}ms, Total: {total_time:.0f}ms)")
        print(f"   Response: {''.join(chunks)}")
        return True
        
    except Exception as e:
        print(f"❌ Streaming test failed: {e}")
    return False

# Run tests
print("🧪 Running API Tests")
print("=" * 50)

if test_health():
    print("")
    test_non_streaming()
    print("")
    test_streaming()

---

## Part 4: Load Testing

Let's test how the API handles concurrent requests.

In [None]:
import asyncio
import aiohttp
import time
from typing import List, Tuple

async def send_request(session: aiohttp.ClientSession, request_id: int) -> Tuple[int, float, bool]:
    """Send a single request and return (id, latency, success)."""
    start = time.time()
    try:
        async with session.post(
            f"{API_URL}/v1/chat/completions",
            json={
                "model": "llama3.1:8b",
                "messages": [{"role": "user", "content": f"Request {request_id}: Say 'hello'"}],
                "max_tokens": 20,
                "stream": False
            },
            timeout=aiohttp.ClientTimeout(total=60)
        ) as response:
            await response.json()
            latency = (time.time() - start) * 1000
            return (request_id, latency, response.status == 200)
    except Exception as e:
        latency = (time.time() - start) * 1000
        return (request_id, latency, False)

async def load_test(num_requests: int, concurrency: int) -> dict:
    """Run load test with specified concurrency."""
    semaphore = asyncio.Semaphore(concurrency)
    
    async def limited_request(session, req_id):
        async with semaphore:
            return await send_request(session, req_id)
    
    start_time = time.time()
    
    async with aiohttp.ClientSession() as session:
        tasks = [limited_request(session, i) for i in range(num_requests)]
        results = await asyncio.gather(*tasks)
    
    total_time = time.time() - start_time
    
    # Analyze results
    latencies = [r[1] for r in results if r[2]]
    successful = sum(1 for r in results if r[2])
    
    return {
        "total_requests": num_requests,
        "concurrency": concurrency,
        "successful": successful,
        "failed": num_requests - successful,
        "total_time_s": total_time,
        "throughput_rps": successful / total_time if total_time > 0 else 0,
        "avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
        "p90_latency_ms": sorted(latencies)[int(len(latencies) * 0.9)] if latencies else 0,
    }

# Check if API is running
try:
    requests.get(f"{API_URL}/health", timeout=2)
    api_running = True
except:
    api_running = False

if api_running:
    print("🔥 Running Load Test")
    print("=" * 50)
    
    # Test at different concurrency levels
    for concurrency in [1, 2, 4]:
        print(f"\nConcurrency: {concurrency}")
        result = asyncio.run(load_test(num_requests=10, concurrency=concurrency))
        print(f"   Throughput: {result['throughput_rps']:.2f} req/s")
        print(f"   Avg Latency: {result['avg_latency_ms']:.0f}ms")
        print(f"   P90 Latency: {result['p90_latency_ms']:.0f}ms")
        print(f"   Success Rate: {result['successful']}/{result['total_requests']}")
else:
    print("⚠️ API server not running. Start it first to run load tests.")

---

## ⚠️ Common Mistakes

### Mistake 1: Not Handling Streaming Correctly

```python
# ❌ Wrong - Response buffered, defeats purpose of streaming
return JSONResponse(content=full_response)  # After collecting all chunks

# ✅ Right - True streaming
return StreamingResponse(
    generate(),
    media_type="text/event-stream"
)
```

### Mistake 2: Blocking the Event Loop

```python
# ❌ Wrong - Blocks async event loop
response = requests.post(url, json=data)  # Synchronous!

# ✅ Right - Non-blocking async
async with aiohttp.ClientSession() as session:
    async with session.post(url, json=data) as response:
        data = await response.json()
```

### Mistake 3: Missing Error Handling in Streaming

```python
# ❌ Wrong - Errors crash silently
async def generate():
    async for chunk in backend.stream():
        yield f"data: {chunk}\n\n"

# ✅ Right - Handle errors gracefully
async def generate():
    try:
        async for chunk in backend.stream():
            yield f"data: {chunk}\n\n"
    except Exception as e:
        yield f"data: {{\"error\": \"{str(e)}\"}}\n\n"
```

---

## ✋ Try It Yourself

### Exercise 1: Add API Key Authentication

Add simple API key authentication to protect your endpoint.

In [None]:
# Exercise 1: Implement API key auth
# TODO: Add a middleware or dependency that:
#   1. Checks for Authorization header: "Bearer sk-xxx"
#   2. Validates against a list of valid keys
#   3. Returns 401 if invalid

# Hint: Use FastAPI's Depends and HTTPBearer
# from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials


### Exercise 2: Add Request Logging to File

Log all requests to a file for later analysis.

In [None]:
# Exercise 2: Add file logging
# TODO: Create a middleware that logs:
#   - Timestamp
#   - Client IP
#   - Request path
#   - Latency
#   - Status code
# to a JSON Lines file (one JSON object per line)


---

## 🎉 Checkpoint

You've learned:
- ✅ How to build an OpenAI-compatible API with FastAPI
- ✅ Implementing streaming responses with SSE
- ✅ Adding rate limiting and metrics
- ✅ Load testing your API

---

## 🚀 Challenge (Optional)

**Build a Multi-Backend Load Balancer**

Create an API that:
1. Routes requests to multiple inference backends
2. Implements health checking and failover
3. Load balances based on current queue depth
4. Provides a unified API across different engines

---

## 📖 Further Reading

- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [Server-Sent Events (SSE) Spec](https://html.spec.whatwg.org/multipage/server-sent-events.html)
- [OpenAI API Reference](https://platform.openai.com/docs/api-reference)
- [Uvicorn Production Deployment](https://www.uvicorn.org/deployment/)

---

## 🧹 Cleanup

In [None]:
import gc
gc.collect()

print("✅ Cleanup complete!")
print("\n💡 To stop the API server: Ctrl+C in the terminal")