**Purpose of Day 7:** Build a **production-ready LLM API** with FastAPI, including streaming, caching, monitoring, and a client.

It's about turning your model into a **real web service** that others can use.

* Model Serving: vLLM, TensorRT-LLM, Triton
* API Layer: FastAPI, Flask, gRPC
* Caching: Redis for prompt/response caching
* Monitoring: Logging, metrics, tracing
* Safety: Content filtering, rate limiting

- **Model Serving:** Engine to run models efficiently.
- **API Layer:** Expose model as web service.
- **Caching:** Store responses to avoid recomputation.
- **Monitoring:** Track performance and errors.
- **Safety:** Protect against abuse/bad content.

**Purpose:** Store generated responses so identical prompts don't need re-generation.

**Usefulness:**
1. **Speed:** Cache hits are instant (ms) vs generation (seconds)
2. **Cost:** Reduces compute/API costs
3. **Scalability:** Handles more users with same resources
4. **Consistency:** Same prompt → same response

**Example:** If 100 users ask "What is AI?", generate once, cache it, serve from cache 99 times.

Yes:

1. **Larger batch size:** Generate multiple tokens at once
2. **Speculative decoding:** Small model drafts, large model verifies
3. **Optimized kernels:** Use FlashAttention, custom CUDA kernels
4. **KV caching:** Already using (`use_cache=True`)
5. **Quantization:** 4-bit/8-bit models (already using)
6. **Hardware:** GPU instead of CPU

But token-by-token streaming will always feel slower than batch display. The **total time** is same, just perception differs.

**Purpose:** Create a **client** to call your FastAPI LLM server from Python.

**Usefulness:**
1. **Programmatic access:** Use your API in other Python scripts
2. **Testing:** Verify server works correctly
3. **Integration:** Connect to your LLM from other applications
4. **Streaming support:** Handle both regular and streaming responses

**Why:** Your API isn't useful without clients. This provides a ready-to-use Python client.

#### 0

In [1]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import asyncio
import uuid
import time
from typing import List, Optional

  from .autonotebook import tqdm as notebook_tqdm


These are **imports** - bringing in all the tools needed:

1. **FastAPI, HTTPException** - for building the web server
2. **BaseModel** - for defining data structures (request/response format)
3. **torch** - PyTorch, the AI framework
4. **AutoModelForCausalLM, AutoTokenizer** - Hugging Face tools to load AI models
5. **asyncio** - for async/await (handling multiple requests)
6. **uuid** - to generate unique IDs for each request
7. **time** - to measure how long generation takes
8. **List, Optional** - type hints for better code clarity

Without these imports, the code won't run because it doesn't know where these tools come from.

Yes, but only **basics**.

For FastAPI, you just need to know:
- `@app.get("/path")` → handles GET requests
- `@app.post("/path")` → handles POST requests
- `@app.on_event("startup")` → runs code when server starts

That's it for now. Learn more as you need it.

In [5]:
class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7
    top_p: float= 0.9
    stream: bool = False

In [6]:
class GenerationResponse(BaseModel):
    generated_text: str
    request_id: str
    token_generated: str
    inference: float

In [7]:
# **Line-by-line explanation:**

# ```python
# Request/Response models
class GenerationRequest(BaseModel):
    prompt: str                    # User's input text (required)
    max_tokens: int = 100          # How many new tokens to generate (default 100)
    temperature: float = 0.7       # Randomness control (default 0.7)
    top_p: float = 0.9             # Token selection limit (default 0.9)
    stream: bool = False           # Whether to stream tokens (default False)

class GenerationResponse(BaseModel):
    generated_text: str            # The AI-generated text
    request_id: str                # Unique ID for this request
    tokens_generated: int          # How many tokens were made
    inference_time: float          # How long generation took (seconds)

# Initialize FastAPI app - creates the web server
app = FastAPI(title="LLM Inference API", version="1.0")

# Global model and tokenizer - will be loaded once and reused
MODEL = None
TOKENIZER = None

@app.on_event("startup")  # Runs when server starts
async def load_model():
    """Load model on startup"""
    global MODEL, TOKENIZER        # Use the global variables
    print("Loading model...")
    
    # Load the tokenizer (text ↔ tokens converter)
    TOKENIZER = AutoTokenizer.from_pretrained("gpt2")
    TOKENIZER.pad_token = TOKENIZER.eos_token  # Set padding token
    
    # Load the actual AI model
    MODEL = AutoModelForCausalLM.from_pretrained(
        "gpt2",
        torch_dtype=torch.float16,  # Use half precision to save memory
        device_map="auto"           # Put on GPU if available
    )
    print("Model loaded successfully!")

@app.post("/generate", response_model=GenerationResponse)  # POST endpoint at /generate
async def generate_text(request: GenerationRequest):
    """Generate text endpoint"""
    if MODEL is None:  # Check if model is loaded
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    start_time = time.time()        # Start timer
    request_id = str(uuid.uuid4())  # Create unique ID for this request
    
    try:
        # Convert user's text to tokens
        inputs = TOKENIZER(request.prompt, return_tensors="pt").to(MODEL.device)
        
        # Generate new tokens
        with torch.no_grad():  # Disable gradient calculation (faster)
            outputs = MODEL.generate(
                **inputs,                     # Pass the tokenized input
                max_new_tokens=request.max_tokens,    # How many tokens to make
                temperature=request.temperature,      # Randomness level
                top_p=request.top_p,                  # Token selection
                do_sample=True,                       # Use sampling (not greedy)
                pad_token_id=TOKENIZER.eos_token_id   # Padding token ID
            )
        
        # Convert tokens back to text
        generated_text = TOKENIZER.decode(outputs[0], skip_special_tokens=True)
        # Count how many new tokens were generated
        tokens_generated = outputs.shape[1] - inputs.input_ids.shape[1]
        # Calculate how long it took
        inference_time = time.time() - start_time
        
        # Return the response in the expected format
        return GenerationResponse(
            generated_text=generated_text,
            request_id=request_id,
            tokens_generated=tokens_generated,
            inference_time=inference_time
        )
        
    except Exception as e:  # If anything goes wrong
        raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")

@app.get("/health")  # GET endpoint at /health
async def health_check():
    """Health check endpoint - for monitoring"""
    return {
        "status": "healthy", 
        "model_loaded": MODEL is not None,  # Is model loaded?
        "device": str(MODEL.device) if MODEL else "none"  # Where's model running?
    }
# Run with: uvicorn script_name:app --host 0.0.0.0 --port 8000 --reload

        on_event is deprecated, use lifespan event handlers instead.

        Read more about it in the
        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
        
  @app.on_event("startup")  # Runs when server starts



**In simple terms:** This creates a web server that loads an AI model once, then lets users send text and get AI-generated responses through an API, with proper error handling and monitoring.

In [3]:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from contextlib import asynccontextmanager
import uuid
import time

In [9]:
# Request/Response models
class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 100
    temperature: float = 0.7
    top_p: float = 0.9
    stream: bool = False

class GenerationResponse(BaseModel):
    generated_text: str
    request_id: str
    tokens_generated: int
    inference_time: float

# Lifespan manager
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    print("Loading model...")
    global TOKENIZER, MODEL
    
    TOKENIZER = AutoTokenizer.from_pretrained("gpt2")
    TOKENIZER.pad_token = TOKENIZER.eos_token
    
    MODEL = AutoModelForCausalLM.from_pretrained(
        "gpt2",
        torch_dtype=torch.float16,
        device_map="auto"
    )
    print("Model loaded successfully!")
    yield
    # Shutdown (optional cleanup)

# Initialize FastAPI with lifespan
app = FastAPI(title="LLM Inference API", version="1.0", lifespan=lifespan)

# Global model and tokenizer
MODEL = None
TOKENIZER = None

@app.post("/generate", response_model=GenerationResponse)
async def generate_text(request: GenerationRequest):
    if MODEL is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    start_time = time.time()
    request_id = str(uuid.uuid4())
    
    try:
        inputs = TOKENIZER(request.prompt, return_tensors="pt").to(MODEL.device)
        
        with torch.no_grad():
            outputs = MODEL.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                top_p=request.top_p,
                do_sample=True,
                pad_token_id=TOKENIZER.eos_token_id
            )
        
        generated_text = TOKENIZER.decode(outputs[0], skip_special_tokens=True)
        tokens_generated = outputs.shape[1] - inputs.input_ids.shape[1]
        inference_time = time.time() - start_time
        
        return GenerationResponse(
            generated_text=generated_text,
            request_id=request_id,
            tokens_generated=tokens_generated,
            inference_time=inference_time
        )
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")

@app.get("/health")
async def health_check():
    return {
        "status": "healthy", 
        "model_loaded": MODEL is not None,
        "device": str(MODEL.device) if MODEL else "none"
    }
# Run with: uvicorn script_name:app --host 0.0.0.0 --port 8000 --reload

#### 1

In [1]:
'./distilgpt2-local', './gpt2-local'

('./distilgpt2-local', './gpt2-local')

In [None]:
import requests
import json

class LLMClient:
    def __init__(self, base_url="http://localhost:8000"):
        self.base_url = base_url
    
    def generate(self, prompt, **kwargs):
        """Sync generation"""
        response = requests.post(
            f"{self.base_url}/generate",
            json={"prompt": prompt, **kwargs}
        )
        return response.json()
    
    def generate_stream(self, prompt, **kwargs):
        """Streaming generation"""
        response = requests.post(
            f"{self.base_url}/generate-stream",
            json={"prompt": prompt, **kwargs},
            stream=True
        )
        
        for line in response.iter_lines():
            if line:
                data = line.decode().replace("data: ", "")
                if data.strip():
                    yield json.loads(data)

# Test client
def test_client():
    client = LLMClient()
    
    # Test sync generation
    print("=== Sync Generation ===")
    result = client.generate("The future of AI is", max_tokens=50)
    print(f"Result: {result['generated_text']}")
    
    # Test streaming
    print("\n=== Streaming Generation ===")
    for chunk in client.generate_stream("Explain quantum computing:", max_tokens=30):
        if not chunk['finished']:
            print(chunk['token'], end="", flush=True)
    print()

# Run client test
test_client()

In [2]:
1

1