# üåê Local Model Serving with FastAPI

**Module 04 | Notebook 1 of 4**

Learn to create production-ready REST APIs for your ML models using FastAPI.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Create a FastAPI application for model serving
2. Implement prediction endpoints
3. Handle input validation with Pydantic
4. Test your API locally

---

In [None]:
%%capture
!pip install transformers torch fastapi uvicorn pydantic

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---

## 1Ô∏è‚É£ Why FastAPI?

### REST API Serving Pattern

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê     HTTP Request      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   Client    ‚îÇ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí ‚îÇ   FastAPI   ‚îÇ
‚îÇ  (Browser,  ‚îÇ     {"text": "..."}   ‚îÇ   Server    ‚îÇ
‚îÇ   Mobile)   ‚îÇ                       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                              ‚îÇ
       ‚Üë                                     ‚ñº
       ‚îÇ                              ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
       ‚îÇ      HTTP Response           ‚îÇ    Model    ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ ‚îÇ  Inference  ‚îÇ
             {"label": "POSITIVE"}    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### FastAPI Advantages

| Feature | Benefit |
|---------|--------|
| **Automatic docs** | Swagger UI out of the box |
| **Type hints** | Automatic validation |
| **Async support** | High concurrency |
| **Fast** | One of the fastest Python frameworks |
| **Modern** | Native Python 3.6+ features |

---

## 2Ô∏è‚É£ Load the Model

In [None]:
# Load model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
model.eval()

print(f"Model loaded: {model_name}")
print(f"Labels: {model.config.id2label}")

In [None]:
# Test prediction function
def predict(text: str) -> dict:
    """Run inference on input text."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    ).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    pred_idx = probs.argmax().item()
    
    return {
        "label": model.config.id2label[pred_idx],
        "confidence": probs[pred_idx].item(),
        "probabilities": {
            model.config.id2label[i]: probs[i].item() 
            for i in range(len(probs))
        }
    }

# Test
result = predict("This movie was fantastic!")
print(f"Test prediction: {result}")

---

## 3Ô∏è‚É£ Create the FastAPI Application

Here's the complete FastAPI application code. In a production setting, you would save this to a file.

In [None]:
# FastAPI application code
app_code = '''
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from typing import Dict, List, Optional
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Initialize app
app = FastAPI(
    title="Sentiment Analysis API",
    description="A REST API for sentiment classification using DistilBERT",
    version="1.0.0"
)

# Load model at startup
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Request/Response schemas
class PredictionRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=5000, description="Text to classify")
    
    class Config:
        json_schema_extra = {
            "example": {"text": "This movie was absolutely fantastic!"}
        }

class PredictionResponse(BaseModel):
    label: str
    confidence: float
    probabilities: Dict[str, float]

class BatchRequest(BaseModel):
    texts: List[str] = Field(..., max_length=100)

class HealthResponse(BaseModel):
    status: str
    model: str
    device: str

# Endpoints
@app.get("/health", response_model=HealthResponse)
def health_check():
    """Check if the API is running and model is loaded."""
    return {
        "status": "healthy",
        "model": model_name,
        "device": str(device)
    }

@app.post("/predict", response_model=PredictionResponse)
def predict_sentiment(request: PredictionRequest):
    """Predict sentiment for a single text."""
    try:
        inputs = tokenizer(
            request.text,
            return_tensors="pt",
            truncation=True,
            max_length=512
        ).to(device)
        
        with torch.no_grad():
            outputs = model(**inputs)
        
        probs = torch.softmax(outputs.logits, dim=-1)[0]
        pred_idx = probs.argmax().item()
        
        return {
            "label": model.config.id2label[pred_idx],
            "confidence": probs[pred_idx].item(),
            "probabilities": {
                model.config.id2label[i]: probs[i].item() 
                for i in range(len(probs))
            }
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/predict/batch", response_model=List[PredictionResponse])
def predict_batch(request: BatchRequest):
    """Predict sentiment for multiple texts."""
    results = []
    for text in request.texts:
        req = PredictionRequest(text=text)
        results.append(predict_sentiment(req))
    return results

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

# Display the code
print("FastAPI Application Code:")
print("=" * 60)
print(app_code)

In [None]:
# Save to file
with open("./app.py", "w") as f:
    f.write(app_code)

print("‚úÖ Application saved to app.py")
print("\nTo run the server:")
print("  python app.py")
print("  OR")
print("  uvicorn app:app --reload --host 0.0.0.0 --port 8000")

---

## 4Ô∏è‚É£ Understanding the Application

### Request/Response Models (Pydantic)

```python
class PredictionRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=5000)
```

This provides:
- **Automatic validation** (text must be 1-5000 characters)
- **Documentation** (shown in Swagger UI)
- **Type hints** for IDE support

### Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Check API status |
| `/predict` | POST | Single text prediction |
| `/predict/batch` | POST | Batch predictions |
| `/docs` | GET | Swagger UI (automatic) |
| `/redoc` | GET | ReDoc UI (automatic) |

---

## 5Ô∏è‚É£ Testing the API

Once the server is running, you can test it using `requests`:

In [None]:
# Example client code (run when server is active)
client_code = '''
import requests

BASE_URL = "http://localhost:8000"

# Health check
response = requests.get(f"{BASE_URL}/health")
print("Health:", response.json())

# Single prediction
response = requests.post(
    f"{BASE_URL}/predict",
    json={"text": "This movie was fantastic!"}
)
print("Prediction:", response.json())

# Batch prediction
response = requests.post(
    f"{BASE_URL}/predict/batch",
    json={
        "texts": [
            "I love this product!",
            "Terrible experience, never again.",
            "It was okay."
        ]
    }
)
print("Batch:", response.json())
'''

print("Client Test Code:")
print("=" * 60)
print(client_code)

# Save client code
with open("./test_client.py", "w") as f:
    f.write(client_code)
print("\n‚úÖ Client code saved to test_client.py")

---

## 6Ô∏è‚É£ Production Best Practices

In [None]:
production_tips = """
## Production Best Practices

### 1. Model Loading
- Load model ONCE at startup, not per request
- Use `@app.on_event("startup")` for initialization

### 2. Error Handling
- Use try/except and HTTPException
- Return meaningful error messages

### 3. Logging
```python
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
```

### 4. Rate Limiting
```python
from fastapi_limiter import FastAPILimiter
```

### 5. CORS (for web clients)
```python
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)
```

### 6. Async for I/O
```python
@app.post("/predict")
async def predict(request: Request):
    # Use async for database/file operations
```

### 7. Health Checks
- Include model status, memory usage, GPU utilization
- Kubernetes/Docker can use for readiness probes
"""

print(production_tips)

---

## üéØ Student Challenge

### Challenge: Add New Endpoints

In [None]:
# TODO: Extend the API with these features:

# 1. Add a `/tokenize` endpoint that returns token information
#    - Input: {"text": "..."}
#    - Output: {"tokens": [...], "token_ids": [...], "num_tokens": N}

# 2. Add model info endpoint `/model/info`
#    - Output: {"name": "...", "parameters": N, "vocab_size": N}

# 3. Add request timing middleware
#    - Log request duration for each call

# Your solution:


---

## üìù Key Takeaways

1. **FastAPI** provides automatic docs, validation, and async support
2. **Pydantic models** define request/response schemas with validation
3. **Load models once** at startup for efficiency
4. **Health endpoints** are essential for production monitoring
5. **Batch endpoints** improve throughput for multiple requests

---

## ‚û°Ô∏è Next Steps

Continue to `02_gradio_ui.ipynb` for interactive web interfaces!