# Chapter 40: Building Prediction Services

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Design a RESTful API to serve predictions from your trained model
- Choose between synchronous and asynchronous prediction services based on latency requirements
- Implement a prediction service using FastAPI and Flask
- Handle request validation, error responses, and logging
- Add authentication and rate limiting to secure your API
- Document your API using OpenAPI/Swagger
- Package the service as a standalone application for deployment
- Test the API locally and with automated tests
- Understand the differences between online (real‑time) and batch prediction services
- Deploy the service using containers and orchestration tools

---

## **40.1 Introduction to Prediction Services**

A model that sits on disk is not useful until it can make predictions on new data. In a production environment, we typically expose the model through a **prediction service** – an API that accepts input data and returns predictions. This decouples the model from the applications that consume its predictions (e.g., trading dashboards, mobile apps, other microservices).

For the NEPSE prediction system, we might build a service that:

- Accepts a stock symbol and returns the predicted next‑day return.
- Accepts a batch of recent market data for multiple stocks and returns predictions.
- Runs as a scheduled batch job that pushes predictions to a database.

In this chapter, we focus on building a **real‑time prediction API** using FastAPI, a modern Python web framework. We'll also discuss batch prediction services briefly.

---

## **40.2 Service Architecture Patterns**

Before coding, we need to decide on the architecture. Common patterns include:

### **40.2.1 Online (Real‑Time) Prediction**

- Client sends a request with input data.
- Service loads the model, preprocesses the input, runs inference, and returns the prediction.
- Latency is critical (milliseconds to seconds).
- Suitable for interactive applications or low‑latency trading signals.

### **40.2.2 Batch Prediction**

- Predictions are generated periodically (e.g., once per day) for all stocks.
- Results are stored in a database or file.
- Clients query the pre‑computed results.
- Simpler to implement, no real‑time infrastructure required.

### **40.2.3 Streaming Prediction**

- Input data arrives as a stream (e.g., tick data).
- Predictions are generated on the fly and emitted to a stream.
- Requires stream processing frameworks (Kafka, Flink). We'll touch on this in Chapter 42.

For the NEPSE system, a daily batch prediction might suffice, but we'll build an API for on‑demand predictions as a learning exercise.

---

## **40.3 REST API Design**

A well‑designed REST API follows conventions:

- Use nouns for resources (e.g., `/predict` is an action, better to use `/predictions`).
- Use HTTP methods appropriately: `POST` for creating a prediction (since we are sending data).
- Accept JSON in the request body, return JSON.
- Use meaningful status codes: 200 for success, 400 for bad input, 404 for not found, 500 for server error.
- Version your API (e.g., `/api/v1/predictions`).

For our prediction service, we'll design two endpoints:

- `POST /api/v1/predict` – single prediction for one stock.
- `POST /api/v1/predict/batch` – batch predictions for multiple stocks.

Request body example for single prediction:

```json
{
  "symbol": "NEPSE",
  "date": "2025-03-16",
  "features": {
    "Open": 1200.5,
    "High": 1210.0,
    "Low": 1195.0,
    "Close": 1205.0,
    "Vol": 1500000,
    "VWAP": 1202.3,
    "Prev_Close": 1198.0
  }
}
```

Response:

```json
{
  "symbol": "NEPSE",
  "prediction_date": "2025-03-16",
  "predicted_return": 0.35,
  "model_version": "v1.2.0"
}
```

---

## **40.4 FastAPI Implementation**

FastAPI is a modern, high‑performance web framework for building APIs with Python. It automatically generates OpenAPI documentation and supports asynchronous programming.

### **40.4.1 Setting Up FastAPI**

First, install FastAPI and an ASGI server like Uvicorn:

```bash
pip install fastapi uvicorn
```

Create a file `app.py`:

```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import joblib
import numpy as np
import pandas as pd
from typing import List, Optional

app = FastAPI(title="NEPSE Prediction API", version="1.0.0")

# Load model and preprocessor at startup
model = joblib.load("models/pipeline.joblib")
feature_names = joblib.load("models/feature_names.joblib")  # list of expected features

class PredictionRequest(BaseModel):
    symbol: str
    date: str
    features: dict

class BatchPredictionRequest(BaseModel):
    requests: List[PredictionRequest]

class PredictionResponse(BaseModel):
    symbol: str
    prediction_date: str
    predicted_return: float
    model_version: str = "v1.2.0"

@app.get("/")
def root():
    return {"message": "NEPSE Prediction API"}

@app.post("/api/v1/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    """
    Predict next day's return for a single stock.
    """
    try:
        # Convert features dict to array in correct order
        features_array = np.array([[request.features.get(f, 0) for f in feature_names]])
        # Predict
        pred = model.predict(features_array)[0]
        return PredictionResponse(
            symbol=request.symbol,
            prediction_date=request.date,
            predicted_return=float(pred)
        )
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))

@app.post("/api/v1/predict/batch", response_model=List[PredictionResponse])
async def predict_batch(request: BatchPredictionRequest):
    """
    Predict next day's return for multiple stocks.
    """
    try:
        # Build feature matrix
        X = []
        for req in request.requests:
            X.append([req.features.get(f, 0) for f in feature_names])
        X = np.array(X)
        preds = model.predict(X)
        responses = []
        for req, pred in zip(request.requests, preds):
            responses.append(PredictionResponse(
                symbol=req.symbol,
                prediction_date=req.date,
                predicted_return=float(pred)
            ))
        return responses
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

**Explanation:**

- **Pydantic models** (`PredictionRequest`, etc.) define the expected request/response structure and provide automatic validation.
- The model is loaded once at startup (outside the endpoint functions) to avoid reloading on every request.
- The endpoint is `async def` – FastAPI can handle async, but our prediction is CPU‑bound, so it will run in a thread pool. This is fine.
- We extract features in the order expected by the model (`feature_names`). If a feature is missing, we default to 0.
- If an error occurs, we raise `HTTPException` with a 400 status code.
- The API documentation is automatically available at `/docs`.

### **40.4.2 Running the Service**

```bash
uvicorn app:app --reload --host 0.0.0.0 --port 8000
```

Then visit `http://localhost:8000/docs` to see the interactive Swagger UI.

---

## **40.5 Flask Implementation (Alternative)**

Flask is another popular choice, though it is synchronous by default. Here's a minimal Flask version:

```python
from flask import Flask, request, jsonify
import joblib
import numpy as np

app = Flask(__name__)
model = joblib.load("models/pipeline.joblib")
feature_names = joblib.load("models/feature_names.joblib")

@app.route('/api/v1/predict', methods=['POST'])
def predict():
    data = request.get_json()
    try:
        features = data['features']
        X = np.array([[features.get(f, 0) for f in feature_names]])
        pred = model.predict(X)[0]
        return jsonify({
            'symbol': data['symbol'],
            'prediction_date': data['date'],
            'predicted_return': float(pred),
            'model_version': 'v1.2.0'
        })
    except Exception as e:
        return jsonify({'error': str(e)}), 400

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)
```

**Explanation:**  
Flask is simpler but lacks automatic validation and async support. You would need to add request validation manually (e.g., using `marshmallow`). FastAPI is generally preferred for new projects.

---

## **40.6 Asynchronous Processing**

If predictions take a long time (e.g., large deep learning models), you may want to process them asynchronously to avoid blocking the request thread. This involves:

- Accepting the request and immediately returning a `202 Accepted` with a task ID.
- Processing the prediction in the background.
- Allowing the client to poll a status endpoint for the result.

FastAPI can integrate with background tasks:

```python
from fastapi import BackgroundTasks
import uuid

tasks = {}

def run_prediction(task_id, features):
    # long-running prediction
    result = model.predict(features)
    tasks[task_id] = result

@app.post("/api/v1/predict/async")
async def predict_async(request: PredictionRequest, background_tasks: BackgroundTasks):
    task_id = str(uuid.uuid4())
    features = ...  # prepare features
    background_tasks.add_task(run_prediction, task_id, features)
    return {"task_id": task_id, "status": "processing"}

@app.get("/api/v1/predict/async/{task_id}")
async def get_result(task_id: str):
    if task_id not in tasks:
        raise HTTPException(status_code=404)
    result = tasks[task_id]
    return {"task_id": task_id, "result": result}
```

**Explanation:**  
This pattern is useful for long‑running inferences. The client can poll the status endpoint until the result is ready.

---

## **40.7 Batch Prediction Services**

For daily batch predictions, we might not need a REST API. Instead, we can write a script that:

1. Fetches the latest data for all stocks.
2. Computes features.
3. Loads the model (or multiple models).
4. Generates predictions.
5. Saves predictions to a database or CSV.

This script can be scheduled with cron, Airflow, or a cloud scheduler.

```python
# batch_predict.py
import pandas as pd
import joblib
from datetime import datetime

def run_batch_predictions():
    # Load data
    df = pd.read_csv(f"data/raw/nepse_{datetime.today().strftime('%Y%m%d')}.csv")
    
    # Feature engineering (reuse feature engineering module)
    from src.data.features import FeatureEngineer
    engineer = FeatureEngineer(config)
    df_features = engineer.create_features(df)
    
    # For each symbol, load its model and predict
    predictions = []
    for symbol in df_features['Symbol'].unique():
        model = joblib.load(f"models/{symbol}/model.joblib")
        symbol_data = df_features[df_features['Symbol'] == symbol]
        X = symbol_data[feature_cols]
        preds = model.predict(X)
        # Store prediction (e.g., for the last row, or for each row)
        predictions.append({'symbol': symbol, 'date': df['Date'].iloc[-1], 'prediction': preds[-1]})
    
    # Save to CSV or database
    pred_df = pd.DataFrame(predictions)
    pred_df.to_csv(f"data/predictions/{datetime.today().strftime('%Y%m%d')}_predictions.csv", index=False)

if __name__ == "__main__":
    run_batch_predictions()
```

---

## **40.8 Authentication and Authorization**

If your API is exposed to the internet, you need to secure it. Common methods:

- **API Key:** Client includes a key in the header.
- **OAuth2:** More complex, suitable for user authentication.
- **JWT:** Token‑based authentication.

FastAPI provides built‑in support for OAuth2 and API keys.

### **40.8.1 API Key Example**

```python
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader

API_KEY = "secret-key-123"
api_key_header = APIKeyHeader(name="X-API-Key")

def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key != API_KEY:
        raise HTTPException(status_code=403, detail="Invalid API Key")
    return api_key

@app.post("/api/v1/predict", dependencies=[Security(verify_api_key)])
async def predict(request: PredictionRequest):
    # ... endpoint code
```

**Explanation:**  
The client must include the header `X-API-Key: secret-key-123`. The key should be stored securely (environment variable) and not hard‑coded.

---

## **40.9 Rate Limiting**

To prevent abuse, you may want to limit the number of requests per client. FastAPI does not include rate limiting out of the box, but you can use middleware or a library like `slowapi`.

```python
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(429, _rate_limit_exceeded_handler)

@app.post("/api/v1/predict")
@limiter.limit("5/minute")
async def predict(request: PredictionRequest):
    # ...
```

---

## **40.10 API Documentation**

FastAPI automatically generates OpenAPI documentation. You can add descriptions to endpoints and models using docstrings and Pydantic field descriptions.

```python
class PredictionRequest(BaseModel):
    symbol: str = Field(..., description="NEPSE stock symbol")
    date: str = Field(..., description="Date of the input data (YYYY-MM-DD)")
    features: dict = Field(..., description="Dictionary of feature values")
```

The interactive docs at `/docs` allow users to try out the API.

---

## **40.11 Service Testing**

Test your API endpoints using `pytest` and `httpx` (for FastAPI).

```python
# test_api.py
from fastapi.testclient import TestClient
from app import app

client = TestClient(app)

def test_predict():
    response = client.post("/api/v1/predict", json={
        "symbol": "NEPSE",
        "date": "2025-03-16",
        "features": {"Open": 1200, "High": 1210, "Low": 1195, "Close": 1205, "Vol": 1500000}
    })
    assert response.status_code == 200
    data = response.json()
    assert "predicted_return" in data
```

Run tests with `pytest`.

---

## **40.12 Deployment**

Once the API is ready, you can deploy it using:

- **Container** (Docker) – as shown in Chapter 38.
- **Serverless** – using AWS Lambda with API Gateway (may require adapting for async).
- **Cloud Run** / **Azure Container Instances** for simple container hosting.

Example Dockerfile:

```dockerfile
FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
COPY models/ ./models/

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
```

Then deploy to a cloud service.

---

## **40.13 Chapter Summary**

In this chapter, we built a prediction service for the NEPSE system using FastAPI.

- **REST API design:** We defined endpoints for single and batch predictions.
- **FastAPI implementation:** We created a service that loads a model, validates input, and returns predictions.
- **Flask alternative:** A simpler, synchronous version.
- **Asynchronous processing:** For long‑running predictions, we used background tasks.
- **Batch prediction:** A script for daily scheduled predictions.
- **Security:** API key authentication and rate limiting.
- **Documentation:** Auto‑generated Swagger UI.
- **Testing:** Unit tests for endpoints.
- **Deployment:** Docker and cloud options.

### **Practical Takeaways for the NEPSE System:**

- For real‑time needs, use FastAPI to serve predictions with low latency.
- Store models and preprocessing artifacts in a known location, loaded at startup.
- Validate input using Pydantic models.
- Secure the API with an API key if exposed externally.
- For batch predictions, schedule a script using cron/Airflow.
- Containerize the service for reproducible deployment.

In the next chapter, **Chapter 41: Batch Prediction Systems**, we will explore batch processing in more detail, including scheduling, data preparation at scale, and integration with data warehouses.

---

**End of Chapter 40**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='39. model_serialization_and_storage.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='41. batch_prediction_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
