# Chapter 43: Model Deployment Strategies

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand different deployment patterns (canary, blue‑green, shadow) and when to use each
- Containerize a machine learning model using Docker for consistent and portable deployment
- Use Docker Compose to orchestrate multi‑service applications locally
- Deploy models to Kubernetes for automated scaling, rolling updates, and self‑healing
- Evaluate serverless and edge deployment options for low‑latency and cost‑sensitive scenarios
- Design deployment pipelines that integrate with CI/CD systems
- Implement rollback strategies to revert to a previous model version safely

---

## Introduction

After you have trained, validated, and packaged a model (like our NEPSE stock predictor), the next critical step is to **deploy** it into a production environment where it can serve predictions. Model deployment is not a one‑time event; it is an ongoing process that must handle updates, scaling, failures, and varying load. The strategy you choose affects the system’s reliability, latency, cost, and ability to evolve.

In this chapter, we will explore various deployment patterns and technologies, using our NEPSE prediction system as a concrete example. We will start with simple containerization, then move to orchestration with Kubernetes, and finally touch on serverless and edge deployments. By the end, you will be equipped to design a deployment pipeline that safely rolls out new models while minimising downtime and risk.

---

## 43.1 Deployment Patterns

Deployment patterns describe how a new version of a model is introduced and how traffic is shifted from the old version to the new one. The goal is to reduce risk: if the new version behaves poorly, we want to minimise the impact on users and be able to revert quickly.

### 43.1.1 Canary Deployment

In a **canary deployment**, the new model version is initially exposed to a small subset of users or requests. If it performs well (according to monitoring metrics), the traffic is gradually increased until it reaches 100%. If errors spike, the canary can be rolled back without affecting most users.

**NEPSE Example:**  
Suppose we have a new model that predicts intra‑day price movements. We might route 5% of the incoming prediction requests to the new model and 95% to the old one. After observing no degradation in latency or accuracy over a few hours, we increase the canary to 25%, then 50%, and finally 100%.

**Implementation Approach:**  
Canary deployments can be implemented at the load balancer level (e.g., using Kubernetes `Service` with weighted routing) or within the application (e.g., by reading a feature flag from a configuration server).

```yaml
# Kubernetes service with weighted routing (using Istio or similar)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: nepse-predictor
spec:
  hosts:
  - nepse-predictor
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: nepse-predictor-v2
      weight: 100
  - route:
    - destination:
        host: nepse-predictor-v1
      weight: 95
    - destination:
        host: nepse-predictor-v2
      weight: 5
```

**Explanation:**  
This Istio `VirtualService` routes 5% of all traffic to version 2 of the predictor and 95% to version 1. Optionally, requests with the header `x-canary: true` can be forced to the new version for internal testing. As confidence grows, the weights can be adjusted gradually.

### 43.1.2 Blue‑Green Deployment

**Blue‑green deployment** maintains two identical production environments: the **blue** (current) and the **green** (new). At any time, only one environment serves live traffic. When a new version is ready, it is deployed to the green environment, thoroughly tested, and then the router is switched to send all traffic to green. Blue becomes the standby for the next deployment.

**NEPSE Example:**  
We have two sets of servers running our prediction API: blue (v1.0) and green (v1.1). After validating green internally, we flip the load balancer from blue to green. If problems arise, we flip back immediately.

**Implementation:**  
Blue‑green requires a load balancer or router that can switch between two backend groups.

```yaml
# Kubernetes service pointing to blue deployment
apiVersion: v1
kind: Service
metadata:
  name: nepse-predictor
spec:
  selector:
    app: nepse-predictor
    version: blue   # initially blue
  ports:
  - port: 80
    targetPort: 8000
```

To switch to green, we simply update the `selector` to `version: green`. Kubernetes will immediately start routing traffic to the green pods.

**Advantages:**  
- Instant rollback (just change selector back).  
- No mixed versions during the switch.

**Disadvantages:**  
- Requires double the resources during deployment.  
- Database schema changes must be backward‑compatible if both environments share the database.

### 43.1.3 Shadow Deployment

**Shadow (or mirroring) deployment** sends a copy of production traffic to the new model while the old model continues to serve the live responses. The new model’s predictions are logged and compared, but never returned to users. This allows you to validate performance under real‑world load without any risk.

**NEPSE Example:**  
We deploy a new model as a shadow service. Every prediction request is duplicated and sent to both the old and the new model. The old model’s response is returned to the user, while the new model’s output is stored for offline analysis. After a week of validation, we may decide to promote it.

**Implementation:**  
Many service meshes (like Istio) support traffic mirroring.

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: nepse-predictor
spec:
  hosts:
  - nepse-predictor
  http:
  - route:
    - destination:
        host: nepse-predictor-v1
      weight: 100
    mirror:
      host: nepse-predictor-v2
    mirrorPercentage:
      value: 100.0
```

**Explanation:**  
This configuration sends all requests to `v1` but mirrors 100% of them to `v2`. Responses from `v2` are ignored, but the traffic allows us to measure latency and error rates under production conditions.

---

## 43.2 Containerization

Containerization packages a model and its dependencies into a lightweight, portable unit that runs consistently across different environments. Docker is the most popular container platform.

### 43.2.1 Docker Fundamentals

A Docker **image** contains everything needed to run an application: code, runtime, system tools, libraries, settings. A **container** is a running instance of an image. Images are built from a **Dockerfile**.

For our NEPSE prediction API (built with FastAPI, for example), we need a Dockerfile that:

- Starts from a base Python image.
- Copies the application code and the trained model.
- Installs dependencies.
- Exposes the port on which the API listens.
- Defines the command to run the application.

**Example Dockerfile:**

```dockerfile
# Use official Python image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy requirements first (for better layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application
COPY app/ ./app/
COPY models/ ./models/

# Expose port 8000
EXPOSE 8000

# Command to run the FastAPI app with uvicorn
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
```

**Explanation:**  
- We use a slim Python image to keep the image size small.  
- Copying `requirements.txt` first allows Docker to cache the `pip install` layer; if only the code changes, we don’t reinstall dependencies.  
- The final `CMD` launches the FastAPI application using Uvicorn, binding to all interfaces so that it can be reached from outside the container.

**Building and running:**

```bash
docker build -t nepse-predictor:v1 .
docker run -p 8000:8000 nepse-predictor:v1
```

Now the API is accessible at `http://localhost:8000`.

### 43.2.2 Docker Compose

When our system consists of multiple containers (e.g., prediction API, database, message queue), we can use **Docker Compose** to define and run them together.

**Example `docker-compose.yml` for NEPSE system:**

```yaml
version: '3.8'

services:
  predictor:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_HOST=redis
      - MODEL_PATH=/app/models/nepse_model.pkl
    depends_on:
      - redis
    volumes:
      - ./models:/app/models  # mount for live model updates

  redis:
    image: redis:alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

volumes:
  redis_data:
```

**Explanation:**  
- The `predictor` service is built from the current directory (using the Dockerfile above).  
- It depends on `redis`, which is started first.  
- Environment variables configure the predictor to connect to Redis.  
- A volume mounts the local `models` directory into the container, allowing us to update the model file without rebuilding the image.  
- The Redis service uses a named volume to persist data.

Running `docker-compose up` starts both containers, and they can communicate via their service names (`redis` resolves to the Redis container).

### 43.2.3 Best Practices for Containerizing ML Models

- **Keep images small**: Use slim base images, multi‑stage builds, and remove unnecessary packages. Smaller images reduce pull times and attack surface.
- **Never bake secrets**: Use environment variables or secret management tools (e.g., Docker secrets, Kubernetes secrets) for API keys, database passwords.
- **Version your models**: Model files should be versioned and possibly pulled from a model registry at runtime, not baked into the image (unless the model rarely changes).
- **Health checks**: Define `HEALTHCHECK` in the Dockerfile or in the orchestration to let the platform know when the container is ready.
- **Log to stdout/stderr**: Containers should log to standard output; the orchestrator collects these logs.

---

## 43.3 Orchestration

When you have multiple containers running across several machines, you need an **orchestrator** to manage them: deploy, scale, network, and heal automatically.

### 43.3.1 Kubernetes

Kubernetes (K8s) is the de facto standard for container orchestration. It provides:

- **Pods**: The smallest deployable units (one or more containers).
- **Deployments**: Declarative updates for Pods (supports rolling updates and rollbacks).
- **Services**: Stable network endpoints to access a set of Pods.
- **Ingress**: External access, often with load balancing and SSL termination.
- **ConfigMaps and Secrets**: Configuration and sensitive data.
- **Horizontal Pod Autoscaler**: Automatically scales the number of Pods based on CPU/memory or custom metrics.

**Deploying the NEPSE predictor to Kubernetes:**

First, we create a **Deployment** manifest:

```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nepse-predictor-v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nepse-predictor
      version: v1
  template:
    metadata:
      labels:
        app: nepse-predictor
        version: v1
    spec:
      containers:
      - name: predictor
        image: myregistry/nepse-predictor:v1
        ports:
        - containerPort: 8000
        env:
        - name: REDIS_HOST
          value: redis-service
        - name: MODEL_PATH
          value: /app/models/nepse_model.pkl
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
```

**Explanation:**  
- The Deployment ensures three replicas are always running.  
- The container image is pulled from `myregistry`.  
- Environment variables configure the connection to Redis (via a Service named `redis-service`).  
- Resource requests guarantee minimum resources; limits prevent a container from consuming all node resources.  
- Liveness and readiness probes let Kubernetes know when the container is healthy and ready to serve traffic.

Next, a **Service** to expose the Deployment internally:

```yaml
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: nepse-predictor
spec:
  selector:
    app: nepse-predictor
    version: v1
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP
```

For external access, an **Ingress**:

```yaml
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nepse-predictor-ingress
spec:
  rules:
  - host: predict.nepse.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nepse-predictor
            port:
              number: 80
```

**Deploying with `kubectl`:**

```bash
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
```

Kubernetes will pull the image, start the Pods, and expose them through the Service and Ingress.

### 43.3.2 Docker Swarm

Docker Swarm is Docker’s native clustering and orchestration solution. It is simpler to set up than Kubernetes but offers fewer features. It uses the same Docker Compose file format with some extensions.

**Example stack file for NEPSE predictor:**

```yaml
version: '3.8'

services:
  predictor:
    image: myregistry/nepse-predictor:v1
    ports:
      - "8000:8000"
    environment:
      - REDIS_HOST=redis
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
    networks:
      - nepse-net

  redis:
    image: redis:alpine
    networks:
      - nepse-net

networks:
  nepse-net:
```

Deploy with `docker stack deploy -c stack.yml nepse`.

### 43.3.3 Cloud Orchestration

All major cloud providers offer managed Kubernetes services:

- **Amazon EKS** (Elastic Kubernetes Service)
- **Google GKE** (Google Kubernetes Engine)
- **Azure AKS** (Azure Kubernetes Service)

These services handle the control plane (master nodes) for you, making it easier to run Kubernetes without managing the underlying infrastructure.

---

## 43.4 Serverless Deployment

Serverless platforms allow you to run code without provisioning or managing servers. They automatically scale to zero when idle, which can be cost‑effective for sporadic workloads. However, they typically impose cold‑start latencies, which may be unacceptable for real‑time predictions.

**Options for serverless ML:**

- **AWS Lambda** with container support (up to 10 GB memory, 15 min timeout).
- **Google Cloud Run** – runs stateless containers, scales automatically, pay‑per‑request.
- **Azure Functions** – similar.

**NEPSE Example with Cloud Run:**

Package your FastAPI app in a container as before. Then deploy to Cloud Run:

```bash
gcloud builds submit --tag gcr.io/myproject/nepse-predictor
gcloud run deploy nepse-predictor --image gcr.io/myproject/nepse-predictor --platform managed --allow-unauthenticated
```

Cloud Run will provide a HTTPS endpoint. It scales from zero to many instances based on traffic. However, if a request arrives when no instance is running, there is a cold start (a few hundred milliseconds to a few seconds). For many prediction workloads, this is acceptable; for high‑frequency trading, it is not.

---

## 43.5 Edge Deployment

Edge deployment runs the model on devices close to the data source (e.g., on a broker’s trading workstation, or on an IoT device). Benefits include ultra‑low latency and offline operation. Challenges include limited compute resources and model updates.

For NEPSE, edge deployment might mean running a lightweight model on a trader’s laptop that makes predictions based on local data, or on a Raspberry Pi at a branch office.

**Tools for edge ML:**

- **TensorFlow Lite** – optimised for mobile and embedded devices.
- **ONNX Runtime** – cross‑platform inference.
- **AWS IoT Greengrass** – runs Lambda functions on edge devices.
- **Azure IoT Edge** – similar.

**Example: Converting an XGBoost model to a format that runs on a Raspberry Pi using ONNX:**

```python
import onnxmltools
from onnxmltools.convert import convert_xgboost
from skl2onnx.common.data_types import FloatTensorType

# Load model
model = joblib.load('nepse_xgboost.pkl')

# Convert to ONNX
initial_type = [('float_input', FloatTensorType([None, 4]))]  # 4 features
onnx_model = convert_xgboost(model, initial_types=initial_type)

# Save
with open("nepse_xgboost.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())
```

On the edge device, you can load the ONNX model and run inference using ONNX Runtime.

---

## 43.6 Hybrid Deployment

A hybrid deployment combines multiple strategies. For example, you might use Kubernetes for the core prediction service but serverless for batch processing, or edge for real‑time alerts and cloud for retraining.

For NEPSE, a hybrid approach could be:

- **Edge** on traders’ workstations for ultra‑low latency signals.
- **Cloud** (Kubernetes) for the main API that powers dashboards and mobile apps.
- **Serverless** for ad‑hoc backtesting requests.

---

## 43.7 Deployment Pipelines

A deployment pipeline automates the steps from code commit to production. For ML, this includes training, validation, packaging, deployment, and monitoring. Tools like Jenkins, GitLab CI, GitHub Actions, and Argo CD help orchestrate these steps.

**Example GitHub Actions workflow for deploying to Kubernetes:**

```yaml
name: Deploy to Kubernetes

on:
  push:
    branches: [ main ]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2

    - name: Build Docker image
      run: docker build -t myregistry/nepse-predictor:${{ github.sha }} .

    - name: Push to registry
      run: docker push myregistry/nepse-predictor:${{ github.sha }}

    - name: Set up kubectl
      uses: azure/setup-kubectl@v1
      with:
        version: 'latest'

    - name: Deploy to Kubernetes
      run: |
        kubectl set image deployment/nepse-predictor predictor=myregistry/nepse-predictor:${{ github.sha }}
        kubectl rollout status deployment/nepse-predictor
```

**Explanation:**  
- On every push to main, the workflow builds a Docker image tagged with the commit SHA.  
- It pushes the image to a container registry.  
- It updates the Kubernetes deployment with the new image using `kubectl set image`.  
- `rollout status` waits for the rollout to complete.

This pipeline implements a **continuous deployment** strategy where every commit potentially goes to production (after tests). For more control, you might add a manual approval step.

---

## 43.8 Rollback Strategies

No matter how careful you are, a new model version may introduce errors. You must be able to roll back quickly.

**Rollback mechanisms:**

- **Kubernetes rollback**: `kubectl rollout undo deployment/nepse-predictor` reverts to the previous revision.
- **Blue‑green**: Simply switch the router back to blue.
- **Canary**: Reduce the weight of the new version to zero and shift all traffic back to the old version.
- **Database rollbacks**: If the new model writes predictions to a database, you might need to mark or delete them. Ideally, model writes are idempotent and easily reversible.

**Testing rollbacks:** Practice them in a staging environment. Measure the time to revert; it should be within your service level objectives.

---

## Chapter Summary

In this chapter, we covered the essential strategies and technologies for deploying machine learning models into production, using the NEPSE prediction system as a running example.

- We explored **deployment patterns** (canary, blue‑green, shadow) that allow safe introduction of new model versions.
- We learned to **containerize** our model with Docker, creating portable images that run anywhere.
- We used **Docker Compose** to orchestrate multi‑container applications locally.
- We dived into **Kubernetes** for production‑grade orchestration, with examples of Deployments, Services, and Ingress.
- We touched on **serverless** and **edge** deployment options for different latency and cost requirements.
- We designed a **deployment pipeline** using GitHub Actions to automate builds and deployments.
- Finally, we discussed **rollback strategies** to recover from failures.

With these tools and patterns, you can deploy your NEPSE model—or any time‑series prediction model—reliably and safely. The next chapter will cover **Monitoring and Observability**, ensuring that once deployed, you can keep track of your model’s health and performance in real time.

---

**End of Chapter 43**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='42. real_time_prediction_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='44. monitoring_and_observability.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
