# Chapter 60: AI/ML Model Deployment

Machine learning systems fundamentally differ from traditional software in one critical dimension: they depend on data, not just code. While conventional CI/CD pipelines version source code and compile binaries, ML pipelines must version datasets, track experiments, manage model artifacts, and validate probabilistic outputs against evolving data distributions. Deploying a model involves not merely pushing a container image, but ensuring that training data lineage, feature engineering logic, and model weights remain synchronized across development, staging, and production environments. This chapter examines MLOps (Machine Learning Operations)—the extension of DevOps practices to machine learning—covering how to automate model training, serve predictions at scale, and maintain model performance through continuous monitoring and automated retraining.

## 60.1 ML Pipelines

ML pipelines orchestrate the end-to-end lifecycle of machine learning models, from data ingestion through training, validation, and deployment. Unlike traditional build pipelines that compile deterministic code, ML pipelines handle stochastic processes where the same code may produce different models when trained on different data snapshots.

### The MLOps Lifecycle

**Continuous Integration (CI) for ML:**
Extends beyond code testing to include:
- Data validation (schema checks, drift detection)
- Feature engineering pipeline testing
- Model unit tests (architecture validation, gradient flow checks)
- Integration tests between model and serving infrastructure

**Continuous Training (CT):**
A unique ML stage where models retrain automatically when:
- New labeled data arrives (batch retraining)
- Data drift exceeds thresholds (triggered retraining)
- Scheduled intervals (nightly/weekly retraining)
- Performance degradation detected in production

**Continuous Delivery (CD) for ML:**
Deploys not just code, but model artifacts, feature stores, and inference configurations. This includes canary deployments where a small percentage of traffic routes to the new model while monitoring for prediction quality degradation.

**Continuous Monitoring:**
Tracks model performance metrics (accuracy, precision, recall, F1) alongside infrastructure metrics (latency, throughput). Unlike software monitoring which focuses on availability, ML monitoring watches for concept drift—when the statistical relationship between inputs and outputs changes over time.

### Pipeline Architecture

A production ML pipeline typically consists of:

**Data Extraction:**
Pulls raw data from warehouses (Snowflake, BigQuery), lakes (S3, GCS), or streaming sources (Kafka, Kinesis). Versioning occurs at this stage using tools like DVC (Data Version Control) or LakeFS.

**Data Validation:**
Checks for schema skew (new columns, type changes), distribution shifts (training-serving skew), and missing values. TensorFlow Data Validation (TFDV) or Great Expectations validate data against predefined schemas.

**Feature Engineering:**
Transforms raw data into features suitable for model consumption. In production pipelines, feature engineering must be consistent between training and serving to prevent training-serving skew.

**Model Training:**
Executes training algorithms, tracking hyperparameters and metrics. Distributed training across multiple GPUs requires specific orchestration (Horovod, Ray Train, or PyTorch DDP).

**Model Evaluation:**
Validates against holdout datasets and custom metrics. Includes fairness checks (demographic parity, equalized odds) and bias detection.

**Model Packaging:**
Bundles model weights, preprocessing code, and dependencies into deployable artifacts (Docker images, SavedModel format, ONNX, or MLflow models).

**Deployment:**
Pushes models to serving infrastructure with traffic management strategies (A/B testing, shadow deployments).

### CI/CD Integration Example

```yaml
# Kubeflow Pipeline definition (Python DSL)
import kfp
from kfp import dsl
from kfp.components import create_component_from_func

@create_component_from_func
def data_validation(input_path: str, schema_path: str) -> str:
    import tensorflow_data_validation as tfdv
    from tensorflow_metadata.proto.v0 import schema_pb2
    
    # Load statistics
    stats = tfdv.generate_statistics_from_tfrecord(input_path)
    
    # Load schema
    schema = tfdv.load_schema_text(schema_path)
    
    # Validate
    anomalies = tfdv.validate_statistics(stats, schema)
    
    if len(anomalies.anomaly_info) > 0:
        raise ValueError(f"Data anomalies detected: {anomalies}")
    
    return "Validation passed"

@create_component_from_func
def train_model(
    data_path: str,
    model_params: dict,
    output_path: str
) -> str:
    import xgboost as xgb
    import pandas as pd
    import pickle
    from sklearn.model_selection import train_test_split
    
    # Load data
    df = pd.read_csv(data_path)
    X = df.drop('target', axis=1)
    y = df['target']
    
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    
    # Train
    model = xgb.XGBClassifier(**model_params)
    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=10
    )
    
    # Save
    with open(output_path, 'wb') as f:
        pickle.dump(model, f)
    
    return output_path

@create_component_from_func
def evaluate_model(
    model_path: str,
    test_data_path: str,
    threshold: float
) -> bool:
    import pickle
    import pandas as pd
    from sklearn.metrics import accuracy_score
    
    with open(model_path, 'rb') as f:
        model = pickle.load(f)
    
    df = pd.read_csv(test_data_path)
    X = df.drop('target', axis=1)
    y = df['target']
    
    preds = model.predict(X)
    accuracy = accuracy_score(y, preds)
    
    if accuracy < threshold:
        raise ValueError(f"Accuracy {accuracy} below threshold {threshold}")
    
    return True

@dsl.pipeline(
    name='Churn Prediction Pipeline',
    description='End-to-end training pipeline'
)
def churn_pipeline(
    data_path: str = 'gs://bucket/data/train.csv',
    schema_path: str = 'gs://bucket/schema/schema.pbtxt',
    model_params: dict = {'max_depth': 6, 'n_estimators': 100}
):
    # Data validation
    validation_op = data_validation(data_path, schema_path)
    
    # Training
    train_op = train_model(
        data_path=data_path,
        model_params=model_params,
        output_path='gs://bucket/models/model.pkl'
    ).after(validation_op)
    
    # Evaluation
    eval_op = evaluate_model(
        model_path=train_op.output,
        test_data_path='gs://bucket/data/test.csv',
        threshold=0.85
    ).after(train_op)
    
    # Conditional deployment
    with dsl.Condition(eval_op.output == True):
        deploy_op = dsl.ContainerOp(
            name='deploy-model',
            image='gcr.io/project/deployer:latest',
            arguments=['--model-path', train_op.output]
        )

# Compile and run
kfp.compiler.Compiler().compile(churn_pipeline, 'pipeline.yaml')
```

## 60.2 Model Training in CI

Integrating model training into CI requires managing computational resources, tracking experiments, and ensuring reproducibility across runs.

### Resource Management

**GPU Scheduling:**
Training deep learning models requires GPU resources, which Kubernetes manages via device plugins:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
  - name: trainer
    image: gcr.io/project/ml-trainer:latest
    resources:
      limits:
        nvidia.com/gpu: 4  # Request 4 GPUs
      requests:
        nvidia.com/gpu: 4
    volumeMounts:
    - name: dshm
      mountPath: /dev/shm
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory  # Shared memory for multiprocessing
      sizeLimit: 10Gi
  nodeSelector:
    node-type: gpu-node  # Schedule on GPU-enabled nodes
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
```

**Distributed Training:**
For large models (LLMs, computer vision), distribute training across multiple nodes:

```python
# PyTorch Distributed Data Parallel (DDP)
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup():
    dist.init_process_group("nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

def train():
    setup()
    model = MyModel().to(local_rank)
    ddp_model = DDP(model, device_ids=[local_rank])
    
    # Training loop
    for epoch in range(epochs):
        for batch in dataloader:
            inputs, labels = batch
            inputs = inputs.to(local_rank)
            labels = labels.to(local_rank)
            
            outputs = ddp_model(inputs)
            loss = criterion(outputs, labels)
            
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

if __name__ == "__main__":
    train()
```

**Spot/Preemptible Instances:**
Reduce training costs by using spot instances with checkpointing:

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: spot-training
spec:
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"  # GKE spot nodes
      containers:
      - name: trainer
        image: trainer:latest
        command: ["python", "train.py", "--resume-from-checkpoint"]
        volumeMounts:
        - name: checkpoints
          mountPath: /checkpoints
      volumes:
      - name: checkpoints
        persistentVolumeClaim:
          claimName: training-checkpoints
      restartPolicy: OnFailure
```

### Experiment Tracking

Track hyperparameters, metrics, and artifacts using MLflow or Weights & Biases:

```python
import mlflow
import mlflow.sklearn

def train_with_tracking(params):
    mlflow.set_experiment("churn-prediction")
    
    with mlflow.start_run():
        # Log parameters
        mlflow.log_params(params)
        
        # Train
        model = xgb.XGBClassifier(**params)
        model.fit(X_train, y_train)
        
        # Evaluate
        accuracy = model.score(X_test, y_test)
        
        # Log metrics
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("precision", precision_score(y_test, preds))
        
        # Log model
        mlflow.sklearn.log_model(
            model, 
            "model",
            registered_model_name="churn-model"
        )
        
        # Log artifacts (confusion matrix, ROC curve)
        mlflow.log_artifact("confusion_matrix.png")

# Hyperparameter sweep
from sklearn.model_selection import ParameterGrid

param_grid = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1],
    'n_estimators': [100, 200]
}

for params in ParameterGrid(param_grid):
    train_with_tracking(params)
```

### Data Versioning

Unlike code, datasets change frequently. Use DVC (Data Version Control) or Git LFS to version data alongside code:

```bash
# Initialize DVC
dvc init

# Track dataset
dvc add data/train.csv

# Push to remote storage (S3, GCS, Azure)
dvc remote add -d storage s3://mybucket/dvcstore
dvc push

# Commit metadata to git
git add data/train.csv.dvc .gitignore
git commit -m "Add version 2.0 of training data"

# Checkout specific version
git checkout v1.0
dvc checkout  # Restores data matching that commit
```

**Pipeline Versioning:**
```yaml
# dvc.yaml
stages:
  data_preprocessing:
    cmd: python preprocess.py
    deps:
      - preprocess.py
      - data/raw.csv
    outs:
      - data/processed.csv
  
  train:
    cmd: python train.py
    deps:
      - train.py
      - data/processed.csv
    params:
      - train.learning_rate
      - train.epochs
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false
```

## 60.3 Model Serving

Model serving exposes trained models as HTTP/gRPC endpoints for real-time inference or batch processes for offline predictions.

### Serving Patterns

**Real-time Inference (Online):**
Low-latency (milliseconds) predictions via REST/gRPC APIs. Suitable for fraud detection, recommendation systems, and search ranking.

**Batch Inference (Offline):**
Process large datasets periodically, writing predictions to databases or files. Suitable for churn prediction, demand forecasting, and report generation.

**Streaming Inference:**
Process events in real-time using Kafka/Kinesis streams with stateful windowing.

### Model Serving Architectures

**Model-as-Container:**
Package model and inference code in containers:

```python
# predict.py
import pickle
from flask import Flask, request, jsonify
import numpy as np

app = Flask(__name__)

# Load model at startup
with open('/models/model.pkl', 'rb') as f:
    model = pickle.load(f)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = np.array(data['features']).reshape(1, -1)
    prediction = model.predict(features)
    probability = model.predict_proba(features).tolist()
    
    return jsonify({
        'prediction': int(prediction[0]),
        'probability': probability
    })

@app.route('/health')
def health():
    return jsonify({'status': 'healthy'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)
```

```dockerfile
FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY predict.py .
COPY models/ /models/

ENV MODEL_PATH=/models/model.pkl
EXPOSE 8080

CMD ["python", "predict.py"]
```

**Model Servers (TF Serving, TorchServe, MLflow):**
Dedicated high-performance servers optimized for model serving:

```yaml
# TensorFlow Serving deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tf-serving
  template:
    metadata:
      labels:
        app: tf-serving
    spec:
      containers:
      - name: tensorflow
        image: tensorflow/serving:latest
        ports:
        - containerPort: 8501  # REST API
        - containerPort: 8500  # gRPC
        volumeMounts:
        - name: models
          mountPath: /models
        env:
        - name: MODEL_NAME
          value: "churn_model"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-storage
---
apiVersion: v1
kind: Service
metadata:
  name: tf-serving
spec:
  selector:
    app: tf-serving
  ports:
  - name: rest
    port: 8501
    targetPort: 8501
  - name: grpc
    port: 8500
    targetPort: 8500
```

### Scaling Strategies

**Horizontal Pod Autoscaling (HPA):**
Scale based on inference request latency or queue depth:

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_latency_p99
      target:
        type: AverageValue
        averageValue: "100m"  # 100ms
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
```

**GPU Autoscaling:**
Use Kubernetes Event-driven Autoscaling (KEDA) for queue-based scaling:

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gpu-inference-scaler
spec:
  scaleTargetRef:
    name: gpu-inference
  minReplicaCount: 0
  maxReplicaCount: 10
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-cluster:9092
      consumerGroup: inference-group
      topic: inference-requests
      lagThreshold: "100"
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
```

## 60.4 Kubernetes for ML

Kubernetes provides the orchestration layer for ML workloads but requires specific configurations for GPU support, high-throughput networking, and storage optimization.

### GPU Support

**NVIDIA GPU Operator:**
Automates GPU driver installation, device plugin, and monitoring:

```bash
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true
```

**Time-Slicing (GPU Sharing):**
Share single GPU across multiple pods for inference workloads:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Split 1 GPU into 4 vGPUs
---
# Pod requesting shared GPU
apiVersion: v1
kind: Pod
metadata:
  name: inference-1
spec:
  containers:
  - name: model
    image: inference:latest
    resources:
      limits:
        nvidia.com/gpu: 1  # Gets 1/4 of physical GPU
```

### Storage for ML

**High-Performance PVCs:**
Training requires high-throughput storage for large datasets:

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data
spec:
  accessModes:
    - ReadOnlyMany  # Shared across training nodes
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi
  volumeMode: Filesystem
---
# Pod mounting data
apiVersion: v1
kind: Pod
metadata:
  name: data-loader
spec:
  containers:
  - name: loader
    image: downloader:latest
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: training-data
```

**Cache Warming:**
Use Fluid or Alluxio to cache remote data locally:

```yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: imagenet
spec:
  mounts:
  - mountPoint: s3://mybucket/imagenet
    name: imagenet
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: imagenet
spec:
  replicas: 2
  tieredstore:
    levels:
    - mediumtype: SSD
      path: /var/lib/alluxio
      quota: 500Gi
      high: "0.95"
      low: "0.7"
```

## 60.5 Kubeflow

Kubeflow is a Kubernetes-native platform for ML workflows, providing integrated components for pipelines, training, serving, and metadata tracking.

### Architecture Components

**Pipelines:** Visual DAG editor and execution engine for ML workflows.
**Notebooks:** Jupyter notebooks integrated with Kubernetes RBAC.
**Katib:** Hyperparameter tuning and neural architecture search.
**Training Operator:** Distributed training (TFJob, PyTorchJob, MPIJob).
**KServe:** Model serving platform with canary rollouts and auto-scaling.

### Installation

```bash
# Install Kubeflow using manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying to apply resources..."
  sleep 10
done

# Or use distribution (EKS, GKE, AKS specific)
wget https://github.com/kubeflow/kubeflow/releases/download/v1.8.0/kfctl_1.8.0_linux.tar.gz
tar -xzf kfctl_1.8.0_linux.tar.gz
./kfctl apply -f kfctl_aws_cognito.v1.8.0.yaml
```

### Kubeflow Pipelines (KFP)

Define reusable, composable ML workflows:

```python
from kfp import dsl
from kfp.components import load_component_from_url

# Load pre-built components
data_prep_op = load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/data_prep/component.yaml'
)
train_op = load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/xgboost/train/component.yaml'
)

@dsl.pipeline(
    name='End-to-End Churn Pipeline',
    description='Production pipeline with data validation'
)
def churn_pipeline(
    data_url: str = 's3://bucket/data.csv',
    model_path: str = 's3://bucket/models/',
    threshold: float = 0.85
):
    # Data preparation
    prep_task = data_prep_op(
        input_path=data_url,
        output_path='s3://bucket/processed/'
    )
    
    # Training with Katib for hyperparameter tuning
    train_task = train_op(
        train_data=prep_task.outputs['train_data'],
        validation_data=prep_task.outputs['val_data'],
        num_boost_round=100,
        max_depth=6
    ).after(prep_task)
    
    # Model evaluation
    eval_task = dsl.ContainerOp(
        name='evaluate-model',
        image='gcr.io/project/evaluator:latest',
        arguments=[
            '--model-path', train_task.outputs['model_path'],
            '--test-data', prep_task.outputs['test_data'],
            '--threshold', threshold
        ]
    ).after(train_task)
    
    # Conditional deployment
    with dsl.Condition(eval_task.outputs['accuracy'] > threshold):
        deploy_task = dsl.ContainerOp(
            name='deploy-model',
            image='gcr.io/project/kserve-deployer:latest',
            arguments=[
                '--model-path', train_task.outputs['model_path'],
                '--model-name', 'churn-model',
                '--namespace', 'production'
            ]
        )

# Compile
import kfp.compiler as compiler
compiler.Compiler().compile(churn_pipeline, 'churn_pipeline.tar.gz')
```

### KServe for Model Serving

Deploy models with advanced serving capabilities:

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: churn-model
  namespace: production
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
spec:
  predictor:
    model:
      modelFormat:
        name: xgboost
      storageUri: s3://bucket/models/churn/v2
      resources:
        requests:
          memory: 2Gi
          cpu: "1"
        limits:
          memory: 4Gi
          cpu: "2"
    minReplicas: 2
    maxReplicas: 10
    containerConcurrency: 100  # Max concurrent requests per pod
  
  transformer:  # Pre/post processing
    containers:
    - name: transformer
      image: gcr.io/project/churn-transformer:latest
      resources:
        requests:
          memory: 1Gi
  
  explainer:  # Model explainability (SHAP, LIME)
    alibi:
      type: ShapTree
      storageUri: s3://bucket/explainers/churn/
```

**Canary Deployment with KServe:**
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: churn-model
spec:
  predictor:
    canaryTrafficPercent: 20  # 20% to new version
    model:
      storageUri: s3://bucket/models/churn/v3
```

## 60.6 MLflow

MLflow provides an open-source platform for the ML lifecycle, focusing on experimentation, reproducibility, and deployment. Unlike Kubeflow's Kubernetes-native approach, MLflow is Python-centric and easier to adopt incrementally.

### Components

**Tracking:** Log parameters, metrics, and artifacts.
**Projects:** Package code in reusable, reproducible formats.
**Models:** Standardize model packaging for diverse deployment targets.
**Registry:** Centralized model store with versioning and stage transitions.

### Deployment Integration

**Model Registry:**
```python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Transition model to production
client.transition_model_version_stage(
    name="churn-model",
    version=3,
    stage="Production"
)

# Load production model
model = mlflow.pyfunc.load_model("models:/churn-model/Production")
```

**CI/CD Pipeline Integration:**
```yaml
# GitHub Actions workflow
name: ML Model CI/CD
on:
  push:
    paths:
      - 'models/**'
      - 'src/**'

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
    
    - name: Train model
      run: |
        pip install -r requirements.txt
        python train.py --experiment-name churn
    
    - name: Evaluate and register
      run: |
        python evaluate.py --run-id ${{ env.RUN_ID }}
        if [ $? -eq 0 ]; then
          python register_model.py --run-id ${{ env.RUN_ID }} --name churn-model
        fi
    
    - name: Deploy to staging
      run: |
        mlflow models build-docker -m models:/churn-model/latest -n churn-model:staging
        kubectl set image deployment/churn-model model=churn-model:staging -n staging
    
    - name: Integration tests
      run: |
        pytest tests/integration/ --endpoint http://staging-api/model
    
    - name: Promote to production
      if: github.ref == 'refs/heads/main'
      run: |
        mlflow models build-docker -m models:/churn-model/latest -n churn-model:prod
        kubectl set image deployment/churn-model model=churn-model:prod -n production
```

## 60.7 Model Versioning

Model versioning extends beyond semantic versioning to include data lineage, feature schemas, and training configurations.

### Model Registry Patterns

**Semantic Versioning for Models:**
- **Major:** Breaking changes (input schema changes, different algorithm)
- **Minor:** Backward compatible improvements (hyperparameter tuning, more data)
- **Patch:** Bug fixes (correcting label errors, code fixes)

**Git-Based Versioning:**
Store model metadata in Git, weights in object storage:

```python
# Tag model with Git commit
import git
import mlflow

repo = git.Repo(search_parent_directories=True)
commit_hash = repo.head.object.hexsha

mlflow.set_tag("git_commit", commit_hash)
mlflow.set_tag("git_branch", repo.active_branch.name)
```

**Data Lineage:**
Track which dataset version trained which model:

```python
# Using DVC and MLflow together
import dvc.api
import mlflow

data_version = dvc.api.get_url("data/train.csv")

with mlflow.start_run():
    mlflow.log_param("data_version", data_version)
    mlflow.log_param("dvc_commit", dvc.api.get_rev())
    
    # Train and log model
    model.fit(X, y)
    mlflow.sklearn.log_model(model, "model")
```

### Model Artifacts

Package models with all dependencies:

```yaml
# conda.yaml for MLflow model
name: churn-env
channels:
  - conda-forge
dependencies:
  - python=3.9
  - pip
  - pip:
    - mlflow==2.8.0
    - scikit-learn==1.3.0
    - xgboost==2.0.0
    - cloudpickle==2.2.1
```

## 60.8 A/B Testing Models

A/B testing validates new models against production traffic before full rollout, measuring business metrics (conversion, revenue) rather than just accuracy.

### Implementation Strategies

**Shadow Mode:**
Run new model alongside production model, comparing predictions without affecting users:

```python
# Application logic
def predict(input_data):
    # Production prediction
    prod_result = prod_model.predict(input_data)
    
    # Shadow prediction (async)
    async def shadow_predict():
        shadow_result = new_model.predict(input_data)
        log_comparison(input_data, prod_result, shadow_result)
    
    asyncio.create_task(shadow_predict())
    
    return prod_result
```

**Traffic Splitting:**
Route percentage of users to new model:

```yaml
# Istio VirtualService for model A/B testing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: model-ab-test
spec:
  hosts:
  - churn-api
  http:
  - match:
    - headers:
        x-model-version:
          exact: "v2"
    route:
    - destination:
        host: churn-api-v2
      weight: 100
  - route:
    - destination:
        host: churn-api-v1
      weight: 90
    - destination:
        host: churn-api-v2
        subset: canary
      weight: 10
```

**Champion/Challenger:**
Continuously evaluate new models (challengers) against production model (champion), promoting only when statistically significant improvement demonstrated:

```python
from scipy import stats

def evaluate_challenger(champion_preds, challenger_preds, actuals):
    champion_acc = (champion_preds == actuals).mean()
    challenger_acc = (challenger_preds == actuals).mean()
    
    # Paired t-test
    t_stat, p_value = stats.ttest_rel(
        (challenger_preds == actuals).astype(int),
        (champion_preds == actuals).astype(int)
    )
    
    if p_value < 0.05 and challenger_acc > champion_acc:
        return "promote"
    return "retain"
```

### Monitoring Model Performance

**Drift Detection:**
Monitor for data drift (input distribution changes) and concept drift (relationship changes):

```python
# Evidently AI for drift detection
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=production_df)

if report.as_dict()['metrics'][0]['result']['dataset_drift']:
    trigger_retraining()
```

**Business Metrics:**
Track downstream business impact, not just model accuracy:

```python
# Log business metrics
mlflow.log_metric("revenue_per_user", revenue)
mlflow.log_metric("conversion_rate", conversions / total_users)
mlflow.log_metric("model_accuracy", accuracy)
```

---

## Chapter Summary and Preview

This chapter explored MLOps—the extension of CI/CD principles to machine learning systems—addressing the unique challenges posed by data dependencies, stochastic training processes, and model performance degradation. We established that ML pipelines require three additional stages beyond traditional CI/CD: Continuous Training (automated retraining on new data), Continuous Evaluation (validating model quality against business metrics), and Continuous Monitoring (detecting data drift and concept drift).

Kubeflow provides a comprehensive Kubernetes-native platform for ML workflows, integrating pipelines, distributed training, and model serving through KServe, while MLflow offers a lighter-weight, Python-centric approach focused on experimentation tracking and model registry management. Both platforms emphasize model versioning that captures not just code state, but data lineage, hyperparameters, and feature engineering logic.

Model serving strategies range from real-time HTTP APIs (requiring low-latency autoscaling) to batch inference jobs (processing large datasets periodically), with Kubernetes providing the orchestration layer for GPU scheduling, high-performance storage, and resource optimization. A/B testing for ML models extends beyond infrastructure canaries to statistical validation of model performance, using shadow mode for safe evaluation and champion/challenger patterns for continuous improvement.

**Key Takeaways:**
- Version data separately from code using DVC or similar tools; a model is inseparable from the dataset that trained it
- Implement training-serving skew detection to ensure feature engineering consistency between batch training and online inference
- Use model registries (MLflow, Kubeflow Metadata) to manage promotion workflows (Staging → Production → Archived) with approval gates
- Monitor for data drift using statistical tests (Kolmogorov-Smirnov, Population Stability Index) to trigger automated retraining
- Package models with explicit dependency manifests (conda.yaml, requirements.txt) to prevent environment mismatches
- Implement circuit breakers for model serving; fall back to baseline models when prediction latency exceeds SLAs or confidence thresholds drop
- Log prediction inputs and outputs for audit trails and debugging, ensuring compliance with AI governance regulations

**Next Chapter Preview:**
Chapter 61: Mobile App CI/CD addresses the unique constraints of deploying to managed ecosystems (iOS App Store, Google Play Store) with stringent review processes, code signing requirements, and binary distribution models. We will explore fastlane for automating screenshot generation and app store submissions, Flutter and React Native build pipelines for cross-platform development, and strategies for over-the-air (OTA) updates using services like CodePush. The chapter examines mobile-specific testing requirements including device farm testing, UI automation with Appium, and managing provisioning profiles and certificates in CI environments. We will also investigate progressive web app (PWA) deployment strategies that blur the line between native and web distribution, completing our coverage of client-side deployment automation.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='59. serverless_cicd.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='61. mobile_app_cicd.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
