# Chapter 62: CI/CD for Machine Learning

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the fundamentals of Continuous Integration and Continuous Delivery (CI/CD) and their importance for ML systems
- Identify the unique challenges of applying CI/CD to machine learning compared to traditional software
- Design and implement automated testing strategies for ML projects, including data tests, model tests, and infrastructure tests
- Set up continuous integration pipelines that automatically validate code, data, and models
- Build continuous delivery pipelines that deploy models to staging and production environments
- Implement continuous deployment strategies for models that pass all validation gates
- Apply GitOps principles to manage ML infrastructure and model versions declaratively
- Use Infrastructure as Code (IaC) to provision and manage ML environments
- Orchestrate end‑to‑end ML pipelines that combine training, validation, and deployment
- Adopt best practices for CI/CD in ML to ensure reliable, reproducible, and frequent model updates

---

## Introduction

In the previous chapter, we introduced MLOps and its importance for production ML systems. One of the core practices of MLOps is **CI/CD**—Continuous Integration and Continuous Delivery (or Deployment). CI/CD pipelines automate the process of building, testing, and deploying software, enabling teams to deliver changes frequently and reliably.

For machine learning systems, CI/CD is both more important and more challenging than for traditional software. An ML system has multiple artifacts: code, data, features, and models. Each of these can change and must be validated. A change in the training data could break the model just as easily as a code change. Therefore, CI/CD for ML must encompass data and model validation alongside traditional software tests.

In this chapter, we will explore how to adapt CI/CD principles to machine learning. We will build automated pipelines for the NEPSE prediction system that test data quality, validate model performance, and deploy new models safely. We'll use tools like GitHub Actions, Jenkins, and MLflow, and discuss best practices for each stage.

---

## 62.1 CI/CD Fundamentals

Before diving into ML‑specific aspects, let's review the basic concepts of CI/CD.

### 62.1.1 Continuous Integration (CI)

**Continuous Integration** is the practice of automatically integrating code changes from multiple contributors into a shared repository several times a day. Each integration is verified by an automated build and tests, allowing teams to detect problems early.

Key elements of CI:
- Developers frequently merge code to a central repository (e.g., `main` branch).
- An automated server (e.g., Jenkins, GitHub Actions) runs a build and tests on every push.
- If tests fail, the team is notified and fixes the issue immediately.

For ML, CI extends beyond code to include data and model validation.

### 62.1.2 Continuous Delivery (CD)

**Continuous Delivery** extends CI by ensuring that the software can be released to production at any time. After passing CI, the application is automatically deployed to a staging environment that mirrors production, where further tests (e.g., integration tests, performance tests) are run. The release to production is still a manual decision but can be done with a single click.

### 62.1.3 Continuous Deployment

**Continuous Deployment** goes one step further: every change that passes all stages of the production pipeline is automatically released to users, with no human intervention. This requires a high level of confidence in the automated testing and deployment process.

For ML systems, continuous deployment of models is possible but requires robust validation and monitoring to prevent deploying a bad model.

---

## 62.2 ML‑Specific CI/CD Challenges

ML systems introduce unique challenges for CI/CD:

1. **Multiple artifacts**: Besides code, we have data, features, and models. Each must be versioned and tested.
2. **Non‑deterministic training**: Model training involves randomness. The same code and data can produce slightly different models. Tests must account for this variability.
3. **Data and concept drift**: A model that performed well at training time may degrade in production due to changing data. CI/CD pipelines must include monitoring and trigger retraining.
4. **Model validation is complex**: Evaluating a model requires more than just a unit test. It involves comparing against baselines, checking for fairness, and measuring performance on holdout data.
5. **Training is expensive**: Running a full training pipeline on every code change may be prohibitively expensive. We need strategies to decide when to retrain.

Despite these challenges, CI/CD for ML is achievable and essential for maintaining production systems.

---

## 62.3 Automated Testing for ML

Testing in ML systems must cover multiple layers:

- **Code tests**: Unit tests for functions (e.g., feature engineering functions, data loading).
- **Data tests**: Validate input data schema, distributions, and quality.
- **Model tests**: Verify model performance, consistency, and fairness.
- **Infrastructure tests**: Ensure the serving infrastructure works correctly (e.g., API responses).

### 62.3.1 Code Tests

Standard unit tests for Python functions using `pytest`. For example, test the RSI calculation function.

```python
# test_features.py
import pandas as pd
import numpy as np
from features import compute_rsi

def test_rsi_calculation():
    prices = pd.Series([100, 101, 102, 101, 100, 99, 98])
    expected_rsi = pd.Series([np.nan, np.nan, np.nan, np.nan, np.nan, 41.18, 33.33])
    result = compute_rsi(prices, period=5)
    pd.testing.assert_series_equal(result, expected_rsi, check_less_precise=True)
```

**Explanation:**  
We test a core feature engineering function with known inputs and expected outputs. This ensures that changes to the code do not break the feature calculation.

### 62.3.2 Data Tests

Data tests validate that the input data meets expectations. For the NEPSE system, we might check:

- No missing values in critical columns.
- Prices are positive.
- Volume is integer and non‑negative.
- Date column is in the correct format and increasing.

**Example using `great_expectations`:**

```python
import great_expectations as ge

def test_data_quality(df):
    df_ge = ge.from_pandas(df)
    
    # Expectations
    df_ge.expect_column_values_to_not_be_null('Close')
    df_ge.expect_column_values_to_be_between('Close', min_value=0)
    df_ge.expect_column_values_to_be_between('Volume', min_value=0)
    df_ge.expect_column_values_to_match_strftime_format('Date', '%Y-%m-%d')
    
    # Run validation
    results = df_ge.validate()
    assert results['success'], "Data quality checks failed"
```

**Explanation:**  
`great_expectations` allows us to define declarative expectations for data. These can be run in CI to catch data issues early.

### 62.3.3 Model Tests

Model tests go beyond simple accuracy checks. They might include:

- **Performance on holdout set**: Ensure the new model meets a minimum accuracy threshold.
- **Comparison with baseline**: New model should outperform the current production model (or a simple heuristic).
- **Invariance tests**: The model should not change its prediction for small, semantically meaningless changes (e.g., adding tiny noise).
- **Fairness tests**: Ensure the model performs similarly across different groups (if applicable).

**Example: Comparing with baseline**

```python
def test_model_against_baseline(new_model, baseline_model, X_test, y_test):
    new_preds = new_model.predict(X_test)
    baseline_preds = baseline_model.predict(X_test)
    
    new_acc = accuracy_score(y_test, new_preds)
    baseline_acc = accuracy_score(y_test, baseline_preds)
    
    assert new_acc > baseline_acc, f"New model ({new_acc:.3f}) not better than baseline ({baseline_acc:.3f})"
```

### 62.3.4 Infrastructure Tests

Once a model is deployed, we need to test that the serving infrastructure works correctly. This includes:

- API endpoint returns correct HTTP status codes.
- Response time is within limits.
- Model inference works with expected input formats.

**Example using `requests` and `pytest`:**

```python
import requests

def test_prediction_endpoint():
    url = "http://localhost:8000/predict"
    sample_input = {"features": [1.2, 3.4, 5.6, 7.8]}
    
    response = requests.post(url, json=sample_input)
    assert response.status_code == 200
    data = response.json()
    assert 'probability' in data
    assert 0 <= data['probability'] <= 1
```

---

## 62.4 Continuous Integration for ML

A CI pipeline for ML should run automatically on every push to the repository (or on a schedule). A typical ML CI pipeline might include:

1. **Lint and format code** (e.g., `black`, `flake8`).
2. **Run unit tests** on feature engineering code.
3. **Run data validation tests** on a sample of the data.
4. **Train a quick model** on a subset of data to ensure the training script runs.
5. **Evaluate the model** on a small validation set (quick check).
6. **Package the model** and any necessary artifacts.

**Example GitHub Actions workflow for ML CI:**

```yaml
name: ML CI

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.9'
    
    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install -r requirements-dev.txt  # testing tools
    
    - name: Lint with flake8
      run: flake8 src/ tests/
    
    - name: Run unit tests
      run: pytest tests/unit
    
    - name: Run data validation
      run: python scripts/validate_data.py --sample-size 1000
    
    - name: Quick model training test
      run: python scripts/train_quick.py --max-samples 5000 --max-epochs 2
    
    - name: Package model (if training succeeded)
      run: python scripts/package_model.py
```

**Explanation:**  
This workflow runs on every push and pull request. It performs code quality checks, unit tests, data validation, and a quick training test to catch errors early. The quick training uses a subset of data to avoid excessive runtime.

---

## 62.5 Continuous Delivery for ML

Continuous delivery ensures that a validated model can be deployed to production at any time. After CI passes, we typically deploy to a staging environment that mirrors production, run more extensive tests, and then promote to production.

**Staging environment tests might include:**

- **Full training on all data** (if not too expensive).
- **Integration tests** with downstream systems (e.g., databases, dashboards).
- **Performance tests** under simulated load.
- **Shadow deployment**: Send real traffic to the new model but use predictions for logging only, not for decisions.

### 62.5.1 Staging Deployment with MLflow

MLflow can help manage model staging. After training, we can register the model in the MLflow Model Registry with the stage "Staging".

```python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
    name="NEPSE_Predictor",
    version=5,
    stage="Staging"
)
```

Then, in the staging environment, we can automatically deploy all models in the "Staging" stage.

### 62.5.2 GitHub Actions for CD

We can extend our GitHub Actions workflow to deploy to staging after CI passes on the main branch.

```yaml
name: ML CD

on:
  push:
    branches: [ main ]

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    # ... install dependencies, etc.
    - name: Train full model
      run: python scripts/train_full.py
    - name: Register model in MLflow
      run: python scripts/register_model.py
    - name: Deploy to staging
      run: python scripts/deploy_staging.py
    - name: Run staging tests
      run: pytest tests/staging
    - name: Promote to production (if manual approval)
      run: echo "Ready for production"
```

**Explanation:**  
This workflow trains a full model, registers it, deploys to staging, and runs tests. After successful tests, it waits for manual approval (or could automatically promote if confidence is high).

---

## 62.6 Continuous Deployment for ML

Continuous deployment automatically promotes models to production after passing all validation gates. This requires a high level of automation and confidence in the testing process.

**Gates for auto‑promotion might include:**

- Model performance on staging data exceeds current production model by a statistically significant margin.
- No degradation in fairness metrics.
- Latency and throughput meet SLOs.
- Shadow deployment shows no anomalies.

**Example: Auto‑promotion script**

```python
def should_promote(staging_model_id, production_model_id, test_data):
    staging_metrics = evaluate_model(staging_model_id, test_data)
    production_metrics = evaluate_model(production_model_id, test_data)
    
    # Check if staging model is significantly better
    if staging_metrics['accuracy'] > production_metrics['accuracy'] + 0.01:
        # Also check latency
        staging_latency = measure_latency(staging_model_id)
        if staging_latency < 100:  # milliseconds
            return True
    return False

if should_promote(staging_model, production_model, test_data):
    client.transition_model_version_stage(
        name="NEPSE_Predictor",
        version=staging_version,
        stage="Production"
    )
    # Also update the serving endpoint
    update_production_endpoint(staging_model_id)
```

**Explanation:**  
The script compares the staging model against the current production model on a holdout dataset. If it meets criteria (better accuracy and acceptable latency), it is automatically promoted to production.

---

## 62.7 GitOps for ML

**GitOps** is a practice where the entire system's desired state is described in Git, and automated processes ensure the actual state matches the desired state. For ML, this means:

- Model versions, configurations, and infrastructure definitions are stored in Git.
- Changes are made via pull requests.
- CI/CD pipelines automatically apply the changes.

### 62.7.1 Infrastructure as Code (IaC)

Use tools like Terraform, AWS CloudFormation, or Pulumi to define your ML infrastructure (e.g., Kubernetes clusters, S3 buckets, IAM roles). Store these definitions in Git.

**Example Terraform snippet for an S3 bucket:**

```hcl
resource "aws_s3_bucket" "model_artifacts" {
  bucket = "nepse-model-artifacts"
  acl    = "private"
  
  versioning {
    enabled = true
  }
}
```

When a pull request changes this file, a CI job can run `terraform plan` to show the changes, and after merge, `terraform apply` to enact them.

### 62.7.2 Model Versions in Git

For smaller projects, you might store model files directly in Git (using Git LFS). More commonly, you store a pointer to the model in a registry (like MLflow), and the registry itself can be backed by Git.

**Example: A YAML file describing the desired production model**

```yaml
# production-model.yaml
model_name: NEPSE_Predictor
model_version: 5
serving_endpoint: https://api.nepse.example.com/predict
```

A GitOps operator (e.g., Argo CD) could watch this file and ensure the deployed model matches the specified version.

---

## 62.8 Infrastructure as Code (IaC) for ML

IaC is a key enabler of GitOps. It allows you to provision and manage ML infrastructure programmatically, ensuring consistency and reproducibility.

### 62.8.1 Terraform for ML Infrastructure

Terraform can manage cloud resources for the entire ML pipeline:

- **Data storage**: S3 buckets, BigQuery datasets.
- **Compute**: SageMaker instances, Kubernetes clusters.
- **Networking**: VPCs, security groups.
- **IAM roles and policies**.

**Example: Provisioning a SageMaker notebook instance**

```hcl
resource "aws_sagemaker_notebook_instance" "nepse_notebook" {
  name          = "nepse-data-science"
  role_arn      = aws_iam_role.sagemaker_role.arn
  instance_type = "ml.t3.medium"
  
  tags = {
    Project = "NEPSE"
  }
}
```

### 62.8.2 Kubernetes Manifests for Model Serving

If you deploy models on Kubernetes, you can define the deployment and service in YAML and store them in Git.

```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nepse-predictor-v5
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nepse-predictor
      version: v5
  template:
    metadata:
      labels:
        app: nepse-predictor
        version: v5
    spec:
      containers:
      - name: predictor
        image: myregistry/nepse-predictor:v5
        ports:
        - containerPort: 8000
```

Updating the version in Git and syncing with the cluster (using Argo CD) deploys the new model.

---

## 62.9 Pipeline Orchestration

CI/CD pipelines for ML often need to orchestrate multiple steps: data validation, training, evaluation, deployment. Tools like Apache Airflow, Prefect, and Kubeflow Pipelines are designed for this.

### 62.9.1 Airflow DAG for ML Pipeline

An Airflow DAG can define the entire workflow, with dependencies and retries.

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'ml-team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'nepse_training_pipeline',
    default_args=default_args,
    schedule_interval='@weekly',
    catchup=False
)

def validate_data(**context):
    # Run data validation
    pass

def train_model(**context):
    # Train model
    pass

def evaluate_model(**context):
    # Evaluate and compare with baseline
    pass

def deploy_if_better(**context):
    # Deploy if evaluation passes
    pass

validate = PythonOperator(task_id='validate_data', python_callable=validate_data, dag=dag)
train = PythonOperator(task_id='train_model', python_callable=train_model, dag=dag)
evaluate = PythonOperator(task_id='evaluate_model', python_callable=evaluate_model, dag=dag)
deploy = PythonOperator(task_id='deploy_if_better', python_callable=deploy_if_better, dag=dag)

validate >> train >> evaluate >> deploy
```

**Explanation:**  
This DAG runs weekly. It validates data, trains a model, evaluates it, and deploys only if it meets criteria. Airflow handles scheduling, retries, and monitoring.

---

## 62.10 Best Practices for ML CI/CD

1. **Start small**: Automate one part of the pipeline at a time (e.g., data validation first).
2. **Version everything**: Code, data, models, and environments.
3. **Test data as rigorously as code**: Data drift can break models.
4. **Use staging environments**: Test models in an environment that mirrors production.
5. **Monitor deployed models**: CI/CD doesn't stop at deployment; monitor performance and trigger retraining.
6. **Keep pipelines fast**: Use incremental training or smaller datasets for quick CI tests.
7. **Secure secrets**: Never hard‑code credentials; use secrets managers (e.g., GitHub Secrets, HashiCorp Vault).
8. **Document the pipeline**: Make it easy for team members to understand and modify.
9. **Have a rollback plan**: Automate the ability to revert to a previous model version.
10. **Embrace GitOps**: Use Git as the single source of truth for both code and configuration.

---

## Chapter Summary

In this chapter, we explored how to apply CI/CD principles to machine learning systems, using the NEPSE prediction system as a concrete example. We covered:

- The fundamentals of CI/CD and their adaptation to ML.
- ML‑specific testing strategies: data tests, model tests, infrastructure tests.
- Building continuous integration pipelines with GitHub Actions.
- Continuous delivery to staging environments and continuous deployment with automatic promotion.
- GitOps and Infrastructure as Code for managing ML infrastructure declaratively.
- Orchestrating complex ML pipelines with Airflow.
- Best practices for reliable and efficient ML CI/CD.

By implementing CI/CD for the NEPSE system, we can ensure that model updates are delivered quickly, reliably, and safely. This transforms the project from a static model into a living system that adapts to changing market conditions.

In the next chapter, we will discuss **Feature Stores**, a critical component for managing features consistently across training and serving.

---

**End of Chapter 62**