# Chapter 46: Continuous Retraining Strategies

## Learning Objectives

By the end of this chapter, you will be able to:

- Determine when and why to retrain a time‑series prediction model in production
- Distinguish between time‑based, performance‑based, and drift‑based retraining triggers
- Implement batch retraining pipelines that periodically update models using historical data
- Apply incremental learning techniques to update models without full retraining
- Explore online learning algorithms that adapt continuously to new data streams
- Use active learning to selectively label and incorporate the most informative new examples
- Design continuous training pipelines with orchestration tools like Apache Airflow
- Compare model versions using A/B testing and shadow deployments
- Build automated retraining systems that monitor drift and trigger updates without manual intervention

---

## Introduction

A model deployed to production is never truly finished. Markets evolve, user behaviour changes, and the underlying data generating process shifts. For a time‑series prediction system like our NEPSE stock predictor, the statistical properties of financial data are constantly in flux. A model that performed well last month may underperform today because of new market regimes, regulatory changes, or unexpected events.

**Continuous retraining** is the practice of updating your model on a regular basis—or in response to specific triggers—to maintain its accuracy and relevance. In this chapter, we will explore the spectrum of retraining strategies, from simple scheduled retraining to sophisticated online learning algorithms. We will also discuss how to build automated pipelines that monitor model performance, detect drift, and retrain without human intervention, all while ensuring that new model versions are thoroughly validated before they serve traffic.

Using the NEPSE prediction system as a running example, we will implement practical retraining workflows and discuss the trade‑offs between freshness, computational cost, and stability.

---

## 46.1 When to Retrain

Deciding when to retrain a model is a fundamental design question. There are three primary drivers:

1. **Time‑based retraining** – Retrain on a fixed schedule (e.g., every week, every month). This is simple to implement and ensures the model is periodically refreshed with the most recent data.
2. **Performance‑based retraining** – Retrain when the model’s prediction error exceeds a threshold on recent data (once ground truth is available). This directly targets degradation but may react late.
3. **Drift‑based retraining** – Retrain when statistical tests detect significant data drift or concept drift. This can be a leading indicator of future performance drop.

For the NEPSE system, a combination is often best. You might retrain every week (time‑based) but also trigger an extra retraining cycle if a critical feature like `Volume` drifts beyond a certain PSI threshold (drift‑based), or if the daily error rate spikes (performance‑based).

---

## 46.2 Batch Retraining

Batch retraining is the most common approach: periodically (e.g., nightly) you collect all available data up to the present, preprocess it, engineer features, and train a new model from scratch. The new model is then validated and, if it passes acceptance tests, deployed to replace the old one.

### 46.2.1 A Simple Batch Retraining Pipeline

Consider a daily retraining script for our NEPSE model:

```python
# retrain_daily.py
import pandas as pd
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import logging

logging.basicConfig(level=logging.INFO)

def load_and_prepare_data():
    """Load all available NEPSE data up to yesterday."""
    df = pd.read_csv('nepse_all.csv', parse_dates=['Date'])
    # Assume we have features already engineered in the CSV
    # Or we could run feature engineering here
    return df

def train_model(df):
    """Train a new model on all data."""
    # Use data up to yesterday for training
    train_data = df[df['Date'] < pd.Timestamp.now().normalize()]
    X = train_data.drop(columns=['Date', 'Close', 'Symbol'])  # features
    y = train_data['Close'] > train_data['Close'].shift(-1)   # binary target: next day up?

    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X, y)
    return model, X, y

def validate_model(model, X, y):
    """Quick validation on last month's data."""
    split_date = pd.Timestamp.now().normalize() - pd.DateOffset(months=1)
    X_val = X[X.index >= split_date]
    y_val = y[y.index >= split_date]
    preds = model.predict(X_val)
    acc = accuracy_score(y_val, preds)
    logging.info(f"Validation accuracy on last month: {acc:.4f}")
    return acc

if __name__ == "__main__":
    df = load_and_prepare_data()
    model, X, y = train_model(df)
    acc = validate_model(model, X, y)
    if acc > 0.55:  # threshold
        joblib.dump(model, 'models/nepse_model_latest.pkl')
        logging.info("New model saved.")
    else:
        logging.warning("New model did not meet accuracy threshold; not deployed.")
```

**Explanation:**  
- The script loads all historical data, trains a random forest on everything up to yesterday, and evaluates on the last month.  
- If the validation accuracy exceeds a simple threshold (55%, since random guessing would be 50%), it saves the model.  
- This script could be run daily via a cron job or an orchestrator like Airflow.

### 46.2.2 Challenges with Batch Retraining

- **Computational cost**: Retraining from scratch every day on a growing dataset becomes expensive. Incremental approaches may be needed.
- **Data consistency**: When using all historical data, the model may be slow to adapt to recent changes because old data dominates.
- **Validation**: You must ensure the new model does not regress on important segments (e.g., high‑volatility periods).

To address the dominance of old data, many practitioners use a **sliding window**: train only on the last N months (e.g., 12 months). This makes the model more responsive to recent patterns and reduces training time.

```python
window_size = 365  # last year of trading days
train_data = df[df['Date'] >= (pd.Timestamp.now() - pd.DateOffset(days=window_size))]
```

---

## 46.3 Incremental Learning

Incremental learning (also called online learning or streaming learning) updates the model with each new batch of data without retraining from scratch. This is much more efficient and can adapt faster. Many algorithms support incremental updates: linear models (SGD), Naive Bayes, and some tree ensembles (e.g., `river` library's Hoeffding Tree).

### 46.3.1 Incremental Learning with River

`river` is a Python library for online machine learning. It provides estimators with a `learn_one` method that updates the model with a single example.

**Example: Incremental logistic regression for NEPSE direction prediction.**

```python
from river import linear_model, preprocessing, metrics
import pandas as pd

# Assume we have a stream of feature vectors (X_dict) and labels (y)
model = preprocessing.StandardScaler() | linear_model.LogisticRegression()

metric = metrics.Accuracy()

# Simulate streaming data
for index, row in df.iterrows():
    # Convert row to dict of features (excluding target)
    x = row.drop('target').to_dict()
    y = row['target']
    
    # Make prediction (optional, for monitoring)
    y_pred = model.predict_one(x)
    if y_pred is not None:
        metric.update(y, y_pred)
    
    # Update model with this example
    model.learn_one(x, y)
    
    # Every 1000 examples, log current accuracy
    if index % 1000 == 0:
        print(f"Iteration {index}, accuracy: {metric.get():.4f}")
```

**Explanation:**  
- The model is wrapped in a pipeline that first scales features (online) and then trains a logistic regression.  
- For each new trading day, we call `learn_one` to update the model. We can also monitor its accuracy on the fly.  
- This approach uses very little memory and can run indefinitely.

### 46.3.2 When to Use Incremental Learning

Incremental learning is ideal when:
- Data arrives continuously and you need near‑real‑time adaptation.
- You cannot afford to retrain from scratch frequently.
- The underlying process changes gradually.

For NEPSE, incremental learning could be used to update the model after each trading day. However, care must be taken because financial data often exhibits non‑stationarity and sudden shifts; incremental models may overreact to noise if the learning rate is too high.

---

## 46.4 Online Learning

Online learning is a subset of incremental learning where the model is updated after each individual instance. It is particularly useful for real‑time systems where predictions must be made immediately and the model can adapt on the fly.

### 46.4.1 Online Learning with Vowpal Wabbit

Vowpal Wabbit (VW) is a fast online learning system that supports various algorithms and is widely used in production.

**Example training a logistic regression model with VW via command line:**

```bash
# Convert data to VW format
# Each line: label |f feature1:value feature2:value ...
vw --data train.vw --loss_function logistic --passes 1 --learning_rate 0.1 -f model.vw
```

For Python integration, you can use `vw` Python wrapper or `vowpalwabbit` package.

### 46.4.2 Challenges of Online Learning

- **Choice of learning rate**: Too high and the model oscillates; too low and it adapts slowly.
- **Catastrophic forgetting**: The model may forget old patterns if it only sees recent data.
- **Evaluation**: Standard cross‑validation doesn't apply; you must use progressive validation.

---

## 46.5 Active Learning

Active learning is a strategy where the model selectively requests labels for the most informative unlabeled examples. In a prediction system, you might not receive immediate feedback (the true next‑day price is only known the next day). Active learning can help you choose which examples to store for later retraining, reducing the amount of labeled data needed.

For NEPSE, you could decide to only retrain on days where the model was highly uncertain (e.g., prediction probability near 0.5) or where the market exhibited unusual behaviour.

**Example: Uncertainty sampling**

```python
from sklearn.ensemble import RandomForestClassifier
import numpy as np

model = RandomForestClassifier()
# Initially train on some data
model.fit(X_initial, y_initial)

# Stream of unlabeled examples
for x_unlabeled in stream:
    proba = model.predict_proba([x_unlabeled])[0]
    uncertainty = 1 - np.max(proba)   # lower max probability = higher uncertainty
    if uncertainty > 0.4:  # threshold
        # Ask for label (in practice, wait for next day's price)
        # Store this example for the next retraining batch
        store_for_retraining(x_unlabeled)
```

---

## 46.6 Continuous Training Pipelines

In a mature MLOps setup, retraining is not a one‑off script but a continuous pipeline integrated with your CI/CD and monitoring systems. Tools like Apache Airflow, Prefect, or Kubeflow Pipelines allow you to orchestrate complex workflows that include:

- Data extraction and validation
- Feature engineering
- Model training and hyperparameter tuning
- Model validation (e.g., against a baseline)
- Model deployment (if validation passes)

### 46.6.1 Example Airflow DAG for Weekly Retraining

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'ds-team',
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

def extract_data(**context):
    # Pull new data from database or data lake
    pass

def engineer_features(**context):
    # Run feature engineering scripts
    pass

def train_model(**context):
    # Train model and save artifact
    pass

def validate_model(**context):
    # Compare new model with current production model
    pass

def deploy_model(**context):
    # Update production endpoint with new model
    pass

with DAG(
    'nepse_retrain',
    schedule_interval='0 2 * * 0',  # 2 AM every Sunday
    start_date=datetime(2024, 1, 1),
    catchup=False,
    default_args=default_args
) as dag:

    t1 = PythonOperator(task_id='extract_data', python_callable=extract_data)
    t2 = PythonOperator(task_id='engineer_features', python_callable=engineer_features)
    t3 = PythonOperator(task_id='train_model', python_callable=train_model)
    t4 = PythonOperator(task_id='validate_model', python_callable=validate_model)
    t5 = PythonOperator(task_id='deploy_model', python_callable=deploy_model)

    t1 >> t2 >> t3 >> t4 >> t5
```

**Explanation:**  
- This DAG runs weekly on Sunday at 2 AM.  
- It chains tasks: data extraction → feature engineering → training → validation → deployment.  
- If validation fails, you could add a branch that sends an alert instead of deploying.

---

## 46.7 A/B Testing Models

Before fully deploying a new model, you should test it against the current production model to ensure it actually improves performance in the live environment. A/B testing (or champion/challenger) is the standard method.

### 46.7.1 Setting Up an A/B Test for NEPSE

- **Champion**: Current production model.
- **Challenger**: Newly trained candidate model.
- **Traffic split**: 90% champion, 10% challenger (or 50/50 if you have enough traffic).
- **Metric**: Compare prediction accuracy (once ground truth arrives), business metrics (e.g., hypothetical trading returns), or system metrics (latency).

You need to ensure that the assignment is consistent (e.g., by user ID or by symbol) to avoid contamination. For a stock‑level predictor, you might split by symbol: some symbols use champion, others challenger.

**Implementation using a feature flag or routing service:**

```python
import random

def get_model_for_symbol(symbol):
    if random.random() < 0.1:  # 10% of requests
        return challenger_model
    else:
        return champion_model
```

But this random split per request may cause the same symbol to switch models, making comparison noisy. A better approach: hash the symbol and use modulo to assign it consistently.

```python
def get_model_for_symbol(symbol):
    # Consistent hashing
    if hash(symbol) % 10 == 0:
        return challenger_model
    else:
        return champion_model
```

### 46.7.2 Statistical Significance

After a test period, you must determine if the challenger is statistically significantly better. Use a statistical test appropriate for your metric (e.g., t‑test for continuous metrics, chi‑square for proportions). Be mindful of multiple testing if you compare many metrics.

```python
from scipy.stats import ttest_ind

# champion_accuracies: list of daily accuracies for champion
# challenger_accuracies: list of daily accuracies for challenger
t_stat, p_value = ttest_ind(champion_accuracies, challenger_accuracies)

if p_value < 0.05 and challenger_mean > champion_mean:
    print("Challenger is significantly better; deploy.")
else:
    print("No significant improvement; keep champion.")
```

---

## 46.8 Model Comparison Framework

Beyond simple A/B testing, you may want to maintain a **model registry** that tracks all versions, their performance metrics, and their deployment status. Tools like MLflow, Weights & Biases, or a simple database can serve this purpose.

### 46.8.1 Tracking Model Performance Over Time

Store metrics for each model version, and compare them not only at training time but also as they serve production.

```python
# Example record in model registry database
{
    "model_id": "rf_20250315",
    "training_date": "2025-03-15",
    "training_accuracy": 0.62,
    "validation_accuracy": 0.59,
    "production_accuracy_last_week": 0.58,
    "data_drift_detected": False,
    "deployment_status": "champion"
}
```

You can build a dashboard that shows the performance of all models over time, helping you spot when a champion starts degrading and a challenger might be ready.

---

## 46.9 Automated Retraining

The ultimate goal is to automate the entire retraining loop: monitor → detect → retrain → validate → deploy → monitor again. This is sometimes called **auto‑pilot** mode.

### 46.9.1 Triggering Retraining from Drift Detection

We can extend the drift detection system from Chapter 45 to trigger a retraining pipeline when drift exceeds a threshold.

**Simplified logic:**

```python
def monitor_and_retrain():
    drift_detected = check_for_drift()
    if drift_detected:
        # Trigger retraining pipeline (e.g., via Airflow API)
        trigger_retraining_dag()
```

In practice, you might use a tool like **Argo Events** or **AWS Lambda** to respond to drift metrics.

### 46.9.2 Safe Rollout of Automatically Retrained Models

Automated retraining carries risk: a model trained on anomalous data could be worse than the current one. Therefore, any automatically trained model should go through validation gates:

- **Unit tests**: Ensure feature schemas match.
- **Performance on holdout set**: Must beat a baseline (e.g., previous model or simple heuristic).
- **Shadow deployment**: Run the new model in shadow mode for a period, logging its predictions for comparison without impacting users.
- **Canary deployment**: Gradually roll out the new model while monitoring.

Only after all gates pass should the model become the new champion.

---

## 46.10 Best Practices for Continuous Retraining

1. **Automate the pipeline** – Manual retraining is error‑prone and doesn’t scale.
2. **Version everything** – Data, code, and models should be versioned together.
3. **Monitor model performance in production** – Use the same metrics you used in validation.
4. **Set up alerts for performance degradation** – Don’t rely solely on scheduled retraining.
5. **Use a holdout set that reflects production** – Time‑series cross‑validation is essential.
6. **Consider the cost‑benefit** – Retraining too often wastes resources; retraining too rarely leads to stale models.
7. **Test the retraining process itself** – Ensure your pipeline can handle data gaps, schema changes, etc.
8. **Keep a human in the loop for critical decisions** – For high‑stakes applications like trading, fully automated deployment may require regulatory approval.

---

## Chapter Summary

In this chapter, we explored the spectrum of continuous retraining strategies for time‑series prediction systems like the NEPSE stock predictor. We covered:

- The three main triggers for retraining: time‑based, performance‑based, and drift‑based.
- Batch retraining with scheduled jobs, including sliding windows to focus on recent data.
- Incremental and online learning techniques that update models continuously without full retraining.
- Active learning to selectively label the most informative examples.
- Building continuous training pipelines with orchestration tools like Apache Airflow.
- A/B testing and shadow deployments to safely validate new model versions.
- Automated retraining loops that respond to drift and performance degradation.

By implementing a robust retraining strategy, you ensure that your NEPSE prediction system remains accurate and reliable even as market dynamics evolve. In the next chapter, we will discuss **A/B Testing and Model Comparison** in greater depth, focusing on statistical rigor and experiment design.

---

**End of Chapter 46**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='45. model_drift_detection.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='47. ab_testing_and_model_comparison.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
