# Chapter 83: Multi-Model Systems

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand why a single model may not be sufficient for complex time‑series prediction tasks.
- Distinguish between different multi‑model patterns: ensemble, routing, cascade, and expert systems.
- Implement ensemble methods (bagging, boosting, stacking) in the context of time‑series, using the NEPSE stock prediction system as an example.
- Design a model router that dynamically selects the best model based on market conditions or input features.
- Build a cascade system where a simple model handles easy cases and a complex model is invoked only when needed.
- Evaluate the performance of multi‑model systems, including per‑model accuracy and overall system latency.
- Monitor for model drift and automatically adjust routing or retrain individual models.
- Understand the trade‑offs between accuracy, complexity, and computational cost.

---

## **83.1 Introduction to Multi‑Model Systems**

In previous chapters, we focused on building and deploying a single model for a given prediction task. However, real‑world time‑series prediction often benefits from using **multiple models** in concert. Reasons include:

- **Diverse market regimes**: A single model may perform well during stable periods but fail during high volatility. A regime‑switching approach can use different models for different conditions.
- **Complementary strengths**: Some models capture linear trends well, while others excel at non‑linear patterns. Combining them can yield better overall performance.
- **Robustness**: Ensembles reduce the risk of choosing a single “wrong” model.
- **Computational efficiency**: A lightweight model can handle most requests, reserving a heavy model for complex cases.
- **Interpretability**: A simple model may be used for explanations, while a complex one provides accuracy.

In the NEPSE system, we might have:

- A **linear model** (e.g., ARIMA) for calm periods.
- A **tree‑based model** (e.g., XGBoost) for normal conditions.
- A **deep learning model** (e.g., LSTM) for volatile periods with complex patterns.
- An **ensemble** that averages predictions from all three.

Multi‑model systems introduce additional complexity: we must decide which model to use when, how to combine outputs, and how to maintain multiple models. This chapter explores the patterns and practical implementation.

---

## **83.2 Patterns for Multi‑Model Systems**

We can classify multi‑model systems into several patterns:

### **83.2.1 Ensemble**
Multiple models are trained on the same data (or different views of it) and their predictions are combined, typically by averaging (regression) or voting (classification). Ensembles are often more accurate and robust than any single model. Common ensemble methods include bagging, boosting, and stacking.

### **83.2.2 Model Routing**
A **router** or **selector** decides which model to use for each prediction based on input features or context. For example, a classifier might predict the current market regime (calm, volatile, trending) and then invoke the model specialised for that regime.

### **83.2.3 Cascade**
Models are arranged in a sequence. A simple, fast model is applied first; if its confidence is high enough, its prediction is used. If not, a more complex model is invoked. This is common in computer vision but can apply to time‑series (e.g., if a linear model’s residual is large, trigger an LSTM).

### **83.2.4 Expert Systems**
Different models are trained on different subsets of data (e.g., by sector, by time of day) and a rule‑based system selects the appropriate expert. In NEPSE, we might have separate models for banking stocks, hydropower stocks, etc., because they behave differently.

### **83.2.5 Hybrid**
Combinations of the above, e.g., an ensemble of routed models.

In this chapter, we will implement examples of ensemble, routing, and cascade using the NEPSE dataset.

---

## **83.3 Ensemble Methods**

Ensemble methods combine multiple models to produce a single prediction. They are well‑studied and often yield state‑of‑the‑art results.

### **83.3.1 Bagging (Bootstrap Aggregating)**
Train multiple instances of the same model on different bootstrap samples of the training data. For regression, predictions are averaged; for classification, majority vote. Random forests are a classic example.

In time‑series, we must be careful with bootstrapping because temporal order matters. Instead of random sampling with replacement, we can use **block bootstrapping** (sampling blocks of consecutive days) to preserve autocorrelation. However, a simpler approach is to train on different time periods (e.g., different years) – this is more like a time‑series ensemble.

### **83.3.2 Boosting**
Sequentially train models, each focusing on the errors of the previous one. Gradient boosting machines (XGBoost, LightGBM) are themselves ensembles of weak learners. In a multi‑model system, we might treat different boosting models as separate members.

### **83.3.3 Stacking (Stacked Generalization)**
Train several base models (level‑0) and then train a meta‑model (level‑1) that learns to combine their predictions optimally. The meta‑model is trained on the predictions of the base models on a hold‑out validation set.

### **83.3.4 Implementing an Ensemble for NEPSE**

We'll build a stacking ensemble with three base models:

- **ARIMA** (statistical)
- **XGBoost** (tree‑based)
- **LSTM** (deep learning)

We'll use historical NEPSE data (synthetic) and train on a time‑based split.

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error
import xgboost as xgb
from statsmodels.tsa.arima.model import ARIMA
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
import warnings
warnings.filterwarnings("ignore")

# Generate synthetic NEPSE data (as in Chapter 74)
def generate_nepse_data(days=1000):
    dates = pd.date_range(start='2020-01-01', periods=days, freq='B')
    prices = 1000 + np.cumsum(np.random.randn(days) * 5)
    df = pd.DataFrame({
        'date': dates,
        'close': prices,
        'volume': np.random.lognormal(12, 1, days)
    })
    # Add some features (simplified)
    df['lag_1'] = df['close'].shift(1)
    df['lag_5'] = df['close'].shift(5)
    df['sma_10'] = df['close'].rolling(10).mean()
    df['volatility'] = df['close'].rolling(20).std()
    df = df.dropna().reset_index(drop=True)
    return df

df = generate_nepse_data(days=1500)
print(df.head())

# Prepare features and target (predict next day close)
feature_cols = ['lag_1', 'lag_5', 'sma_10', 'volatility', 'volume']
X = df[feature_cols]
y = df['close'].shift(-1).dropna()
X = X.iloc[:-1]  # align

# Time-based split
split_idx = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

# --- Base Model 1: ARIMA (univariate, using only close) ---
# For simplicity, we'll use a fixed order; in practice, tune with AIC.
arima_model = ARIMA(y_train, order=(5,1,0))
arima_fit = arima_model.fit()
# ARIMA predicts next step; we'll need to align predictions with test set
# For stacking, we need predictions on validation set to train meta-model.
# We'll do a time-series split to generate out-of-sample predictions for stacking.

# We'll implement a proper stacking with time-series cross-validation later.
# For now, let's demonstrate a simple average ensemble.
```

**Explanation:**

- We generate synthetic NEPSE data and create simple features.
- We prepare three base models: ARIMA, XGBoost, and LSTM.
- For a stacking ensemble, we need to generate predictions from base models on a validation set that was not used to train them. This requires careful time‑based splitting to avoid look‑ahead.

### **83.3.5 Time‑Series Cross‑Validation for Stacking**

We'll use a walk‑forward approach to generate out‑of‑sample predictions for the base models, then train a meta‑model.

```python
def walk_forward_predictions(model_fn, X, y, initial_train_size, step=1):
    """
    Generate out-of-sample predictions using walk-forward validation.
    model_fn: function that takes (X_train, y_train) and returns a predict function.
    """
    n = len(y)
    predictions = np.full(n, np.nan)
    for i in range(initial_train_size, n, step):
        train_idx = slice(0, i)
        test_idx = i
        X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        X_test = X.iloc[[test_idx]]
        model = model_fn(X_train, y_train)
        pred = model(X_test)
        predictions[test_idx] = pred
    return predictions

# Example for XGBoost
def train_xgb(X_train, y_train):
    model = xgb.XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1)
    model.fit(X_train, y_train)
    return lambda X: model.predict(X)[0]

xgb_preds = walk_forward_predictions(train_xgb, X, y, initial_train_size=500, step=1)

# For ARIMA (univariate, using y only)
def train_arima(y_train):
    model = ARIMA(y_train, order=(5,1,0))
    fit = model.fit()
    return lambda X: fit.forecast(steps=1)[0]  # X ignored

arima_preds = walk_forward_predictions(train_arima, pd.DataFrame(index=y.index), y, initial_train_size=500, step=1)

# For LSTM (requires reshaping)
def train_lstm(X_train, y_train):
    # Reshape to [samples, timesteps, features] – we'll use 10 timesteps
    # This is simplified; in practice you'd create sequences
    model = Sequential([
        LSTM(50, activation='relu', input_shape=(10, X_train.shape[1])),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse')
    # Create sequences
    def create_sequences(X, y, seq_len=10):
        X_seq, y_seq = [], []
        for i in range(seq_len, len(X)):
            X_seq.append(X.iloc[i-seq_len:i].values)
            y_seq.append(y.iloc[i])
        return np.array(X_seq), np.array(y_seq)
    X_seq, y_seq = create_sequences(X_train, y_train)
    model.fit(X_seq, y_seq, epochs=10, verbose=0)
    def predict(X_new):
        # X_new is a single row; need last 10 rows from training to form sequence
        # This is tricky in walk-forward; for simplicity, we'll skip LSTM in this example.
        return np.nan
    return predict

# For simplicity, we'll proceed with xgb_preds and arima_preds only.

# Remove NaNs from beginning
valid_idx = ~np.isnan(xgb_preds) & ~np.isnan(arima_preds)
X_meta = pd.DataFrame({
    'xgb': xgb_preds[valid_idx],
    'arima': arima_preds[valid_idx]
})
y_meta = y[valid_idx]

# Train meta-model (simple linear regression)
from sklearn.linear_model import LinearRegression
meta_model = LinearRegression()
meta_model.fit(X_meta, y_meta)

# Evaluate on test set (last 20%)
test_start = int(len(y) * 0.8)
X_test_meta = X_meta.iloc[test_start:]
y_test_meta = y_meta.iloc[test_start:]
y_pred_meta = meta_model.predict(X_test_meta)
mae = mean_absolute_error(y_test_meta, y_pred_meta)
print(f"Stacking ensemble MAE: {mae:.2f}")
```

**Explanation:**

- `walk_forward_predictions` simulates a rolling forecast: for each day from `initial_train_size` onward, train the model on all previous data and predict the next day. This yields out‑of‑sample predictions that can be used to train the meta‑model without look‑ahead bias.
- We implement this for XGBoost and ARIMA (LSTM would require careful sequence handling).
- The meta‑model (linear regression) learns the optimal combination of the base models.
- This stacking approach often outperforms simple averaging because it weights models based on their recent performance.

---

## **83.4 Model Routing**

In model routing, we dynamically select which model to use for each prediction based on the input features or the current market state. This is particularly useful when different models excel under different conditions.

### **83.4.1 Regime Detection as a Router**

We can build a regime classifier that predicts the market state (e.g., using features like volatility, volume, trend strength). Then, for each regime, we have a specialised prediction model.

Steps:

1. **Label historical data** with regimes (e.g., using volatility thresholds, trend following indicators, or clustering).
2. **Train a classifier** (e.g., random forest) to predict the regime from features.
3. **Train a separate prediction model** for each regime.
4. At inference, first predict the regime, then use the corresponding model.

### **83.4.2 Implementing a Router for NEPSE**

We'll define three regimes based on 20‑day volatility:

- **Low volatility**: rolling std < 20
- **Medium volatility**: 20 ≤ rolling std < 40
- **High volatility**: rolling std ≥ 40

```python
# Add volatility to dataframe
df['volatility_20'] = df['close'].rolling(20).std()

# Define regimes
def label_regime(row):
    if row['volatility_20'] < 20:
        return 'low'
    elif row['volatility_20'] < 40:
        return 'medium'
    else:
        return 'high'

df['regime'] = df.apply(label_regime, axis=1)

# Shift target (next day close)
df['target'] = df['close'].shift(-1)
df = df.dropna()

# Train a regime classifier (use features from current day to predict regime for next day?)
# We want to predict the regime for which we will apply the model. Usually, we use features known at prediction time.
feature_cols = ['lag_1', 'lag_5', 'sma_10', 'volume']
X_class = df[feature_cols]
y_class = df['regime']

# Time-based split
split = int(0.8 * len(df))
X_train_class, X_test_class = X_class.iloc[:split], X_class.iloc[split:]
y_train_class, y_test_class = y_class.iloc[:split], y_class.iloc[split:]

# Train classifier (simple random forest)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_class, y_train_class)

# Evaluate classifier accuracy
print("Classifier accuracy:", clf.score(X_test_class, y_test_class))

# Now train separate regression models for each regime
models = {}
for regime in ['low', 'medium', 'high']:
    mask = (df['regime'] == regime) & (df.index < split)  # train on training period only
    X_reg = df.loc[mask, feature_cols]
    y_reg = df.loc[mask, 'target']
    if len(X_reg) > 0:
        model = xgb.XGBRegressor(n_estimators=100, max_depth=5)
        model.fit(X_reg, y_reg)
        models[regime] = model
    else:
        models[regime] = None

# Predict on test set using router
y_pred = []
for i in range(split, len(df)):
    row = df.iloc[i]
    X_row = row[feature_cols].values.reshape(1, -1)
    # Predict regime
    regime = clf.predict(X_row)[0]
    model = models.get(regime)
    if model is not None:
        pred = model.predict(X_row)[0]
    else:
        # fallback to a global model (e.g., trained on all data)
        pred = global_model.predict(X_row)[0]  # assume we have a global model
    y_pred.append(pred)

y_true = df['target'].iloc[split:].values
mae = mean_absolute_error(y_true, y_pred)
print(f"Routing model MAE: {mae:.2f}")
```

**Explanation:**

- We label each day with a volatility regime based on the current day's volatility.
- A classifier is trained to predict the regime from features (note: this uses current day's features to predict the regime for the same day – but the regime is based on current volatility, which is known at prediction time, so it's valid).
- Separate XGBoost models are trained for each regime using only data from that regime.
- At inference, we first classify the regime, then use the corresponding model.
- If a regime model doesn't exist (e.g., not enough training data), we fall back to a global model.

---

## **83.5 Cascade Systems**

A cascade system processes predictions through a sequence of models, typically starting with a cheap, simple model and only invoking a more expensive model if the simple model's confidence is low.

### **83.5.1 Designing a Cascade for NEPSE**

We can use a linear model (e.g., ARIMA) as the first stage. If its prediction error on recent known data is high, or if the predicted value has high uncertainty (e.g., wide confidence interval), we invoke a complex model (e.g., XGBoost or LSTM).

To measure confidence, we can use:

- The standard error of the ARIMA forecast.
- The residual from a recent validation period.
- A separate uncertainty model.

### **83.5.2 Implementing a Simple Cascade**

```python
from statsmodels.tsa.arima.model import ARIMA
import numpy as np

class CascadePredictor:
    def __init__(self, simple_model_fn, complex_model, threshold=5.0):
        self.simple_model_fn = simple_model_fn  # function to train simple model on expanding data
        self.complex_model = complex_model      # pre-trained complex model
        self.threshold = threshold              # if simple model's recent MAE > threshold, use complex
    
    def predict(self, X, y_history, date_idx):
        """
        X: features for current prediction (for complex model)
        y_history: array of recent actuals (for evaluating simple model)
        date_idx: index to know where to split
        """
        # Train simple model on data up to date_idx-1
        simple = self.simple_model_fn(y_history[:date_idx])
        # Make simple prediction
        simple_pred = simple.forecast(steps=1)[0]
        
        # Evaluate simple model on recent validation period (e.g., last 5 days)
        if date_idx > 5:
            recent_actuals = y_history[date_idx-5:date_idx]
            recent_preds = []
            for i in range(5):
                # Re-train simple model up to date_idx-5+i? This is expensive.
                # Instead, we can use residuals from the fitted model on the validation period.
                # Simpler: use a fixed threshold based on recent volatility.
                pass
        
        # For simplicity, we'll just use the threshold on the absolute predicted return
        # If simple_pred indicates a large move, we might want complex model.
        # But that's not a good confidence measure.
        
        # Here we'll just use a heuristic: if the simple model's prediction deviates too much from recent average, use complex.
        recent_avg = np.mean(y_history[date_idx-5:date_idx]) if date_idx >=5 else simple_pred
        if abs(simple_pred - recent_avg) > self.threshold:
            # Use complex model
            X_input = X.values.reshape(1, -1)
            complex_pred = self.complex_model.predict(X_input)[0]
            return complex_pred
        else:
            return simple_pred

# Example usage
# We need to pre-train a complex model on the entire training set
complex_model = xgb.XGBRegressor(n_estimators=100)
complex_model.fit(X_train, y_train)

# Simple model function (ARIMA with fixed order)
def train_arima(y):
    model = ARIMA(y, order=(5,1,0))
    return model.fit()

cascade = CascadePredictor(train_arima, complex_model, threshold=10.0)

# Simulate walk-forward
y_history = y_train.tolist()  # start with training data
predictions = []
for i in range(len(X_test)):
    X_row = X_test.iloc[i]
    pred = cascade.predict(X_row, y_history, len(y_history))
    predictions.append(pred)
    # Append actual test value to history (simulating we observe it next day)
    y_history.append(y_test.iloc[i])

mae = mean_absolute_error(y_test, predictions)
print(f"Cascade MAE: {mae:.2f}")
```

**Explanation:**

- The cascade predictor uses a simple ARIMA model that is retrained on all available history for each step (computationally heavy; in practice you might use a rolling ARIMA or a pre‑fitted model with updates).
- The decision to invoke the complex model is based on a simple heuristic: if the simple model's prediction deviates from the recent average by more than a threshold, we assume the situation is unusual and use the complex model.
- This reduces the number of complex model invocations, saving computation while maintaining accuracy.

---

## **83.6 Multi‑Model System Architecture**

In a production system, multi‑model logic needs to be integrated into the prediction service. We can extend the microservices architecture from Chapter 81:

- **Model Registry** now stores multiple models, each with metadata about its type, regime, or ensemble weight.
- **Prediction Service** contains a **router** component that, based on input features, decides which model(s) to invoke.
- For ensembles, the prediction service may call multiple models in parallel and combine results.
- For cascades, it may call models sequentially.

We can also have a dedicated **Ensemble Service** that handles the combination logic, but keeping it within the prediction service is simpler.

---

## **83.7 Monitoring and Maintenance**

With multiple models, monitoring becomes more complex. We need to track:

- **Per‑model performance** (MAE, bias) over time, to detect when a model degrades.
- **Router accuracy** (if using a classifier), to ensure it's selecting the right model.
- **Ensemble weights** (if using stacking) – they may drift and need recalibration.
- **Latency** – cascades may have variable latency depending on how often the complex model is invoked.

We can extend the monitoring service from Chapter 73 to log which model was used for each prediction and compute per‑model metrics. Alerts can be set up if a particular model's performance drops below a threshold.

### **83.7.1 Automated Retraining and Recalibration**

- For routing models, if the classifier's accuracy drops, we may need to retrain it with new regime labels.
- For stacking, the meta‑model weights can be recalibrated periodically using recent out‑of‑sample predictions.
- For individual models, retraining schedules may differ (e.g., complex models retrained less often).

---

## **83.8 Trade‑offs and Best Practices**

### **83.8.1 Accuracy vs. Complexity**
Adding more models generally improves accuracy up to a point, but the gain diminishes. Consider the computational cost and maintenance overhead.

### **83.8.2 Latency**
Ensembles that run models in parallel can be acceptable if each model is fast. Cascades can reduce average latency but may have high latency for the worst‑case path.

### **83.8.3 Interpretability**
Simple models are more interpretable. In a multi‑model system, you can still provide explanations from the simple model when it is used, or use SHAP on the ensemble (though it becomes complex).

### **83.8.4 Testing**
Test each model individually and the combined system. Ensure that the routing logic itself doesn't introduce bias.

### **83.8.5 Start Simple**
Begin with a single strong model. Add multi‑model complexity only when you have evidence that it improves performance on a hold‑out set.

---

## **83.9 Case Study: Multi‑Model NEPSE System in Production**

Imagine a production NEPSE system that uses:

- A **LightGBM model** as the primary workhorse.
- An **LSTM model** for high‑volatility regimes, triggered by a router.
- A **simple ARIMA** as a fallback if the primary model is unavailable.
- An **ensemble** of the LightGBM and LSTM for the final prediction during volatile periods.

The system logs predictions and model usage. Once a week, a batch job recomputes the router's classifier and retrains the LightGBM model on new data. The LSTM is retrained monthly due to higher computational cost.

If the ensemble's performance degrades, an alert is sent to the data science team.

---

## **83.10 Future Directions**

- **Automated model selection**: Use meta‑learning to choose the best model architecture based on time‑series characteristics.
- **Neural architecture search** for time‑series: automatically discover the best model combination.
- **Online learning** for routers: adapt the routing policy in real time based on recent performance.

---

## **Chapter Summary**

In this chapter, we explored multi‑model systems for time‑series prediction. We discussed ensemble methods (bagging, boosting, stacking), model routing based on regime detection, and cascade systems that trade off accuracy for computational efficiency. We implemented a stacking ensemble for NEPSE using walk‑forward validation, a router using volatility regimes, and a simple cascade. We also covered monitoring, maintenance, and best practices. Multi‑model systems can significantly improve robustness and accuracy, but they require careful design and monitoring. In the next chapter, we will delve into **Real‑Time Learning Systems**, where models update continuously as new data arrives.

---

**End of Chapter 83**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='82. event_driven_architecture.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='84. real_time_learning_systems.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
