# Chapter 37: Error Analysis

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the critical role of systematic error analysis in model development and refinement
- Identify and visualize systematic error patterns in time‑series predictions
- Perform residual analysis to diagnose model misspecification, including tests for autocorrelation and heteroscedasticity
- Analyze error distributions and their properties (skewness, kurtosis, normality)
- Conduct conditional error analysis to determine when and why the model fails (e.g., error as a function of feature values, market regimes)
- Detect temporal error patterns such as autocorrelation and volatility clustering
- Assess the impact of outliers on model performance using influence measures
- Cluster errors to identify distinct regimes of poor performance
- Perform root cause analysis to trace errors back to data issues, feature gaps, or model limitations
- Develop an error analysis workflow and document findings for iterative model improvement

---

## **37.1 Introduction to Error Analysis**

Error analysis is the systematic investigation of a model's prediction errors to understand their nature, causes, and potential remedies. While evaluation metrics (RMSE, MAE, etc.) give a summary of performance, they do not reveal *why* the model makes mistakes or *when* it is most unreliable. Error analysis fills this gap by examining the residuals (errors) in detail, uncovering patterns that can guide feature engineering, model selection, and data collection.

For the NEPSE prediction system, error analysis can answer questions like:

- Does the model consistently overpredict on Mondays?
- Are errors larger on high‑volatility days?
- Does the model fail to capture sudden reversals?
- Are there specific stocks or time periods where performance degrades?

By answering these questions, we can iteratively improve the model.

---

## **37.2 Residual Analysis Basics**

Residuals are the differences between actual and predicted values:  
`eᵢ = yᵢ - ŷᵢ`

A good model should have residuals that are:

- **Unbiased:** Mean close to zero.
- **Homoscedastic:** Constant variance over time.
- **Uncorrelated:** No autocorrelation.
- **Approximately normally distributed** (optional, but useful for inference).

We will examine these properties using the NEPSE return predictions from a sample model (e.g., a random forest).

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from scipy import stats
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson

# Assume we have trained a model and have predictions on a test set
# For demonstration, we'll create synthetic residuals
np.random.seed(42)
y_test = np.random.randn(200) * 2  # actual returns
y_pred = y_test + np.random.randn(200) * 0.5  # predictions with noise
residuals = y_test - y_pred

# Basic statistics
print(f"Mean of residuals: {np.mean(residuals):.4f}")
print(f"Std of residuals: {np.std(residuals):.4f}")
print(f"Skewness: {stats.skew(residuals):.4f}")
print(f"Kurtosis: {stats.kurtosis(residuals):.4f}")

# Plot residuals over time (index)
plt.figure(figsize=(12,4))
plt.plot(residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Test sample index')
plt.ylabel('Residual')
plt.title('Residuals over time')
plt.show()
```

**Explanation:**  
We compute basic statistics: mean near zero indicates unbiasedness. Skewness and kurtosis describe the shape. A residual plot helps spot trends, clusters, or changing variance.

---

## **37.3 Systematic Error Patterns**

We can look for patterns by plotting residuals against:

- Predicted values
- Individual features
- Time (day of week, month, etc.)

### **37.3.1 Residuals vs. Predicted**

A random scatter around zero is ideal. A funnel shape suggests heteroscedasticity.

```python
plt.figure(figsize=(12,4))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.title('Residuals vs. Predicted')
plt.show()
```

### **37.3.2 Residuals vs. Features**

Plot residuals against each feature to see if errors depend on feature values. For example, if errors are larger when RSI is extreme, the model may not handle overbought/oversold conditions well.

```python
# Assume we have a DataFrame X_test with feature columns
# For demonstration, we'll create a synthetic feature
X_test = pd.DataFrame({
    'RSI': np.random.uniform(0, 100, 200),
    'Lag1': np.random.randn(200)
})

fig, axes = plt.subplots(1, 2, figsize=(12,4))
axes[0].scatter(X_test['RSI'], residuals, alpha=0.5)
axes[0].set_xlabel('RSI')
axes[0].set_ylabel('Residual')
axes[0].axhline(y=0, color='r', linestyle='--')

axes[1].scatter(X_test['Lag1'], residuals, alpha=0.5)
axes[1].set_xlabel('Lag1')
axes[1].set_ylabel('Residual')
axes[1].axhline(y=0, color='r', linestyle='--')
plt.tight_layout()
plt.show()
```

### **37.3.3 Residuals by Time Categories**

If we have datetime information, we can group residuals by day of week, month, etc.

```python
# Assume we have a test set with dates
dates = pd.date_range('2023-01-01', periods=200, freq='B')
df_test = pd.DataFrame({'Date': dates, 'Residual': residuals})
df_test['DayOfWeek'] = df_test['Date'].dt.dayofweek
df_test['Month'] = df_test['Date'].dt.month

# Boxplot by day of week
df_test.boxplot(column='Residual', by='DayOfWeek')
plt.title('Residuals by Day of Week')
plt.suptitle('')
plt.show()
```

**Explanation:**  
If the median residual is not zero on certain days, the model may have systematic bias related to day‑of‑week effects.

---

## **37.4 Residual Distribution Analysis**

We can check if residuals are approximately normal using a Q‑Q plot and statistical tests.

```python
# Q-Q plot
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()

# Shapiro-Wilk test for normality
shapiro_stat, shapiro_p = stats.shapiro(residuals)
print(f"Shapiro-Wilk p-value: {shapiro_p:.4f}")
if shapiro_p > 0.05:
    print("Residuals appear normally distributed (fail to reject H0)")
else:
    print("Residuals do not appear normally distributed")
```

**Note:** Normality is not strictly required for forecasting models, but severe non‑normality may indicate model misspecification or outliers.

---

## **37.5 Temporal Error Patterns**

Time‑series residuals often exhibit autocorrelation or volatility clustering.

### **37.5.1 Autocorrelation**

Autocorrelation means that errors are correlated with their own past values. This violates the independence assumption and suggests the model is missing some temporal structure (e.g., an AR term).

```python
# ACF plot
plot_acf(residuals, lags=30)
plt.show()

# Durbin-Watson test (values near 2 indicate no autocorrelation)
dw = durbin_watson(residuals)
print(f"Durbin-Watson statistic: {dw:.4f}")
# If dw << 2, positive autocorrelation; if dw >> 2, negative autocorrelation.
```

If autocorrelation is present, consider adding lagged features or using an ARIMA model.

### **37.5.2 Heteroscedasticity (Volatility Clustering)**

In financial returns, volatility tends to cluster: large changes follow large changes. If residuals show periods of high variance followed by low variance, it suggests the model does not capture volatility dynamics.

```python
# Plot squared residuals
plt.figure(figsize=(12,4))
plt.plot(residuals**2)
plt.title('Squared Residuals (volatility proxy)')
plt.show()

# Breusch-Pagan test for heteroscedasticity
# Need to regress squared residuals on features
# For simplicity, we'll use a constant-only model
from statsmodels.stats.diagnostic import het_breuschpagan
# We need a matrix of predictors (including constant)
X_design = sm.add_constant(X_test)  # assuming X_test is a DataFrame of features
bp_test = het_breuschpagan(residuals, X_design)
print(f"Breusch-Pagan LM statistic: {bp_test[0]:.4f}, p-value: {bp_test[1]:.4f}")
if bp_test[1] < 0.05:
    print("Evidence of heteroscedasticity")
else:
    print("No strong evidence of heteroscedasticity")
```

**Explanation:**  
If heteroscedasticity is present, consider models that explicitly model volatility (e.g., GARCH) or use heteroscedasticity‑robust standard errors.

---

## **37.6 Conditional Error Analysis**

We can analyze errors conditional on specific events or regimes. For example, we might want to know if errors are larger on days when the market is volatile, or when a stock hits a circuit breaker.

### **37.6.1 Error by Market Regime**

Define market regimes (e.g., high/low volatility) and compare error distributions.

```python
# Example: classify days by volatility (e.g., using VIX or rolling std)
# For demonstration, we'll create a synthetic volatility measure
volatility = np.abs(np.random.randn(200)) + 0.5
high_vol = volatility > np.percentile(volatility, 75)
low_vol = volatility <= np.percentile(volatility, 75)

errors_high = residuals[high_vol]
errors_low = residuals[low_vol]

print(f"Mean error (high vol): {np.mean(errors_high):.4f}, std: {np.std(errors_high):.4f}")
print(f"Mean error (low vol): {np.mean(errors_low):.4f}, std: {np.std(errors_low):.4f}")

# Boxplot comparison
plt.boxplot([errors_high, errors_low], labels=['High Vol', 'Low Vol'])
plt.ylabel('Residual')
plt.title('Error Distribution by Volatility Regime')
plt.show()
```

### **37.6.2 Error by Feature Value**

We can bin a feature (e.g., RSI) and compute average absolute error per bin.

```python
bins = pd.cut(X_test['RSI'], bins=10)
grouped = pd.DataFrame({'RSI_bin': bins, 'AbsError': np.abs(residuals)}).groupby('RSI_bin').mean()
print(grouped)
grouped.plot(kind='bar')
plt.ylabel('Mean Absolute Error')
plt.title('MAE by RSI Bin')
plt.show()
```

**Explanation:**  
If MAE is higher for extreme RSI values, the model may not capture mean reversion well.

---

## **37.7 Outlier Impact Analysis**

Outliers can disproportionately affect model performance. We need to identify influential points and decide whether to treat them.

### **37.7.1 Cook's Distance**

Cook's distance measures the influence of each observation on the regression coefficients. For tree models, we can use a simpler approach: refit the model without each point and observe the change.

```python
# For linear models, we can use Cook's distance from statsmodels
# For tree models, we can use a simple loop
def influence_analysis(model, X, y, indices):
    """
    Estimate influence by refitting model without each index.
    This is computationally expensive; we may sample a subset.
    """
    original_pred = model.predict(X)
    original_rmse = np.sqrt(np.mean((y - original_pred)**2))
    influences = []
    for idx in indices:
        X_loo = np.delete(X, idx, axis=0)
        y_loo = np.delete(y, idx)
        model_loo = clone(model).fit(X_loo, y_loo)
        pred_loo = model_loo.predict(X)
        rmse_loo = np.sqrt(np.mean((y - pred_loo)**2))
        influences.append(rmse_loo - original_rmse)
    return np.array(influences)

# Sample a few points to test
sample_idx = np.random.choice(len(residuals), size=50, replace=False)
influences = influence_analysis(model, X_test.values, y_test, sample_idx)
plt.hist(influences, bins=20)
plt.xlabel('Change in RMSE when point removed')
plt.title('Influence of Points')
plt.show()
```

**Explanation:**  
Points that cause a large increase in RMSE when removed are highly influential. They may be outliers or important data points. We should examine them to see if they are errors or genuine market events.

### **37.7.2 Identifying Outliers in Residuals**

We can flag residuals beyond a threshold (e.g., 3 standard deviations).

```python
threshold = 3 * np.std(residuals)
outliers = np.abs(residuals) > threshold
print(f"Number of outliers: {np.sum(outliers)}")
print(f"Outlier indices: {np.where(outliers)[0]}")
```

If outliers are data errors, we may correct them; if they represent extreme but real events (e.g., market crash), we may want the model to capture them, but they may dominate the loss function.

---

## **37.8 Error Clustering**

We can cluster error‑prone regions by using features that characterize the context of each error. For example, we can create a dataset where each row is a prediction error, along with the feature values at that time, and then cluster these rows to find groups of similar errors.

```python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Create error context dataset
error_context = X_test.copy()
error_context['Error'] = residuals
error_context['AbsError'] = np.abs(residuals)

# Normalize features (excluding error columns)
features_for_clustering = error_context[feature_cols].values
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_for_clustering)

# Cluster the error contexts (e.g., into 3 clusters)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(features_scaled)
error_context['Cluster'] = clusters

# Analyze each cluster
for cluster in range(3):
    cluster_data = error_context[error_context['Cluster'] == cluster]
    print(f"Cluster {cluster}: size {len(cluster_data)}, mean abs error {cluster_data['AbsError'].mean():.4f}")
    print("Feature means in this cluster:")
    print(cluster_data[feature_cols].mean())
    print()
```

**Explanation:**  
This helps identify regimes where errors are systematically high. For example, one cluster might have high RSI and low lagged returns, indicating the model fails in overbought conditions.

---

## **37.9 Root Cause Analysis**

Once we identify problematic patterns, we need to trace them back to their causes. Possible root causes include:

- **Missing features:** The model lacks a crucial indicator (e.g., news sentiment).
- **Feature engineering issues:** Lag windows too short, wrong transformations.
- **Data quality problems:** Errors in the raw data (e.g., incorrect prices).
- **Model limitations:** The model cannot capture certain non‑linearities.
- **Temporal shifts:** The market regime has changed (concept drift).

We can use SHAP to explain individual large errors.

```python
# For a large error instance, compute SHAP values
large_error_idx = np.argmax(np.abs(residuals))
shap_values_instance = explainer.shap_values(X_test.iloc[large_error_idx:large_error_idx+1])
shap.force_plot(explainer.expected_value, shap_values_instance[0], X_test.iloc[large_error_idx])
```

**Explanation:**  
The force plot shows which features contributed most to the prediction (and thus the error). If the model's prediction was too high, we can see which features pushed it up.

---

## **37.10 Iterative Improvement Workflow**

Error analysis should be an iterative process:

1. **Train initial model** and evaluate on test set.
2. **Perform error analysis** (as above) to identify patterns.
3. **Hypothesize causes** and propose changes (e.g., add new feature, transform feature, change model).
4. **Implement changes** and retrain.
5. **Validate** that the changes reduce the targeted errors without harming overall performance.
6. **Repeat**.

This is analogous to the scientific method and is key to building robust models.

---

## **37.11 Documentation**

Document your error analysis findings. For each major error pattern, note:

- Description of the pattern
- Possible causes
- Impact (e.g., how much it degrades performance)
- Proposed solution
- Whether the solution was implemented and its effect

This documentation is valuable for future debugging and for knowledge transfer.

---

## **37.12 Chapter Summary**

In this chapter, we conducted a thorough error analysis of a time‑series forecasting model using the NEPSE dataset as an example.

- **Residual analysis** provided basic diagnostics: mean, variance, distribution, and plots against predictions and features.
- **Temporal patterns** were examined via autocorrelation and heteroscedasticity tests.
- **Conditional error analysis** revealed how errors vary with feature values and market regimes.
- **Outlier impact analysis** identified influential points.
- **Error clustering** grouped error contexts to find systematic failure modes.
- **Root cause analysis** used SHAP to explain large errors.
- **Iterative improvement workflow** turned insights into actionable model enhancements.
- **Documentation** ensured findings are captured and shared.

### **Practical Takeaways for the NEPSE System:**

- Regularly perform error analysis after each model iteration.
- Look for systematic biases (e.g., day‑of‑week effects) and address them with additional features.
- Check for autocorrelation; if present, consider adding lagged errors (AR terms) or switching to ARIMA.
- If errors are heteroscedastic, consider volatility‑based features or models that account for changing variance.
- Use SHAP to understand individual large errors and guide feature engineering.
- Document findings to build institutional knowledge about market behavior and model limitations.

In the next chapter, **Chapter 38: From Development to Production**, we will discuss the crucial steps to move a model from a Jupyter notebook to a production environment, covering code organization, testing, and deployment considerations.

---

**End of Chapter 37**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='36. model_interpretation_and_explainability.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../6. production_systems/38. from_development_to_production.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
