# Chapter 33: Time‑Series Specific Evaluation

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand why standard evaluation metrics must be adapted for time‑series forecasting
- Evaluate model performance across multiple forecast horizons (1‑day, 5‑day, etc.)
- Compute cumulative forecast error to assess long‑term trend accuracy
- Measure directional accuracy (sign prediction) as a complement to magnitude errors
- Calculate hit rate to quantify how often predictions are within a tolerance band
- Apply economic evaluation metrics (e.g., profitability, Sharpe ratio) to assess real‑world value
- Use risk‑adjusted metrics (Sharpe, Sortino, Calmar) to compare models
- Compare models against simple benchmarks (naïve, historical mean, etc.)
- Perform statistical significance tests (Diebold‑Mariano) to determine if one model is truly better
- Develop a robust model comparison framework for time‑series
- Adopt best practices for evaluating forecasts in a financial context

---

## **33.1 Introduction to Time‑Series Specific Evaluation**

Evaluating forecasting models in a time‑series context goes beyond simple regression or classification metrics. Financial time series have unique characteristics: autocorrelation, non‑stationarity, and economic significance. A model that achieves a low RMSE may still be useless if it consistently misses turning points or yields negative trading profits. Conversely, a model with higher RMSE might capture directional moves correctly and generate profitable trades.

Therefore, we need a suite of evaluation tools that capture different aspects of forecast quality:

- **Forecast horizon:** How does accuracy decay as we predict further into the future?
- **Cumulative error:** Does the model correctly track the overall trend?
- **Directional accuracy:** Does the model get the sign right?
- **Hit rate:** Are predictions within an acceptable error band?
- **Economic utility:** Can the model be turned into a profitable trading strategy?
- **Risk adjustment:** How does the model perform per unit of risk?
- **Statistical significance:** Is the improvement over a baseline genuine?

We will explore each of these using the NEPSE dataset and example models (e.g., LSTM, ARIMA, naïve forecast).

---

## **33.2 Forecast Horizon Evaluation**

A model may perform well at one‑step‑ahead forecasts but poorly at longer horizons. It is essential to evaluate performance across multiple horizons. For the NEPSE system, we might want to know how accurate our 1‑day, 5‑day, and 20‑day forecasts are.

### **33.2.1 Multi‑Step Forecasting and Horizon‑Specific Metrics**

We can compute RMSE, MAE, or other metrics separately for each horizon. For example, if we have a model that predicts the next 5 days simultaneously (direct multi‑step), we can evaluate each step individually.

```python
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

# Assume we have a multi-step forecast model that predicts 5 steps ahead
# For demonstration, we'll generate synthetic multi-step predictions
np.random.seed(42)
n_samples = 100
horizons = 5
y_true_multi = np.random.randn(n_samples, horizons)  # actual future returns
y_pred_multi = y_true_multi + np.random.randn(n_samples, horizons) * 0.5  # predictions with error

# Compute RMSE per horizon
rmse_per_horizon = np.sqrt(np.mean((y_true_multi - y_pred_multi) ** 2, axis=0))
mae_per_horizon = np.mean(np.abs(y_true_multi - y_pred_multi), axis=0)

for h in range(horizons):
    print(f"Horizon {h+1}: RMSE = {rmse_per_horizon[h]:.4f}, MAE = {mae_per_horizon[h]:.4f}")

# Plot
plt.figure(figsize=(10,6))
plt.plot(range(1, horizons+1), rmse_per_horizon, marker='o', label='RMSE')
plt.plot(range(1, horizons+1), mae_per_horizon, marker='s', label='MAE')
plt.xlabel('Forecast Horizon (days)')
plt.ylabel('Error')
plt.title('Forecast Error by Horizon')
plt.legend()
plt.grid(True)
plt.show()
```

**Explanation:**  
We compute errors for each forecast step separately. Typically, error increases with horizon due to accumulating uncertainty. For NEPSE, we might observe that 5‑day ahead RMSE is double the 1‑day RMSE. This information helps in setting realistic expectations and deciding which horizons are reliable enough for trading.

---

## **33.3 Cumulative Forecast Error**

Cumulative forecast error measures how well the model tracks the overall trend over a period. It is especially important for multi‑step forecasts where we care about the cumulative return over a horizon, not just each step individually.

For example, if we predict the next 5 daily returns, the cumulative 5‑day return is the sum of those returns (if we use simple returns). We can compare the predicted cumulative return to the actual cumulative return.

```python
# Compute cumulative returns (assuming simple returns)
y_true_cumulative = np.cumprod(1 + y_true_multi / 100, axis=1) - 1  # convert to percentage returns
y_pred_cumulative = np.cumprod(1 + y_pred_multi / 100, axis=1) - 1

# For each sample, we have a cumulative return at each horizon
# We can compute RMSE of cumulative returns per horizon
rmse_cumulative = np.sqrt(np.mean((y_true_cumulative - y_pred_cumulative) ** 2, axis=0))

plt.figure(figsize=(10,6))
plt.plot(range(1, horizons+1), rmse_per_horizon, marker='o', label='Step-wise RMSE')
plt.plot(range(1, horizons+1), rmse_cumulative, marker='s', label='Cumulative RMSE')
plt.xlabel('Horizon (days)')
plt.ylabel('RMSE')
plt.title('Step-wise vs Cumulative Forecast Error')
plt.legend()
plt.grid(True)
plt.show()
```

**Explanation:**  
Cumulative error often grows faster than step‑wise error because errors accumulate. A model that is good at daily forecasts may still be poor at predicting the 5‑day trend if errors are correlated. This metric helps identify that.

---

## **33.4 Directional Accuracy**

Directional accuracy (also called sign prediction accuracy) measures how often the model correctly predicts whether the return will be positive or negative. It is a classification metric applied to regression forecasts. For a trading strategy, getting the direction right is often more important than the exact magnitude.

We can compute directional accuracy by comparing the sign of the predicted return with the sign of the actual return.

```python
def directional_accuracy(y_true, y_pred):
    """
    Compute the proportion of times the sign of prediction matches the sign of actual.
    """
    true_sign = np.sign(y_true)
    pred_sign = np.sign(y_pred)
    # Handle zero: we can define zero as either positive or negative; here we treat as correct only if both zero
    correct = (true_sign == pred_sign)
    return np.mean(correct)

# For a single horizon (e.g., 1-day ahead)
dir_acc_1 = directional_accuracy(y_true_multi[:, 0], y_pred_multi[:, 0])
print(f"1-day directional accuracy: {dir_acc_1:.4f}")

# For multi-step, we can compute per horizon or overall
dir_acc_per_horizon = [directional_accuracy(y_true_multi[:, h], y_pred_multi[:, h]) for h in range(horizons)]
print("Directional accuracy per horizon:", np.round(dir_acc_per_horizon, 4))
```

**Explanation:**  
Directional accuracy above 0.5 indicates the model has some predictive power for the sign. For NEPSE, if we achieve 55% directional accuracy, we might be able to build a profitable trading strategy, depending on transaction costs.

---

## **33.5 Hit Rate**

Hit rate measures the proportion of predictions that fall within a certain tolerance band around the actual value. It is useful when we care about being "close enough" rather than exact precision. For example, we might define a hit as `|y_true - y_pred| < threshold`, where threshold could be 0.5% for returns.

```python
def hit_rate(y_true, y_pred, threshold=0.5):
    """
    Proportion of predictions within threshold (absolute error).
    """
    errors = np.abs(y_true - y_pred)
    hits = errors < threshold
    return np.mean(hits)

# Example with threshold 0.5%
hit_rate_1 = hit_rate(y_true_multi[:, 0], y_pred_multi[:, 0], threshold=0.5)
print(f"1-day hit rate (within 0.5%): {hit_rate_1:.4f}")
```

**Explanation:**  
Hit rate is intuitive and can be tailored to the application. For a day trader, a threshold of 0.2% might be appropriate; for a swing trader, 1% might be acceptable.

---

## **33.6 Economic Evaluation Metrics**

Ultimately, a forecasting model's value is determined by its economic utility. We can simulate a simple trading strategy based on the model's predictions and compute profit and loss (P&L), Sharpe ratio, maximum drawdown, etc.

### **33.6.1 Simple Trading Simulation**

Assume we have a model that predicts the next day's return. We can go long if the predicted return is positive, short if negative, and hold for one day. The daily return of the strategy is:

`strategy_return = sign(pred) * actual_return`

We then compute cumulative returns and risk metrics.

```python
# Simulate a simple strategy
pred_sign = np.sign(y_pred_multi[:, 0])
strategy_returns = pred_sign * y_true_multi[:, 0]  # daily returns (%)

# Cumulative wealth (starting with 1)
cumulative_wealth = np.cumprod(1 + strategy_returns / 100)
total_return = (cumulative_wealth[-1] - 1) * 100

# Annualized Sharpe ratio (assuming 252 trading days)
sharpe = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()

print(f"Total Return: {total_return:.2f}%")
print(f"Sharpe Ratio: {sharpe:.4f}")

# Plot cumulative wealth
plt.figure(figsize=(10,6))
plt.plot(cumulative_wealth, label='Strategy')
plt.plot(np.cumprod(1 + y_true_multi[:, 0] / 100), label='Buy & Hold')
plt.xlabel('Days')
plt.ylabel('Cumulative Wealth')
plt.title('Trading Strategy Performance')
plt.legend()
plt.show()
```

**Explanation:**  
This simulation shows whether the model can generate profits after accounting for direction. We compare against a buy‑and‑hold benchmark. If the Sharpe ratio is above 1, the strategy is considered good. For NEPSE, we must also account for transaction costs (brokerage, taxes), which we omit here for simplicity but should be included in a real evaluation.

---

## **33.7 Risk‑Adjusted Metrics**

Risk‑adjusted metrics normalize returns by the risk taken. Common metrics include:

- **Sharpe ratio:** (average return) / (standard deviation of returns)
- **Sortino ratio:** (average return) / (downside deviation) – focuses on negative volatility.
- **Calmar ratio:** (average annual return) / (maximum drawdown)

These metrics allow comparison across strategies with different risk profiles.

```python
def downside_deviation(returns, target=0):
    """Calculate downside deviation (standard deviation of negative returns)."""
    downside = returns[returns < target]
    return np.sqrt(np.mean(downside**2))

sharpe = strategy_returns.mean() / strategy_returns.std() * np.sqrt(252)
sortino = strategy_returns.mean() / downside_deviation(strategy_returns) * np.sqrt(252)
max_drawdown = np.max(np.maximum.accumulate(cumulative_wealth) - cumulative_wealth) / np.maximum.accumulate(cumulative_wealth)
calmar = strategy_returns.mean() * 252 / max_drawdown  # annualized return / max drawdown

print(f"Sharpe Ratio: {sharpe:.4f}")
print(f"Sortino Ratio: {sortino:.4f}")
print(f"Calmar Ratio: {calmar:.4f}")
```

**Explanation:**  
Higher ratios indicate better risk‑adjusted performance. For NEPSE, a Sharpe above 1 is excellent; above 2 is exceptional. However, such high values are rare in efficient markets.

---

## **33.8 Benchmark Comparison**

A model should always be compared against simple benchmarks to ensure it adds value. Common benchmarks for time‑series:

- **Naïve forecast (persistence):** predict the last observed value.
- **Historical mean:** predict the average of the training period.
- **Drift:** predict a linear trend.
- **Seasonal naïve:** for seasonal data, predict the value from the same season last year.

We already introduced MASE in Chapter 31, which compares MAE to the naïve forecast's MAE. We can also compute relative metrics like **Theil's U**, which compares RMSE to the RMSE of a random walk.

```python
# Naïve forecast (persistence) for 1-step ahead
y_naive = np.roll(y_true_multi[:, 0], 1)  # shift by one, assuming we use previous actual as prediction
y_naive[0] = 0  # handle first prediction

rmse_model = np.sqrt(mean_squared_error(y_true_multi[:, 0], y_pred_multi[:, 0]))
rmse_naive = np.sqrt(mean_squared_error(y_true_multi[:, 0], y_naive))

print(f"Model RMSE: {rmse_model:.4f}")
print(f"Naïve RMSE: {rmse_naive:.4f}")
print(f"RMSE ratio (model/naïve): {rmse_model/rmse_naive:.4f}")
if rmse_model < rmse_naive:
    print("Model beats naïve forecast.")
```

**Explanation:**  
If the ratio is less than 1, the model improves upon persistence. For financial returns, beating the naïve forecast is a significant achievement.

---

## **33.9 Statistical Significance Testing**

Even if a model has lower RMSE, we need to know if the improvement is statistically significant. The **Diebold‑Mariano test** is commonly used to compare predictive accuracy of two forecasts. It tests the null hypothesis that the two forecasts have equal predictive accuracy.

```python
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.stattools import adfuller
import scipy.stats as stats

def diebold_mariano(e1, e2, h=1, method='HLN'):
    """
    Diebold-Mariano test for equal predictive accuracy.
    e1, e2: forecast errors from two models.
    h: forecast horizon.
    method: 'HLN' for Harvey, Leybourne, Newbold adjustment.
    """
    d = e1**2 - e2**2  # squared error loss differential
    n = len(d)
    # Estimate variance of d using Newey-West (autocorrelation robust)
    # Simple version: assume no autocorrelation (for h=1)
    var_d = np.var(d, ddof=1) / n
    DM = np.mean(d) / np.sqrt(var_d)
    # For h>1, we need to account for autocorrelation
    # We'll implement a simple version ignoring autocorrelation for demonstration
    p_value = 2 * (1 - stats.norm.cdf(np.abs(DM)))
    return DM, p_value

# Example: compare model vs naïve forecast errors
e_model = y_true_multi[:, 0] - y_pred_multi[:, 0]
e_naive = y_true_multi[:, 0] - y_naive

dm_stat, p_val = diebold_mariano(e_model, e_naive)
print(f"Diebold-Mariano statistic: {dm_stat:.4f}, p-value: {p_val:.4f}")
if p_val < 0.05:
    print("Reject null: models have different predictive accuracy.")
else:
    print("Cannot reject null: models may have equal accuracy.")
```

**Explanation:**  
A small p‑value (e.g., < 0.05) indicates that the difference in forecast accuracy is statistically significant. This helps avoid concluding superiority based on chance.

---

## **33.10 Model Comparison Framework**

To compare multiple models systematically, we need a framework that includes:

1. **Train/validation/test splits** (temporal).
2. **Multiple evaluation metrics** (RMSE, MAE, directional accuracy, Sharpe, etc.).
3. **Statistical tests** to compare models pairwise.
4. **Robustness checks** across different time periods (walk‑forward validation).

We can create a table summarizing performance across models and metrics.

```python
# Example: compare three models (ARIMA, LSTM, Naïve) on multiple metrics
models = ['ARIMA', 'LSTM', 'Naïve']
metrics = ['RMSE', 'MAE', 'DirAcc', 'Sharpe']
results = pd.DataFrame(index=models, columns=metrics)

# Assume we have precomputed results
results.loc['ARIMA'] = [0.85, 0.62, 0.53, 0.8]
results.loc['LSTM'] = [0.82, 0.60, 0.55, 1.1]
results.loc['Naïve'] = [0.90, 0.68, 0.50, 0.0]

print(results)

# Identify best model per metric
print("\nBest model per metric:")
for metric in metrics:
    best = results[metric].idxmax() if metric in ['DirAcc', 'Sharpe'] else results[metric].idxmin()
    print(f"{metric}: {best}")
```

**Explanation:**  
A summary table helps in making a final selection. Often, a model that performs well across multiple metrics is preferred.

---

## **33.11 Evaluation Best Practices**

To ensure robust evaluation in a time‑series context, follow these best practices:

- **Use walk‑forward validation:** Evaluate models over multiple rolling windows to see performance stability.
- **Avoid look‑ahead bias:** All features and targets must be constructed using only past information.
- **Report multiple metrics:** No single metric tells the full story.
- **Benchmark against simple models:** Always compare to naïve, mean, or ARIMA.
- **Consider economic value:** A model with slightly higher RMSE but better directional accuracy may be more profitable.
- **Test for statistical significance:** Use Diebold‑Mariano or similar tests when comparing models.
- **Check performance across market regimes:** Evaluate separately in bull, bear, and sideways markets.
- **Account for transaction costs:** In trading simulations, include realistic costs.
- **Document the evaluation procedure:** Ensure reproducibility.

---

## **33.12 Chapter Summary**

In this chapter, we explored evaluation techniques specific to time‑series forecasting, with a focus on financial applications like the NEPSE system.

- **Horizon evaluation** reveals how accuracy degrades with longer forecasts.
- **Cumulative error** assesses trend tracking.
- **Directional accuracy** measures sign prediction, often more important than magnitude.
- **Hit rate** gives a flexible "close enough" metric.
- **Economic metrics** (P&L, Sharpe) connect model performance to real‑world value.
- **Risk‑adjusted metrics** (Sortino, Calmar) account for downside risk.
- **Benchmark comparison** ensures the model adds value over simple baselines.
- **Statistical significance testing** (Diebold‑Mariano) validates improvements.
- **Model comparison framework** organizes evaluation across multiple models and metrics.
- **Best practices** guide robust, realistic assessment.

### **Practical Takeaways for the NEPSE System:**

- Always evaluate on multiple horizons, as short‑term accuracy may not translate to long‑term profitability.
- Use directional accuracy and trading simulations to gauge economic value.
- Compare against a naïve forecast; if you cannot beat it, your model is useless for trading.
- Apply walk‑forward validation to test stability over time.
- Consider transaction costs in simulations to avoid overestimating profitability.

In the next chapter, **Chapter 34: Cross‑Validation Techniques**, we will delve into advanced resampling methods for time‑series, including purged and embargoed cross‑validation, to obtain more reliable performance estimates.

---

**End of Chapter 33**