# Chapter 31: Evaluation Metrics for Regression

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the purpose of evaluation metrics in assessing regression model performance
- Compute and interpret Mean Absolute Error (MAE) for the NEPSE return predictions
- Explain why Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are sensitive to outliers
- Use Mean Absolute Percentage Error (MAPE) and Symmetric MAPE (sMAPE) to express errors in percentage terms
- Apply R‑squared (R²) and Adjusted R‑squared to measure the proportion of variance explained
- Recognize the limitations of percentage errors when dealing with values near zero
- Implement Mean Absolute Scaled Error (MASE) to compare against a naive baseline
- Compare different metrics and select the most appropriate for financial forecasting tasks
- Avoid common pitfalls when interpreting regression metrics in time‑series contexts

---

## **31.1 Introduction to Regression Metrics**

After training a regression model to predict next‑day returns or prices for NEPSE stocks, we need to evaluate how well it performs. Evaluation metrics quantify the difference between predicted values and actual observed values. Choosing the right metric is crucial because it reflects the business objective and guides model selection.

For financial forecasting, we care about different aspects:

- **Magnitude of error:** How far off are our predictions? (MAE, RMSE)
- **Directional accuracy:** Did we at least get the sign right? (not covered here, see Chapter 32)
- **Relative error:** How large is the error compared to the actual value? (MAPE)
- **Variance explained:** How much of the price movement does the model capture? (R²)
- **Improvement over baseline:** Is the model better than a simple forecast (e.g., predicting the mean or last value)? (MASE)

No single metric tells the whole story. We will examine each metric, its formula, interpretation, and implementation on the NEPSE dataset.

---

## **31.2 Mean Absolute Error (MAE)**

MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It is the average absolute difference between predicted and actual values.

**Formula:**  
`MAE = (1/n) * Σ |yᵢ - ŷᵢ|`

**Interpretation:** MAE is in the same units as the target variable. For return predictions (%), MAE tells us the average absolute percentage point error. A lower MAE indicates better accuracy.

**Advantages:** Easy to understand, robust to outliers (since it does not square errors).

**Disadvantages:** Does not penalize large errors more than small ones, which might be important in finance (a huge miss could be disastrous).

### **31.2.1 Computing MAE for NEPSE Return Predictions**

```python
import numpy as np
from sklearn.metrics import mean_absolute_error

# Assume we have actual returns (y_test) and predictions (y_pred) from a model
# For demonstration, we'll create synthetic predictions
np.random.seed(42)
y_test = np.random.randn(100) * 2  # actual returns (percentage points)
y_pred = y_test + np.random.randn(100) * 0.5  # predictions with some error

mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.4f} percentage points")
```

**Explanation:**  
`mean_absolute_error` from scikit‑learn computes the average absolute difference. In the NEPSE context, if MAE = 0.5%, it means our predictions are off by about half a percent on average. This is interpretable and useful for comparing models.

---

## **31.3 Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)**

MSE is the average of squared differences between predictions and actuals. RMSE is the square root of MSE, bringing the metric back to the original units.

**Formulas:**  
`MSE = (1/n) * Σ (yᵢ - ŷᵢ)²`  
`RMSE = √MSE`

**Interpretation:** Because errors are squared, larger errors are penalized more heavily. RMSE is in the same units as the target. If RMSE = 1.0%, it means the typical error magnitude is about 1%, but because of squaring, it is more sensitive to outliers than MAE.

**Advantages:** Differentiable, commonly used as a loss function. Emphasizes large errors, which may be important in risk‑sensitive applications.

**Disadvantages:** Sensitive to outliers; can be dominated by a few extreme errors.

### **31.3.1 Computing MSE and RMSE for NEPSE**

```python
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f} percentage points")
```

**Explanation:**  
For the same predictions, RMSE will be larger than MAE if there are large errors. If the errors are normally distributed, RMSE ≈ 1.25 × MAE. Comparing RMSE across models helps identify which model avoids huge mistakes.

---

## **31.4 Mean Absolute Percentage Error (MAPE)**

MAPE expresses error as a percentage of the actual value, making it scale‑independent and easy to communicate.

**Formula:**  
`MAPE = (100/n) * Σ |(yᵢ - ŷᵢ) / yᵢ|`

**Interpretation:** MAPE = 5% means the average error is 5% of the actual value. Useful when the target has varying scales (e.g., different stocks with different price levels).

**Advantages:** Intuitive percentage interpretation.

**Disadvantages:** Undefined when yᵢ = 0. Can be unstable for values close to zero. Also asymmetric: errors when yᵢ is small can dominate. For returns that can be zero, MAPE is problematic. Instead, we often use returns as percentages themselves, so MAE is already in percentage points.

### **31.4.1 Computing MAPE for NEPSE Returns (with Caution)**

```python
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    # Avoid division by zero
    mask = y_true != 0
    return np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask])) * 100

mape = mean_absolute_percentage_error(y_test, y_pred)
print(f"MAPE: {mape:.2f}%")
```

**Explanation:**  
If any actual return is zero, we exclude it (or add a small epsilon). For NEPSE returns, zero is possible if the stock price didn't change, but such cases are rare. MAPE can still be informative but should be used with caution.

---

## **31.5 Symmetric Mean Absolute Percentage Error (sMAPE)**

sMAPE was introduced to overcome MAPE's asymmetry and the problem of division by zero. It uses the average of actual and predicted as the denominator.

**Formula:**  
`sMAPE = (200/n) * Σ |yᵢ - ŷᵢ| / (|yᵢ| + |ŷᵢ|)`

**Interpretation:** Ranges from 0% to 200%. Lower is better. It is symmetric and handles zero values better (though if both are zero, it's undefined). Often used in forecasting competitions (e.g., M4 competition).

**Advantages:** Symmetric, bounded, handles zeros (if both zero, the term is 0/0, usually defined as 0).

**Disadvantages:** Still sensitive when both actual and predicted are near zero. Interpretation is less intuitive than MAPE.

### **31.5.1 Computing sMAPE for NEPSE**

```python
def smape(y_true, y_pred):
    denominator = (np.abs(y_true) + np.abs(y_pred))
    # If denominator is zero, set term to 0
    diff = np.abs(y_true - y_pred)
    result = np.where(denominator == 0, 0, 200 * diff / denominator)
    return np.mean(result)

smape_value = smape(y_test, y_pred)
print(f"sMAPE: {smape_value:.2f}%")
```

**Explanation:**  
For NEPSE returns, sMAPE gives a balanced measure. A value of 10% means the average symmetric percentage error is 10%.

---

## **31.6 R‑squared (R²) and Adjusted R‑squared**

R² measures the proportion of variance in the target that is explained by the model. It is defined as:

`R² = 1 - (SS_res / SS_tot)`

where SS_res is the sum of squared residuals, and SS_tot is the total sum of squares (variance of the target times n‑1). R² ranges from -∞ to 1, with 1 indicating perfect fit, 0 indicating the model predicts no better than the mean, and negative values indicating worse than the mean.

**Interpretation:** An R² of 0.8 means the model explains 80% of the variance in returns. In finance, achieving high R² is very difficult because returns are noisy.

**Adjusted R²** penalizes the inclusion of extra features:  
`Adj R² = 1 - (1 - R²) * (n - 1) / (n - p - 1)`  
where p is the number of predictors. It helps prevent overfitting by decreasing when useless features are added.

### **31.6.1 Computing R² and Adjusted R²**

```python
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f"R²: {r2:.4f}")

# Adjusted R²
n = len(y_test)
p = 1  # number of features (assuming one predictor; adjust as needed)
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"Adjusted R²: {adj_r2:.4f}")
```

**Explanation:**  
A low R² (e.g., 0.05) is typical for return prediction. It does not mean the model is useless; it may still have significant predictive power for trading. R² is more meaningful for price prediction where the series is highly autocorrelated.

---

## **31.7 Mean Absolute Scaled Error (MASE)**

MASE compares the model's MAE to the MAE of a naive benchmark forecast (e.g., the previous observation, or the seasonal naive). It was proposed by Hyndman and Koehler (2006) to overcome scale dependence and issues with MAPE.

**Formula:**  
`MASE = MAE_model / MAE_naive`

A common naive benchmark for non‑seasonal time series is the **persistence forecast**: predict the next value as the last observed value. For seasonal series, the naive forecast is the value from the same season last period.

**Interpretation:** MASE < 1 means the model is better than the naive forecast. MASE > 1 means it is worse. It is scale‑free and can be averaged across series.

### **31.7.1 Computing MASE for NEPSE Returns**

```python
def persistence_forecast(y_train, y_test):
    """Naive forecast: predict the last observed value for all future steps."""
    last_value = y_train[-1]
    return np.full(len(y_test), last_value)

# Assuming we have training and test returns
# For demonstration, create y_train from first 80% of synthetic data
y_train = y_test[:80]  # not actually, but for illustration

# Compute naive forecast errors on test set
y_naive = persistence_forecast(y_train, y_test)
mae_naive = mean_absolute_error(y_test, y_naive)

mae_model = mean_absolute_error(y_test, y_pred)
mase = mae_model / mae_naive
print(f"MAE (model): {mae_model:.4f}, MAE (naive): {mae_naive:.4f}")
print(f"MASE: {mase:.4f}")
```

**Explanation:**  
If MASE = 0.8, our model's errors are 20% smaller on average than simply predicting the last observed return. This is a strong indicator that the model captures some dynamics beyond persistence.

---

## **31.8 Choosing the Right Metric**

The choice of metric depends on the business objective:

- If you care about the typical error magnitude (e.g., for setting stop‑loss levels), use **MAE** or **RMSE**.
- If large errors are disproportionately costly (e.g., a huge misprediction could cause a large loss), **RMSE** is more appropriate because it penalizes large errors.
- If you want to compare across different stocks or time periods, use a relative measure like **MAPE** or **sMAPE**, but be cautious with near‑zero values.
- If you want to know how much variance is explained, **R²** is useful, but low values are expected for returns.
- If you need to benchmark against a simple rule (like "tomorrow will be the same as today"), **MASE** directly tells you if your model adds value.

In practice, report multiple metrics to give a comprehensive view. For the NEPSE system, we might focus on MAE (interpretable in percentage points) and MASE (to ensure we beat a naive baseline). For trading strategies, directional accuracy (Chapter 32) might be even more important.

---

## **31.9 Pitfalls and Considerations**

### **31.9.1 Outliers**

One extreme error can dominate MSE and RMSE. Inspect residuals to see if a few points are driving the metric. If outliers are genuine (e.g., market crashes), they should be included; if they are data errors, clean them.

### **31.9.2 Scale Dependence**

MAE and RMSE are scale‑dependent. If you switch from predicting returns (small numbers) to predicting prices (large numbers), the metrics will not be comparable. Always use the same units or rely on relative metrics.

### **31.9.3 Percentage Errors with Small Values**

MAPE can blow up when actual values are near zero. For returns, this can happen. Consider using sMAPE or MASE instead.

### **31.9.4 Overfitting to Metrics**

Tuning a model to minimize a specific metric on a validation set can lead to overfitting to that metric. Always validate on a separate test set and consider multiple metrics.

### **31.9.5 Statistical Significance**

A lower RMSE on one test set does not guarantee the model is truly better. Use statistical tests (e.g., Diebold‑Mariano) to compare forecasts.

---

## **31.10 Chapter Summary**

In this chapter, we covered the essential regression evaluation metrics and applied them to the NEPSE return prediction problem.

- **MAE** gives the average absolute error in the original units.
- **MSE/RMSE** penalize large errors more heavily.
- **MAPE/sMAPE** express errors as percentages, useful for scale‑free comparison.
- **R²** measures variance explained, but low values are typical for noisy returns.
- **Adjusted R²** penalizes extra features.
- **MASE** compares the model to a naive persistence forecast, indicating whether the model adds value.

We also discussed how to choose the right metric based on the business goal and the pitfalls to avoid.

### **Practical Takeaways for the NEPSE System:**

- Report MAE (or RMSE) in percentage points for return forecasts.
- Always compute MASE to ensure the model outperforms a simple "no change" forecast.
- Use sMAPE if you need a relative measure and are concerned about zeros.
- Remember that even a low R² can be acceptable if the model provides a profitable trading signal.
- Validate metrics on out‑of‑sample data and consider multiple metrics to get a full picture.

In the next chapter, **Chapter 32: Evaluation Metrics for Classification**, we will shift focus to metrics for binary and multi‑class predictions, such as accuracy, precision, recall, F1‑score, ROC‑AUC, and others, with applications to direction forecasting.

---

**End of Chapter 31**