# Day 19: Model Evaluation Metrics
## Comprehensive Forecast Assessment Framework

**Objective:** Master quantitative evaluation metrics for time series forecasts. Implement and compare MAE, RMSE, MAPE, SMAPE, MDA, and directional accuracy to select optimal forecasting models.

**Key Concepts:**
- **Accuracy metrics**: MAE, RMSE, MAPE - measuring prediction error
- **Financial metrics**: MDA, directional accuracy - trading perspective
- **Robustness**: Understanding metric sensitivity to outliers
- **Scale-dependence**: When to use percentage vs absolute metrics
- **Metric selection**: Choosing appropriate metrics for your use case

In [None]:
# 1. Import Required Libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots

from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from scipy import stats
import gc

print("✓ All libraries imported successfully")

In [None]:
# Load data
try:
    df = pd.read_csv('./data/gold_prices.csv', parse_dates=['Date'])
except:
    try:
        df = pd.read_csv('../data/gold_prices.csv', parse_dates=['Date'])
    except:
        df = pd.read_csv('/home/tads/Work/TADSPROJ/30-days-of-tsa/day19/data/gold_prices.csv', 
                        parse_dates=['Date'])

# Clean data
if 'Price' not in df.columns:
    df = df.rename(columns={'Adj Close': 'Price'})
df = df.drop_duplicates(subset=['Date']).sort_values('Date').reset_index(drop=True)

print(f"✓ Data loaded: {len(df)} observations")
print(f"  Date range: {df['Date'].min().date()} to {df['Date'].max().date()}")
print(f"  Price range: ${df['Price'].min():.2f} to ${df['Price'].max():.2f}")

# Train-test split (80/20)
train_size = int(len(df) * 0.8)
train_df = df[:train_size].copy()
test_df = df[train_size:].copy()

print(f"\nTrain-Test Split:")
print(f"  Train: {len(train_df)} obs | {train_df['Date'].min().date()} to {train_df['Date'].max().date()}")
print(f"  Test:  {len(test_df)} obs | {test_df['Date'].min().date()} to {test_df['Date'].max().date()}")

train_prices = train_df['Price'].values
test_prices = test_df['Price'].values

## Section 2: Implement Evaluation Metrics

Define comprehensive evaluation metrics for time series forecasts:
- **MAE**: Mean Absolute Error - average magnitude of errors
- **RMSE**: Root Mean Squared Error - penalizes large errors  
- **MAPE**: Mean Absolute Percentage Error - scale-independent
- **SMAPE**: Symmetric MAPE - fixes MAPE issues with near-zero values
- **ME/MPE**: Mean Error/Percentage Error - measures bias
- **MDA**: Mean Directional Accuracy - useful for trading

In [None]:
def calculate_metrics(actual, forecast):
    """
    Calculate comprehensive evaluation metrics for time series forecast
    
    Parameters:
    -----------
    actual : array-like
        Actual values
    forecast : array-like
        Forecasted values
    
    Returns:
    --------
    dict : Dictionary with all calculated metrics
    """
    
    actual = np.asarray(actual)
    forecast = np.asarray(forecast)
    errors = actual - forecast
    
    # 1. Mean Absolute Error (MAE)
    mae = np.mean(np.abs(errors))
    
    # 2. Root Mean Squared Error (RMSE)
    rmse = np.sqrt(np.mean(errors ** 2))
    
    # 3. Mean Absolute Percentage Error (MAPE)
    mape = np.mean(np.abs(errors / actual)) * 100
    
    # 4. Symmetric MAPE (SMAPE)
    smape = np.mean(2 * np.abs(errors) / (np.abs(actual) + np.abs(forecast))) * 100
    
    # 5. Mean Error (ME) - bias
    me = np.mean(errors)
    
    # 6. Mean Percentage Error (MPE)
    mpe = np.mean(errors / actual) * 100
    
    # 7. Mean Directional Accuracy (MDA)
    actual_direction = np.diff(actual) > 0
    forecast_direction = np.diff(forecast) > 0
    mda = np.mean(actual_direction == forecast_direction) * 100
    
    # 8. Theil's U Statistic (normalized vs naive)
    naive_forecast = actual[:-1]  # Last value carried forward
    theil_u = np.sqrt(np.sum((actual[1:] - forecast[1:]) ** 2) / np.sum((actual[1:] - naive_forecast) ** 2))
    
    # 9. Error statistics
    error_std = np.std(errors)
    error_min = np.min(errors)
    error_max = np.max(errors)
    
    # 10. Directional Accuracy (simple direction match)
    da = mda  # Same as MDA
    
    return {
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape,
        'SMAPE': smape,
        'ME': me,
        'MPE': mpe,
        'MDA': mda,
        'Theil_U': theil_u,
        'Error_Std': error_std,
        'Error_Min': error_min,
        'Error_Max': error_max,
        'DA': da
    }

print("✓ Metrics calculation functions defined")

## Section 3: Fit Multiple Forecasting Models

Compare performance across different model types:
- **ARIMA(0,1,0)**: Random walk (baseline)
- **ARIMA(1,1,0)**: AR component
- **ARIMA(0,1,1)**: MA component
- **Naive**: Last value carried forward
- **Seasonal Naive**: 252-day annual baseline

In [None]:
print("Fitting forecasting models...\n")

forecasts = {}

# 1. ARIMA(0,1,0) - Random walk
try:
    model_arima010 = ARIMA(train_prices, order=(0,1,0))
    model_arima010_fit = model_arima010.fit()
    forecasts['ARIMA(0,1,0)'] = model_arima010_fit.forecast(steps=len(test_prices))
    print("✓ ARIMA(0,1,0) fitted")
except Exception as e:
    print(f"✗ ARIMA(0,1,0) failed: {str(e)[:50]}")

# 2. ARIMA(1,1,0) - AR component
try:
    model_arima110 = ARIMA(train_prices, order=(1,1,0))
    model_arima110_fit = model_arima110.fit()
    forecasts['ARIMA(1,1,0)'] = model_arima110_fit.forecast(steps=len(test_prices))
    print("✓ ARIMA(1,1,0) fitted")
except Exception as e:
    print(f"✗ ARIMA(1,1,0) failed: {str(e)[:50]}")

# 3. ARIMA(0,1,1) - MA component
try:
    model_arima011 = ARIMA(train_prices, order=(0,1,1))
    model_arima011_fit = model_arima011.fit()
    forecasts['ARIMA(0,1,1)'] = model_arima011_fit.forecast(steps=len(test_prices))
    print("✓ ARIMA(0,1,1) fitted")
except Exception as e:
    print(f"✗ ARIMA(0,1,1) failed: {str(e)[:50]}")

# 4. Naive forecast (last value)
forecasts['Naive'] = np.full(len(test_prices), train_prices[-1])
print("✓ Naive fitted (last value)")

# 5. Seasonal Naive (252-day annual)
seasonal_period = 252
if len(train_prices) >= seasonal_period:
    seasonal_naive_forecast = train_prices[-seasonal_period:seasonal_period]
    # Repeat to match test length
    seasonal_naive = np.tile(seasonal_naive_forecast, int(np.ceil(len(test_prices) / seasonal_period)))
    forecasts['Seasonal Naive'] = seasonal_naive[:len(test_prices)]
    print(f"✓ Seasonal Naive fitted (period={seasonal_period})")

print(f"\n✓ All models fitted. Total models: {len(forecasts)}")

## Section 4: Calculate Metrics for All Models

Evaluate each forecast using all implemented metrics.

In [None]:
# Calculate metrics for all models
all_metrics = {}

for model_name, forecast in forecasts.items():
    metrics = calculate_metrics(test_prices, forecast)
    all_metrics[model_name] = metrics

# Create metrics dataframe
metrics_df = pd.DataFrame(all_metrics).T

# Format for display
metrics_display = metrics_df.copy()
metrics_display['MAE'] = metrics_display['MAE'].apply(lambda x: f"{x:.2f}")
metrics_display['RMSE'] = metrics_display['RMSE'].apply(lambda x: f"{x:.2f}")
metrics_display['MAPE'] = metrics_display['MAPE'].apply(lambda x: f"{x:.2f}%")
metrics_display['SMAPE'] = metrics_display['SMAPE'].apply(lambda x: f"{x:.2f}%")
metrics_display['ME'] = metrics_display['ME'].apply(lambda x: f"{x:.2f}")
metrics_display['MPE'] = metrics_display['MPE'].apply(lambda x: f"{x:.2f}%")
metrics_display['MDA'] = metrics_display['MDA'].apply(lambda x: f"{x:.2f}%")
metrics_display['DA'] = metrics_display['DA'].apply(lambda x: f"{x:.2f}%")
metrics_display['Theil_U'] = metrics_display['Theil_U'].apply(lambda x: f"{x:.4f}")
metrics_display['Error_Std'] = metrics_display['Error_Std'].apply(lambda x: f"{x:.2f}")

print("\n" + "="*120)
print("METRICS COMPARISON TABLE")
print("="*120)
print(metrics_display.to_string())
print("="*120)

## Section 5: Error Distribution Analysis

Analyze properties of forecast errors: distribution, skewness, kurtosis, and bounds.

In [None]:
# Error distribution analysis
print("\n" + "="*120)
print("ERROR DISTRIBUTION ANALYSIS")
print("="*120)

for model_name, forecast in forecasts.items():
    errors = test_prices - forecast
    
    print(f"\n{model_name}:")
    print(f"  Mean error (bias):     {np.mean(errors):8.2f}")
    print(f"  Std dev of errors:     {np.std(errors):8.2f}")
    print(f"  Skewness:              {stats.skew(errors):8.4f}")
    print(f"  Kurtosis:              {stats.kurtosis(errors):8.4f}")
    print(f"  Min error (overest):   {np.min(errors):8.2f}")
    print(f"  Max error (underest):  {np.max(errors):8.2f}")
    
    # Error bounds
    mean_err = np.mean(errors)
    std_err = np.std(errors)
    bound_1sigma = (np.mean(np.abs(errors - mean_err) <= 1 * std_err)) * 100
    bound_2sigma = (np.mean(np.abs(errors - mean_err) <= 2 * std_err)) * 100
    
    print(f"  Errors within ±1σ:     {bound_1sigma:6.1f}%")
    print(f"  Errors within ±2σ:     {bound_2sigma:6.1f}%")

## Section 6: Comprehensive Visualizations

Create interactive plots comparing forecasts, errors, and distributions.

In [None]:
# Create comprehensive visualization
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=(
        'Forecast vs Actual',
        'MAE/RMSE Comparison',
        'Forecast Errors Over Time',
        'Error Distribution (MAE)',
        'Directional Accuracy',
        'MAPE/SMAPE Comparison'
    ),
    specs=[[{}, {}], [{}, {}], [{}, {}]],
    vertical_spacing=0.12,
    horizontal_spacing=0.10
)

# Convert test dates for x-axis
test_dates = test_df['Date'].values

# 1. Forecast vs Actual (top left)
fig.add_trace(
    go.Scatter(x=test_dates, y=test_prices, name='Actual', 
               line=dict(color='black', width=2), mode='lines'),
    row=1, col=1
)

colors = ['red', 'blue', 'green', 'purple', 'orange']
for (model_name, forecast), color in zip(forecasts.items(), colors):
    fig.add_trace(
        go.Scatter(x=test_dates, y=forecast, name=model_name,
                   line=dict(color=color, width=1, dash='dash'), mode='lines'),
        row=1, col=1
    )

# 2. MAE/RMSE Comparison (top right)
mae_values = [all_metrics[m]['MAE'] for m in forecasts.keys()]
rmse_values = [all_metrics[m]['RMSE'] for m in forecasts.keys()]

fig.add_trace(
    go.Bar(x=list(forecasts.keys()), y=mae_values, name='MAE',
           marker=dict(color='skyblue')),
    row=1, col=2
)
fig.add_trace(
    go.Bar(x=list(forecasts.keys()), y=rmse_values, name='RMSE',
           marker=dict(color='salmon')),
    row=1, col=2
)

# 3. Errors Over Time (middle left)
for model_name, forecast in forecasts.items():
    errors = test_prices - forecast
    fig.add_trace(
        go.Scatter(x=test_dates, y=errors, name=f'{model_name} Error',
                   mode='markers', marker=dict(size=3)),
        row=2, col=1
    )

# 4. Error Distribution (middle right) - for first model
first_model = list(forecasts.keys())[0]
errors_first = test_prices - forecasts[first_model]
fig.add_trace(
    go.Histogram(x=errors_first, name=f'{first_model} Error Dist',
                marker=dict(color='lightblue'), nbinsx=30),
    row=2, col=2
)

# 5. Directional Accuracy (bottom left)
mda_values = [all_metrics[m]['MDA'] for m in forecasts.keys()]
fig.add_trace(
    go.Bar(x=list(forecasts.keys()), y=mda_values, name='MDA',
           marker=dict(color='lightgreen')),
    row=3, col=1
)

# 6. MAPE/SMAPE Comparison (bottom right)
mape_values = [all_metrics[m]['MAPE'] for m in forecasts.keys()]
smape_values = [all_metrics[m]['SMAPE'] for m in forecasts.keys()]

fig.add_trace(
    go.Bar(x=list(forecasts.keys()), y=mape_values, name='MAPE',
           marker=dict(color='wheat')),
    row=3, col=2
)
fig.add_trace(
    go.Bar(x=list(forecasts.keys()), y=smape_values, name='SMAPE',
           marker=dict(color='plum')),
    row=3, col=2
)

# Update layout
fig.update_xaxes(title_text="Date", row=1, col=1)
fig.update_yaxes(title_text="Price ($)", row=1, col=1)

fig.update_xaxes(title_text="Model", row=1, col=2)
fig.update_yaxes(title_text="Error ($)", row=1, col=2)

fig.update_xaxes(title_text="Date", row=2, col=1)
fig.update_yaxes(title_text="Error ($)", row=2, col=1)

fig.update_xaxes(title_text="Error ($)", row=2, col=2)
fig.update_yaxes(title_text="Frequency", row=2, col=2)

fig.update_xaxes(title_text="Model", row=3, col=1)
fig.update_yaxes(title_text="MDA (%)", row=3, col=1)

fig.update_xaxes(title_text="Model", row=3, col=2)
fig.update_yaxes(title_text="Error (%)", row=3, col=2)

fig.update_layout(height=1200, width=1400, showlegend=True, hovermode='x unified')
fig.write_html("evaluation_metrics.html")
print("\n✓ Saved interactive visualization: evaluation_metrics.html")

## Section 7: Financial Context Analysis

Interpret metrics in financial context - implications for trading, risk management, and decision-making.

In [None]:
# Detailed financial analysis
print("\n" + "="*120)
print("FINANCIAL CONTEXT ANALYSIS")
print("="*120)

# Find best model by each metric
print("\nBest Model by Criterion:")
print("-" * 50)

metrics_to_check = {
    'MAE': 'Lower is better (average absolute error)',
    'RMSE': 'Lower is better (penalizes large errors)',
    'MAPE': 'Lower is better (percentage error)',
    'MDA': 'Higher is better (directional accuracy)',
    'Theil_U': 'Lower is better (vs naive baseline)'
}

for metric, description in metrics_to_check.items():
    if metric in ['MDA']:  # Higher is better
        best_model = max(all_metrics, key=lambda x: all_metrics[x][metric])
    else:  # Lower is better
        best_model = min(all_metrics, key=lambda x: all_metrics[x][metric])
    
    value = all_metrics[best_model][metric]
    print(f"\n{metric:15s}: {best_model:20s} ({value:8.2f})")
    print(f"  {description}")

# Trading implications
print("\n" + "="*120)
print("TRADING IMPLICATIONS")
print("="*120)

for model_name, forecast in forecasts.items():
    errors = test_prices - forecast
    mda = all_metrics[model_name]['MDA']
    mae = all_metrics[model_name]['MAE']
    
    print(f"\n{model_name}:")
    print(f"  Direction accuracy: {mda:.1f}%")
    
    if mda > 55:
        print(f"    → Better than random (50%) - potentially profitable on directional signals")
    elif mda > 50:
        print(f"    → Slightly better than random")
    else:
        print(f"    → Worse than random - avoid using for directional trading")
    
    # PnL simulation
    actual_direction = np.diff(test_prices) > 0
    forecast_direction = np.diff(forecast) > 0
    correct_directions = actual_direction == forecast_direction
    
    # Assume $1000 per correct direction, -$500 per wrong direction
    pnl = np.sum(correct_directions * 1000) - np.sum(~correct_directions * 500)
    win_rate = np.mean(correct_directions) * 100
    
    print(f"  Win rate: {win_rate:.1f}%")
    print(f"  Simulated PnL (±$1000/$500): ${pnl:,.0f}")
    print(f"  Avg error magnitude (MAE): ${mae:.2f}")

# Risk analysis
print("\n" + "="*120)
print("RISK ANALYSIS")
print("="*120)

for model_name, forecast in forecasts.items():
    errors = test_prices - forecast
    max_error = np.max(np.abs(errors))
    percentile_95 = np.percentile(np.abs(errors), 95)
    percentile_99 = np.percentile(np.abs(errors), 99)
    
    avg_price = np.mean(test_prices)
    max_error_pct = (max_error / avg_price) * 100
    
    print(f"\n{model_name}:")
    print(f"  Max error:        ${max_error:8.2f} ({max_error_pct:5.1f}% of avg price)")
    print(f"  95th percentile:  ${percentile_95:8.2f}")
    print(f"  99th percentile:  ${percentile_99:8.2f}")
    
    if max_error > 20:
        print(f"  Risk level: HIGH (errors can exceed $20)")
    elif max_error > 10:
        print(f"  Risk level: MODERATE (errors can exceed $10)")
    else:
        print(f"  Risk level: LOW (errors typically under $10)")

## Section 8: Summary and Recommendations

Key insights and best practices for model selection and metric interpretation.

In [None]:
print("\n" + "="*120)
print("KEY INSIGHTS AND BEST PRACTICES")
print("="*120)

insights = """
1. **Metric Selection Matters**
   - Different metrics emphasize different aspects of forecast quality
   - MAE: Robust, interpretable average error
   - RMSE: Penalizes large errors (outlier sensitive)
   - MAPE: Scale-independent percentage error
   - MDA: Directional accuracy for trading

2. **No Single Best Metric**
   - Use complementary metrics (MAE + RMSE)
   - Always compare to baseline (naive forecast)
   - Consider domain-specific requirements

3. **Financial Context**
   - For trading: MDA (directional accuracy) most important
   - For risk management: RMSE and max error matter
   - For reporting: MAPE is most interpretable to stakeholders

4. **Common Pitfalls**
   - Relying on MAPE alone (fails on near-zero values)
   - Ignoring outliers (not all large errors are equal)
   - Comparing RMSE across different scales
   - Forgetting to compare to naive baseline

5. **Best Practices**
   - Report multiple metrics in tables
   - Visualize forecasts vs actual
   - Show error distributions (histograms, Q-Q plots)
   - Include confidence intervals around forecasts
   - Document assumptions and limitations
"""

print(insights)

print("\n" + "="*120)
print("METRIC SELECTION FRAMEWORK FOR GOLD PRICE FORECASTING")
print("="*120)

framework = """
For Portfolio Management:
  → Use RMSE (emphasizes large deviations)
  → Monitor error bounds (±1σ, ±2σ)
  → Focus on tail risk (max error)

For Trading Signals:
  → Use MDA (directional accuracy)
  → Need MDA > 55% for edge
  → Combine with risk management rules

For Internal Reporting:
  → Use MAE (easy to explain: "$X average error")
  → Use MAPE (percentage: "2% average error")
  → Show metrics table and visualization

For Stakeholder Communication:
  → Use MAPE ("Average error is 2.3%")
  → Use forecast vs actual plot
  → Highlight comparison to naive baseline
"""

print(framework)

print("\n" + "="*120)
print("✓ Day 19: Model Evaluation Metrics Complete!")
print("="*120)

## Section 1: Load and Prepare Data

Load gold price data from Yahoo Finance, clean the data, and create train/test split for model evaluation.