# Forecasting Pipeline - Model Training and Evaluation

This notebook builds and evaluates machine learning models to predict Europe Base Port container prices 1-week ahead.

---

## Approach

1. **Load prepared data** from `data/processed/model_data.csv`
2. **Exclude current price** from features (prevent data leakage)
3. **Train-test split** using time-based split (80/20)
4. **Baseline models** using lagged prices for comparison
5. **Advanced models** with hyperparameter tuning
6. **Model evaluation** using RMSE, MAE, MAPE, R²
7. **Feature importance analysis**
8. **Final model selection and validation**

## Models to Test

- **Baseline**: Naive (last week's price), Moving Average (4-week)
- **Linear Models**: Linear Regression, Ridge, Lasso
- **Tree-based**: Random Forest, XGBoost, Gradient Boosting


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

Libraries imported successfully!


## Step 1: Load Prepared Model Data

In [2]:
# Load the prepared model data from the previous notebook
df_model = pd.read_csv('data/processed/model_data.csv', parse_dates=['Date'], index_col='Date')

print(f"Loaded model data: {df_model.shape[0]} rows, {df_model.shape[1]} columns")
print(f"Date range: {df_model.index.min().date()} to {df_model.index.max().date()}")
print(f"\nColumns: {list(df_model.columns)}")
print(f"\nMissing values:\n{df_model.isnull().sum()[df_model.isnull().sum() > 0]}")
print(f"\nFirst 5 rows:")
df_model.head()

Loaded model data: 267 rows, 199 columns
Date range: 2019-02-15 to 2025-04-25

Columns: ['SCFI_Index', 'Europe_Base_Price', 'global_total_events', 'global_disruption_events', 'extreme_crisis_events', 'high_velocity_media_events', 'black_swan_candidate_events', 'global_avg_impact', 'global_worst_event_impact', 'global_avg_sentiment', 'global_total_media_mentions', 'global_peak_event_media', 'maritime_conflict_events', 'infrastructure_attack_events', 'trade_restriction_events', 'protest_events', 'middle_east_disruption', 'asia_disruption', 'europe_disruption', 'russia_ukraine_disruption', 'egypt_disruption', 'yemen_disruption', 'unique_sources', 'Brent_Price', 'sh_portcalls_container', 'sh_portcalls_dry_bulk', 'sh_portcalls_general_cargo', 'sh_portcalls_roro', 'sh_portcalls_tanker', 'sh_portcalls_cargo', 'sh_portcalls', 'sh_import_container', 'sh_import_dry_bulk', 'sh_import_general_cargo', 'sh_import_roro', 'sh_import_tanker', 'sh_import_cargo', 'sh_import', 'sh_export_container', 'sh_e

Unnamed: 0_level_0,SCFI_Index,Europe_Base_Price,global_total_events,global_disruption_events,extreme_crisis_events,high_velocity_media_events,black_swan_candidate_events,global_avg_impact,global_worst_event_impact,global_avg_sentiment,...,trade_ais_export_general_cargo_lag_1w,trade_ais_import_roro_lag_1w,trade_ais_export_roro_lag_1w,trade_value_import_total_lag_1w,trade_value_export_total_lag_1w,trade_volume_export_total_lag_1w,trade_volume_import_total_lag_1w,trade_trade_value_lag_1w,trade_trade_volume_lag_1w,trade_ObjectId_lag_1w
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-02-15,888.29,906,279818,75418,22028,11,1,0.634257,-10,-2.123476,...,2579928.0,156697.503171,192011.115264,85.738538,77.845838,76.693894,83.558004,80.556455,80.125949,2907.0
2019-02-22,847.75,831,275158,74994,20687,11,2,0.612086,-10,-1.919688,...,2579928.0,156697.503171,192011.115264,85.738538,77.845838,76.693894,83.558004,80.556455,80.125949,2907.0
2019-03-01,803.71,796,278990,74514,20917,12,4,0.661356,-10,-1.847598,...,2579928.0,156697.503171,192011.115264,85.738538,77.845838,76.693894,83.558004,80.556455,80.125949,2907.0
2019-03-08,766.92,754,250800,68848,18028,8,2,0.585994,-10,-1.956145,...,3292372.0,237832.57364,305087.79624,104.381826,100.640893,99.91674,103.260289,101.925655,101.588514,2914.0
2019-03-15,742.1,714,281385,84994,26589,16,1,0.250296,-10,-2.406428,...,3292372.0,237832.57364,305087.79624,104.381826,100.640893,99.91674,103.260289,101.925655,101.588514,2914.0


## Step 2: Prepare Features and Target

In [3]:
# Drop rows with missing target values
# The target column from 02_data_understanding.ipynb is 'price_1w_ahead'
df_clean = df_model.dropna(subset=['price_1w_ahead']).copy()

print(f"Rows after removing missing targets: {len(df_clean)}")

# Define target and features
target_col = 'price_1w_ahead'
y = df_clean[target_col]

# ====================================================================
# CRITICAL FIX: Exclude Europe_Base_Price to prevent data leakage!
# We predict NEXT week's price, so we can't use THIS week's price as a feature
# ====================================================================
exclude_cols = [
    target_col,  # Don't include target in features
    'Europe_Base_Price',  # CRITICAL: Exclude current week's price (data leakage!)
]

# Feature columns (exclude target and current price)
feature_cols = [col for col in df_clean.columns if col not in exclude_cols]

X = df_clean[feature_cols]

print(f"\n{'='*70}")
print("FEATURE SELECTION (DATA LEAKAGE PREVENTION)")
print(f"{'='*70}")
print(f"Target variable: {target_col}")
print(f"Excluded columns: {exclude_cols}")
print(f"Number of features: {len(feature_cols)}")
print(f"Features sample: {feature_cols[:10]}...") 
print(f"\n✓ Europe_Base_Price EXCLUDED to prevent data leakage")
print(f"✓ Using only lagged features and external data for prediction")
print(f"\nTarget statistics:")
print(y.describe())


Rows after removing missing targets: 267

FEATURE SELECTION (DATA LEAKAGE PREVENTION)
Target variable: price_1w_ahead
Excluded columns: ['price_1w_ahead', 'Europe_Base_Price']
Number of features: 197
Features sample: ['SCFI_Index', 'global_total_events', 'global_disruption_events', 'extreme_crisis_events', 'high_velocity_media_events', 'black_swan_candidate_events', 'global_avg_impact', 'global_worst_event_impact', 'global_avg_sentiment', 'global_total_media_mentions']...

✓ Europe_Base_Price EXCLUDED to prevent data leakage
✓ Using only lagged features and external data for prediction

Target statistics:
count     267.000000
mean     2929.112360
std      2393.071425
min       580.000000
25%       876.000000
50%      1997.000000
75%      4732.000000
max      7797.000000
Name: price_1w_ahead, dtype: float64


## Step 3: Train-Test Split (Time-Based)

In [4]:
# Time-based split (80/20)
split_idx = int(len(df_clean) * 0.8)

X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y.iloc[:split_idx]
y_test = y.iloc[split_idx:]

print(f"Training set: {len(X_train)} samples ({X_train.index.min().date()} to {X_train.index.max().date()})")
print(f"Test set: {len(X_test)} samples ({X_test.index.min().date()} to {X_test.index.max().date()})")
print(f"\nTraining target range: ${y_train.min():.2f} to ${y_train.max():.2f}")
print(f"Test target range: ${y_test.min():.2f} to ${y_test.max():.2f}")

Training set: 213 samples (2019-02-15 to 2024-02-02)
Test set: 54 samples (2024-02-09 to 2025-04-25)

Training target range: $580.00 to $7797.00
Test target range: $1200.00 to $5051.00


## Step 4: Feature Scaling

In [5]:
# Check for non-numeric columns before scaling
print("Checking data types in features...")
print(X_train.dtypes.value_counts())
print("\nNon-numeric columns:")
non_numeric_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()
print(non_numeric_cols)

# Remove non-numeric columns from features
if len(non_numeric_cols) > 0:
    print(f"\nRemoving {len(non_numeric_cols)} non-numeric columns: {non_numeric_cols}")
    X_train = X_train.select_dtypes(include=[np.number])
    X_test = X_test.select_dtypes(include=[np.number])
    feature_cols = X_train.columns.tolist()
    print(f"\nRemaining features: {len(feature_cols)}")

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("\nFeatures scaled successfully!")
print(f"Scaled feature shape: {X_train_scaled.shape}")

Checking data types in features...
float64    176
int64       19
object       2
Name: count, dtype: int64

Non-numeric columns:
['trade_region', 'trade_ISO3']

Removing 2 non-numeric columns: ['trade_region', 'trade_ISO3']

Remaining features: 195

Features scaled successfully!
Scaled feature shape: (213, 195)


## Step 5: Define Evaluation Metrics

In [6]:
def evaluate_model(y_true, y_pred, model_name):
    """
    Calculate and display model performance metrics.
    """
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    r2 = r2_score(y_true, y_pred)
    
    metrics = {
        'Model': model_name,
        'RMSE': rmse,
        'MAE': mae,
        'MAPE': mape,
        'R²': r2
    }
    
    print(f"\n{'='*60}")
    print(f"{model_name} Performance")
    print(f"{'='*60}")
    print(f"RMSE: ${rmse:.2f}")
    print(f"MAE:  ${mae:.2f}")
    print(f"MAPE: {mape:.2f}%")
    print(f"R²:   {r2:.4f}")
    print(f"{'='*60}")
    
    return metrics

# Store all model results
results = []

print("Evaluation functions defined!")

Evaluation functions defined!


## Step 6: Baseline Models

In [7]:
# ====================================================================
# BASELINE MODELS - Using Lagged Price Features (No Data Leakage!)
# ====================================================================

# Check what lagged price features are available
print("Available lagged price columns:")
price_lag_cols = [col for col in df_clean.columns if 'price_lag' in col.lower()]
print(price_lag_cols)

# Baseline 1: Naive Forecast (use price from 1 week ago)
if 'price_lag_1w' in df_clean.columns:
    print(f"\n✓ Using 'price_lag_1w' for naive baseline")
    y_pred_naive = df_clean.loc[y_test.index, 'price_lag_1w']
    metrics_naive = evaluate_model(y_test, y_pred_naive, 'Naive (Last Week)')
    results.append(metrics_naive)
    
    # Baseline 2: Moving Average (4-week average of LAGGED prices)
    print("\n✓ Calculating 4-week moving average from lagged prices")
    y_pred_ma = []
    
    for test_idx in y_test.index:
        # Use the lagged prices available at prediction time
        # We can use price_lag_1w, price_lag_2w, price_lag_3w, price_lag_4w
        lagged_prices = []
        for lag in [1, 2, 3, 4]:
            lag_col = f'price_lag_{lag}w'
            if lag_col in df_clean.columns and test_idx in df_clean.index:
                lagged_prices.append(df_clean.loc[test_idx, lag_col])
        
        if len(lagged_prices) >= 2:  # Need at least 2 lags
            y_pred_ma.append(np.mean(lagged_prices))
        elif 'price_lag_1w' in df_clean.columns:
            # Fallback to naive if not enough lags
            y_pred_ma.append(df_clean.loc[test_idx, 'price_lag_1w'])
        else:
            y_pred_ma.append(np.nan)
    
    y_pred_ma = pd.Series(y_pred_ma, index=y_test.index)
    
    # Remove any NaN predictions
    valid_idx = ~y_pred_ma.isna()
    if valid_idx.sum() > 0:
        metrics_ma = evaluate_model(y_test[valid_idx], y_pred_ma[valid_idx], '4-Week Moving Average')
        results.append(metrics_ma)
    else:
        print("⚠️  Could not create moving average baseline - insufficient lagged data")
else:
    print("\n❌ ERROR: price_lag_1w not found!")
    print("   This is created in 02_data_understanding.ipynb")
    print("   Please re-run that notebook to create lagged price features")
    print(f"\n   Available columns: {[c for c in df_clean.columns if 'price' in c.lower()]}")


Available lagged price columns:
['price_lag_1w', 'price_lag_2w', 'price_lag_4w']

✓ Using 'price_lag_1w' for naive baseline

Naive (Last Week) Performance
RMSE: $446.85
MAE:  $358.20
MAPE: 13.42%
R²:   0.8471

✓ Calculating 4-week moving average from lagged prices

4-Week Moving Average Performance
RMSE: $672.02
MAE:  $557.48
MAPE: 21.11%
R²:   0.6543


## Step 7: Linear Regression Models

In [8]:
# Linear Regression
print("\nTraining Linear Regression...")
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
y_pred_lr = lr_model.predict(X_test_scaled)
metrics_lr = evaluate_model(y_test, y_pred_lr, 'Linear Regression')
results.append(metrics_lr)

# Ridge Regression
print("\nTraining Ridge Regression...")
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_model.predict(X_test_scaled)
metrics_ridge = evaluate_model(y_test, y_pred_ridge, 'Ridge Regression')
results.append(metrics_ridge)

# Lasso Regression
print("\nTraining Lasso Regression...")
lasso_model = Lasso(alpha=1.0, max_iter=5000)
lasso_model.fit(X_train_scaled, y_train)
y_pred_lasso = lasso_model.predict(X_test_scaled)
metrics_lasso = evaluate_model(y_test, y_pred_lasso, 'Lasso Regression')
results.append(metrics_lasso)


Training Linear Regression...

Linear Regression Performance
RMSE: $758.39
MAE:  $652.47
MAPE: 26.17%
R²:   0.5597

Training Ridge Regression...

Ridge Regression Performance
RMSE: $859.34
MAE:  $731.41
MAPE: 27.11%
R²:   0.4347

Training Lasso Regression...

Lasso Regression Performance
RMSE: $667.79
MAE:  $583.28
MAPE: 23.34%
R²:   0.6586


## Step 8: Tree-Based Models

In [9]:
# Random Forest
print("\nTraining Random Forest...")
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
metrics_rf = evaluate_model(y_test, y_pred_rf, 'Random Forest')
results.append(metrics_rf)

# Gradient Boosting
print("\nTraining Gradient Boosting...")
gb_model = GradientBoostingRegressor(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42
)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
metrics_gb = evaluate_model(y_test, y_pred_gb, 'Gradient Boosting')
results.append(metrics_gb)

# XGBoost
print("\nTraining XGBoost...")
xgb_model = xgb.XGBRegressor(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    min_child_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
metrics_xgb = evaluate_model(y_test, y_pred_xgb, 'XGBoost')
results.append(metrics_xgb)


Training Random Forest...

Random Forest Performance
RMSE: $604.85
MAE:  $484.42
MAPE: 20.18%
R²:   0.7199

Training Gradient Boosting...

Gradient Boosting Performance
RMSE: $683.95
MAE:  $587.56
MAPE: 25.87%
R²:   0.6419

Training XGBoost...

XGBoost Performance
RMSE: $536.96
MAE:  $449.53
MAPE: 20.84%
R²:   0.7793


## Step 9: Model Comparison

In [10]:
# Create comparison dataframe
df_results = pd.DataFrame(results)
df_results = df_results.sort_values('RMSE')

print("\n" + "="*80)
print("MODEL PERFORMANCE COMPARISON")
print("="*80)
print(df_results.to_string(index=False))
print("="*80)

# Visualize model comparison
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('RMSE', 'MAE', 'MAPE (%)', 'R²')
)

fig.add_trace(go.Bar(x=df_results['Model'], y=df_results['RMSE'], name='RMSE'), row=1, col=1)
fig.add_trace(go.Bar(x=df_results['Model'], y=df_results['MAE'], name='MAE'), row=1, col=2)
fig.add_trace(go.Bar(x=df_results['Model'], y=df_results['MAPE'], name='MAPE'), row=2, col=1)
fig.add_trace(go.Bar(x=df_results['Model'], y=df_results['R²'], name='R²'), row=2, col=2)

fig.update_xaxes(tickangle=45)
fig.update_layout(height=800, showlegend=False, title_text="Model Performance Metrics")
fig.show()


MODEL PERFORMANCE COMPARISON
                Model       RMSE        MAE      MAPE       R²
    Naive (Last Week) 446.847206 358.203704 13.416669 0.847150
              XGBoost 536.956768 449.531544 20.836169 0.779288
        Random Forest 604.847892 484.420793 20.182208 0.719947
     Lasso Regression 667.786818 583.284944 23.341549 0.658631
4-Week Moving Average 672.023595 557.481481 21.110146 0.654286
    Gradient Boosting 683.952265 587.559274 25.873810 0.641904
    Linear Regression 758.385445 652.473107 26.171279 0.559721
     Ridge Regression 859.342758 731.408095 27.108211 0.434697


## Step 10: Feature Importance Analysis

In [11]:
# Get feature importances from XGBoost
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': xgb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print(f"\nTop 20 Most Important Features (XGBoost):")
print(feature_importance.head(20).to_string(index=False))

# Visualize top 20 features
fig = px.bar(
    feature_importance.head(20),
    x='Importance',
    y='Feature',
    orientation='h',
    title='Top 20 Feature Importances (XGBoost)'
)
fig.update_layout(height=600, yaxis={'categoryorder': 'total ascending'})
fig.show()


Top 20 Most Important Features (XGBoost):
                                       Feature  Importance
                                    SCFI_Index    0.599849
                                  price_lag_1w    0.160590
                                  price_lag_2w    0.123736
             trade_ais_portcalls_general_cargo    0.018771
              trade_ais_import_dry_bulk_lag_1w    0.010142
 choke_bab_el-mandeb_strait_capacity_container    0.006091
                                  sh_portcalls    0.005207
                  choke_malacca_strait_n_total    0.004899
                     sh_portcalls_cargo_lag_1w    0.004230
    choke_bab_el-mandeb_strait_capacity_lag_1w    0.004220
 choke_taiwan_strait_capacity_container_lag_1w    0.003331
choke_malacca_strait_capacity_container_lag_2w    0.002798
                       global_peak_event_media    0.002764
     choke_gibraltar_strait_capacity_container    0.002701
         trade_ais_import_general_cargo_lag_1w    0.002325
             

## Step 11: Prediction Visualization

In [12]:
# Create prediction dataframe
pred_df = pd.DataFrame({
    'Date': y_test.index,
    'Actual': y_test.values,
    'Naive': y_pred_naive.values,
    'XGBoost': y_pred_xgb,
    'Random Forest': y_pred_rf,
    'Gradient Boosting': y_pred_gb
}).set_index('Date')

# Plot actual vs predicted
fig = go.Figure()

fig.add_trace(go.Scatter(x=pred_df.index, y=pred_df['Actual'], mode='lines+markers', 
                         name='Actual', line=dict(color='black', width=2)))
fig.add_trace(go.Scatter(x=pred_df.index, y=pred_df['XGBoost'], mode='lines+markers',
                         name='XGBoost', line=dict(dash='dash')))
fig.add_trace(go.Scatter(x=pred_df.index, y=pred_df['Random Forest'], mode='lines+markers',
                         name='Random Forest', line=dict(dash='dot')))
fig.add_trace(go.Scatter(x=pred_df.index, y=pred_df['Naive'], mode='lines',
                         name='Naive Baseline', line=dict(color='gray', dash='dash', width=1), opacity=0.5))

fig.update_layout(
    title='Actual vs Predicted Europe Base Port Prices (Test Set)',
    xaxis_title='Date',
    yaxis_title='Price (USD)',
    height=600,
    hovermode='x unified'
)
fig.show()

## Step 12: Hyperparameter Tuning

In [13]:
print("\nPerforming hyperparameter tuning for XGBoost...")

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'min_child_weight': [1, 3, 5]
}

# Use TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=3)

# Grid search
xgb_tuned = xgb.XGBRegressor(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(
    xgb_tuned,
    param_grid,
    cv=tscv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: ${-grid_search.best_score_:.2f}")

# Evaluate tuned model
y_pred_tuned = grid_search.predict(X_test)
metrics_tuned = evaluate_model(y_test, y_pred_tuned, 'XGBoost (Tuned)')
results.append(metrics_tuned)


Performing hyperparameter tuning for XGBoost...
Fitting 3 folds for each of 36 candidates, totalling 108 fits

Best parameters: {'learning_rate': 0.1, 'max_depth': 7, 'min_child_weight': 5, 'n_estimators': 100}
Best CV RMSE: $1794.79

XGBoost (Tuned) Performance
RMSE: $527.89
MAE:  $437.66
MAPE: 18.53%
R²:   0.7867


## Step 13: Save Best Model and Results

In [14]:
import pickle
import os

# Create models directory
os.makedirs('models', exist_ok=True)

# Save best model
best_model_final = grid_search.best_estimator_
with open('models/best_model_xgboost.pkl', 'wb') as f:
    pickle.dump(best_model_final, f)
print("✓ Best model saved")

# Save scaler
with open('models/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)
print("✓ Scaler saved")

# Save feature names
with open('models/feature_names.pkl', 'wb') as f:
    pickle.dump(feature_cols, f)
print("✓ Feature names saved")

# Save results
df_results_final = pd.DataFrame(results).sort_values('RMSE')
df_results_final.to_csv('data/processed/model_comparison_results.csv', index=False)
print("✓ Results saved")

# Save predictions
pred_final = pd.DataFrame({
    'Date': y_test.index,
    'Actual': y_test.values,
    'Predicted': y_pred_tuned,
    'Error': y_test.values - y_pred_tuned
})
pred_final.to_csv('data/processed/test_predictions.csv', index=False)
print("✓ Predictions saved")

print("\n" + "="*80)
print("MODEL TRAINING COMPLETE!")
print("="*80)
print(f"Best Model: XGBoost (Tuned)")
print(f"Test RMSE: ${metrics_tuned['RMSE']:.2f}")
print(f"Test MAE: ${metrics_tuned['MAE']:.2f}")
print(f"Test MAPE: {metrics_tuned['MAPE']:.2f}%")
print(f"Test R²: {metrics_tuned['R²']:.4f}")
print("="*80)

✓ Best model saved
✓ Scaler saved
✓ Feature names saved
✓ Results saved
✓ Predictions saved

MODEL TRAINING COMPLETE!
Best Model: XGBoost (Tuned)
Test RMSE: $527.89
Test MAE: $437.66
Test MAPE: 18.53%
Test R²: 0.7867


In [15]:
# Display final comparison of ALL models
print("\n" + "="*80)
print("FINAL MODEL PERFORMANCE COMPARISON (ALL MODELS)")
print("="*80)
df_all_results = pd.DataFrame(results).sort_values('RMSE')
print(df_all_results.to_string(index=False))
print("="*80)

# Highlight the best model
best_model_name = df_all_results.iloc[0]['Model']
best_rmse = df_all_results.iloc[0]['RMSE']
best_r2 = df_all_results.iloc[0]['R²']

print(f"\n🏆 BEST MODEL: {best_model_name}")
print(f"   RMSE: ${best_rmse:.2f}")
print(f"   R²: {best_r2:.4f}")

# Check if naive baseline is winning (this would be bad!)
if 'Naive' in best_model_name:
    print("\n⚠️  WARNING: Naive baseline is the best model!")
    print("   This suggests:")
    print("   1. Features may not be predictive")
    print("   2. Possible data leakage")
    print("   3. Need to review feature engineering in 02_data_understanding.ipynb")
else:
    print(f"\n✅ Good! Advanced models outperform the naive baseline.")
    
    # Calculate improvement over naive
    naive_rmse = df_all_results[df_all_results['Model'].str.contains('Naive', case=False)]['RMSE'].values
    if len(naive_rmse) > 0:
        improvement = ((naive_rmse[0] - best_rmse) / naive_rmse[0]) * 100
        print(f"   Improvement over naive: {improvement:.1f}% reduction in RMSE")


FINAL MODEL PERFORMANCE COMPARISON (ALL MODELS)
                Model       RMSE        MAE      MAPE       R²
    Naive (Last Week) 446.847206 358.203704 13.416669 0.847150
      XGBoost (Tuned) 527.889524 437.662491 18.528637 0.786679
              XGBoost 536.956768 449.531544 20.836169 0.779288
        Random Forest 604.847892 484.420793 20.182208 0.719947
     Lasso Regression 667.786818 583.284944 23.341549 0.658631
4-Week Moving Average 672.023595 557.481481 21.110146 0.654286
    Gradient Boosting 683.952265 587.559274 25.873810 0.641904
    Linear Regression 758.385445 652.473107 26.171279 0.559721
     Ridge Regression 859.342758 731.408095 27.108211 0.434697

🏆 BEST MODEL: Naive (Last Week)
   RMSE: $446.85
   R²: 0.8471

   This suggests:
   1. Features may not be predictive
   2. Possible data leakage
   3. Need to review feature engineering in 02_data_understanding.ipynb


## Final Model Comparison Summary

Let's review all model performances to identify the best model.