# Challenger Model: Gradient Boosting (CatBoost)

**Goal**: Implement and verify the 'Challenger' model designed during the research phase.

**Key Strategies (The 4 Pillars)**:
1.  **Objective Function**: `Tweedie` (p=1.5) to handle 31% zero-sales (sparsity).
2.  **Payday Effect**: Explicit `days_to_payday` countdown and 15th/End-of-Month flags.
3.  **Store Clustering**: Interaction features for High-Lift-Weekend clusters (14, 5, 11).
4.  **Structural Breaks**: Binary flag for the **Earthquake Period** (April-May 2016) to isolate the anomaly.
5.  **Exogenous**: Oil Price Trend (7-day rolling average).

In [None]:
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor, Pool
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add root to path for imports
sys.path.append('..')
from src.features.features import RetailFeatureEngineer, create_lags

pd.set_option('display.max_columns', None)
sns.set_theme(style="whitegrid")

## 1. Load Canon Data

In [None]:
# Load optimized daily data
df = pd.read_parquet('../data/processed/daily_canon.parquet')

# Sort for correct lagging
df = df.sort_values(by=['store_nbr', 'family', 'date']).reset_index(drop=True)

print(f"Data Loaded: {df.shape}")

## 2. Feature Engineering Pipeline

In [None]:
print("Generating Lags (This may take memory)...")
# 1. Create Heavy Lags first (can be slow)
df_features = create_lags(df, lags=[7, 14, 28])

# 2. Apply Retail Specific Features (Payday, Clusters)
engineer = RetailFeatureEngineer()
df_features = engineer.transform(df_features)

# 3. Handle Missing Values from Lags
df_features = df_features.dropna(subset=['sales_lag_28'])

# --- CRITICAL FIX FOR TWEEDIE ERROR ---
# Tweedie loss strictly forbids NaNs in the target variable (sales).
df_features = df_features.dropna(subset=['sales'])

print(f"Feature Set Ready: {df_features.shape}")

## 3. Train/Validation Split
Validation: Last 15 days of training data (Aug 1 - Aug 15 2017) to match Test Set duration.

In [None]:
split_date = '2017-08-01'
mask_train = df_features['date'] < split_date
mask_val = (df_features['date'] >= split_date) & (df_features['is_train_day'] == 1)

X = df_features.drop(columns=['sales', 'date', 'id', 'set', 'transactions', 'transactions_missing'])
y = df_features['sales']

# Categorical Features for CatBoost
cat_features = ['store_nbr', 'family', 'city', 'state', 'type', 'cluster']
# Ensure strings for cat features
X[cat_features] = X[cat_features].astype(str)

X_train = X[mask_train]
y_train = y[mask_train]

X_val = X[mask_val]
y_val = y[mask_val]

print(f"Train Size: {X_train.shape}, Val Size: {X_val.shape}")

## 4. Model Training (CatBoost)
Using **Tweedie Loss** to handle zero-inflated data.

In [None]:
model = CatBoostRegressor(
    iterations=500, 
    learning_rate=0.1,
    depth=8,
    loss_function='Tweedie:variance_power=1.5', 
    eval_metric='RMSE',
    random_seed=42,
    verbose=100,
    allow_writing_files=False
)

train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool = Pool(X_val, y_val, cat_features=cat_features)

model.fit(train_pool, eval_set=val_pool, early_stopping_rounds=50)


## 5. Evaluation & Feature Importance

In [None]:
# Feature Importance
feature_importances = model.get_feature_importance()
feature_names = X_train.columns

fi_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
fi_df = fi_df.sort_values(by='importance', ascending=False).head(20)

plt.figure(figsize=(10, 8))
sns.barplot(x='importance', y='feature', data=fi_df, palette='viridis')
plt.title('CatBoost Feature Importance (Top 20)')
plt.show()

In [None]:
# WAPE Calculation (Weighted Average Percentage Error)
preds = model.predict(X_val)
preds = np.maximum(preds, 0)

wape_score = np.sum(np.abs(y_val - preds)) / np.sum(y_val)
print(f"--- VALIDATION WAPE: {wape_score:.4f} ---")