# Phase 2: Advanced Model Training & Evaluation

This notebook serves as the primary script for training and evaluating predictive models (e.g., churn, LTV). It incorporates the accuracy upgrades specified in the project brief, including robust data validation, imputation, and time-aware cross-validation.

**Workflow:**
1.  Run `data_preparation_pipeline.py` first to generate the analysis-ready data.
2.  This notebook loads the final, clean data from `/data/analysis_ready/`.
3.  It then trains a model using the best hyperparameters found via Optuna and evaluates it with `TimeSeriesSplit`.

In [3]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import roc_auc_score
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## 1. Load Analysis-Ready Data

Load the clean, validated, and imputed data generated by the preparation pipeline.

In [4]:
try:
    # Load the final trends data
    df = pd.read_csv('data/analysis_ready/final_monthly_trends.csv')
    df['date'] = pd.to_datetime(df['date'])
    df.sort_values('date', inplace=True)
    
    logging.info("Successfully loaded analysis-ready data.")
    print(df.head())
except FileNotFoundError:
    logging.error("Error: 'final_monthly_trends.csv' not found. Please run data_preparation_pipeline.py first.")

2025-07-01 22:24:14,041 - INFO - Successfully loaded analysis-ready data.


         date  mapping_key  total_spend   net_revenue  imputed_bool  \
11 2022-01-01    affiliate      68418.0  46565.416276          True   
3  2022-01-01  paid search      51972.0  44426.868685          True   
7  2022-01-01  paid social      17057.0  39886.713684          True   
19 2022-02-01  paid social       6602.0  38527.202857          True   
15 2022-02-01  paid search      35202.0  42246.189912          True   

    monthly_roas  monthly_cac  
11      0.680602    74.267960  
3       0.854823    62.768116  
7       2.338437    55.022581  
19      5.835687    25.688716  
15      1.200108    43.298893  


## 2. Feature Engineering & Model Preparation

Prepare the features (X) and target variable (y) for the churn model. For this demonstration, we will create a synthetic `churned` column.

In [5]:
# Create a sample target variable for demonstration (e.g., churn prediction)
# In a real scenario, this would be based on actual customer behavior.
df['churned'] = (df['monthly_roas'] < 1.5).astype(int)

# Define features and target
features = ['total_spend', 'net_revenue', 'monthly_roas', 'monthly_cac']
target = 'churned'

X = df[features]
y = df[target]

logging.info(f"Features for modeling: {features}")

2025-07-01 22:24:29,949 - INFO - Features for modeling: ['total_spend', 'net_revenue', 'monthly_roas', 'monthly_cac']


## 3. Train Model with Best Hyperparameters and TimeSeriesSplit

This section implements Task 8.3 and 8.4. It uses the optimal parameters found by the Optuna script and evaluates the model using a robust time-series cross-validation strategy.

In [6]:
# Best parameters found from the hyperparameter_tuning.py script (Task 8.4)
best_params = {
    'n_estimators': 184,
    'max_depth': 6,
    'learning_rate': 0.1842290290462422,
    'subsample': 0.7885720186693017,
    'colsample_bytree': 0.8096297971911706,
    'gamma': 3.8149755795467173,
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'random_state': 42
}

logging.info("Using best hyperparameters found by Optuna.")

# Initialize the model with the best parameters
model = xgb.XGBClassifier(**best_params)

# Use TimeSeriesSplit for cross-validation (Task 8.3)
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

logging.info(f"Performing cross-validation with TimeSeriesSplit (n_splits={n_splits})...")

auc_scores = []
fold = 1
for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    model.fit(X_train, y_train)
    
    # Predict probabilities for the positive class
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    auc = roc_auc_score(y_test, y_pred_proba)
    auc_scores.append(auc)
    logging.info(f"Fold {fold}: Train size={len(X_train)}, Test size={len(X_test)}, AUC={auc:.4f}")
    fold += 1

mean_auc = np.mean(auc_scores)
logging.info(f"\nFinal Mean Cross-Validated AUC: {mean_auc:.4f}")

print(f"\nModel training and evaluation complete.")
print(f"The average AUC score using the optimized model is: {mean_auc:.4f}")

2025-07-01 22:24:47,057 - INFO - Using best hyperparameters found by Optuna.
2025-07-01 22:24:47,066 - INFO - Performing cross-validation with TimeSeriesSplit (n_splits=5)...
2025-07-01 22:24:47,338 - INFO - Fold 1: Train size=23, Test size=20, AUC=1.0000
2025-07-01 22:24:47,467 - INFO - Fold 2: Train size=43, Test size=20, AUC=1.0000
2025-07-01 22:24:47,600 - INFO - Fold 3: Train size=63, Test size=20, AUC=0.2798
2025-07-01 22:24:47,732 - INFO - Fold 4: Train size=83, Test size=20, AUC=1.0000
2025-07-01 22:24:47,881 - INFO - Fold 5: Train size=103, Test size=20, AUC=1.0000
2025-07-01 22:24:47,881 - INFO - 
Final Mean Cross-Validated AUC: 0.8560



Model training and evaluation complete.
The average AUC score using the optimized model is: 0.8560
