# üõí End-to-End Churn Prediction Pipeline (Olist)
**Goal:** Predict customer churn in a non-contractual e-commerce setting and identify key drivers for retention strategies.

---
### üìå Executive Summary
This notebook executes the final production pipeline. It loads raw data, engineers features (RFM + Logistics + Satisfaction), trains a champion model, and generates an actionable **"High Risk Lead List"** for the Marketing Team.

**Key Results:**
* **Champion Model:** Logistic Regression (Selected for High Recall)
* **Primary Churn Driver:** Delivery Delays (Logistics)
* **Secondary Driver:** Review Scores (Satisfaction)

In [None]:
# Setup path to import from 'src' folder
import sys
import os
import pandas as pd

# Add the parent directory to system path to access 'src'
sys.path.append(os.path.abspath('..'))

# Import our custom modules
from src.data_prep import load_and_clean_data, generate_churn_labels, build_features, split_and_scale
from src.modeling import train_baseline_models, evaluate_models, get_model_coefficients, generate_leads
from src.visualization import set_plot_style, plot_confusion_matrix, plot_roc_curve, plot_coefficients

# Set visual style
set_plot_style()
pd.set_option('display.max_columns', None)

print("‚úÖ Setup Complete. Modules Loaded.")

### 1Ô∏è‚É£ Data Pipeline Execution
We load 9 raw CSV files, merge them into a single customer view, and apply "Smart Imputation" for missing values using geospatial context.

In [None]:
# 1. Load Data
# Note: We use '../data/raw/' because the notebook is inside the 'notebooks' folder
dfs = load_and_clean_data(data_path='../data/raw/')

# 2. Generate Labels (Churn Definition: >280 days inactivity)
df_targets = generate_churn_labels(dfs['orders'], dfs['customers'], threshold_days=280)

# 3. Feature Engineering (RFM + Logistics + Satisfaction)
df_features = build_features(dfs, df_targets)

# 4. Split & Scale (Ready for AI)
X_train_scaled, X_test_scaled, y_train, y_test, scaler, df_modeling = split_and_scale(
    df_features, 
    target_col='is_churn', 
    test_size=0.2
)

print(f"\nPipeline Output Shape: {X_train_scaled.shape}")
display(X_train_scaled.head(3))

### 2Ô∏è‚É£ Model Training & Evaluation
We train three baseline models (Logistic Regression, Random Forest, XGBoost) handling class imbalance.
**Strategy:** We prioritize **Recall** (capturing as many churners as possible) over Precision.

In [None]:
# Train Models
models = train_baseline_models(X_train_scaled, y_train)

# Evaluate on Test Set
print("\nüìä Model Performance Comparison:")
results = evaluate_models(models, X_test_scaled, y_test)
display(results)

# Select Champion Model (Logistic Regression for Interpretability & Recall)
champion_model = models['Logistic Regression']

### 3Ô∏è‚É£ Interpretability: Why are they leaving?
Using Logistic Regression coefficients, we identify the direction and magnitude of risk factors.

In [None]:
# Extract Coefficients
feature_names = X_train_scaled.columns.tolist()
df_coef = get_model_coefficients(champion_model, feature_names)

# Visualize Drivers (Save to reports folder)
plot_coefficients(df_coef, top_n=10, save_path='../reports/figures/churn_drivers.png')

# Display Top Risk Factors
print("üö® Top Risk Factors (Positive Coefficient = Increases Churn):")
display(df_coef[df_coef['Coefficient'] > 0].head(5))

### 4Ô∏è‚É£ Business Action: High Risk Lead List
We map the model's mathematical probabilities back to raw business data to create a readable list for the Marketing Team.

In [None]:
# Generate Leads from the Test Set (Simulating active customers)
# We map predictions back to 'df_modeling' (Raw Data) using indices
leads = generate_leads(
    model=champion_model, 
    X_scaled=X_test_scaled, 
    df_raw=df_modeling, 
    threshold=0.75
)

print(f"\nüìã ACTION REQUIRED: Identified {len(leads)} High-Risk Customers")
display(leads.head(10))

# Export for CRM
leads.to_csv('../reports/high_risk_customers.csv')
print("‚úÖ Exported list to 'reports/high_risk_customers.csv'")