# ü§ñ Day 3: Feature Engineering + Model Training + Selection

**Customer Churn Analytics Project**

This notebook covers:
1. Feature Engineering
2. Preprocessing Pipeline
3. Model Training (LR, RF, GB, XGBoost)
4. Model Comparison
5. Threshold Tuning
6. Save Best Model

In [1]:
# Imports
import sys
sys.path.insert(0, '../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from pathlib import Path

# Project modules
from config import OUTPUT_DIR
from preprocess import load_clean_data, split_data
from features import (
    create_interaction_features, apply_log_transforms,
    build_preprocessing_pipeline, prepare_features,
    get_feature_names, get_X_y, NUMERIC_FEATURES, CATEGORICAL_FEATURES
)
from train_models import (
    get_models, train_and_compare_models,
    save_model, save_metrics, save_comparison,
    get_feature_importance, find_best_threshold
)
from evaluate import (
    plot_roc_curve, plot_precision_recall_curve,
    plot_confusion_matrix, plot_feature_importance,
    plot_threshold_analysis, print_model_summary
)

print("‚úì All modules loaded successfully!")

XGBoost not available, will skip XGB model
‚úì All modules loaded successfully!


## 1. Load Cleaned Data

In [2]:
# Load cleaned dataset
df = load_clean_data()
print(f"\nShape: {df.shape}")
print(f"Churn rate: {df['churn'].mean()*100:.2f}%")
df.head()

Loaded cleaned dataset: 5,000 rows √ó 17 columns

Shape: (5000, 17)
Churn rate: 21.94%


Unnamed: 0,customer_id,age,gender,location,device_type,acquisition_channel,plan_type,monthly_price,auto_renew,total_sessions_30d,avg_session_minutes_30d,total_crashes_30d,failed_payments_30d,total_amount_success_30d,support_tickets_30d,avg_resolution_time_30d,churn
0,C000001,50,Female,Nagpur,Web,Ads,Standard,499,1,207,30.09,3,0,488.79,0,0.0,0
1,C000002,34,Male,Patna,Android,Partner,Standard,499,1,233,27.37,5,0,477.52,0,0.0,0
2,C000003,45,Female,Bangalore,Android,Ads,Standard,499,1,206,25.24,3,0,501.11,0,0.0,0
3,C000004,18,Male,Nagpur,iOS,Ads,Basic,199,1,158,20.67,0,0,203.95,0,0.0,0
4,C000005,40,Male,Vadodara,Android,Organic,Premium,999,0,0,0.0,0,0,0.0,0,0.0,1


## 2. Feature Engineering

In [3]:
# Create interaction features
df_engineered = create_interaction_features(df)
print(f"\nNew columns: {df_engineered.shape[1] - df.shape[1]}")

Created interaction features: sessions_per_crash, payment_failure_rate, support_per_session, avg_minutes_per_session

New columns: 4


In [4]:
# Apply log transforms for skewed features
df_engineered = apply_log_transforms(df_engineered)
print(f"\nFinal columns: {df_engineered.shape[1]}")

Applied log1p transform to 7 columns

Final columns: 28


In [5]:
# Check new features
new_cols = [c for c in df_engineered.columns if c not in df.columns]
print("New engineered features:")
for col in new_cols:
    print(f"  - {col}")

New engineered features:
  - sessions_per_crash
  - payment_failure_rate
  - support_per_session
  - avg_minutes_per_session
  - total_sessions_30d_log
  - avg_session_minutes_30d_log
  - total_crashes_30d_log
  - failed_payments_30d_log
  - total_amount_success_30d_log
  - support_tickets_30d_log
  - avg_resolution_time_30d_log


## 3. Train/Test Split

In [6]:
# Get feature columns
numeric_features, categorical_features = get_feature_names(
    include_interactions=True, include_log=True
)

# Filter to available columns
numeric_features = [f for f in numeric_features if f in df_engineered.columns]
categorical_features = [f for f in categorical_features if f in df_engineered.columns]

print(f"Numeric features: {len(numeric_features)}")
print(f"Categorical features: {len(categorical_features)}")

Numeric features: 21
Categorical features: 5


In [7]:
# Get X and y
all_features = numeric_features + categorical_features
X = df_engineered[all_features]
y = df_engineered['churn']

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (5000, 26)
y shape: (5000,)


In [8]:
# Split data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train churn rate: {y_train.mean()*100:.2f}%")
print(f"Test churn rate: {y_test.mean()*100:.2f}%")

Train: (4000, 26), Test: (1000, 26)
Train churn rate: 21.95%
Test churn rate: 21.90%


## 4. Build Preprocessing Pipeline

In [9]:
# Build sklearn preprocessing pipeline
preprocessor = build_preprocessing_pipeline(
    numeric_features=numeric_features,
    categorical_features=categorical_features
)

Built preprocessing pipeline:
  Numeric features: 21
  Categorical features: 5


## 5. Train and Compare Models

In [10]:
# Train all models and compare
all_metrics, comparison_df, best_pipeline = train_and_compare_models(
    X_train, X_test, y_train, y_test, preprocessor
)

MODEL TRAINING AND COMPARISON

Training LogisticRegression...
  ROC-AUC: 1.0000
  F1: 1.0000
  Best Threshold: 0.10

Training RandomForest...
  ROC-AUC: 1.0000
  F1: 1.0000
  Best Threshold: 0.25

Training GradientBoosting...
  ROC-AUC: 1.0000
  F1: 1.0000
  Best Threshold: 0.10

‚úì Best Model: LogisticRegression (ROC-AUC: 1.0000)


In [11]:
# Display comparison table
print("\nüìä MODEL COMPARISON")
display(comparison_df)


üìä MODEL COMPARISON


Unnamed: 0,model,accuracy,precision,recall,f1,roc_auc,pr_auc,best_threshold
0,LogisticRegression,1.0,1.0,1.0,1.0,1.0,1.0,0.1
1,RandomForest,1.0,1.0,1.0,1.0,1.0,1.0,0.25
2,GradientBoosting,1.0,1.0,1.0,1.0,1.0,1.0,0.1


## 6. Save Metrics and Comparison

In [12]:
# Save metrics
save_metrics(all_metrics)

# Save comparison
save_comparison(comparison_df)

‚úì Metrics saved: C:\Users\Lenovo\Desktop\churn\churn_project\outputs\metrics.json
‚úì Comparison saved: C:\Users\Lenovo\Desktop\churn\churn_project\outputs\model_comparison.csv


WindowsPath('C:/Users/Lenovo/Desktop/churn/churn_project/outputs/model_comparison.csv')

## 7. Best Model Analysis

In [13]:
# Get best model name
best_model_name = comparison_df.iloc[0]['model']
best_metrics = all_metrics[best_model_name]

print(f"Best Model: {best_model_name}")
print(f"ROC-AUC: {best_metrics['roc_auc']:.4f}")

Best Model: LogisticRegression
ROC-AUC: 1.0000


In [14]:
# Get predictions from best model
y_proba = best_pipeline.predict_proba(X_test)[:, 1]
y_pred = best_pipeline.predict(X_test)

In [15]:
# Plot ROC curve
plots_dir = OUTPUT_DIR / 'plots'
plot_roc_curve(y_test.values, y_proba, model_name=best_model_name, plots_dir=plots_dir)

  ‚úì Saved: roc_curve.png


WindowsPath('C:/Users/Lenovo/Desktop/churn/churn_project/outputs/plots/roc_curve.png')

In [16]:
# Plot Precision-Recall curve
plot_precision_recall_curve(y_test.values, y_proba, model_name=best_model_name, plots_dir=plots_dir)

  ‚úì Saved: precision_recall_curve.png


WindowsPath('C:/Users/Lenovo/Desktop/churn/churn_project/outputs/plots/precision_recall_curve.png')

In [17]:
# Confusion matrix at threshold 0.5
plot_confusion_matrix(y_test.values, y_pred, threshold=0.5, plots_dir=plots_dir)

  ‚úì Saved: confusion_matrix_t50.png


WindowsPath('C:/Users/Lenovo/Desktop/churn/churn_project/outputs/plots/confusion_matrix_t50.png')

## 8. Threshold Tuning

In [18]:
# Find best threshold
best_threshold, best_f1 = find_best_threshold(best_pipeline, X_test, y_test, metric='f1')

print(f"\nüéØ THRESHOLD TUNING")
print(f"Best threshold: {best_threshold:.2f}")
print(f"F1 at best threshold: {best_f1:.4f}")

# Business reasoning
print("\nüìù Reasoning:")
print("   - Lower threshold = catch more churners (higher recall)")
print("   - Higher threshold = fewer false positives (higher precision)")
print(f"   - Chosen {best_threshold:.2f} to balance precision and recall")


üéØ THRESHOLD TUNING
Best threshold: 0.10
F1 at best threshold: 1.0000

üìù Reasoning:
   - Lower threshold = catch more churners (higher recall)
   - Higher threshold = fewer false positives (higher precision)
   - Chosen 0.10 to balance precision and recall


In [19]:
# Plot threshold analysis
plot_threshold_analysis(y_test.values, y_proba, plots_dir=plots_dir)

  ‚úì Saved: threshold_analysis.png


WindowsPath('C:/Users/Lenovo/Desktop/churn/churn_project/outputs/plots/threshold_analysis.png')

In [20]:
# Confusion matrix at best threshold
y_pred_best = (y_proba >= best_threshold).astype(int)
plot_confusion_matrix(y_test.values, y_pred_best, threshold=best_threshold, plots_dir=plots_dir)

  ‚úì Saved: confusion_matrix_t10.png


WindowsPath('C:/Users/Lenovo/Desktop/churn/churn_project/outputs/plots/confusion_matrix_t10.png')

## 9. Feature Importance

In [21]:
# Get feature importance
importance_df = get_feature_importance(best_pipeline, all_features)

print("\nüîù Top 10 Features:")
display(importance_df.head(10))


üîù Top 10 Features:


Unnamed: 0,feature,importance
3,num__total_sessions_30d,2.532245
4,num__avg_session_minutes_30d,2.212337
15,num__avg_session_minutes_30d_log,1.472334
14,num__total_sessions_30d_log,1.218215
2,num__auto_renew,0.964961
13,num__avg_minutes_per_session,0.713551
10,num__sessions_per_crash,0.698469
5,num__total_crashes_30d,0.469092
9,num__avg_resolution_time_30d,0.404964
21,cat__gender_Female,0.346182


In [22]:
# Plot feature importance
if len(importance_df) > 0:
    plot_feature_importance(importance_df, top_n=15, plots_dir=plots_dir)

  ‚úì Saved: feature_importance.png


## 10. Save Best Model

In [23]:
# Save best model
save_model(best_pipeline)

‚úì Model saved: C:\Users\Lenovo\Desktop\churn\churn_project\outputs\models\best_model.pkl


WindowsPath('C:/Users/Lenovo/Desktop/churn/churn_project/outputs/models/best_model.pkl')

## 11. Final Summary

In [24]:
# Print model summary
print_model_summary(
    model_name=best_model_name,
    metrics=best_metrics,
    feature_importance=importance_df,
    top_n_features=10
)

print("\n‚úì Saved outputs:")
print("   - outputs/models/best_model.pkl")
print("   - outputs/metrics.json")
print("   - outputs/model_comparison.csv")
print("   - outputs/plots/*.png")


BEST MODEL SUMMARY

üìä Model: LogisticRegression

üìà Performance Metrics:
   ‚Ä¢ ROC-AUC: 1.0000
   ‚Ä¢ PR-AUC: 1.0000
   ‚Ä¢ Accuracy: 1.0000
   ‚Ä¢ Precision: 1.0000
   ‚Ä¢ Recall: 1.0000
   ‚Ä¢ F1: 1.0000
   ‚Ä¢ Best Threshold: 0.10

üîù Top 10 Features:
    4. num__total_sessions_30d: 2.5322
    5. num__avg_session_minutes_30d: 2.2123
   16. num__avg_session_minutes_30d_log: 1.4723
   15. num__total_sessions_30d_log: 1.2182
    3. num__auto_renew: 0.9650
   14. num__avg_minutes_per_session: 0.7136
   11. num__sessions_per_crash: 0.6985
    6. num__total_crashes_30d: 0.4691
   10. num__avg_resolution_time_30d: 0.4050
   22. cat__gender_Female: 0.3462

‚úì Saved outputs:
   - outputs/models/best_model.pkl
   - outputs/metrics.json
   - outputs/model_comparison.csv
   - outputs/plots/*.png
