# Customer Churn Exploration Notebook

This notebook is a template for exploring the customer churn dataset and developing initial models.

## Table of Contents
1. Setup & Imports
2. Load Data
3. Data Exploration
4. Data Cleaning
5. Feature Engineering
6. Model Training
7. Evaluation
8. Next Steps

## 1. Setup & Imports

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ML imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, roc_auc_score, confusion_matrix, classification_report
)

# Project imports (if running from project root)
import sys
sys.path.insert(0, '../src')
from churn_mlops.data import generate_sample_data

# Settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

## 2. Load Data

In [None]:
# Generate sample data for exploration
# In production, replace with: df = pd.read_csv('path/to/data.csv')
df = generate_sample_data(n_samples=5000)

print(f"Dataset shape: {df.shape}")
df.head()

## 3. Data Exploration

In [None]:
# Basic info
print("Dataset Info:")
df.info()
print("\nBasic Statistics:")
df.describe()

In [None]:
# Check target distribution
print("Churn Distribution:")
print(df['churn'].value_counts(normalize=True))

plt.figure(figsize=(6, 4))
df['churn'].value_counts().plot(kind='bar')
plt.title('Churn Distribution')
plt.xlabel('Churn (0=No, 1=Yes)')
plt.ylabel('Count')
plt.show()

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())

In [None]:
# Correlation heatmap for numeric features
numeric_df = df.select_dtypes(include=[np.number])

plt.figure(figsize=(10, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()

## 4. Data Cleaning

TODO: Add data cleaning steps as needed

In [None]:
# Handle missing values (if any)
# df = df.fillna(...)

# Remove duplicates (if any)
# df = df.drop_duplicates()

print(f"Clean dataset shape: {df.shape}")

## 5. Feature Engineering

In [None]:
# Create derived features
df['avg_monthly_spend'] = df['total_charges'] / (df['tenure'] + 1)
df['charge_ratio'] = df['monthly_charges'] / (df['total_charges'] + 1)

# Encode categorical variables
categorical_cols = df.select_dtypes(include=['object']).columns
print(f"Categorical columns: {list(categorical_cols)}")

# Use label encoding for simplicity
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col + '_encoded'] = le.fit_transform(df[col])
    label_encoders[col] = le

In [None]:
# Prepare features and target
target = 'churn'
exclude_cols = ['customer_id', 'churn'] + list(categorical_cols)
feature_cols = [c for c in df.columns if c not in exclude_cols]

print(f"Features: {feature_cols}")

X = df[feature_cols]
y = df[target]

## 6. Model Training

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)

# Train Logistic Regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_scaled, y_train)

## 7. Evaluation

In [None]:
def evaluate_model(model, X_test, y_test, model_name):
    """Evaluate model and print metrics."""
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    print(f"\n{'='*50}")
    print(f"{model_name} Results")
    print(f"{'='*50}")
    print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")
    print(f"ROC AUC:   {roc_auc_score(y_test, y_proba):.4f}")
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    return y_pred, y_proba

In [None]:
# Evaluate Random Forest
rf_pred, rf_proba = evaluate_model(rf_model, X_test, y_test, "Random Forest")

In [None]:
# Evaluate Logistic Regression
lr_pred, lr_proba = evaluate_model(lr_model, X_test_scaled, y_test, "Logistic Regression")

In [None]:
# Feature importance (Random Forest)
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Feature Importance (Random Forest)')
plt.show()

## 8. Next Steps

After exploration, consider:

1. **Move to production code**: Convert this notebook to proper Python modules
2. **Track experiments**: Use MLflow to track different experiments
3. **Hyperparameter tuning**: Use GridSearchCV or similar
4. **Feature selection**: Remove low-importance features
5. **Cross-validation**: Use k-fold CV for more robust evaluation
6. **Model selection**: Try additional models (XGBoost, etc.)
7. **Production deployment**: Package model for API serving

In [None]:
# Save best model for deployment (optional)
# import pickle
# with open('../models/churn_model.pkl', 'wb') as f:
#     pickle.dump(rf_model, f)