# Churn Prediction Modeling Notebook

**Note:** This notebook is a continuation of the Exploratory Data Analysis (EDA) notebook. It covers model development, evaluation, and preparing data outputs for Tableau.


🚀 Project Overview

**Problem Statement:** Subscription businesses lose revenue to customer churn.  
**Goal:** Train a simple, interpretable model (Logistic Regression & Decision Tree) to generate churn risk scores, then export for a Tableau dashboard.

In [None]:
# churn_modeling_colab.ipynb

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
BASE_PATH = '/content/drive/MyDrive/Capstone3'


In [None]:
# STEP 0: Preprocessing and Training Data Development
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np

In [None]:
# STEP 1: Load the cleaned dataset from Drive
df = pd.read_csv(f"{BASE_PATH}/cleaned_churn_data_final.csv")
print("Available columns:", df.columns.tolist())

Available columns: ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn', 'tenure_bucket', 'charges_ratio', 'Churn_num']


In [None]:
# STEP 2: Preprocessing
# 2.1: Drop original 'Churn' and use 'Churn_num' as target
if 'Churn' in df.columns:
    df = df.drop('Churn', axis=1)

In [None]:
# 2.2: Create dummy variables for categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.drop('tenure_bucket', errors='ignore')
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


In [None]:
# 2.3: Drop unencoded tenure_bucket (still string)
if 'tenure_bucket' in df_encoded.columns:
    df_encoded = df_encoded.drop('tenure_bucket', axis=1)

In [None]:
# 2.4: Standardize numeric features (excluding target 'Churn_num')
numerical_cols = df_encoded.select_dtypes(include=['int64', 'float64']).columns.drop('Churn_num')
scaler = StandardScaler()
df_encoded[numerical_cols] = scaler.fit_transform(df_encoded[numerical_cols])

In [None]:
# 2.5: Split data into training and testing sets
X = df_encoded.drop('Churn_num', axis=1)
y = df_encoded['Churn_num']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

# Save preprocessed dataset to Drive
df_encoded.to_csv(f"{BASE_PATH}/telco_preprocessed.csv", index=False)
print("✅ Preprocessed data saved to telco_preprocessed.csv")


✅ Preprocessed data saved to telco_preprocessed.csv


In [None]:
# STEP 3: Modeling
# 3.1: Logistic Regression
logreg = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
logreg.fit(X_train, y_train)
y_pred_log = logreg.predict(X_test)
y_prob_log = logreg.predict_proba(X_test)[:, 1]
print("Logistic Regression Results")
print(classification_report(y_test, y_pred_log))
print("ROC AUC:", roc_auc_score(y_test, y_prob_log))
# Feature importance
importance_log = pd.Series(logreg.coef_[0], index=X.columns).sort_values(ascending=False)
print("Top 5 Churn Drivers (LogReg):")
print(importance_log.head())

Logistic Regression Results
              precision    recall  f1-score   support

           0       0.91      0.71      0.80      1035
           1       0.50      0.80      0.62       374

    accuracy                           0.74      1409
   macro avg       0.71      0.76      0.71      1409
weighted avg       0.80      0.74      0.75      1409

ROC AUC: 0.8490583585212741
Top 5 Churn Drivers (LogReg):
InternetService_Fiber optic    1.288699
StreamingMovies_Yes            0.474621
StreamingTV_Yes                0.461778
MultipleLines_Yes              0.421806
charges_ratio                  0.420602
dtype: float64


🔍 Logistic Regression Model Results

**✅ Model Performance**

- Accuracy: 74%
- Precision (Churn = 1): 50% — meaning when the model predicts churn, it’s correct half the time
- Recall (Churn = 1): 80% — meaning the model correctly catches 80% of all actual churners
- ROC-AUC Score: 0.85

This indicates strong discriminatory power — the model effectively separates churners from non-churners.


---


📊 Top Drivers of Churn
Based on model coefficients (impact on log-odds of churn):

- Fiber Optic Internet Service (+1.29)
→ Customers with fiber internet are significantly more likely to churn

- Streaming Movies Enabled (+0.47)
→ Streaming service users show higher churn rates

- Streaming TV Enabled (+0.46)
→ Suggests bundling services may not be enough to retain users

- Multiple Phone Lines (+0.42)
→ More complex accounts may correlate with churn

- High Charges Relative to Tenure (charges_ratio) (+0.42)
→ Customers paying a lot for a short time are high-risk

# 📌 Key Takeaways
The model is highly effective at identifying likely churners, which is critical for proactive retention.

Fiber customers with multiple services and high relative charges are the most at-risk. These churn drivers can be directly used to target segment-specific offers (e.g., discounts or contract changes).



In [None]:
# 3.2: Decision Tree (Shallow)
tree = DecisionTreeClassifier(max_depth=3, class_weight='balanced', random_state=42)
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)
y_prob_tree = tree.predict_proba(X_test)[:, 1]
print("\nDecision Tree Results")
print(classification_report(y_test, y_pred_tree))
print("ROC AUC:", roc_auc_score(y_test, y_prob_tree))
# Feature importance
importance_tree = pd.Series(tree.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Top 5 Churn Drivers (DT):")
print(importance_tree.head())



Decision Tree Results
              precision    recall  f1-score   support

           0       0.93      0.57      0.71      1035
           1       0.43      0.88      0.57       374

    accuracy                           0.65      1409
   macro avg       0.68      0.73      0.64      1409
weighted avg       0.80      0.65      0.67      1409

ROC AUC: 0.7862473843292257
Top 5 Churn Drivers (DT):
Contract_Two year              0.492014
Contract_One year              0.307701
InternetService_Fiber optic    0.135436
StreamingMovies_Yes            0.041401
TotalCharges                   0.011653
dtype: float64


🌲 Decision Tree Model Results (Max Depth = 3)

**✅ Model Performance**
- Accuracy: 65%
- Precision (Churn = 1): 43%
- Recall (Churn = 1): 88% (The model is very good at identifying churners but over-predicts churn in some cases.)
- ROC-AUC Score: 0.79 (Indicates good performance, though not as strong as logistic regression (which was 0.85).)


---


📊 Top Drivers of Churn (Feature Importance)
- Contract Type: Two-Year (− churn): 49% importance
→ Customers on long-term contracts are least likely to churn
- Contract Type: One-Year (− churn): 31%
→ Medium-term contracts also correlate with low churn
- Fiber Optic Internet Service (+ churn): 14%
→ Fiber users more likely to leave despite fast service
- Streaming Movies Enabled (+ churn): 4%
- TotalCharges (slightly + churn): 1%

# 📌 Key Takeaways
The decision tree confirms contract type as the single strongest predictor of churn. Customers on month-to-month plans are far more likely to churn.

This model is very interpretable and useful for business rules-based segmentation. Retention campaigns should prioritize moving customers to longer-term contracts.

In [None]:
# STEP 4: Export for Tableau Dashboard
full_scores = df.copy()
full_scores['churn_score'] = tree.predict_proba(X)[:, 1]
full_scores['predicted_label'] = tree.predict(X)
full_scores['churn_risk_level'] = pd.cut(
    full_scores['churn_score'],
    bins=[0, 0.33, 0.66, 1.0],
    labels=['Low', 'Medium', 'High']
)

In [None]:
# STEP 5: Add segment tag

def tag_segment(row):
    if pd.isna(row.get('tenure_bucket')) or pd.isna(row.get('Contract')):
        return 'General'
    contract = str(row['Contract']).strip()
    tenure = str(row['tenure_bucket']).strip()
    if contract == 'Month-to-month' and tenure in ['0-12', '13-24']:
        return 'High Risk Segment'
    return 'General'

full_scores['segment_tag'] = full_scores.apply(tag_segment, axis=1)

In [None]:
# STEP 6: Save Tableau-ready unscaled dataset
full_scores.to_csv(f"{BASE_PATH}/telco_tableau_unscaled.csv", index=False)
print("✅ Exported unscaled data for Tableau to telco_tableau_unscaled.csv")

# Export feature importances for Top Drivers sheet
importance_tree.head(10).to_csv(f"{BASE_PATH}/churn_driver_importance.csv", header=['importance'], index_label='feature')
print("✅ Exported top churn drivers to churn_driver_importance.csv")


✅ Exported unscaled data for Tableau to telco_tableau_unscaled.csv
✅ Exported top churn drivers to churn_driver_importance.csv


# 🤔 Logistic Regression vs. Decision Tree

| Metric           | Logistic Regression | Decision Tree |
| ---------------- | ------------------- | ------------- |
| ROC-AUC          | **0.85** ✅          | 0.79          |
| Accuracy         | 74% ✅               | 65%           |
| Recall (Churn)   | 80%                 | **88%** ✅     |
| Interpretability | Medium              | **High** ✅    |

*Conclusion:*
- Use **Logistic Regression** for scoring (better ROC‑AUC).
- Leverage **Decision Tree** insights for business rules (contract tenure retention strategies).

# 📚 Next Steps
- **Tableau Dashboard:** Connect CSVs and build interactive sheets: Risk Distribution, Segment Explorer, Top Drivers, Modeling Results

In [None]:
# Logistic Regression Predictions
y_pred_logreg = logreg.predict(X_test)
y_proba_logreg = logreg.predict_proba(X_test)[:, 1]

# Decision Tree Predictions
y_pred_tree = tree.predict(X_test)
y_proba_tree = tree.predict_proba(X_test)[:, 1]

# Optional: Add churn risk level (based on probability)
def risk_level(prob):
    if prob >= 0.67:
        return 'High'
    elif prob >= 0.34:
        return 'Medium'
    else:
        return 'Low'

# If you have a CustomerID column in your original data
# and X_test.index is your customer identifier:
results_df = pd.DataFrame({
    'CustomerID': X_test.index,
    'Actual': y_test,
    'LogReg_Predicted': y_pred_logreg,
    'LogReg_Probability': y_proba_logreg,
    'Tree_Predicted': y_pred_tree,
    'Tree_Probability': y_proba_tree
})

results_df['LogReg_Risk_Level'] = results_df['LogReg_Probability'].apply(risk_level)
results_df['Tree_Risk_Level'] = results_df['Tree_Probability'].apply(risk_level)

# Export to CSV
results_df.to_csv('/content/drive/MyDrive/Capstone3/model_results_for_tableau.csv', index=False)
print("✅ Exported to model_results_for_tableau.csv")


✅ Exported to model_results_for_tableau.csv


In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresh = roc_curve(y_test, y_proba_logreg)
roc_df = pd.DataFrame({'FPR': fpr, 'TPR': tpr, 'Threshold': thresh})
roc_df.to_csv('/content/drive/MyDrive/Capstone3/roc_data.csv', index=False)