# **Credit Card Customer Churn Analysis (ML)**

## Objectives – Machine Learning (ML)

In this notebook, I will build, evaluate, and interpret a predictive model for customer churn using the processed Credit Card Customer Churn dataset. The goal is to leverage insights from the EDA notebook to select relevant features, train a logistic regression model, and assess which customer attributes significantly influence churn probability.

This notebook focuses on:

* Preparing data for modeling, including feature selection, scaling, and train/test splitting.
* Building an interpretable Logistic Regression model to predict churn (binary classification).
* Evaluating model performance using metrics such as Accuracy, Precision, Recall, F1-score, and ROC-AUC.
* Visualizing feature importance to identify which factors drive churn.
* Translating model results into actionable insights for business stakeholders.

### **Inputs**

To run this notebook, the following inputs are required:

* Processed dataset CSV: The cleaned and transformed Credit Card Customer Churn dataset produced in the ETL notebook.
* Python libraries: Including but not limited to:
* pandas – data manipulation
* numpy – numerical operations
* scikit-learn – machine learning algorithms, scaling, train/test split, evaluation metrics
* matplotlib and seaborn – visualizations for performance metrics and feature importance

### **Outputs**

This notebook will generate:

* Train/Test Sets: Split datasets ready for modeling.
* Scaled Features: Normalized numeric features for better model performance.
* Model Performance Metrics: Accuracy, Precision, Recall, F1-score, ROC-AUC, and confusion matrix visualizations.
* Feature Importance Visualization: Bar charts of logistic regression coefficients showing positive or negative influence on churn.
* Interpretation of Results: Insights on which features support or refute your original hypotheses (H1–H3).
* Optional Export: DataFrame with predicted churn probabilities for downstream dashboard integration or further analysis.

### **Workflow**

**1. Data Preparation**

* Load the processed dataset.
* Separate features (X) and target (y).
* Scale numeric features.
* Encode categorical variables if needed (already one-hot in your dataset).

**2. Train/Test Split**

* Split the dataset into training and testing sets (e.g., 80/20).
* Ensure stratification on the target variable (Attrition_Flag) to preserve class balance.

**3. Model Training**

* Train a Logistic Regression model.
* Optionally, compare with other interpretable models (e.g., Random Forest for feature importance).

**4. Model Evaluation**

* Evaluate predictions on the test set using standard metrics.
* Visualize results via confusion matrix and ROC curve.

**5. Feature Interpretation**

* Extract model coefficients.
* Rank features by absolute influence on churn probability.
* Visualize top features using bar charts for easy stakeholder interpretation.

**6. Hypothesis Insights**

* Map the top features back to H1–H3:
* H1 – Tenure & Age
* H2 – Credit usage & transactions
* H3 – Income & card category
* Summarize which hypotheses are supported or contradicted by the model.

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# ML & preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Style
sns.set_style("whitegrid")
