# Customer Churn Prediction – End-to-End ML Project

This notebook builds a small but complete **machine learning pipeline** for predicting whether a customer will churn.

We will:

1. Generate a synthetic "telecom churn" dataset.
2. Explore the data with **pandas** and **matplotlib**.
3. Train and evaluate baseline models with **scikit-learn**.
4. Visualize performance and feature importance.

You can use this as a portfolio-ready ML project on GitHub.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## 1. Create a synthetic churn dataset

We will simulate a simple telecom-style dataset with:

- `monthly_charges`
- `tenure_months`
- `num_support_calls`
- `num_addon_services`
- `has_paperless_billing`
- `uses_mobile_app`

In [None]:
# Generate a synthetic binary classification dataset
X, y = make_classification(
    n_samples=1500,
    n_features=6,
    n_informative=4,
    n_redundant=0,
    n_clusters_per_class=2,
    weights=[0.7, 0.3],  # 30% churn rate
    random_state=RANDOM_STATE,
)

# Wrap into a DataFrame with meaningful feature names
columns = [
    "monthly_charges",
    "tenure_months",
    "num_support_calls",
    "num_addon_services",
    "has_paperless_billing",
    "uses_mobile_app",
]
df = pd.DataFrame(X, columns=columns)
df["churn"] = y

# Apply some simple transformations so numbers look realistic
df["monthly_charges"] = (df["monthly_charges"] - df["monthly_charges"].min())
df["monthly_charges"] = 30 + 70 * df["monthly_charges"] / df["monthly_charges"].max()

df["tenure_months"] = (df["tenure_months"] - df["tenure_months"].min())
df["tenure_months"] = (24 * df["tenure_months"] / df["tenure_months"].max()).round()

df["num_support_calls"] = (df["num_support_calls"] - df["num_support_calls"].min())
df["num_support_calls"] = (5 * df["num_support_calls"] / df["num_support_calls"].max()).round()

df["num_addon_services"] = (df["num_addon_services"] - df["num_addon_services"].min())
df["num_addon_services"] = (4 * df["num_addon_services"] / df["num_addon_services"].max()).round()

# Convert last two features to "binary-ish" flags for variety
df["has_paperless_billing"] = (df["has_paperless_billing"] > df["has_paperless_billing"].median()).astype(int)
df["uses_mobile_app"] = (df["uses_mobile_app"] > df["uses_mobile_app"].median()).astype(int)

df.head()

## 2. Exploratory Data Analysis (EDA)

In [None]:
df.describe().T

In [None]:
# Class balance
class_counts = df["churn"].value_counts().sort_index()
class_counts

In [None]:
# Plot histograms for a few key numeric features
fig, axes = plt.subplots(2, 2, figsize=(10, 6))

axes = axes.flatten()
features_to_plot = ["monthly_charges", "tenure_months", "num_support_calls", "num_addon_services"]

for ax, col in zip(axes, features_to_plot):
    ax.hist(df[col], bins=20)
    ax.set_title(col)
    ax.set_xlabel(col)
    ax.set_ylabel("Count")

fig.tight_layout()
plt.show()

We can also look at how the churn rate varies with some of these features.

In [None]:
# Helper: compute churn rate per binned feature
def plot_binned_churn(feature, bins, figsize=(5, 3)):
    temp = df.copy()
    temp["bin"] = pd.cut(temp[feature], bins=bins)
    churn_rate = temp.groupby("bin")["churn"].mean()

    plt.figure(figsize=figsize)
    churn_rate.plot(kind="bar")
    plt.ylabel("Churn rate")
    plt.title(f"Churn rate vs {feature}")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

plot_binned_churn("monthly_charges", bins=6)
plot_binned_churn("tenure_months", bins=6)

## 3. Train / Test Split

In [None]:
feature_cols = [
    "monthly_charges",
    "tenure_months",
    "num_support_calls",
    "num_addon_services",
    "has_paperless_billing",
    "uses_mobile_app",
]
X = df[feature_cols].values
y = df["churn"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

X_train.shape, X_test.shape

## 4. Baseline model – Logistic Regression

In [None]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

y_pred_lr = log_reg.predict(X_test)

acc_lr = accuracy_score(y_test, y_pred_lr)
print(f"Logistic Regression accuracy: {acc_lr:.3f}\n")
print(classification_report(y_test, y_pred_lr, target_names=["No churn", "Churn"]))

In [None]:
# Confusion matrix plot
cm = confusion_matrix(y_test, y_pred_lr)

fig, ax = plt.subplots(figsize=(4, 3))
im = ax.imshow(cm, interpolation="nearest")
ax.set_title("Logistic Regression – Confusion Matrix")
ax.set_xlabel("Predicted")
ax.set_ylabel("Actual")

ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(["No churn", "Churn"])
ax.set_yticklabels(["No churn", "Churn"])

for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, cm[i, j], ha="center", va="center")

fig.colorbar(im, ax=ax)
plt.tight_layout()
plt.show()

## 5. Tree-based model – Random Forest

In [None]:
rf = RandomForestClassifier(
    n_estimators=200,
    random_state=RANDOM_STATE,
    max_depth=None,
)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

acc_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest accuracy: {acc_rf:.3f}\n")
print(classification_report(y_test, y_pred_rf, target_names=["No churn", "Churn"]))

In [None]:
# Compare model performance
models = ["Logistic Regression", "Random Forest"]
accuracies = [acc_lr, acc_rf]

plt.figure(figsize=(5, 3))
plt.bar(models, accuracies)
plt.ylim(0, 1.0)
plt.ylabel("Accuracy")
plt.title("Model Comparison")
plt.tight_layout()
plt.show()

## 6. Feature Importance

For the Random Forest model, we can inspect which features contribute most to the churn prediction.

In [None]:
importances = rf.feature_importances_
idx = np.argsort(importances)[::-1]

sorted_features = [feature_cols[i] for i in idx]
sorted_importances = importances[idx]

plt.figure(figsize=(6, 4))
plt.bar(range(len(sorted_features)), sorted_importances)
plt.xticks(range(len(sorted_features)), sorted_features, rotation=45, ha="right")
plt.ylabel("Importance")
plt.title("Random Forest – Feature Importance")
plt.tight_layout()
plt.show()

## 7. Wrap-up

In this notebook, we:

- Built a synthetic **customer churn** dataset.
- Performed quick **EDA** with summary stats and plots.
- Trained and evaluated **Logistic Regression** and **Random Forest** models.
- Visualized performance and **feature importance**.

This structure mirrors a typical small data science workflow and is ideal for a GitHub portfolio project.