# Customer Churn Prediction – Machine Learning Project

This notebook builds an end-to-end machine learning pipeline to predict customer churn using the Telco Customer Churn dataset. The goal is to identify customers who are at high risk of leaving so that the business can take proactive retention measures.


## 1. Introduction

Customer churn is one of the most important business problems for subscription-based services.  
The objective of this project is to:

- Explore customer behaviour  
- Identify major churn drivers  
- Build machine learning models that can classify at-risk customers  
- Provide business recommendations based on insights  

We use the **Telco Customer Churn** dataset, which includes information on customer demographics, service usage, account details, and churn labels.


## 2. Importing Required Libraries

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import(
    classification_report,
    roc_auc_score,
    roc_curve,
    confusion_matrix,
    accuracy_score
)

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"D:\Downlaods\WA_Fn-UseC_-Telco-Customer-Churn.csv")

## 2. Dataset Overview

The Telco dataset contains:

- **7043 rows** (customers)  
- **20+ features** including:
  - Customer demographics  
  - Subscription details  
  - Billing information  
  - Service usage  
- **Target variable:** `Churn` (Yes/No)

Before modeling, we explore the dataset to understand trends and patterns.


In [None]:
print(df.head())

In [None]:
print(df.info())

In [None]:
print(df.describe())

## 3. Cleaning and formatting Dataset

In [None]:
df.isna().sum()

In [None]:
df.info()

## 3. Exploratory Data Analysis (EDA)

EDA helps us understand how customers behave and which groups are more likely to churn.

Key areas explored:

- Churn distribution  
- Tenure patterns  
- Monthly charges  
- Contract type  
- Correlation between features  

These insights guide feature engineering and model selection.


### 4.1 Churn Distribution Analysis

In [None]:
sns.countplot(x=df['Churn'])
plt.title("Churn Distribution")
plt.show()

df["Churn"].value_counts(normalize=True)


The churn distribution is imbalanced, with approximately:

- **73% Non-churners**
- **27% Churners**

This imbalance means evaluation metrics like **accuracy** alone are unreliable.  
We will therefore focus on:

- Recall (churn class)  
- F1-score  
- ROC-AUC  


### 4.2 Tenure Distribution Analysis

In [None]:
sns.histplot(df["tenure"], kde = True)
plt.title("Tenure Distribution")
plt.show()

The distribution of customer tenure reveals several important patterns:

There is a large spike at very low tenure (0–5 months)
→ Indicates many customers leave early after onboarding

Another noticeable spike appears around 70–72 months
→ Customers who have stayed long are much less likely to churn

The distribution dips in the mid-range (20–50 months)

Customers with **very low tenure (0–6 months)** show the highest churn.  
This indicates that dissatisfaction starts early, highlighting the importance of strong onboarding.

### 4.3 Monthly Charges Analysis

In [None]:
sns.histplot(df["MonthlyCharges"], kde=True)
plt.title("Monthly Charge Distribution")
plt.show()

The distribution of MonthlyCharges reveals:

A small cluster of customers paying very low charges (~$20–30)

A major concentration around $70–100

A long tail reaching above $100

Customers with higher monthly charges tend to churn more often:

Higher cost → increased dissatisfaction

Customers may switch to cheaper competitors

This aligns with industry research: pricing pressure is a primary driver of churn in telecom/data services.


Higher monthly charges correlate strongly with churn.  
Customers paying more tend to be more dissatisfied or price-sensitive.


### 4.4 Contract Type vs Churn Analysis

In [None]:
sns.countplot(data=df, x= 'Contract', hue = "Churn")
plt.title("Churn By Contract Type")
plt.xticks(rotation= 45)
plt.show()

Month-to-month contracts have the highest churn

Very high churn rate

Minimal commitment

Customers can leave easily

One-year contracts have significantly lower churn

Moderate retention effect

Two-year contracts have the lowest churn

Long-term contracts stabilize retention

Customers with long-term commitments rarely churn early


Contract type plays a crucial role:

- **Month-to-month customers** churn the most  
- **One-year contracts** have moderate churn  
- **Two-year contract customers** churn the least  

This confirms that long-term commitments stabilize retention.


## 5. Data Preprocessing

To prepare the data for modeling:

1. Convert `TotalCharges` to numeric  
2. Handle missing values  
3. Drop unnecessary identifiers (`customerID`)  
4. Encode categorical variables using one-hot encoding  
5. Scale numerical features using StandardScaler  
6. Train-test split (80/20)

Preprocessing ensures the dataset is clean and machine-learning ready.


In [None]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors = 'coerce')

df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())

In [None]:
df["Churn"] = df["Churn"].map({"Yes":1, "No":0})
df_encoded = pd.get_dummies(df.drop(["customerID"], axis = 1), drop_first=True)

### Correlation Heatmap

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df_encoded.corr(), cmap='coolwarm', center=0)
plt.title("Correlation Heatmap")
plt.show()

### Correlation Heatmap

The heatmap highlights:

- Strong negative correlation between **tenure** and **churn**
- Strong positive correlation between:
  - Month-to-month contracts and churn
  - Electronic check payment and churn
  - Fiber optic internet and churn  

These will be important features for modeling.


In [None]:
df_encoded.head()

In [None]:
df_encoded.isna().sum()

## 5. Model Training

We train three baseline models:

- Logistic Regression  
- Random Forest  
- XGBoost  

These models provide a good mix of:

- Linear interpretability (LR)  
- Non-linear structure (RF)  
- Boosted performance (XGB)


In [None]:
from sklearn.model_selection import train_test_split

x = df_encoded.drop('Churn', axis=1)
y = df_encoded["Churn"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=13, stratify=y)


In [None]:
scalar = StandardScaler()
x_train = scalar.fit_transform(x_train)
x_test = scalar.transform(x_test)


In [None]:
def random_search(model, parameters, model_type):
    random_search = RandomizedSearchCV(model, parameters,n_iter=50, scoring='roc_auc', cv = 5, n_jobs=1, verbose=1)
    random_search.fit(x_train,y_train)
    best_model = random_search.best_estimator_
    print(f"Best {model_type} Parameters:", random_search.best_params_)
    return best_model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

lr_model = LogisticRegression(max_iter=500)

lr_model.fit(x_train, y_train)
pred = lr_model.predict(x_test)
probs = lr_model.predict_proba(x_test)[:, 1]

In [None]:
rf_classifier = RandomForestClassifier(random_state=13, class_weight='balanced')
rf_parameter_grid = {
    'n_estimators': [100, 500, 1000],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 6],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

best_rf_model = random_search(rf_classifier, rf_parameter_grid, "Random Forest")
rf_pred = best_rf_model.predict(x_test)
rf_probs = best_rf_model.predict_proba(x_test)[:, 1]

In [None]:
xgb_classifier = XGBClassifier(eval_metric = 'logloss')
xgb_parameter_grid = {
    'n_estimators': [200, 300, 500, 800],
    'learning_rate': [0.01, 0.03, 0.05, 0.1],
    'max_depth': [3, 4, 5, 6, 8],
    'subsample': [0.6, 0.7, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 1.0],
    'gamma': [0, 1, 5],
    'reg_alpha': [0, 0.01, 0.1],
    'reg_lambda': [1, 2, 5]
}
best_xgb_model = random_search(xgb_classifier, xgb_parameter_grid, "XGBoost")


best_xgb_model.fit(x_train, y_train)
xgb_pred = best_xgb_model.predict(x_test)
xgb_probs = best_xgb_model.predict_proba(x_test)[:, 1]

## 6. Model Evaluation

Models are evaluated using:

- **Recall (Churn class)** – priority metric  
- **F1-score (Churn)**  
- **Accuracy**  
- **ROC-AUC**  

Why recall matters:
We want to catch as many churners as possible — missing a churner is costlier than mistakenly flagging a loyal customer.


In [None]:
print("Logistic Regression Model Classification Report")
print(classification_report(y_test, pred))

print("Random Forest Classifier Classification Report:")
print(classification_report(y_test, rf_pred))

print("XGBoost Classifier Classification Report:")
print(classification_report(y_test, xgb_pred))

In [None]:
def plot_confusion_matrix(y_true, y_pred, title):
    cm  =  confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot = True, fmt = "d" , cmap="Blues")
    plt.title(title)
    plt.xlabel("Prediction")
    plt.ylabel("Actual")
    plt.show()

plot_confusion_matrix(y_test, rf_pred, "Random Forest Classifier Confusion Matrix")
plot_confusion_matrix(y_test, xgb_pred, "XGBOOST Classifier Confusion Matrix")

### ROC Curve

The ROC curve compares the true positive rate vs. false positive rate.  
The **AUC score** summarizes model predictive power:

- 0.5 = random guessing  
- 1.0 = perfect classifier  


In [None]:
print("Logistic Regression ROC-AUC:", roc_auc_score(y_test, probs))
print("Random Forest ROC-AUC:", roc_auc_score(y_test, rf_probs))
print("XGBoost ROC-AUC:", roc_auc_score(y_test, xgb_probs))

In [None]:
fpr_lr, tpr_lr, _ = roc_curve(y_test, probs)
fpr_rf, tpr_rf,_ = roc_curve(y_test, rf_probs)
fpr_xgb, tpr_xgb,_ = roc_curve(y_test, xgb_probs)

plt.figure(figsize=(10,6))
plt.plot(fpr_lr, tpr_lr, label = "Logistic Regression")
plt.plot(fpr_rf, tpr_rf, label = "Random Forest")
plt.plot(fpr_xgb, tpr_xgb, label = "XGBOOST")
plt.plot([0,1],[0,1], linestyle = '--')

plt.title("ROC Curve")
plt.xlabel("Flase Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()

## 7. Feature Importance (XGBoost)

XGBoost provides feature importance based on gain (improvement in splits).  
Top churn drivers include:

- Fiber optic internet  
- Month-to-month contract  
- Electronic check payment  
- Low tenure  
- Lack of security/tech support add-ons  


In [None]:
importance_dict = best_xgb_model.get_booster().get_score(importance_type = 'gain')
mapped_importance = {
    x.columns[int(k[1:])]: v for k, v in importance_dict.items()
}

mapped_importance

In [None]:


xgb_importance = pd.DataFrame({
    'feature': mapped_importance.keys(),
    'importance': mapped_importance.values()
})

xgb_importance = xgb_importance.sort_values(by = 'importance', ascending= False)

plt.figure(figsize=(10,8))
plt.barh(xgb_importance['feature'][:15], xgb_importance['importance'][:15])
plt.title("Top 15 XGBoost Feature Importance")
plt.gca().invert_yaxis()
plt.show()

## 8. Business Insights

Based on the model’s feature importance:

1. **Promote longer-term contracts**  
2. **Improve onboarding for new customers**  
3. **Investigate fiber optic dissatisfaction**  
4. **Encourage auto-pay instead of electronic checks**  
5. **Bundle security/tech-support with subscriptions**

These insights help significantly reduce churn.


## 9. Conclusion

This project successfully:

- Explored the Telco dataset  
- Identified key churn factors  
- Built ML models to predict churn  
- Selected the best model (XGBoost)  
- Delivered actionable business recommendations  

The pipeline can be extended using:
- SMOTE oversampling  
- SHAP interpretability  
- Deployment via Streamlit/Flask  
- A Power BI dashboard  
