
# Customer Churn Prediction

## Objective
Build an end-to-end machine learning pipeline to predict whether a customer will churn based on historical customer data.

**Dataset:** Telco Customer Churn Dataset (Public)  
**Target Variable:** Churn (Yes / No)


## Import Required Libraries

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report,
    confusion_matrix
)

sns.set(style="whitegrid")


## Load Dataset

In [None]:

df = pd.read_csv("Telco-Customer-Churn.csv")
df.head()


## Dataset Overview

In [None]:
df.info()

In [None]:
df.describe()

## Target Variable Distribution

In [None]:

sns.countplot(x="Churn", data=df)
plt.title("Churn Distribution")
plt.show()


## Data Cleaning

In [None]:

df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df.dropna(inplace=True)
df.drop("customerID", axis=1, inplace=True)


## Exploratory Data Analysis

In [None]:

sns.boxplot(x="Churn", y="tenure", data=df)
plt.title("Tenure vs Churn")
plt.show()


In [None]:

sns.boxplot(x="Churn", y="MonthlyCharges", data=df)
plt.title("Monthly Charges vs Churn")
plt.show()


In [None]:

sns.countplot(x="Contract", hue="Churn", data=df)
plt.title("Contract Type vs Churn")
plt.show()



## EDA Insights
- Low tenure customers churn more.
- Higher monthly charges increase churn probability.
- Month-to-month contracts show higher churn.


## Feature Engineering

In [None]:

le = LabelEncoder()
df["Churn"] = le.fit_transform(df["Churn"])

df = pd.get_dummies(df, drop_first=True)



## Feature Engineering Justification
- One-hot encoding converts categorical data to numerical.
- Scaling improves Logistic Regression performance.
- Class imbalance handled using class weights.


## Train-Test Split & Scaling

In [None]:

X = df.drop("Churn", axis=1)
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


## Model Training

In [None]:

lr = LogisticRegression(max_iter=1000, class_weight="balanced", random_state=42)
rf = RandomForestClassifier(n_estimators=200, class_weight="balanced", random_state=42)

lr.fit(X_train, y_train)
rf.fit(X_train, y_train)


## Model Evaluation

In [None]:

def evaluate(model, name):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:,1]
    print(name)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("F1:", f1_score(y_test, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test, y_prob))
    print("-"*40)

evaluate(lr, "Logistic Regression")
evaluate(rf, "Random Forest")


## Confusion Matrix

In [None]:

cm = confusion_matrix(y_test, rf.predict(X_test))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Random Forest Confusion Matrix")
plt.show()


## Feature Importance

In [None]:

importance = pd.Series(rf.feature_importances_, index=X.columns)
importance.sort_values(ascending=False).head(10)



## Business Insights & Conclusion

- Month-to-month customers have higher churn.
- Early engagement is critical for low-tenure users.
- Pricing strategy impacts retention.

Random Forest performed best and is selected as the final model.
