# Project Overview: Customer Churn Prediction 📉

### This project focuses on predicting customer churn using historical telecommunications or banking data. By applying machine learning models, we aim to identify customers at high risk of leaving, allowing the company to implement targeted retention strategies.

### Project Setup and Data Ingestion

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
# Set global display options
plt.style.use('seaborn-v0_8-whitegrid')

In [3]:
# Load the data (assuming 'WA_Fn-UseC_-Telco-Customer-Churn.csv' is used)
try:
    df = pd.read_csv('Telco-Customer-Churn.csv')
    print("Data loaded successfully.")
    print("\n--- Initial Data Snapshot ---")
    print(df.head())
    print("\n--- Churn Distribution ---")
    print(df['Churn'].value_counts(normalize=True))
except FileNotFoundError:
    print("ERROR: CSV file not found. Please obtain a suitable churn dataset (e.g., Telco Churn from Kaggle).")
    exit()

Data loaded successfully.

--- Initial Data Snapshot ---
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  Tec

### Data Cleaning and Feature Engineering

##### Data Cleaning and Type Correction
The 'TotalCharges' column is often loaded as a string due to empty spaces, which represent missing values.

In [4]:
# 'customerID' is an identifier, not a feature
df.drop('customerID', axis=1, inplace=True)

In [5]:
# Correct 'TotalCharges' type and handle missing values
# Replace spaces with NaN, then convert to numeric
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan)
df.dropna(subset=['TotalCharges'], inplace=True)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

In [6]:
# Convert the target variable 'Churn' to a binary integer (0 or 1)
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
print(f"\nRows after cleaning: {len(df)}")


Rows after cleaning: 7032


We separate features (X) from the target (y) and split the data into training and testing sets.

In [7]:
# Separate features (X) and target (y)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Split data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Training set size: 5625
Testing set size: 1407


We define which columns are categorical (to be One-Hot Encoded) and which are numerical (to be Scaled). This is crucial for models like Logistic Regression.

In [8]:
# Identify column types
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include='object').columns.tolist()

In [9]:
# Create preprocessing transformers
numerical_transformer = StandardScaler() # Standardize numerical features
categorical_transformer = OneHotEncoder(handle_unknown='ignore') # Encode categorical features

In [13]:
# Combine transformers into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'
)
print("\n--- Preprocessing Pipeline Defined ---")


--- Preprocessing Pipeline Defined ---


### Model Building and Evaluation


##### Model 1 - Logistic Regression
Logistic Regression is a strong baseline model, especially sensitive to feature scaling.

In [15]:
# Create the Logistic Regression Pipeline
logreg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='liblinear', random_state=42))
])

# Train the model
logreg_pipeline.fit(X_train, y_train)

# Predict and Evaluate
y_pred_logreg = logreg_pipeline.predict(X_test)
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)

# Displaying
print("\n--- Model 1: Logistic Regression Results ---")
print(f"Accuracy: {accuracy_logreg:,.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_logreg))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_logreg))


--- Model 1: Logistic Regression Results ---
Accuracy: 0.8038
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87      1033
           1       0.65      0.57      0.61       374

    accuracy                           0.80      1407
   macro avg       0.75      0.73      0.74      1407
weighted avg       0.80      0.80      0.80      1407

Confusion Matrix:
 [[917 116]
 [160 214]]


##### Model 2 - Decision Tree Classifier
A Decision Tree provides a non-linear, interpretable model. We limit its depth to prevent overfitting.

In [16]:
# Create the Decision Tree Pipeline
dtree_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(max_depth=5, random_state=42))
])

# Train the model
dtree_pipeline.fit(X_train, y_train)

# Predict and Evaluate
y_pred_dtree = dtree_pipeline.predict(X_test)
accuracy_dtree = accuracy_score(y_test, y_pred_dtree)

# Displaying
print("\n--- Model 2: Decision Tree Results ---")
print(f"Accuracy: {accuracy_dtree:,.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred_dtree))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dtree))


--- Model 2: Decision Tree Results ---
Accuracy: 0.7896
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.85      0.86      1033
           1       0.60      0.61      0.61       374

    accuracy                           0.79      1407
   macro avg       0.73      0.73      0.73      1407
weighted avg       0.79      0.79      0.79      1407

Confusion Matrix:
 [[881 152]
 [144 230]]


##### Analyzing Logistic Regression Coefficients
For Logistic Regression, the magnitude and sign of the coefficients indicate the impact on churn probability.

In [17]:
# Get feature names after one-hot encoding
feature_names = logreg_pipeline['preprocessor'].get_feature_names_out()

# Get coefficients from the trained model
coefficients = logreg_pipeline['classifier'].coef_[0]

# Create a DataFrame for easy viewing
feature_importance_logreg = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
}).sort_values(by='Coefficient', ascending=False)

print("\n--- Top 10 Features Driving CHURN (Positive Coefficient) ---")
print(feature_importance_logreg.head(10))

print("\n--- Top 10 Features Preventing CHURN (Negative Coefficient) ---")
print(feature_importance_logreg.tail(10).iloc[::-1])


--- Top 10 Features Driving CHURN (Positive Coefficient) ---
                                Feature  Coefficient
3                     num__TotalCharges     0.640922
36         cat__Contract_Month-to-month     0.601468
16     cat__InternetService_Fiber optic     0.558548
43  cat__PaymentMethod_Electronic check     0.171924
32                 cat__StreamingTV_Yes     0.170074
35             cat__StreamingMovies_Yes     0.156036
18               cat__OnlineSecurity_No     0.152477
27                  cat__TechSupport_No     0.131335
14               cat__MultipleLines_Yes     0.071819
0                    num__SeniorCitizen     0.071338

--- Top 10 Features Preventing CHURN (Negative Coefficient) ---
                                      Feature  Coefficient
1                                 num__tenure    -1.349386
38                     cat__Contract_Two year    -0.785987
15                   cat__InternetService_DSL    -0.614660
2                         num__MonthlyCharges    -0.50

## Interpretation Key:

1. Positive Coefficient: Increases the log-odds of a customer churning (e.g., higher MonthlyCharges).

2. Negative Coefficient: Decreases the log-odds of a customer churning (e.g., higher tenure).