Project Overview

The goal of this project is to predict customer churn for a telecom company. Customer churn occurs when a customer stops using the company’s services, and predicting it allows the company to take proactive steps to retain customers.

We are using the Kaggle Telecom Churn Dataset (telecom_churn.csv) for this analysis. The dataset contains customer account information, service usage, and whether the customer has churned.

This is a classification problem because the target variable, churn, is categorical it indicates whether a customer has churned (Yes) or not (No).

Stakeholder: The telecom company is the primary stakeholder. By predicting churn, they can:

Identify high risk customers and implement retention strategies.

Reduce overall customer loss and revenue decline.

Improve marketing campaigns by targeting customers likely to churn.

IMPORTS 

In [53]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


LOADING DATASET

In [54]:
df = pd.read_csv("../data/telecom_churn.csv")
df.head()


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


DATA INFO

In [60]:

df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


DEFINING FEATURES AND TARGET

In [61]:

# Features and target
X = df.drop(columns="churn")
y = df["churn"]

print("Feature shape:", X.shape)
print("Target shape:", y.shape)


Feature shape: (3333, 20)
Target shape: (3333,)


CHECK FOR MISSING VALUES

In [65]:
# Check for missing values
print("Total NaNs in X:", X.isna().sum().sum())

# Only check numeric columns for infinity
numeric_cols = X.select_dtypes(include=[np.number])
print("Total infinite values in numeric columns:", np.isinf(numeric_cols.values).sum())


Total NaNs in X: 0
Total infinite values in numeric columns: 0


CONVERT CATEGORICAL COLUMNS 

In [66]:
# Identify object columns
categorical_cols = X.select_dtypes(include=["object"]).columns
print("Categorical columns:", list(categorical_cols))

# Convert categorical columns to dummy variables
X = pd.get_dummies(X, drop_first=True)

# Check updated shape
print("New X shape after encoding:", X.shape)



Categorical columns: ['state', 'phone number', 'international plan', 'voice mail plan']
New X shape after encoding: (3333, 3400)


CHECK AGAIN FOR NaNs/Infs

In [67]:
# Now X should be fully numeric
print("Total NaNs in X after encoding:", X.isna().sum().sum())
print("Total infinite values in X after encoding:", np.isinf(X.values).sum())


Total NaNs in X after encoding: 0
Total infinite values in X after encoding: 0


TRAIN-TEST SPLIT

In [68]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y  # Keeps same churn ratio in train/test
)

print("Training shape:", X_train.shape)
print("Testing shape:", X_test.shape)


Training shape: (2499, 3400)
Testing shape: (834, 3400)


SCALING

In [69]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


BASELINE LOGISTIC REGRESSION MODEL

In [70]:
# Import model
from sklearn.linear_model import LogisticRegression

# Initialize the model
logreg = LogisticRegression(max_iter=1000, random_state=42)

# Fit on scaled training data
logreg.fit(X_train_scaled, y_train)

# Predict on test data
y_pred_logreg = logreg.predict(X_test_scaled)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred_logreg))
print("\nClassification Report:\n", classification_report(y_test, y_pred_logreg))


Accuracy: 0.8561151079136691

Classification Report:
               precision    recall  f1-score   support

       False       0.86      1.00      0.92       713
        True       1.00      0.01      0.02       121

    accuracy                           0.86       834
   macro avg       0.93      0.50      0.47       834
weighted avg       0.88      0.86      0.79       834



DECISION TREE MODEL

In [71]:
from sklearn.tree import DecisionTreeClassifier

# Initialize model (baseline)
dt = DecisionTreeClassifier(random_state=42)

# Fit on training data
dt.fit(X_train, y_train)  # Trees do NOT require scaling

# Predict
y_pred_dt = dt.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))


Accuracy: 0.9328537170263789

Classification Report:
               precision    recall  f1-score   support

       False       0.94      0.98      0.96       713
        True       0.86      0.64      0.74       121

    accuracy                           0.93       834
   macro avg       0.90      0.81      0.85       834
weighted avg       0.93      0.93      0.93       834



TUNE HYPERPARAMETERS

In [72]:
# Try tuning max_depth
dt_tuned = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42)
dt_tuned.fit(X_train, y_train)
y_pred_dt_tuned = dt_tuned.predict(X_test)

# Evaluate tuned tree
print("Accuracy (tuned):", accuracy_score(y_test, y_pred_dt_tuned))
print("\nClassification Report (tuned):\n", classification_report(y_test, y_pred_dt_tuned))


Accuracy (tuned): 0.920863309352518

Classification Report (tuned):
               precision    recall  f1-score   support

       False       0.94      0.97      0.95       713
        True       0.78      0.64      0.70       121

    accuracy                           0.92       834
   macro avg       0.86      0.80      0.83       834
weighted avg       0.92      0.92      0.92       834



FEATURE IMPORTANCE

In [73]:
# Get feature importances
importances = pd.Series(dt_tuned.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).head(10)


total day charge          0.235978
customer service calls    0.163728
total intl charge         0.118413
international plan_yes    0.105244
total eve charge          0.090944
total intl calls          0.083226
total day minutes         0.081572
voice mail plan_yes       0.070977
total eve minutes         0.035203
total day calls           0.007923
dtype: float64