# Classification of Churn Dataset: A Comprehensive Guide to Preprocessing and Modeling

This Notebook provides a step-by-step approach to classifying a churn dataset. It covers essential data preprocessing techniques and explores machine learning models to predict customer churn effectively. The notebook includes:

- **Introduction**: Overview of the churn dataset and its importance.
- **Preprocessing**:
  - Handling missing values.
  - Encoding categorical features.
  - Feature scaling.
  - Splitting the data into training and testing sets.
- **Modeling**:
  - Implementing classification algorithms.
  - Evaluating model performance using accuracy, precision, recall, and F1-score.
- **Conclusion**: Summary of results and next steps for improvement.

By the end of this notebook, you will gain insights into preprocessing churn data and applying classification techniques for actionable predictions.


In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

In [3]:
train_data = pd.read_csv('https://raw.githubusercontent.com/Ganesh-1912004/End-to-End-ML-Project-Pune-House-Price-Prediction-Model-/refs/heads/main/customer_churn_dataset-training-master.csv')
train_data

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.00,17.0,1.0
1,3.0,65.0,Female,49.0,1.0,10.0,8.0,Basic,Monthly,557.00,6.0,1.0
2,4.0,55.0,Female,14.0,4.0,6.0,18.0,Basic,Quarterly,185.00,3.0,1.0
3,5.0,58.0,Male,38.0,21.0,7.0,7.0,Standard,Monthly,396.00,29.0,1.0
4,6.0,23.0,Male,32.0,20.0,5.0,8.0,Basic,Monthly,617.00,20.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
440828,449995.0,42.0,Male,54.0,15.0,1.0,3.0,Premium,Annual,716.38,8.0,0.0
440829,449996.0,25.0,Female,8.0,13.0,1.0,20.0,Premium,Annual,745.38,2.0,0.0
440830,449997.0,26.0,Male,35.0,27.0,1.0,5.0,Standard,Quarterly,977.31,9.0,0.0
440831,449998.0,28.0,Male,55.0,14.0,2.0,0.0,Standard,Quarterly,602.55,2.0,0.0


In [4]:
train_data.drop(columns = ['CustomerID'], inplace = True)
train_data

Unnamed: 0,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.00,17.0,1.0
1,65.0,Female,49.0,1.0,10.0,8.0,Basic,Monthly,557.00,6.0,1.0
2,55.0,Female,14.0,4.0,6.0,18.0,Basic,Quarterly,185.00,3.0,1.0
3,58.0,Male,38.0,21.0,7.0,7.0,Standard,Monthly,396.00,29.0,1.0
4,23.0,Male,32.0,20.0,5.0,8.0,Basic,Monthly,617.00,20.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
440828,42.0,Male,54.0,15.0,1.0,3.0,Premium,Annual,716.38,8.0,0.0
440829,25.0,Female,8.0,13.0,1.0,20.0,Premium,Annual,745.38,2.0,0.0
440830,26.0,Male,35.0,27.0,1.0,5.0,Standard,Quarterly,977.31,9.0,0.0
440831,28.0,Male,55.0,14.0,2.0,0.0,Standard,Quarterly,602.55,2.0,0.0


In [6]:
train_data.isnull().sum() # .isna() or isnan()

Unnamed: 0,0
Age,1
Gender,1
Tenure,1
Usage Frequency,1
Support Calls,1
Payment Delay,1
Subscription Type,1
Contract Length,1
Total Spend,1
Last Interaction,1


In [7]:
train_data.dtypes

Unnamed: 0,0
Age,float64
Gender,object
Tenure,float64
Usage Frequency,float64
Support Calls,float64
Payment Delay,float64
Subscription Type,object
Contract Length,object
Total Spend,float64
Last Interaction,float64


In [8]:
from sklearn.preprocessing import LabelEncoder
encoders = {}
for classes in train_data.columns:
    if train_data[classes].dtype == 'object':
        le = LabelEncoder()
        le.fit(train_data[classes])
        encoders[classes] = le

def transform(dataFrame: pd.DataFrame) -> pd.DataFrame:
    for column in dataFrame.columns:
        if dataFrame[column].dtype == 'object':
            dataFrame[column] = encoders[column].fit_transform(dataFrame[column])
    return dataFrame

transformed_data = transform(train_data)
transformed_data

Unnamed: 0,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,30.0,0,39.0,14.0,5.0,18.0,2,0,932.00,17.0,1.0
1,65.0,0,49.0,1.0,10.0,8.0,0,1,557.00,6.0,1.0
2,55.0,0,14.0,4.0,6.0,18.0,0,2,185.00,3.0,1.0
3,58.0,1,38.0,21.0,7.0,7.0,2,1,396.00,29.0,1.0
4,23.0,1,32.0,20.0,5.0,8.0,0,1,617.00,20.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
440828,42.0,1,54.0,15.0,1.0,3.0,1,0,716.38,8.0,0.0
440829,25.0,0,8.0,13.0,1.0,20.0,1,0,745.38,2.0,0.0
440830,26.0,1,35.0,27.0,1.0,5.0,2,2,977.31,9.0,0.0
440831,28.0,1,55.0,14.0,2.0,0.0,2,2,602.55,2.0,0.0


In [11]:
columnsName = list(transformed_data.columns)

In [14]:
columnsName

['Age',
 'Gender',
 'Tenure',
 'Usage Frequency',
 'Support Calls',
 'Payment Delay',
 'Subscription Type',
 'Contract Length',
 'Total Spend',
 'Last Interaction',
 'Churn']

In [15]:
transformed_data.dtypes

Unnamed: 0,0
Age,float64
Gender,int64
Tenure,float64
Usage Frequency,float64
Support Calls,float64
Payment Delay,float64
Subscription Type,int64
Contract Length,int64
Total Spend,float64
Last Interaction,float64


In [16]:
for columns in transformed_data.columns:
    if transformed_data[columns].isnull().sum() > 0:
        transformed_data[columns].fillna(transformed_data[columns].mean(), inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  transformed_data[columns].fillna(transformed_data[columns].mean(), inplace = True)


In [None]:
transformed_data.isnull().sum()

Unnamed: 0,0
Age,0
Gender,0
Tenure,0
Usage Frequency,0
Support Calls,0
Payment Delay,0
Subscription Type,0
Contract Length,0
Total Spend,0
Last Interaction,0


In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(transformed_data.iloc[:, :-1])
pd.DataFrame(scaled_data, columns = transformed_data.columns[:-1])

Unnamed: 0,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction
0,-0.75,-1.15,0.45,-0.21,0.45,0.61,1.21,-1.11,1.25,0.29
1,2.06,-1.15,1.03,-1.72,2.08,-0.60,-1.25,0.00,-0.31,-0.99
2,1.26,-1.15,-1.00,-1.38,0.78,0.61,-1.25,1.12,-1.85,-1.34
3,1.50,0.87,0.39,0.60,1.11,-0.72,1.21,0.00,-0.98,1.69
4,-1.32,0.87,0.04,0.49,0.45,-0.60,-1.25,0.00,-0.06,0.64
...,...,...,...,...,...,...,...,...,...,...
440828,0.21,0.87,1.32,-0.09,-0.85,-1.21,-0.02,-1.11,0.35,-0.75
440829,-1.16,-1.15,-1.35,-0.33,-0.85,0.85,-0.02,-1.11,0.47,-1.45
440830,-1.07,0.87,0.22,1.30,-0.85,-0.96,1.21,1.12,1.44,-0.64
440831,-0.91,0.87,1.38,-0.21,-0.52,-1.57,1.21,1.12,-0.12,-1.45


In [None]:
y = transformed_data.iloc[:, -1].astype(int)

In [None]:
transformed_data.shape

(440833, 11)

In [None]:
y

Unnamed: 0,Churn
0,1
1,1
2,1
3,1
4,1
...,...
440828,0
440829,0
440830,0
440831,0


In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_data, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape

(352666, 10)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Accuracy:", accuracy)

Logistic Regression Accuracy: 0.852688647679971


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.85      0.83     38143
           1       0.88      0.85      0.87     50024

    accuracy                           0.85     88167
   macro avg       0.85      0.85      0.85     88167
weighted avg       0.85      0.85      0.85     88167



In [None]:
!pip install lazypredict



In [None]:
from lazypredict.Supervised import LazyClassifier
classifier = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = classifier.fit(X_train[:1000], X_test[:1000], y_train[:1000], y_test[:1000])
print(models)

100%|██████████| 32/32 [00:02<00:00, 13.95it/s]

[LightGBM] [Info] Number of positive: 565, number of negative: 435
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000062 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 482
[LightGBM] [Info] Number of data points in the train set: 1000, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.565000 -> initscore=0.261480
[LightGBM] [Info] Start training from score 0.261480
                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
LGBMClassifier                     0.99               0.99     0.99      0.99   
RandomForestClassifier             0.98               0.99     0.99      0.99   
ExtraTreesClassifier               0.98               0.98     0.98      0.98   
BaggingClassifier                  0.97      


