# Introduction

This notebook is created in order to perform EDA and create a binary classification model in order to predict customer churn. Churn is defined when a customer stops using a product or subscription of a business or company. In this case, churn refers to the customers that have stopped using this bank service. Given a dataset, we can perform a binary classification to predict whether or not give the same features, whether a new customer is likely to stop using the service. 

# Loading Data

In [1]:
import pandas as pd

traindf = pd.read_csv("traindata.csv")
traindf.head()

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.0,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.0,2,1.0,1.0,49503.5,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.0,2,1.0,0.0,184866.69,0
3,3,15741417,Kao,581,France,Male,34.0,2,148882.54,1,1.0,1.0,84560.88,0
4,4,15766172,Chiemenam,716,Spain,Male,33.0,5,0.0,2,1.0,1.0,15068.83,0


In [2]:
traindf.size

2310476

In [3]:
traindf.shape
#165034 rows, 14 columns

(165034, 14)

In [4]:
def summary(df):
    print(f'data shape: {df.shape}')  
    summ = pd.DataFrame(df.dtypes, columns=['data type'])
    summ['#missing'] = df.isnull().sum().values 
    summ['%missing'] = df.isnull().sum().values / len(df)* 100
    summ['#unique'] = df.nunique().values
    desc = pd.DataFrame(df.describe(include='all').transpose())
    summ['min'] = desc['min'].values
    summ['max'] = desc['max'].values
    return summ

summary(traindf)
# No missing values, binary variables already numerically encoded

data shape: (165034, 14)


Unnamed: 0,data type,#missing,%missing,#unique,min,max
id,int64,0,0.0,165034,0.0,165033.0
CustomerId,int64,0,0.0,23221,15565701.0,15815690.0
Surname,object,0,0.0,2797,,
CreditScore,int64,0,0.0,457,350.0,850.0
Geography,object,0,0.0,3,,
Gender,object,0,0.0,2,,
Age,float64,0,0.0,71,18.0,92.0
Tenure,int64,0,0.0,11,0.0,10.0
Balance,float64,0,0.0,30075,0.0,250898.09
NumOfProducts,int64,0,0.0,4,1.0,4.0


In [5]:
testdf = pd.read_csv("test.csv")
testdf
summary(testdf)

data shape: (110023, 13)


Unnamed: 0,data type,#missing,%missing,#unique,min,max
id,int64,0,0.0,110023,165034.0,275056.0
CustomerId,int64,0,0.0,19698,15565701.0,15815690.0
Surname,object,0,0.0,2708,,
CreditScore,int64,0,0.0,454,350.0,850.0
Geography,object,0,0.0,3,,
Gender,object,0,0.0,2,,
Age,float64,0,0.0,74,18.0,92.0
Tenure,int64,0,0.0,11,0.0,10.0
Balance,float64,0,0.0,22513,0.0,250898.09
NumOfProducts,int64,0,0.0,4,1.0,4.0


In [6]:
# Calculate IQR excluding the last column ("Exited")
Q1 = traindf.iloc[:, :-1].select_dtypes(include=['float64', 'int64']).quantile(0.25)
Q3 = traindf.iloc[:, :-1].select_dtypes(include=['float64', 'int64']).quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
outliers = (
    (traindf.iloc[:, :-1].select_dtypes(include=['float64', 'int64']) < lower_fence) | 
    (traindf.iloc[:, :-1].select_dtypes(include=['float64', 'int64']) > upper_fence)
).any(axis=1)

# Remove outliers
traindf = traindf[~outliers]
traindf

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.00,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.00,2,1.0,1.0,49503.50,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.00,2,1.0,0.0,184866.69,0
3,3,15741417,Kao,581,France,Male,34.0,2,148882.54,1,1.0,1.0,84560.88,0
4,4,15766172,Chiemenam,716,Spain,Male,33.0,5,0.00,2,1.0,1.0,15068.83,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165027,165027,15703793,Bevan,767,France,Female,44.0,4,76554.06,2,1.0,0.0,77837.63,0
165028,165028,15704770,Oluchukwu,630,France,Male,50.0,8,0.00,2,1.0,1.0,5962.50,0
165029,165029,15667085,Meng,667,Spain,Female,33.0,2,0.00,1,1.0,1.0,131834.75,0
165031,165031,15664752,Hsia,565,France,Male,31.0,5,0.00,1,1.0,1.0,127429.56,0


We see that we can define churn if the Exited variable is True (1). We can make this our target variable. 

# Model Creation 

In [7]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Separate features (X) and target variable (y)
X = traindf.drop(columns=['Exited'])  # Exclude the target variable
y = traindf['Exited']

In [8]:
# Define categorical and numeric features
categorical_features = ['Surname', 'Geography', 'Gender']
numeric_features = [col for col in X.columns if col not in categorical_features]

In [9]:
# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

In [16]:
# Combine the preprocessor with the classifier in a pipeline
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('classifier', XGBClassifier())])


# Split the data into training and testing sets
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Assuming 'y' is a continuous variable
threshold = 0.5  # set your desired threshold

# Ensure the indices are aligned
y_binary = (y_train > threshold).astype(int)


# Train the model
pipeline.fit(X_train, y_binary)


In [17]:
from sklearn.model_selection import cross_val_score

# Perform cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_binary, cv=5, scoring='accuracy')



In [18]:
print(cv_scores)

[0.87340113 0.87549801 0.87754246 0.87560285 0.87381389]


In [13]:
test_predictions = pipeline.predict(testdf)
train_predictions = pipeline.predict(X_train)  # Assuming X_train is the feature matrix used for training

# Assuming 'y_train' is the target variable used for training
train_accuracy = accuracy_score(y_train, train_predictions)
print(f'Training Accuracy: {train_accuracy}')

Training Accuracy: 0.9999790310236005


In [14]:
predictions_df = pd.DataFrame({'id': testdf['id'], 'Exited': test_predictions})
predictions_df.to_csv('PipelinePassthroughXGB.csv', index=False)

The most important predictive features of the model are Age, balance, and whether or not they are an active member. 