**CUSTOMER CHURN** **PREDICTION**


by Srijita Ghosh Hajra

Overview:
In this notebook, I will try to predict customer churn with the help of given dataset.

kaggle datasets download -d shantanudhakadd/bank-customer-churn-prediction

Importing Libraries

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [11]:
df = pd.read_csv('Churn_Modelling.csv')
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [12]:
df.shape

(10000, 14)

In [13]:
# dataset informations
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


Exploratory Data Analysis

In [14]:
df.isnull().values.any()

False

In [15]:
## Get the churn and not churn dataset

churn = df[df['Exited']==1]

not_churn = df[df['Exited']==0]

In [16]:
X1 = df.drop(['Exited',"Surname"], axis=1)
y = df['Exited']

In [17]:
X = pd.get_dummies(X1,columns=['Gender'])
X.head(5)

Unnamed: 0,RowNumber,CustomerId,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Gender_Female,Gender_Male
0,1,15634602,619,France,42,2,0.0,1,1,1,101348.88,1,0
1,2,15647311,608,Spain,41,1,83807.86,1,0,1,112542.58,1,0
2,3,15619304,502,France,42,8,159660.8,3,1,0,113931.57,1,0
3,4,15701354,699,France,39,1,0.0,2,0,0,93826.63,1,0
4,5,15737888,850,Spain,43,2,125510.82,1,1,1,79084.1,1,0


In [18]:
# Identify categorical columns to be encodeR

# Initialize LabelEncoder
label_encoder = LabelEncoder()
X['Geography'] = label_encoder.fit_transform(X['Geography'])
X


Unnamed: 0,RowNumber,CustomerId,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Gender_Female,Gender_Male
0,1,15634602,619,0,42,2,0.00,1,1,1,101348.88,1,0
1,2,15647311,608,2,41,1,83807.86,1,0,1,112542.58,1,0
2,3,15619304,502,0,42,8,159660.80,3,1,0,113931.57,1,0
3,4,15701354,699,0,39,1,0.00,2,0,0,93826.63,1,0
4,5,15737888,850,2,43,2,125510.82,1,1,1,79084.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,771,0,39,5,0.00,2,1,0,96270.64,0,1
9996,9997,15569892,516,0,35,10,57369.61,1,1,1,101699.77,0,1
9997,9998,15584532,709,0,36,7,0.00,1,0,1,42085.58,1,0
9998,9999,15682355,772,1,42,3,75075.31,2,1,0,92888.52,0,1


In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Random Forest

In [20]:
# Model Creation and Training
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions and Evaluation
y_train_pred = rf_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)

y_test_pred = rf_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Train Accuracy: {train_accuracy:}")
print(f"Test Accuracy: {test_accuracy:}")


Train Accuracy: 1.0
Test Accuracy: 0.8645


Logistic Regression

In [21]:
from sklearn.linear_model import LogisticRegression

#Model Creation and Training
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)

#Predictions and Evaluation
y_train_pred = logreg_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)

y_test_pred = logreg_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Train Accuracy: {train_accuracy:}")
print(f"Test Accuracy: {test_accuracy:}")


Train Accuracy: 0.7945
Test Accuracy: 0.8035


Conclusion:

Using Logistic regression, accuracy of training and testing data are 0.7945 and 0.8035 respectively.
Using Random forest accuracy of training and testing data are 1.0 and 0.8645 respectively.
So, we can see that my model performs well using Random Forest for predict credit card fraud detection.