# Predicting Customer Churn with Machine Learning

In this project, we want to predict which customers might leave a company. By knowing this, businesses can take steps to keep their customers. We will try different machine learning models like Random Forest, Logistic Regression, and Decision Tree to find the best one.

## Initialization

In [1]:
# Loading all the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.utils import resample

## Load data

In [2]:
# Load the data files into different DataFrames
try:
    df= pd.read_csv('Churn.csv')
except:
    df = pd.read_csv('datasets/Churn.csv')

df.head()


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


## EDA

In [3]:
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [4]:
df.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

We are checking for missing values in the dataset. We found 909 null values in the Tenure column, so we will fill all of them with 0.

In [5]:
df['Tenure']= df['Tenure'].fillna(df['Tenure'].median())
df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [6]:
# Splitting the data as we wouldn't need the first three rows
df= df.iloc[:, 3:]
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


We are removing the first three columns as they are not needed for our analysis.

## Preprocessing the data

We are preparing the data for further analysis.

### Encode Categorical Variables

In [9]:

# We are encoding the categorical variables 'Geography' and 'Gender' using one-hot encoding.
# This process converts categorical data into a format that can be provided to machine learning algorithms.
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['Geography', 'Gender']]).toarray()

# Create DataFrame for the encoded data with appropriate column names
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(['Geography', 'Gender']))
df = df.drop(['Geography', 'Gender'], axis=1)

# Concatenate the original DataFrame and the encoded DataFrame
df = pd.concat([df, encoded_df], axis=1)
df.head()


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,1.0,0.0,0.0,1.0,0.0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0.0,0.0,1.0,1.0,0.0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,1.0,0.0,0.0,1.0,0.0
3,699,39,1.0,0.0,2,0,0,93826.63,0,1.0,0.0,0.0,1.0,0.0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0.0,0.0,1.0,1.0,0.0


### Scale Numerical Features

In [10]:
# We are splitting the data into training and validation sets. 
#scaler= StandardScaler()
#numeric_features = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
#df[numeric_features]= scaler.fit_transform(df[numeric_features])
#df.head()

### Split the Data and dealing the balance

In [11]:
target = df['Exited']
features = df.drop('Exited', axis=1)

x_train, x_temp, y_train, y_temp = train_test_split(features, target, test_size=0.4, random_state=12345, stratify=target)
x_valid, x_test, y_valid, y_test = train_test_split(x_temp, y_temp, test_size=0.5, random_state=12345, stratify=y_temp)
print(x_train.shape)
print(y_train.shape)
print(x_valid.shape)
print(y_valid.shape)
print(x_test.shape)
print(y_test.shape)


(6000, 13)
(6000,)
(2000, 13)
(2000,)
(2000, 13)
(2000,)


In [12]:
# Combine x_train and y_train for upsampling
train_data = pd.concat([x_train, y_train], axis=1)

majority = train_data[train_data['Exited'] == 0]
minority = train_data[train_data['Exited'] == 1]

minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=123)
upsampled = pd.concat([majority, minority_upsampled])

x_train_up = upsampled.drop('Exited', axis=1)
y_train_up = upsampled['Exited']
print(x_train_up.shape)
print(y_train_up.shape)


(9556, 13)
(9556,)


In [29]:
inner= pd.merge(x_train_up, x_test, how='inner')
inner

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male


In [33]:
# downsampling
majority_downsampled = resample(majority, replace=False, n_samples=len(minority), random_state=123)

downsampled = pd.concat([majority_downsampled, minority])

x_train_down = downsampled.drop('Exited', axis=1)
y_train_down = downsampled['Exited']

print(x_train_down.shape)
print(y_train_down.shape)

(2444, 13)
(2444,)


In [31]:
inner= pd.merge(x_train_down, x_test, how='inner')
inner

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male


### Scale Numerical Features

In [35]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

scaler = StandardScaler()

# Fit the scaler on the training data
scaler.fit(x_train[numeric])

# Transform the training and validation data
x_train[numeric] = scaler.transform(x_train[numeric])
x_valid[numeric] = scaler.transform(x_valid[numeric])
x_test[numeric] = scaler.transform(x_test[numeric])

In [36]:
# Fit the scaler on the upsampled training data
scaler.fit(x_train_up[numeric])

# Transform the upsampled training and validation data
x_train_up[numeric] = scaler.transform(x_train_up[numeric])

In [37]:
# Fit the scaler on the upsampled training data
scaler.fit(x_train_down[numeric])

# Transform the upsampled training and validation data
x_train_down[numeric] = scaler.transform(x_train_down[numeric])

## Model

In [41]:
# function to check each module 
def check_model(model, x_valid= x_valid, y_valid= y_valid ):
    predicted= model.predict(x_valid)
    prop= model.predict_proba(x_valid)[:, 1]
    print('Accuracy: ', model.score(x_valid, y_valid))
    print('F1 score: ', f1_score(y_valid, predicted))
    print('AUC-ROC score: ', roc_auc_score(y_valid, prop))

We define a function to evaluate the performance of different models using accuracy, F1 score, and AUC-ROC score.

### Random Forest Classifier

In [39]:
# Normal data
random_forest1= RandomForestClassifier(random_state=15).fit(x_train, y_train)
check_model(random_forest1)

Accuracy:  0.865
F1 score:  0.5945945945945946
AUC-ROC score:  0.8616743336781949


In [40]:
# upsampled data 
random_forest1= RandomForestClassifier(random_state=15).fit(x_train_up, y_train_up)
check_model(random_forest1)

Accuracy:  0.8535
F1 score:  0.6267515923566879
AUC-ROC score:  0.8574782614050646


In [42]:
# Downsampled data 
random_forest1= RandomForestClassifier(random_state=15).fit(x_train_down, y_train_down)
check_model(random_forest1 )

Accuracy:  0.765
F1 score:  0.5833333333333334
AUC-ROC score:  0.8617882611587349


As we can see, the model works the best when dealing with upsampled data

In [43]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [ 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


# Perform grid search
grid_search = GridSearchCV(estimator=random_forest1, param_grid=param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=2)
grid_search.fit(x_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best Parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Best Score: 0.5691190275296278


In [44]:
# Now let's check this model with unbalanced data
random_forest2= RandomForestClassifier(random_state=15, max_depth= 20, min_samples_split=10, min_samples_leaf=1, n_estimators=200).fit(x_train, y_train)
check_model(random_forest2 )

Accuracy:  0.864
F1 score:  0.5878787878787879
AUC-ROC score:  0.867916481919401


In [45]:
# Now let's check this model with balanced data
random_forest2= RandomForestClassifier(random_state=15, max_depth= 20, min_samples_split=10, min_samples_leaf=1, n_estimators=200).fit(x_train_up, y_train_up)
check_model(random_forest2 )

Accuracy:  0.849
F1 score:  0.6378896882494005
AUC-ROC score:  0.8636857695339442


In [46]:
# Now let's check this model with balanced data
random_forest2= RandomForestClassifier(random_state=15, max_depth= 20, min_samples_split=10, min_samples_leaf=1, n_estimators=200).fit(x_train_down, y_train_down)
check_model(random_forest2 )

Accuracy:  0.7615
F1 score:  0.5841325196163906
AUC-ROC score:  0.8649913168785102


Surprisingly it works worse than the regular model with upsampled data 

### Logistic Regression

In [47]:
logistic_regression = LogisticRegression(random_state=15).fit(x_train, y_train)
check_model(logistic_regression )


Accuracy:  0.8115
F1 score:  0.31078610603290674
AUC-ROC score:  0.787463666371071


In [48]:
logistic_regression = LogisticRegression(random_state=15).fit(x_train_up, y_train_up)
check_model(logistic_regression )

Accuracy:  0.6735
F1 score:  0.501906941266209
AUC-ROC score:  0.7916759040299537


In [49]:
logistic_regression = LogisticRegression(random_state=15).fit(x_train_down, y_train_down)
check_model(logistic_regression)

Accuracy:  0.6695
F1 score:  0.4996214988644966
AUC-ROC score:  0.7918976007488423


### Decision Tree Classifier

In [50]:
decision_tree= DecisionTreeClassifier(random_state=15).fit(x_train, y_train)
check_model(decision_tree )

Accuracy:  0.795
F1 score:  0.5095693779904307
AUC-ROC score:  0.693504286136565


In [51]:
# Decision Tree with unbalanced data

#for i in range(1, 8):
#    test_model= DecisionTreeClassifier(random_state=15, max_depth=i).fit(x_train, y_train)
#    print(f'The scores when max depth is {i}')
#    check_model(test_model )


param_grid = {'max_depth': range(1, 8)}
dt = DecisionTreeClassifier(random_state=15)
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='f1')
grid_search.fit(x_train, y_train)

# Print the best parameters and best score
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best F1 score: {grid_search.best_score_}')

# Evaluate the best model on the validation set
best_model = grid_search.best_estimator_
check_model(best_model, x_valid, y_valid)


Best parameters: {'max_depth': 7}
Best F1 score: 0.5524627895428031
Accuracy:  0.8605
F1 score:  0.5985611510791367
AUC-ROC score:  0.8303311902650508


We get best results when  max depth equals 6 

In [52]:
# Decision Tree with balanced data
#for i in range(1, 8):
#    test_model= DecisionTreeClassifier(random_state=15, max_depth=i).fit(x_train_up, y_train_up)
#    print(f'The scores when max depth is {i}')
#    check_model(test_model )

grid_search.fit(x_train_up, y_train_up)

# Print the best parameters and best score
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best F1 score: {grid_search.best_score_}')

# Evaluate the best model on the validation set
best_model = grid_search.best_estimator_
check_model(best_model, x_valid, y_valid)

Best parameters: {'max_depth': 7}
Best F1 score: 0.7887637413284656
Accuracy:  0.755
F1 score:  0.5601436265709157
AUC-ROC score:  0.839266029904424


In [53]:
# Decision Tree with balanced data
#for i in range(1, 8):
#    test_model= DecisionTreeClassifier(random_state=15, max_depth=i).fit(x_train_down, y_train_down)
#    print(f'The scores when max depth is {i}')
#    check_model(test_model)

grid_search.fit(x_train_down, y_train_down)

# Print the best parameters and best score
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best F1 score: {grid_search.best_score_}')

# Evaluate the best model on the validation set
best_model = grid_search.best_estimator_
check_model(best_model, x_valid, y_valid)

Best parameters: {'max_depth': 6}
Best F1 score: 0.7597803828132926
Accuracy:  0.785
F1 score:  0.601113172541744
AUC-ROC score:  0.8474295497093308


As we can see we don't see a huge change in decision trees as they don't get affected much with unbalanced data

## Best Model

In [54]:
# best model with balanced data
best_model= RandomForestClassifier(random_state=15, max_depth= 20, min_samples_split=10, min_samples_leaf=1, n_estimators=200).fit(x_train_up, y_train_up)
check_model(best_model, x_test, y_test)

Accuracy:  0.834
F1 score:  0.5931372549019608
AUC-ROC score:  0.848120848120848


The Random Forest classifier, when trained with more balanced data, had the highest F1 score of 0.59. This means it was very good at predicting customer churn accurately. Balancing the data helped improve the model's performance, making it the best choice for our task.

### DummyClassifier Performance

We are evaluating the performance of a Dummy classifier as a baseline for comparison with other models.

In [55]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(x_train, y_train)

# Evaluate the model
check_model(dummy_clf)

Accuracy:  0.796
F1 score:  0.0
AUC-ROC score:  0.5


## Conclusion

After testing several models, the Random Forest classifier was the best at predicting customer churn. It had the highest F1 score, which means it was good at both finding customers who might leave and not making too many mistakes. This model is reliable and can help businesses understand customer behavior better and take action to keep their customers.