## Logistic Regression Approach

1. Convert business problem to data science problem.
2. Load data.
3. Understand the data.
4. Data preprocessing.
5. Exploratory data analysis(EDA).
6. Model building.
7. Model diagnostics.
8. Predictions and Evaluations.

### 1. Problem Statement

Build a classification engine which classifies a customer if he/she has churned out of the bank or not, based on various features like credit score, balance, tenure, gender etc.

In [1]:
# import necessary modules
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 800)
pd.set_option('display.max_columns', 500)

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

### 2. Load Data

In [2]:
churn_data = pd.read_csv('data.csv', index_col='RowNumber')

### 3. Understanding the data

In [3]:
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1 to 10000
Data columns (total 13 columns):
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
churn_data.describe()

Unnamed: 0,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [5]:
churn_data.head()

Unnamed: 0_level_0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [6]:
num_cols = churn_data.select_dtypes(include=np.number).columns
cat_cols = churn_data.select_dtypes(exclude=np.number).columns

print(f"Numerical columns: {num_cols.values}\n")
print(f"Categorical columns: {cat_cols.values}\n")

Numerical columns: ['CustomerId' 'CreditScore' 'Age' 'Tenure' 'Balance' 'NumOfProducts'
 'HasCrCard' 'IsActiveMember' 'EstimatedSalary' 'Exited']

Categorical columns: ['Surname' 'Geography' 'Gender']



### 4. Data pre-processing

In [7]:
# Let's do one hot encoding for the column `Geography` as Logistic Regression model only accepts numerical values
churn_data_new = pd.get_dummies(prefix='Geo', data=churn_data, columns=['Geography'])
churn_data_new.head()

Unnamed: 0_level_0,CustomerId,Surname,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geo_France,Geo_Germany,Geo_Spain
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,15634602,Hargrave,619,Female,42,2,0.0,1,1,1,101348.88,1,1,0,0
2,15647311,Hill,608,Female,41,1,83807.86,1,0,1,112542.58,0,0,0,1
3,15619304,Onio,502,Female,42,8,159660.8,3,1,0,113931.57,1,1,0,0
4,15701354,Boni,699,Female,39,1,0.0,2,0,0,93826.63,0,1,0,0
5,15737888,Mitchell,850,Female,43,2,125510.82,1,1,1,79084.1,0,0,0,1


In [14]:
# For `Gender` column, we will do Label Encoding, 1: Male, 0: Female
churn_data_new['Gender'] = churn_data_new['Gender'].map({'Male':1, 'Female':0})

In [19]:
churn_data_new.head()

Unnamed: 0_level_0,CustomerId,Surname,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geo_France,Geo_Germany,Geo_Spain
RowNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,15634602,Hargrave,619,0,42,2,0.0,1,1,1,101348.88,1,1,0,0
2,15647311,Hill,608,0,41,1,83807.86,1,0,1,112542.58,0,0,0,1
3,15619304,Onio,502,0,42,8,159660.8,3,1,0,113931.57,1,1,0,0
4,15701354,Boni,699,0,39,1,0.0,2,0,0,93826.63,0,1,0,0
5,15737888,Mitchell,850,0,43,2,125510.82,1,1,1,79084.1,0,0,0,1


### 5. Exploratory Data Analysis (EDA)