### Data Preprocessing, Analysis, and Visualization for building a Machine learning model

In [10]:
# Importing Libraries and Dataset

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler 
  
dataset = pd.read_csv('Churn_Modelling.csv')

In [6]:
dataset.head() #first five rows from dataset

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [4]:
dataset.info()  #information about the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


#### Data Preprocessing

In [12]:
#Finding Missing Values and Handling them

dataset.isnull().any() 


RowNumber          False
CustomerId         False
Surname            False
CreditScore        False
Geography          False
Gender             False
Age                False
Tenure             False
Balance            False
NumOfProducts      False
HasCrCard          False
IsActiveMember     False
EstimatedSalary    False
Exited             False
dtype: bool

For example,
We observe that there are 3 columns containing null values. The 3 columns are Geography, Gender, and Age. Now we need to remove the null values, to do this there are 3 ways they are:

Deleting rows
Replacing null with custom values
Replacing using Mean, Median, and Mode
In this scenario, we replace null values with Mean and Mode.

In [13]:
dataset["Geography"].fillna(dataset["Geography"].mode()[0],inplace = True) 
dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace = True) 
dataset["Age"].fillna(dataset["Age"].mean(),inplace = True)

#### Label Encoding
Label Encoding is used to convert textual data to integer data. As we know there are two textual data type columns which are “Geography” and “Gender”. 

In [14]:
le = LabelEncoder() 
dataset['Geography'] = le.fit_transform(dataset["Geography"]) 
dataset['Gender'] = le.fit_transform(dataset["Gender"]) 

#### Splitting Dependent and Independent Variables


In [19]:
x = dataset.iloc[:,3:13].values 
y = dataset.iloc[:,13:14].values

# Splitting into Train and Test Dataset 
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 0)


#### Feature Scaling
Feature Scaling is a technique done to normalize the independent variables.

In [28]:
sc = StandardScaler() 
x_train = sc.fit_transform(x_train) 
x_test = sc.fit_transform(x_test)

#### Model Training and Evaluation

In [27]:
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import SVC 
from sklearn.linear_model import LogisticRegression 
  
from sklearn import metrics 

# Reshape y_train to be a 1D array if necessary
y_train = y_train.ravel()

# Initialize the classifiers
knn = KNeighborsClassifier(n_neighbors=3) 
rfc = RandomForestClassifier(n_estimators = 7, criterion = 'entropy',random_state =7) 
svc = SVC() 
lc = LogisticRegression() 

# List of classifiers
classifiers = [rfc, knn, svc, lc]

# making predictions on the training set 
for clf in (rfc, knn, svc,lc): 
    clf.fit(x_train, y_train) 
    y_pred = clf.predict(x_test) 
    print("Accuracy score of ",clf.__class__.__name__,"=", 100*metrics.accuracy_score(y_test, y_pred))

Accuracy score of  RandomForestClassifier = 84.5
Accuracy score of  KNeighborsClassifier = 82.45
Accuracy score of  SVC = 86.15
Accuracy score of  LogisticRegression = 80.75
