<img src="https://res.cloudinary.com/dn1j6dpd7/image/fetch/f_auto,q_auto,w_736/https://www.livechat.com/wp-content/uploads/2016/04/customer-churn@2x.jpg">

# Our methodology


## Data visualization 

    Loading the data
    Take a quick look at our data 
    Understanding our data
    Finding the correlations
    
## Data preperation 

    Outliers detection
    Skweness correction
    
## Data spliting


## Pipeline

    Encoding 
    Feature scaling
    
## Modeling

    Building the model
    Evaluation with cross-validation
    
## Fine-tuning 

    Finding the best hyperparameters

    
## Testing our model

    Evaluate the model with the test set
    

### Importing needed libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from scipy.stats import norm
from scipy import stats
import seaborn as sns
warnings.filterwarnings('ignore')
%matplotlib inline
plt.style.use('fivethirtyeight')

## Data visualization 

### Loading the data

In [None]:
data = pd.read_csv('/kaggle/input/churn-modelling/Churn_Modelling.csv') 

### Take a quick look at our data

In [None]:
data.head()

In [None]:
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.var()

### Understanding our data

In [None]:
data.hist(bins=50, figsize=(30,25)) 
plt.show()

In [None]:
sns.countplot(x="Exited", data=data)
plt.show()

In [None]:
sns.countplot(x="Gender", data=data)
plt.show()

In [None]:
sns.displot(data, x="Age", hue="Age")
plt.show()

In [None]:
sns.countplot(x="HasCrCard", data=data)
plt.show()

In [None]:
sns.countplot(x="IsActiveMember", data=data)
plt.show()

In [None]:
sns.countplot(x="NumOfProducts", data=data)
plt.show()

### Find the correlations

In [None]:
corr_matrix = data.corr()
f, ax = plt.subplots(figsize=(25, 15))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr_matrix, cmap=cmap, vmax=.5, annot=True, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

In [None]:
High_corr = corr_matrix.nlargest(4, 'Exited')['Exited'].index
High_corr

In [None]:
corr_matrix["Exited"].sort_values(ascending=False)

In [None]:
new_df = data.copy()

## Data preperation

### Outliers Detection

we have a various methods to detect the outliers i am going to use IQR here this method works fine for me but 

you can try other methods like 

            1- Z-score method
            2. Robust Z-score
            3. I.Q.R method
            4. Winterization method(Percentile Capping)
            5. DBSCAN Clustering
            6. Isolation Forest
            7. Visualizing the data
            
IQR stands for "Inter Quartiles Range"

this method depends on two values 
    
    Q1 >> which represents a quarter of the way through the list of all data usually this value is 0.25 but i will use .15 trying not to delete a lot of data 
    
    Q3 >> which represents three-quarters of the way through the list of all data usually this value is 0.75 but i will use .80 for the same resone
    
how IQR works :
    well first it sorts the data and finds its median 
    then seperate the numbers before the median and finds its own median "Q1"  and also seperates the numbers 
    after the total medain and finds its own median "Q3"
    
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/1200px-Boxplot_vs_PDF.svg.png">

then we will take the diffrance between Q3 and Q1

#### But before getting our hands dirty lets define some functions that we will use a lot like 
    "IQR" to calculate the IQR for us 
    "Upper and Lower" to fetch upper values and lower values that contain outliers 
    "outliers_del" to delete them 
    "Plot" function to plot the curves 
    "outlier_compare" to compare the data before deleting outliers and correct the skewness and after
    
I will write a comment for each function when creating it

In [None]:
# This function will calculate the IQR for us and save the values that is higher or lower as follwow
def IQR(column_name):
    Q1 = new_df[column_name].quantile(0.12)
    Q3 = new_df[column_name].quantile(0.88)
    IQR = Q3 - Q1
    upper_limit = Q3 + 1.5 * IQR
    lower_limit = Q1 - 1.5 * IQR
    values_upper = new_df[new_df[column_name] > upper_limit]
    values_lower = new_df[new_df[column_name] < lower_limit]
    
    return values_upper, values_lower, upper_limit, lower_limit

In [None]:
# this Function will check if the returned shape from IQR is higher than zero 
# why zero! cos the output will be for example like this (2,63) that means there are 2 rows contains outliers 
# and if it more than zero it will show us this rows
def upper(column_name):
    if values_upper.shape[0] > 0:
        print("Outliers upper than the higher limit: ")
        return new_df[new_df[column_name] > upper_limit]
    else:
        print("There are no values higher than the upper limit!")

In [None]:
# same as above but for lower values
def lower(column_name):
    if values_lower.shape[0] > 0:
        print("Outliers lower than the higher limit: ")
        return new_df[new_df[column_name] < lower_limit]
    else:
        print("There are no values lower than the lower limit!")

In [None]:
# this function will delete any outliers upper or lower the limit
def outliers_del(column_name):
    # we will make new_df global to consider the global variable not the local
    global new_df
    new_df = new_df[new_df[column_name] < upper_limit]
    new_df = new_df[new_df[column_name] > lower_limit]
    print("the old data shape is :", data.shape)
    print("the new data shape is :", new_df.shape)

In [None]:
# this function is for ploting the data 
def plot(column_name):
    plt.style.use('fivethirtyeight')
    plt.figure(figsize=(16,5))
    #plt.subplot(1,2,1)
    # we will use fit norm to draw the normal distibutions that the data sould be it will be in black 
    #sns.distplot(data[column_name], fit=norm)
    plt.subplot(1,2,1)
    sns.boxplot(data[column_name],palette="rocket")
    plt.show()

In [None]:
def outlier_compare(column_name):
    plt.style.use('fivethirtyeight')
    plt.figure(figsize=(25,15))
    plt.subplot(2,2,1)
    sns.boxplot(data[column_name], palette="rocket")
    plt.subplot(2,2,2)
    sns.boxplot(new_df[column_name], palette="rocket")
    plt.show()

In [None]:
Upper_Outliers_columns = []
Lower_Outliers_columns = []
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
for column in new_df:
    if new_df[column].dtype in numeric_dtypes:
        values_upper, values_lower, upper_limit, lower_limit = IQR(column)
        if values_upper.shape[0] > 0:
            Upper_Outliers_columns.append(column)
        if values_lower.shape[0] > 0:
            Lower_Outliers_columns.append(column)

In [None]:
print('Columns upper the limit is: ', Upper_Outliers_columns)
print('Columns lower the limit is: ', Lower_Outliers_columns)

well i will ignore NumOfProducts and Exited cos those are categorical data!

lets start with Age then CreditScore 

#### Age

In [None]:
plot('Age')

In [None]:
values_upper, values_lower, upper_limit, lower_limit = IQR('Age')

In [None]:
upper('Age')

In [None]:
lower('Age')

In [None]:
outliers_del('Age')

In [None]:
outlier_compare('Age')

### Skewness

In [None]:
from scipy.stats import skew

skewness_list = {}
for i in new_df:
    if new_df[i].dtype != "object":
        skewness_list[i] = skew(new_df[i])

skewness = pd.DataFrame({'Skew' :skewness_list})
plt.style.use('fivethirtyeight')
plt.figure(figsize=(15,9))
plt.xlabel('Features', fontsize=15)
plt.ylabel('Skewness', fontsize=15)
plt.xticks(rotation='90')
plt.bar(range(len(skewness_list)), list(skewness_list.values()), align='center')
plt.xticks(range(len(skewness_list)), list(skewness_list.keys()))

plt.show()

In [None]:
skewness_list

well i tried to correct age skewness but i got a lower score so we will keep everything as it is

## Data spliting

In [None]:
X = new_df.drop("Exited", axis=1)

In [None]:
y = new_df['Exited'].copy()

In [None]:
X.shape, y.shape

Why we will use stratify in spliting?

Some classification problems can exhibit a large imbalance in the distribution of the target 
classes: for instance there could be several times more negative samples than positive samples. 
In such cases it is recommended to use stratified sampling as implemented in StratifiedKFold 
and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in 
each train and validation fold.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, shuffle=True, random_state=42, stratify=y)

## Pipeline

What we will do here?

well i used to to the following steps one by one but when i learned to do it with pipeline eveything has changed!,

its much organized and more simple so, here what we will combine to our pipeline.

##### Encoding

Machine learning algorithms works only with numerical feature and here we have some categrical feature so we need to convert them into numbers and that is what we called encoding there are many ways to encode your data

    Encoding 
    Replacing 
    Get dummies 
    and more
    
the most popular one is "get_dummies" but a big note here if you going to use get dummies you have to use it before splitting the data and if you try to apply "get_dummies" to the traing set and test set you will get two different results and it not going to work!

we will use sklearn encoder called one hot encoder "OneHotEncoder"

##### Feature Scaling

Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled.

we have three methods in sklearn 

MinMaxScaler(feature_range = (0, 1)) will transform each value in the column proportionally within the range [0,1]. Use this as the first scaler choice to transform a feature, as it will preserve the shape of the dataset (no distortion).

StandardScaler() will transform each value in the column to range about the mean 0 and standard deviation 1, ie, each value will be normalised by subtracting the mean and dividing by standard deviation. Use StandardScaler if you know the data distribution is normal.

If there are outliers, use RobustScaler(). Alternatively you could remove the outliers and use either of the above 2 scalers (choice depends on whether data is normally distributed)

We delete most outliers earlier so we can use MinMaxScaler or StandardScaler




In [None]:
# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
# Encoders
from sklearn.preprocessing import OneHotEncoder
# Scaling
from sklearn.preprocessing import MinMaxScaler
# feature selection 
from sklearn.feature_selection import SelectPercentile, chi2
# Cols transform
from sklearn.compose import make_column_transformer
# Pipeline
from sklearn.pipeline import make_pipeline
# cross val
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold
# interactive diagrams of Pipelines 
from sklearn import set_config
set_config(display='diagram')

In [None]:
OHE = OneHotEncoder()
MMS = MinMaxScaler()
cv = StratifiedKFold(5, random_state=1, shuffle=True)

In [None]:
column_trans = make_column_transformer(
    (OHE, ['Geography', 'Gender']),
    (MMS, ['CreditScore', 'Balance', 'EstimatedSalary']),
    remainder='passthrough')

In [None]:
column_trans.fit_transform(X_train)

## Modeling

#### RandomForestClassifier

In [None]:
RF = RandomForestClassifier(random_state=4, criterion='gini', max_depth=10, max_features='auto')

In [None]:
RF_pipe = make_pipeline(column_trans, RF)

In [None]:
RF_pipe

In [None]:
cross_val_score(RF_pipe, X_train, y_train, cv=cv, scoring='accuracy').mean()

this is so important when using cross validation for the entire pipeline it first split the data into cv 
number and then pass it to the pipeline process this is better than preprocess the data first and feed it after
processing to the model and just use cross validation to the model only instead this way will validate the entire 
pipeline process!

In [None]:
RF_pipe.fit(X_train, y_train);

this will fit the data of X_train and it will train from it and when we use predict it will only do transform to the data based on the data that has been learned from fit "X_train" means that it learn from Train data and transform the test data based on train data that prevent data leakage! and data leakage accuares when the model learn new staff from test data that was not exiest in traing data!

In [None]:
RF_pipe.score(X_train, y_train)

In [None]:
RF_pred = RF_pipe.predict(X_train)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, RF_pred)

#### LogisticRegression

In [None]:
LR = LogisticRegression(random_state=4, C=1, max_iter=1000, multi_class='auto', penalty='l1',solver='saga')

In [None]:
LR_pipe = make_pipeline(column_trans, LR)

In [None]:
LR_pipe

In [None]:
cross_val_score(LR_pipe, X_train, y_train, cv=cv, scoring='accuracy').mean()

In [None]:
LR_pipe.fit(X_train, y_train);

this will fit the data of X_train and it will train from it and when we use predict it will only do transform to the data based on the data that has been learned from fit "X_train" means that it learn from Train data and transform the test data based on train data that prevent data leakage! and data leakage accuares when the model learn new staff from test data that was not exiest in traing data!

In [None]:
LR_pipe.score(X_train, y_train)

In [None]:
LR_pred = LR_pipe.predict(X_train)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, LR_pred)

#### KNeighborsClassifier

In [None]:
KNN = KNeighborsClassifier(algorithm='auto', leaf_size=30, n_neighbors=9, weights='uniform')

In [None]:
KNN_pipe = make_pipeline(column_trans, KNN)

In [None]:
KNN_pipe

In [None]:
cross_val_score(KNN_pipe, X_train, y_train, cv=cv, scoring='accuracy').mean()

In [None]:
KNN_pipe.fit(X_train, y_train);

this will fit the data of X_train and it will train from it and when we use predict it will only do transform to the data based on the data that has been learned from fit "X_train" means that it learn from Train data and transform the test data based on train data that prevent data leakage! and data leakage accuares when the model learn new staff from test data that was not exiest in traing data!

In [None]:
KNN_pipe.score(X_train, y_train)

In [None]:
KNN_pred = KNN_pipe.predict(X_train)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, KNN_pred)

## Fine-Tune Our Model

After we trained our model and take an idea about how it performed no time to find the optimal hyperparameters of the model
One way to do that would be to fiddle with the hyperparameters manually until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations.
Instead, you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation and the amazing thing here that it will search for 
the hyperparameters for the entire pipeline not only the model!

GridSearchCV may take a long time so you should try RandomizedSearchCV this method chose random hyperparameters 
and mix them and with this way to determine how many times you want your search to iterate cos GridSearchCV maybe 
cos high cost.

#### RandomForestClassifier

In [None]:
hyper_params = {}
hyper_params['randomforestclassifier__n_estimators'] = [150, 200, 250]
hyper_params['randomforestclassifier__max_depth'] = [7, 8, 9, 10]
hyper_params['randomforestclassifier__criterion'] = ['gini','entropy']
hyper_params['randomforestclassifier__max_features'] = ['auto', 'sqrt', 'log2']

In [None]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(RF_pipe, hyper_params, cv=cv, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train);

In [None]:
grid.best_score_

In [None]:
grid.best_params_

convert results into a DataFrame

In [None]:
results = pd.DataFrame(grid.cv_results_)[['params', 'mean_test_score', 'rank_test_score']]

In [None]:
results.sort_values('rank_test_score')

#### LogisticRegression

In [None]:
hyper_params = {}
hyper_params['logisticregression__penalty'] = ['l1', 'l2']
hyper_params['logisticregression__C'] = [.001, .01, .1, 1]
hyper_params['logisticregression__solver'] = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
hyper_params['logisticregression__max_iter'] = [100, 1000]

In [None]:
grid = GridSearchCV(LR_pipe, hyper_params, cv=cv, scoring='accuracy')
grid.fit(X_train, y_train);

In [None]:
grid.best_score_

In [None]:
grid.best_params_

In [None]:
results = pd.DataFrame(grid.cv_results_)[['params', 'mean_test_score', 'rank_test_score']]

In [None]:
results.sort_values('rank_test_score')

#### KNeighborsClassifier

In [None]:
hyper_params = {}
hyper_params['kneighborsclassifier__n_neighbors'] = [5, 6, 7, 8, 9]
hyper_params['kneighborsclassifier__weights'] = ['uniform','distance']
hyper_params['kneighborsclassifier__algorithm'] = ['auto', 'ball_tree', 'kd_tree', 'brute']
hyper_params['kneighborsclassifier__leaf_size'] = [30, 40, 50]

In [None]:
grid = GridSearchCV(KNN_pipe, hyper_params, cv=cv, scoring='accuracy')
grid.fit(X_train, y_train);

In [None]:
grid.best_score_

In [None]:
grid.best_params_

In [None]:
results = pd.DataFrame(grid.cv_results_)[['params', 'mean_test_score', 'rank_test_score']]

In [None]:
results.sort_values('rank_test_score')

## Evaluate Our System on the Test Set 

In [None]:
# Evaluation metrices
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_auc_score

#### RF

In [None]:
RF_pred = RF_pipe.predict(X_test)

In [None]:
accuracy_score(y_test, RF_pred)

confusion_matrix

In [None]:
plot_confusion_matrix(RF_pipe, X_test, y_test, cmap=plt.cm.Blues)  
plt.show() 

In [None]:
roc_auc_score(y_test, RF_pred, multi_class='ovo')

#### KNN

In [None]:
KNN_pred = KNN_pipe.predict(X_test)

In [None]:
accuracy_score(y_test, KNN_pred)

confusion_matrix

In [None]:
plot_confusion_matrix(KNN_pipe, X_test, y_test, cmap=plt.cm.Blues)  
plt.show() 

In [None]:
roc_auc_score(y_test, KNN_pred, multi_class='ovo')

well the best score we got is around 87% from RandomForestClassifier but while modeling and testing i have got 92% score but when i ran the code a gain the score has changed it you have an idea why this happened just let me know!