# Assignment - Supervised Learning

### Objective

Data Analysis to identify the potential customers who have a higher probability of purchasing the loan.


# Context


This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

The department wants to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.

# Attribute Information: Column descriptions



    ID                    :Customer ID 

    Age                   :Customer's age in completed years 

    Experience            :#years of professional experience 

    Income                :Annual income of the customer ($000) 

    ZIPCode               :Home Address ZIP code 

    Family                :Family size of the customer 

    CCAvg                 :Avg. spending on credit cards per month ($000) 

    Education             :Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional 

    Mortgage              :Value of house mortgage if any. ($000) 

    Personal Loan         :Did this customer accept the personal loan offered in the last campaign? 

    Securities Account    :Does the customer have a securities account with the bank? 

    CD Account            :Does the customer have a certificate of deposit (CD) account with the bank? 

    Online                :Does the customer use internet banking facilities? CreditCard Does the customer 
                       uses a credit card issued  by UniversalBank?

### Acknowledgements

This data set was given as part of course in machine learning. I have also added my observations on the data. I thank Great Learning and my faculty for giving an opportunity to work on this dataset.

### Inspiration

Study the data distribution in each attribute, share my findings. Used a classification model to predict the likelihood of a liability customer buying personal loans.

### Technologies & Libraries

Python3

    Logistic Regression ; KNN Classifier ; Naive Bayes ; SVM ; Metrics ; Preprocessing ; 

    Pandas ; NumPy ; Matplotlib ; SeaBorn ; SKLearn ; SciPy ; Statsmdels ; Copy ; OS ; re ; traceback ; string , Scikitplot;
    
    
        
        



In [None]:
## Necessary libraries are imported

import warnings 
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

import os
import statsmodels.api as sm
import scipy.stats as stats
import copy
import pandas.core.algorithms as algos
from pandas import Series
import re
import traceback
import string
import scikitplot as skplt

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
from scipy.stats import zscore
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import model_selection
from sklearn import preprocessing
from sklearn import svm
from sklearn.preprocessing import MinMaxScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from yellowbrick.model_selection import FeatureImportances


##### Dataset

BankLoan.csv


In [None]:
print(os.listdir("../input"))

In [None]:
# Dataset is read 
bank_per_loan_df = pd.read_csv('../input/Bank_Personal_Loan_Modelling(1).csv')

In [None]:
bank_per_loan_df.head()

# Pandas Profiling

I have attached the pandas profiling html file separate, it gave me error in jupyter notebook, due to version issues. So did it in Google Colab and generated the report. It has been given as separate html file in the submission.

Code used in Colab:

 !pip install -U pandas-profiling
 
 import pandas_profiling as pp
 
 profile= pp.ProfileReport(bank_per_loan_df)
 
 profile.to_file('./output.html') 


##### Points observed by profile report & univariate analysis:

    The data set got 0 missing cells.

    It got 7 numeric variables: ‘Age’, ‘CC_Avg’, ‘ID’, ‘Income’, ‘Mortgage’, ‘Zip_Code’, ‘Experience’
    It got 2 categorical variables: ‘Education’, ‘Family’
    It got 5 Boolean variables: ‘CD_Account’, ‘Credit_Card’, ‘Online’, ‘Personal_Loan’, ‘Securities Account’
    Personal Loan is highly correlated with Income, average spending on Credit cards, mortgage & if the customer has a certificate of deposit (CD) account with the bank.
    Also, Experience is highly correlated with Age (ρ = 0.994214857)

##### Categorical
    42% of the candidates are graduated, while 30% are professional and 28% are Undergraduate.
    Around 29% of the customer’s family size is 1.

##### Boolean
    94% of the customer does not have a certificate of deposit (CD) account with the bank.
    Around 71% of the customer does not use a credit card issued by Universal Bank.
    Around 60% of customers use internet banking facilities.
    Around 90% of the customer does not accept the personal loan offered in the last campaign.
    Around 90% of the customer does not have a securities account with the bank.
    
##### Numeric
    The mean age of the customers is 45 with standard deviation of 11.5. Also, we had estimated the average age in hypothesis testing between 30–50. The curve is slightly negatively skewed (Skewness = -0.02934068151) hence the curve is fairly symmetrical
    The mean of Avg. spending on credit cards per month is 1.93 with standard deviation of 1.75. The curve is highly positive skewed (Skewness = 1.598443337)
    The mean annual income of the customer is 73.77 with standard deviation of 46. The curve is moderately positive skewed (Skewness = 0.8413386073)
    The mean value of house mortgage is 56.5 with standard deviation of 101.71. The curve is highly positive skewed (Skewness = 2.104002319) and there are a lot of outlier’s present (Kurtosis = 4.756796669)
    
    
    Also, no need for ‘ID’, ‘ZIP_Code’ & ‘Experience’ columns for further analysis since ‘ID’ and ‘ZIP_Code’ are just numbers of series & ‘Experience’ is highly correlated with ‘Age’.


#### Variables definition



    ID - Customer ID
    Age - Customer's age in completed years
    Experience - Number of years of professional experience.
    Income - Annual income of the customer (in $ 1000).
    ZIPCode - Home Address ZIP code.
    Family - Family size of the customer
    CCAvg - Avg. spending on credit cards per month - in thousands usd
    Education - Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
    Mortgage - Value of house mortgage if any - in thousands usd
    Personal Loan - Did this customer accept the personal loan offered in the last campaign?
    Securities Account - Does the customer have a securities account with the bank?
    CD Account - Does the customer have a certificate of deposit (CD) account with the bank?
    Online - Does the customer use internet banking facilities?
    CreditCard - Does the customer uses a credit card issued by UniversalBank?



##### Categorical Feature:


    Family
    Education
    ID
    Zip Code
    Securities Account
    CD Account
    Online
    Credit Card

##### Numerical feature:

    Age
    Experience
    Income
    CCAvg
    Mortgage

##### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Question 1: Read the column description and ensure you understand each attribute well

In [None]:
print(bank_per_loan_df.columns)
print(bank_per_loan_df.shape)

##### Observation

We have 13 independent variables and 1 dependent variable i.e. ‘Personal Loan’ in the data set. Also, we got 5000 rows which can be split into test & train datasets.

In [None]:
bank_per_loan_df.info()

##### Observation

    No Missing Values

In [None]:
bank_per_loan_df.isna().apply(pd.value_counts)   #null value check

##### Observation

    No Null Values

###### Check for duplicate records

In [None]:
bank_per_loan_df.duplicated().any()

##### Observation


no duplicate records

In [None]:
bank_per_loan_df.describe().T

##### Observation

    Column 'Experience' has negative values
    
    Binary variables "Personal Loan", "CreditCard", "Online", "CD Account", "Securities Account" has clean data
    
    Ordinary Cat variables "Family" and "Education" are clean too

Replacing the negative values with the mean value of the column

In [None]:
any(bank_per_loan_df['Experience'] < 0)

In [None]:
exp_med = bank_per_loan_df.loc[:,"Experience"].median()
print(" Experience median is", exp_med)
bank_per_loan_df.loc[:, 'Experience'].replace([-1, -2, -3], [exp_med, exp_med, exp_med], inplace=True)

In [None]:
any(bank_per_loan_df['Experience'] < 0)

In [None]:
bank_per_loan_df.describe().T

###### ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Question2: Study the data distribution in each attribute, share your findings. 

##### Finding: ( analysis is shown below)

1). ID: This attribute can be dropped.Though the data distribution is normal.

2). Age:Three small peaks can be indicating three values of age would be slightly more in number.However, the mean and median of the attribute is equal.The distribution is in considerable shape.

3). Eductaion : Mean and median is almost equal. Data is finely distributed. A few peaks shows different values dominance.

4). Income : We can clearly see data is highly left skewed.Data for less income customers is more in the sample.

5). ZIP Code: The attribute has sharp peaks telling the data from particular places are collected more.Spread is also less in the sample. More data from different places can be collected.

6).Family: It has 4 peaks(4 values) , families with least member is highest in the sample.

7).Mortage: This attribute is highly left skewed with a very high peak on the left telling us that most customer are having least mortage while a very few have some mortage.

8).Securities Account : This attributes tells us that majorly cutomers are not having Security account.

9).CD account: Most of the customers dont have CDaccounts.

10).Online: Higher number of customers use online banking in the sample.

11).Credit Card: This attribute has less customers using CC in comparison to the CC users.


# Univariate Analysis of the continuous variables - 1

In [None]:
plt.figure(figsize= (40.5,40.5))
plt.subplot(5,3,1)
plt.hist(bank_per_loan_df.Age, color='lightpink', edgecolor = 'black')
plt.xlabel('Age')

plt.subplot(5,3,2)
plt.hist(bank_per_loan_df.Experience, color='darkblue', edgecolor = 'black')
plt.xlabel('Experience')

plt.subplot(5,3,3)
plt.hist(bank_per_loan_df.Income, color='lightblue', edgecolor = 'black')
plt.xlabel('Income')

plt.subplot(5,3,4)
plt.hist(bank_per_loan_df.CCAvg, color='lightgreen', edgecolor = 'black')
plt.xlabel('Credit Card Average')

plt.subplot(5,3,5)
plt.hist(bank_per_loan_df.Mortgage, color='purple', edgecolor = 'black')
plt.xlabel('Mortgage')

plt.show()

##### Observation

    Age & Experience seems to be quiet normally distributed

    Income, CC Average & Mortgage are highly skewed



In [None]:
# Checking for Skewness of data
Skewness = pd.DataFrame({'Skewness' : [stats.skew(bank_per_loan_df.Age),stats.skew(bank_per_loan_df.Experience),stats.skew(bank_per_loan_df.Income),stats.skew(bank_per_loan_df.CCAvg)
                                      ,stats.skew(bank_per_loan_df.Mortgage)]},index=['Age','Experience','Income','CCAvg','Mortgage'])
Skewness

##### Observation

    Age and Experience seems to be quiet Symmetrical

    Income, CCAvg and Mortgage are Positively skewed, as they are highly skewed there will be quiet a lot of extreme values

# Univariate Analysis of the continuous variables - 2

In [None]:
plt.figure(figsize= (25,25))
plt.subplot(5,2,1)
sns.boxplot(x= bank_per_loan_df.Age, color='yellow')

plt.subplot(5,2,2)
sns.boxplot(x= bank_per_loan_df.Experience, color='red')

plt.subplot(5,2,3)
sns.boxplot(x= bank_per_loan_df.Income, color='lightgreen')

plt.subplot(5,2,4)
sns.boxplot(x= bank_per_loan_df.CCAvg, color='lightpink')

plt.subplot(5,2,5)
sns.boxplot(x= bank_per_loan_df.Mortgage, color='lightblue')

##### Inference

    Age feature is normally distributed with majority of customers falling between 35 years and 55 years of age. We can infer from the boxplot above, and also in info attained from describe() shows mean is almost equal to median.

    Experience is normally distributed with more customer having experience starting from 11 years to 30 Years. Here also the mean is equal to median.

    Income is positively skewed. Majority of the customers have income between 45K and 55K. We can confirm this by saying the mean is greater than the median.

    CCAvg is also a positively skewed variable and average spending is between 0K to 10K and majority spends less than 2.5K.

    Mortgage 70% of the individuals have a mortgage of less than 40K. However the max value is 635K.



# Univariate Analysis of the categorical variables

In [None]:
plt.figure(figsize=(30,45))

plt.subplot(6,2,1)
bank_per_loan_df['Family'].value_counts().plot(kind="bar", align='center',color = 'green',edgecolor = 'black')
plt.xlabel("Number of Family Members")
plt.ylabel("Count")
plt.title("Family Members Distribution")

plt.subplot(6,2,2)
bank_per_loan_df['Education'].value_counts().plot(kind="bar", align='center',color = 'blue',edgecolor = 'black')
plt.xlabel('Level of Education')
plt.ylabel('Count ')
plt.title('Education Distribution')

plt.subplot(6,2,3)
bank_per_loan_df['Securities Account'].value_counts().plot(kind="bar", align='center',color = 'red',edgecolor = 'black')
plt.xlabel('Holding Securities Account')
plt.ylabel('Count')
plt.title('Securities Account Distribution')

plt.subplot(6,2,4)
bank_per_loan_df['CD Account'].value_counts().plot(kind="bar", align='center',color = 'lightpink',edgecolor = 'black')
plt.xlabel('Holding CD Account')
plt.ylabel('Count')
plt.title("CD Account Distribution")

plt.subplot(6,2,5)
bank_per_loan_df['Online'].value_counts().plot(kind="bar", align='center',color = 'lightgreen',edgecolor = 'black')
plt.xlabel('Accessing Online Banking Facilities')
plt.ylabel('Count')
plt.title("Online Banking Distribution")

plt.subplot(6,2,6)
bank_per_loan_df['CreditCard'].value_counts().plot(kind="bar", align='center',color = 'yellow',edgecolor = 'black')
plt.xlabel('Holding Credit Card')
plt.ylabel('Count')
plt.title("Credit Card Distribution")

##### Observations

    The variables family and education are ordinal variables. The distribution of families is evenly distributed
    It seems that many of the population is not holding Securities Account and CD Account, vast difference is visible

In [None]:
#Pairplot
sns.pairplot(bank_per_loan_df.iloc[:,1:])

##### Observation

    'Age' has an association with 'Experience
    
    Age feature is normally distributed with majority of customers falling between 30 years and 60 years of age. We can confirm this by looking at the describe statement above, which shows mean is almost equal to median


    Experience is normally distributed with more customer having experience starting from 8 years. Here the mean is equal to median. There are negative values in the Experience. This could be a data input error as in general it is not possible to measure negative years of experience. We can delete these values, because we have 3 or 4 records from the sample.


    Income is positively skewed. Majority of the customers have income between 45K and 55K. We can confirm this by saying the mean is greater than the median

    CCAvg is also a positively skewed variable and average spending is between 0K to 10K and majority spends less than 2.5K

    Mortgage 70% of the individuals have a mortgage of less than 40K. However the max value is 635K

    The variables family and education are ordinal variables. The distribution of families is evenly distributes

# Dependant variable analysis

In [None]:
bank_per_loan_df["Personal Loan"].value_counts().to_frame()

In [None]:
pd.value_counts(bank_per_loan_df["Personal Loan"]).plot(kind="bar")

# Bivariate Analysis

Hypotheses based on the data and loan awarness:

    -High salaries are less feasible to buy personal loans while customers with medium or low salaries are more feasible for buying personal loans.
    
    
    -More the number of earning family members, less probability of buying personal loans.
    
    
    -Customers with probably the age of 30–50 will buy personal loans.
    
    
    -The customer is a graduate or under-graduate can affect the buying probability, people who are graduated or Advanced Professionals are more viable to buy personal loans from a bank rather than people who are under-graduated.

###### Categorical Independent Variable vs Target Variable

In [None]:
sns.countplot(x='Personal Loan',hue='Education',data=bank_per_loan_df)
table=pd.crosstab(bank_per_loan_df['Education'],bank_per_loan_df['Personal Loan'])
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Education vs Purchase')
plt.xlabel('Education')
plt.ylabel('Proportion of Loans')
# undergraduate has very less prob of taking the loan

In [None]:
education = pd.crosstab(bank_per_loan_df['Education'],bank_per_loan_df['Personal Loan'])
education.div(education.sum(1).astype(float),axis =0).plot(kind='bar',stacked=True)
print('cross tabulation can be given as : ', '\n', education)
print('cross tabulation can be given in percentage as : ', '\n', education.div(education.sum(1).astype(float),axis =0))

##### Observation

From the above plots, we can infer that customers who are more educated have a higher probability of buying personal loans. Hence our hypothesis was true…!

In [None]:
sns.countplot(x='Family',data=bank_per_loan_df,hue='Personal Loan',palette='Set1')

In [None]:
family = pd.crosstab(bank_per_loan_df['Family'],bank_per_loan_df['Personal Loan'])
family.div(family.sum(1).astype(float),axis =0).plot(kind='bar',stacked=True)
print('cross tabulation can be given as : ', '\n', family)
print('cross tabulation can be given in percentage as : ', '\n', family.div(family.sum(1).astype(float),axis =0))

##### Observation

The number of family members not significantly affect probability. Hence it contradicts our hypothesis that the number of family members will affect the probability.

# Influence of income and education on personal loan

In [None]:
sns.boxplot(x='Education',y='Income',hue='Personal Loan',data=bank_per_loan_df)

##### Observation : 
    
    
    
    
    It seems the customers whose education level is 1 is having more income. However customers who has taken the personal loan have the same income levels

# Influence of mortage and education on personal loan

In [None]:
sns.boxplot(x="Education", y='Mortgage', hue="Personal Loan", data=bank_per_loan_df,color='pink')

##### Observation

From the above chart it seems that customer who do not have personal loan and customer who has personal loan have high mortgage

# Boolean Independent Variable vs Target Variable

Let us now look at the Boolean variables (‘CD_Account’, ‘Credit_Card’, ‘Online’, ‘Securities Account’) vs Target variable (‘Personal_Loan’)

In [None]:
sns.countplot(x='CD Account',data=bank_per_loan_df,hue='Personal Loan')

In [None]:
CD_Account = pd.crosstab(bank_per_loan_df['CD Account'],bank_per_loan_df['Personal Loan'])
CD_Account.div(CD_Account.sum(1).astype(float),axis =0).plot(kind='bar',stacked=True)
print('cross tabulation can be given as : ', '\n', CD_Account)
print('cross tabulation can be given in percentage as : ', '\n', CD_Account.div(CD_Account.sum(1).astype(float),axis =0))

##### Observation

The customer who has a certificate of deposit (CD) account with the bank seems to buy personal loans from the bank.

##### Let us now compare between the personal loan buyers who use or doesn’t use a credit card issued by UniversalBank

In [None]:
credit = pd.crosstab(bank_per_loan_df['CreditCard'],bank_per_loan_df['Personal Loan'])
credit.div(credit.sum(1).astype(float),axis =0).plot(kind='bar',stacked=True)
print('cross tabulation can be given as : ', '\n', credit)
print('cross tabulation can be given in percentage as : ', '\n', credit.div(credit.sum(1).astype(float),axis =0))

##### Observation

The customer who uses or doesn’t use a credit card issued by UniversalBank doesn’t seem to affect the probability of buying a personal loan.

##### Let us now compare the personal loan buyer’s customer who uses or doesn’t use internet banking facilities:

In [None]:
online = pd.crosstab(bank_per_loan_df['Online'],bank_per_loan_df['Personal Loan'])
online.div(online.sum(1).astype(float),axis =0).plot(kind='bar',stacked=True)
print('cross tabulation can be given as : ', '\n', online)
print('cross tabulation can be given in percentage as : ', '\n', online.div(online.sum(1).astype(float),axis =0))

###### Observation

The customer who uses or doesn’t use internet banking facilities seems to not affect the probability of buying personal loans.

##### Let us now compare between the personal loan buyer’s customer who has or doesn’t have a securities account with the bank:

In [None]:
sns.countplot(x="Securities Account", data=bank_per_loan_df,hue="Personal Loan")

In [None]:
securities = pd.crosstab(bank_per_loan_df['Securities Account'],bank_per_loan_df['Personal Loan'])
securities.div(securities.sum(1).astype(float),axis =0).plot(kind='bar',stacked=True)
print('cross tabulation can be given as : ', '\n', securities)
print('cross tabulation can be given in percentage as : ', '\n', securities.div(securities.sum(1).astype(float),axis =0))

##### Obervations

The customers who have or don’t have securities account with the bank do not affect the probability of buying a personal loan.

# Influence of few attributes on 'Personal Loan' - Dependant Variable

In [None]:
sns.scatterplot(y = 'Income', x = 'Age', data = bank_per_loan_df, hue = 'Personal Loan')

In [None]:
plt.figure(figsize=(15,15))

plt.subplot(3,1,1)
sns.scatterplot(bank_per_loan_df.CCAvg, bank_per_loan_df.Income, hue = bank_per_loan_df['Personal Loan'], palette= ['lightpink','green'])

plt.subplot(3,1,2)
sns.scatterplot(bank_per_loan_df.Family, bank_per_loan_df.Income, hue = bank_per_loan_df['Personal Loan'], palette= ['lightblue','purple'])

plt.subplot(3,1,3)
sns.scatterplot(bank_per_loan_df.Income, bank_per_loan_df.Mortgage, hue = bank_per_loan_df['Personal Loan'], palette= ['lightgreen','green'])

##### Observation

    The graph show persons who have personal loan have a higher credit card average.

    It is clearly visible that as the members of family increases (say >=3) the necessity of loan is also increasing.

    It is very precise that as the income increases (approx 100K) the mortgage value also increases gradually wiht the necessity of personal loan.

In [None]:
plt.figure(figsize=(15,15))

plt.subplot(3,1,1)
sns.scatterplot(bank_per_loan_df.Age, bank_per_loan_df.Experience, hue = bank_per_loan_df['Personal Loan'], palette= ['lightpink','green'])

plt.subplot(3,1,2)
sns.scatterplot(bank_per_loan_df.Education, bank_per_loan_df.Income, hue = bank_per_loan_df['Personal Loan'], palette= ['lightgreen','green'])

plt.subplot(3,1,3)
sns.scatterplot(bank_per_loan_df.Education, bank_per_loan_df.Mortgage, hue = bank_per_loan_df['Personal Loan'], palette= ['red','green'])

##### Observation

    'Age' has a very strong association with 'Experience' but nothing gets affected with loan attribute.
    It seems that customers with education level is 1 is having more income which is mere equal to the customers who has taken the personal loan.
    Customers with education level 2 & 3 seems to take personal loan as they have high mortgage.

In [None]:
plt.figure(figsize=(15,15))

plt.subplot(2,2,1)
sns.countplot(x="Securities Account", data=bank_per_loan_df ,hue="Personal Loan")

plt.subplot(2,2,2)
sns.countplot(x='CD Account' ,data=bank_per_loan_df ,hue='Personal Loan')

##### Observation

    Majority of customers who does not have loan is holding securities account, Whereas small proportion of customers having loan does hold but majority of them do not have securities account.
    Customers who does not have CD account, does not have loan as well, but almost all customers who has CD account has loan as well

In [None]:
sns.distplot(bank_per_loan_df[bank_per_loan_df["Personal Loan"] == 0]['Income'], color = 'b')
sns.distplot(bank_per_loan_df[bank_per_loan_df["Personal Loan"] == 1]['Income'], color = 'y')

##### Observation

    The graph show those who have personal loan also have a higher income.

In [None]:
sns.distplot( bank_per_loan_df[bank_per_loan_df['Personal Loan'] == 0]['CCAvg'], color = 'r')
sns.distplot( bank_per_loan_df[bank_per_loan_df['Personal Loan'] == 1]['CCAvg'], color = 'g')

In [None]:
print('Credit card spending of Non-Loan customers: ',bank_per_loan_df[bank_per_loan_df['Personal Loan'] == 0]['CCAvg'].median()*1000)
print('Credit card spending of Loan customers    : ', bank_per_loan_df[bank_per_loan_df['Personal Loan'] == 1]['CCAvg'].median()*1000)

##### Observation: 

The graph show persons who have personal loan have a higher credit card average. Average credit card spending with a median of 3800 dollar indicates a higher probability of personal loan. Lower credit card spending with a median of 1400 dollars is less likely to take a loan. This could be useful information.

In [None]:
plt.figure(figsize=(10,5))
sns.relplot(x="Income", y="CCAvg" ,aspect = 2 ,data=bank_per_loan_df)
plt.show()
plt.clf()

##### Observation

Income and credit card average use is also related in a linear fashion and is more dense in the income bracket of 50k-100k bracket.

In [None]:
fig, ax = plt.subplots()
colors = {1:'red',2:'yellow',3:'green'}
ax.scatter(bank_per_loan_df['Experience'],bank_per_loan_df['Age'],c=bank_per_loan_df['Education'].apply(lambda x:colors[x]))
plt.xlabel('Experience')
plt.ylabel('Age')

##### Observation:

The above plot show with experience and age have a positive correlation. As experience increase age also increases. Also the colors show the education level. There is gap in the mid forties of age and also more people in the under graduate level

In [None]:
plt.figure(figsize=(10,5))
sns.relplot(x="Income", y="Mortgage", #hue="Personal Loan",
            col="Education", data=bank_per_loan_df)
plt.show()
plt.clf()

###### Observation


Higher income level and higher education level customers have very few mortgages on them. Plus there are some smart people who dont have any mortgages across all education levels.

The mortgages are mainly concentrated between 0k-80k annual income individuals irrespective of the education background.

In [None]:
plt.figure(figsize=(5,5))
sns.relplot(x="Experience", y="Income",col = "Education",
             data=bank_per_loan_df)
plt.show()
plt.clf()

##### Observation:

Income of customers with higher experiences is an even spread and not related linearly.

It should be noted that even with Bachelors level education, higher experience indivudals have higher income as compared to their more educated counterparts.

The scatter plot is sparse above 100k USD for higher educated customers with the same experience.

In [None]:
plt.figure(figsize=(5,5))
sns.relplot(x="Income", y="Mortgage",col = "Family",# hue="Education",
             data=bank_per_loan_df)
plt.show()
plt.clf()

##### Observation

Income and Mortgage was linearly related.

When we see this relation with respect to the number of family members we see that above 100k USD annual income, families of 3 and 4 have lesser mortgages as compared to families of 1 or 2.

In [None]:
melted_data = pd.melt(bank_per_loan_df.iloc[:,9:])
melted_data.loc[melted_data['value'] == 0 , ['value']] = 'No'
melted_data.loc[melted_data['value'] == 1 , ['value']] = 'Yes'
plt.figure(figsize=(15,5))
sns.countplot(x="variable", hue="value", data=melted_data)
plt.show()
plt.close()

##### Observation
From the above graph we can observe that a major portion of the customers have no securities or CD Accounts.

Around 3000 use Online banking but not many use credit cards.

Credit cards usage is mostly among youngsters generally.

The average age of this dataset is 45, so it makes sense that credit card users are less.

In [None]:
numerical_1 = ['Age' , 'Experience' ,'Family' ,'Income' , 'CCAvg' , 'Mortgage']
fig, ax_1 = plt.subplots(2, 3, figsize=(15, 10))
for var_1, subplot in zip(numerical_1, ax_1.flatten()):
    sns.boxplot(x='Personal Loan', y=var_1, data=bank_per_loan_df, ax=subplot)
plt.show()
plt.clf()

Our EDA can conclude with analysis of the numerical values with a categorical Personal Loan feature, and a box plot is the best way to do it

Age of the customer is not a defining factor if the person will accept a personal loan or not.
Professional years of experience also not a governing factor.

As we saw in the previous graph, a family of 3 or 4 has lesser mortgages even with higher incomes. The reason, based on this box plot can be the fact that taking a personal loan with higher interest rate might seem justified to them. Will show it in a graph below.

As expected, higher the income more is the chance that a person will accept a personal loan offer from the bank.

If one's credit card average spending per month is higher, they will probably accept a personal loan offer.
Higher mortgage means a custome might accept a personal loan offer.

# Checking for correlation

In [None]:
# Correlation with heat map
corr_overall = bank_per_loan_df.corr()
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5})
plt.figure(figsize=(13,7))
# create a mask so we only see the correlation values once
mask = np.zeros_like(corr_overall)
mask[np.triu_indices_from(mask, 1)] = True
a = sns.heatmap(corr_overall,mask=mask, annot=True, fmt='.2f')
rotx = a.set_xticklabels(a.get_xticklabels(), rotation=90)
roty = a.set_yticklabels(a.get_yticklabels(), rotation=30)

###### Observation

Income and CCAvg is moderately correlated.

Age and Experience is highly correlated

In [None]:
ncol_2 = ['Age', 'Income','CCAvg', 'Mortgage']
grid = sns.PairGrid(bank_per_loan_df, y_vars = 'Experience', x_vars = ncol_2, height = 4)
grid.map(sns.regplot);

##### Observation:

Age' has a very strong association with 'Experience

Is there some association between personal characteristics and the fact that person obtained Personal Loan?

Let's check what the values or group of values of each variable lies inside group that have 'Personal Loan' and don't have that.

Since we found strong association between 'Age' and 'Experience' we decided to exclud 'Experience' from analysis steps to avoid multicollinearity.

##### QUANTATIVE VARIABLES

['Age', 'Income', 'CCAvg', 'Mortgage'] with Personal Loan

In [None]:
quant_df=bank_per_loan_df[['Personal Loan', 'Age', 'Income', 'CCAvg', 'Mortgage']]

In [None]:
bank_per_loan_df[['Personal Loan', 'Age', 'Income', 'CCAvg', 'Mortgage']].corr()

In [None]:
sns.heatmap(bank_per_loan_df[['Personal Loan', 'Age', 'Income', 'CCAvg', 'Mortgage']].corr(), annot = True)

In [None]:
bank_per_loan_df[['Personal Loan', 'Age', 'Income', 'CCAvg', 'Mortgage']].corr()['Personal Loan'][1:].plot.bar()

##### Observation

    The above diagram shows a clear vision on the correlation between the independant variable and dependant variables, we see that 'Income' and 'Credit Card Average' has some correlation with 'Personal Loan'.

##### Let's check our confidense about this statment with logistic regression model:

In [None]:
quant_df['intercept'] = 1
log_mod_check = sm.Logit(quant_df['Personal Loan'], quant_df[['intercept', 'Age', 'Income', 'CCAvg', 'Mortgage']]).fit()

In [None]:
log_mod_check.summary()

In [None]:
# include 'intercept'
log_mod_check.pvalues[1:5].plot.bar()
plt.axhline(y = 0.05);

##### Observation

We can say with confidence that 'Income' and 'CCAvg' both has statisticaly significant association with 'Personal Loan', since their p-value in logistic regression < 0.05

###### The bar chart of coefficient distribution

In [None]:
# exclude 'intercept'
log_mod_check.params[1:5].plot.bar();

##### Observation

'CCAvg' has strongest association with 'Personal Loan'

Filter columns with P-values less then 0.05 and store variables and it's coefficients into the dictionary

In [None]:
quant_df_main = {}
for i in log_mod_check.params[1:5].to_dict().keys():
    if log_mod_check.pvalues[i] < 0.05:
        quant_df_main[i] = log_mod_check.params[i]
    else:
        continue

In [None]:
quant_df_main

In [None]:
quant_df_main_odds = {k : np.exp(v) for k, v in quant_df_main.items()}
quant_df_main_odds

##### Observation :

'Personal Loan' has statisticaly significant association with:

    'Income' : coef = 0.03508
    'CCAvg' : coef = 0.06879
Both variables are positively associated with 'Personal Loan'. As soon as both have one unit as $1000 we may say the following:

    For each $1000 increase in 'Income' we expect the odds to sell Personal Loan to increase by 3.57%, holding everything else constant

    For each $1000 increase in 'CCAvg' we expect the odds to sell Personal Loan to increase by 7.12%, holding everything else constant

###### CATEGORICAL VARIABLES

'ZIP Code', 'Family', 'Education'

'Family' and 'Education' are ordinal categorical variables so we may apply logistic regression direct to them. 'ZIP Code' is nominal, so we need to build dummy variables to check the association existence

In [None]:
cat_df = bank_per_loan_df[['ZIP Code', 'Family', 'Education', 'Personal Loan']].copy()

'Family' and 'Education'

In [None]:
cat_df.corr()

In [None]:
cat_df.corr()['Personal Loan'][0:2]

In [None]:
cat_df.corr()['Personal Loan'][0:2].plot.bar();

###### Observation :

Family' and 'Education' has low association with 'Personal Loan'




Let's check our confidence with logistic regresstion

In [None]:
cat_df['intercept'] = 1
log_mod_1 = sm.Logit(cat_df['Personal Loan'], cat_df[['intercept', 'Family', 'Education']]).fit()

In [None]:
log_mod_1.summary()

##### Observation:

We can say with confidence that 'Family' and 'Education' both has statisticaly significant association with 'Personal Loan', since their p-value in logistic regression < 0.05

The bar chart of coefficient distribution

In [None]:
# exclude 'intercept'
log_mod_1.params[1:3].plot.bar();

##### Observation:

'Education' has strongest association with 'Personal Loan'

In [None]:
cat_df_main = {}
for i in log_mod_1.params[1:3].to_dict().keys():
    if log_mod_1.pvalues[i] < 0.05:
        cat_df_main[i] = log_mod_1.params[i]
    else:
        continue 
    
cat_df_main

In [None]:
cat_df_odds_1 = {k : np.exp(v) for k, v in cat_df_main.items()}
cat_df_odds_1

##### Observation:

Conclusion:

'Personal Loan' has statisticaly significant association with:

    'Family' : coef = 0.16231
    'Education' : coef = 0.54873
Both variables are positively associated with 'Personal Loan'. We may say the following:

    For each unit increase in 'Family' we expect the odds to sell Personal Loan to increase by 17.62%, holding everything else constant

    For each unit increase in 'Education' we expect the odds to sell Personal Loan to increase by 73.11%, holding everything else constant

##### ZIP Code

In [None]:
cat_df.head()

In [None]:
zip_df = cat_df[['Personal Loan', 'intercept','ZIP Code']].copy()
zip_df.head(2)

Lets check how we can group the 'Zip Code' values to minimize the number of dummies

In [None]:
zip_df['ZIP Code'].nunique()

In [None]:
# make string version of original column, call it 'col'
zip_df['ZIP Code_str'] = zip_df['ZIP Code'].astype(str)
zip_df.info()

# make the new columns using string indexing
zip_df['zip_df_2'] = zip_df['ZIP Code_str'].str[0:2]
zip_df['zip_df_3'] = zip_df['ZIP Code_str'].str[0:3]
zip_df.info()
zip_df.head()

In [None]:
zip_df['zip_df_3'].nunique()

In [None]:
zip_df['zip_df_2'].nunique()

In [None]:
zip_df['zip_df_2'].value_counts()

Guess this set is okay for the first view since we assume that the initial campaign of selling Personal Loans was evenly spreaded through all zip codes.

before we get dummies, let us drop ZIP Code, ZIP Code_str and zip_df_3

In [None]:
zip_2_df = copy.deepcopy(zip_df)
zip_2_df

In [None]:
zip_2_df.drop(['ZIP Code', 'ZIP Code_str','zip_df_3'], axis=1, inplace=True)
zip_2_df

Let's get dummies...

In [None]:
dum_zip_df = pd.get_dummies(zip_2_df, prefix = "Z", drop_first = True)
dum_zip_df

In [None]:
#Fit a logic model
#exclude 'Personal Loan' from independ vars
dum_zip_df_columns = dum_zip_df.columns.drop('Personal Loan').tolist()
dum_zip_df_columns

In [None]:
log_mod_2 = sm.Logit(dum_zip_df['Personal Loan'], dum_zip_df[dum_zip_df_columns]).fit()

In [None]:
log_mod_2.summary()

##### Observation

We can say with confidence that any ZIP Code does not have statisticaly significant association with 'Personal Loan', since their p-value in logistic regression > 0.05

##### BINARY VARIABLES

'Securities Account', 'CD Account', 'Online', 'Credit Card'

In [None]:
bin_df = bank_per_loan_df[['Personal Loan', 'Securities Account', 'CD Account', 'Online', 'CreditCard']].copy()
bin_df.head()

In [None]:
bin_df.corr()['Personal Loan']

In [None]:
bin_df.corr()['Personal Loan'][1:].plot.bar();

##### Observation

CD Account' - the only one variable with moderate association

In [None]:
#Let's fit logistic regression
bin_df['intercept'] = 1
bin_df_colmn = bin_df.columns.drop('Personal Loan').tolist()
log_mod_3 = sm.Logit(bin_df['Personal Loan'], bin_df[bin_df_colmn]).fit()

In [None]:
log_mod_3.summary()

In [None]:
log_mod_4 = sm.Logit(bin_df['Personal Loan'], bin_df[['intercept', 'CD Account']]).fit()

In [None]:
log_mod_4.summary()

In [None]:
bin_odds = {'CD Account' : np.exp(log_mod_4.params[1])}
bin_odds

##### Observation


Conclusion:

'Personal Loan' has statisticaly significant positive association with only:

    'CD Account' : coef = 2.40
We may say the following:

    With customer been hold CD Account with The Bank we expect the odds to sell Personal Loan to increase 10 times, holding everything else constant

### Summary Conclusion:

'Personal Loan' has statisticaly significant association with:

    'CD Account' : coef = 2.40 : odds = 11.07
    'Family' : coef = 0.16231 : odds = 1.176
    'Education' : coef = 0.54873 : odds = 1.731
    'Income' : coef = 0.03508 : odds = 1.036
    'CCAvg' : coef = 0.06879 : odds = 1.071
Both variables are positively associated with 'Personal Loan'. We may say the following:

    With customer been hold CD Account with The Bank we expect the odds to sell Personal Loan to increase 11 times, holding everything else constant

    For each unit increase in 'Family' we expect the odds to sell Personal Loan to increase by 17.62%, holding everything else constant

    For each unit increase in 'Education' we expect the odds to sell Personal Loan to increase by 73.11%, holding everything else constant

    For each $1000 increase in 'Income' we expect the odds to sell Personal Loan to increase by 3.57%, holding everything else constant

    For each $1000 increase in 'CCAvg' we expect the odds to sell Personal Loan to increase by 7.12%, holding everything else constant

###### ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Question 3: Get the target distribution.

## Target Label

Personal Loan will be the target

In [None]:
# 0 = didnt take loan in the last campaign (90.4%)
# 1 = took loan in the last campaign (9.6%)
bank_per_loan_df["Personal Loan"].value_counts().to_frame()

In [None]:
pd.value_counts(bank_per_loan_df["Personal Loan"]).plot(kind="bar")

In [None]:
count_no_sub = len(bank_per_loan_df[bank_per_loan_df['Personal Loan']==0])
print('count_no_sub :',count_no_sub)
count_sub = len(bank_per_loan_df[bank_per_loan_df['Personal Loan']==1])
print('count_sub :',count_sub)
pct_of_no_sub = count_no_sub/(count_no_sub+count_sub)
print('pct_of_no_sub')
print("percentage of no subscription is", pct_of_no_sub*100)
pct_of_sub = count_sub/(count_no_sub+count_sub)
print("percentage of subscription", pct_of_sub*100)

##### Looking into the distribution to the various attributes in relation with the target.

In [None]:
bank_per_loan_df.groupby(bank_per_loan_df['Personal Loan']).mean()

##### Observations: 
    
1). The average Income of customers who took loan is more than double of the avg income of customers who didn’t take loan last year.

2). The Avg. spending on credit cards per month ($000) is also more than double for the customer's who took loan.

3). The average mortage for loan availing customers is approximately double for the not availing customers.

4). Avg literacy is less for non loan takers.

As given in the data description that person who took loan in the last camping is 9.6%

###### ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Question 4: Split the data into training and test set in the ratio of 70:30 respectively.

# Data Split 70:30 Ratio

In [None]:
train_set, test_set = train_test_split(bank_per_loan_df.drop(['ID','Experience'], axis=1), test_size=0.3 , random_state= 77)

In [None]:
train_labels = train_set.pop('Personal Loan')
test_labels = test_set.pop('Personal Loan')

In [None]:
train_set_indep = bank_per_loan_df.drop(['Experience' ,'ID'] , axis = 1).drop(labels= "Personal Loan" , axis = 1)
train_set_dep = bank_per_loan_df["Personal Loan"]
X = np.array(train_set_indep)
Y = np.array(train_set_dep)
X_Train = X[ :3500, :]
X_Test = X[3501: , :]
Y_Train = Y[:3500, ]
Y_Test = Y[3501:, ]

#### Also I have showing couple variation of models, and have also scaled the data to increase the accuracy of the model by standard scaler mthod, shown later below.

###### --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Question 5: Use different classification models (Logistic, K-NN and Naïve Bayes) to predict the likelihood of a liability customer buying personal loans

# Logistic Regression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_Train,Y_Train)

In [None]:
predict = logmodel.predict(X_Test)
predictProb = logmodel.predict_proba(X_Test)

In [None]:
# Confusion Matrix
cm = confusion_matrix(Y_Test, predict)
class_label = ["Positive", "Negative"]
df_cm = pd.DataFrame(cm, index = class_label, columns = class_label)
sns.heatmap(df_cm, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# Classification Report
print(classification_report(Y_Test, predict))

In [None]:
predictProb = logmodel.predict_proba(X_Test)
skplt.metrics.plot_cumulative_gain(Y_Test, predictProb)
plt.show()

In [None]:
skplt.metrics.plot_precision_recall(Y_Test, predictProb)
plt.show()

# K-NN

##### Cross Validation

In [None]:
# Creating odd list of K for KNN
myList = list(range(1,20))
# Subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))

In [None]:
# Empty list that will hold accuracy scores
ac_scores = []

# Perform accuracy metrics for values from 1,3,5....19
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_Train, Y_Train)
    
    # Predict the response
    Y_Pred = knn.predict(X_Test)
    
    # Evaluate accuracy
    scores = accuracy_score(Y_Test, Y_Pred)
    ac_scores.append(scores)

# Changing to misclassification error
MSE = [1 - x for x in ac_scores]

# Determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

##### Model

In [None]:
knn = KNeighborsClassifier(n_neighbors= 13 , weights = 'uniform', metric = 'euclidean')
knn.fit(X_Train, Y_Train)    
predicted = knn.predict(X_Test)
acc = accuracy_score(Y_Test, predicted)
print(acc)

##### Misclassification Error vs K

In [None]:
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')

In [None]:
# Confusion Matrix
cm1 = confusion_matrix(Y_Test, predicted)
class_label = ["Positive", "Negative"]
df_cm1 = pd.DataFrame(cm1, index = class_label, columns = class_label)
sns.heatmap(df_cm1, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# Classification Report
print(classification_report(Y_Test, predicted))

In [None]:
y_probas17= knn.predict_proba(X_Test)
skplt.metrics.plot_cumulative_gain(Y_Test, y_probas17)
plt.show()

In [None]:
skplt.metrics.plot_precision_recall(Y_Test, y_probas17)
plt.show()

# Naive Bayes

In [None]:
# Model
naive_model = GaussianNB()
naive_model.fit(train_set, train_labels)
prediction = naive_model.predict(test_set)
naive_model.score(test_set,test_labels)

In [None]:
# Confusion Matrix
cm2 = confusion_matrix(test_labels, prediction)
class_label = ["Positive", "Negative"]
df_cm2 = pd.DataFrame(cm2, index = class_label, columns = class_label)
sns.heatmap(df_cm2, annot = True, fmt = "d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# Classififcation Report
print(classification_report(test_labels, prediction))

In [None]:
y_probas67 = naive_model.predict_proba(X_Test)
skplt.metrics.plot_cumulative_gain(Y_Test, y_probas67)
plt.show()

In [None]:
skplt.metrics.plot_precision_recall(Y_Test, y_probas67)
plt.show()

# Model Comparison

In [None]:
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('LR', LogisticRegression()))
models.append(('NB', GaussianNB()))

# Evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=12345)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

###### ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# 6. Print the confusion matrix for all the above models

In [None]:
# Confusion Matrix
cm = confusion_matrix(Y_Test, predict)
class_label = ["Positive", "Negative"]
df_cm = pd.DataFrame(cm, index = class_label, columns = class_label)
sns.heatmap(df_cm, annot = True, fmt = "d")
plt.title("Confusion Matrix via Logistics Regression")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# Confusion Matrix
cm1 = confusion_matrix(Y_Test, predicted)
class_label = ["Positive", "Negative"]
df_cm1 = pd.DataFrame(cm1, index = class_label, columns = class_label)
sns.heatmap(df_cm1, annot = True, fmt = "d")
plt.title("Confusion Matrix via K-NN")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# Confusion Matrix
cm2 = confusion_matrix(test_labels, prediction)
class_label = ["Positive", "Negative"]
df_cm2 = pd.DataFrame(cm2, index = class_label, columns = class_label)
sns.heatmap(df_cm2, annot = True, fmt = "d")
plt.title("Confusion Matrix via Naive Bayes ")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# 7. Give your reasoning on which is the best model in this case and why it performs better?



### Model Comparison

In [None]:
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('LR', LogisticRegression()))
models.append(('NB', GaussianNB()))

# Evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=12345)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    
# Boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()


Summary

    The aim of the Bank is to convert there liability customers into loan customers.
    They want to set up a new marketing campaign; hence, they need information about the connection between the variables given in the data.
    Three classification algorithms were used in this study.
    From the above graph , it seems like 'Logistic Regression' algorithm have the highest accuracy and we can choose that as our final model
    
    
    The logistic Regression model is the best as the accuracy of the train and test set is almost similar and also the precsion and recall accuracy is good. The confusion matrix is also better in comparision to other models.

The requirement is to classify the target. The KNN is distance based which not perfect for this situation.Though the accuracy is good but confusion matrix tells that is correct predictions is not that much acceptable.

The Naive Bayes giving the ccuracy less in comaprision to other models meaning the probability of determing the target correctly is less.




##### ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### Another apporach for model, cleaning the outlier in the dataset for Zscore greater than 3, scaling the data via standardization, ROC , others

In [None]:
bank_per_loan_1_df = copy.deepcopy(bank_per_loan_df)
bank_per_loan_1_df

Due to these outliers’ bulk of the data in the Mortgage is at the left and the right tail is longer. This is called right skewness. One way to remove the skewness is by doing the z-score.

In [None]:
bank_per_loan_1_df['Mortgage_zscore'] = np.abs(stats.zscore(bank_per_loan_1_df['Mortgage']))
bank_per_loan_1_df = bank_per_loan_1_df[bank_per_loan_1_df['Mortgage_zscore']<3]
bank_per_loan_1_df.drop('Mortgage_zscore', axis =1, inplace =True)

In [None]:
bank_per_loan_1_df.shape

Removed the outlier
Here I had chosen those rows only whose z_score is less than 3, it can vary accordingly. Here we had dropped more than 100+ rows which contain outliers and now we can start with the model building

In [None]:
bank_per_loan_1_df.info()

Also, no need for ‘ID’, ‘ZIP_Code’ & ‘Experience’ columns for further analysis since ‘ID’ and ‘ZIP_Code’ are just numbers of series & ‘Experience’ is highly correlated with ‘Age’.

In [None]:
#also droping - 'ID','Experience'
bank_per_loan_1_df.drop('ID', axis =1, inplace =True)
bank_per_loan_1_df.drop('Experience', axis =1, inplace =True)
bank_per_loan_1_df.drop('ZIP Code', axis =1, inplace =True)
bank_per_loan_1_df.shape

In [None]:
bank_per_loan_1_df.info()

# Logistic Regression

In [None]:
X_1 = bank_per_loan_1_df.drop('Personal Loan', axis =1) 
y_1 = bank_per_loan_1_df['Personal Loan']

X1_train, X1_test, y1_train, y1_test = train_test_split(X_1,y_1,test_size=0.3, random_state = 2)

In [None]:
LogReg_model = LogisticRegression()
LogReg_model.fit(X1_train,y1_train)

In [None]:
y_1_pred = LogReg_model.predict(X1_test)
print(classification_report(y1_test,y_1_pred))
print(accuracy_score(y1_test,y_1_pred))
print(confusion_matrix(y1_test,y_1_pred))

In [None]:
LogReg_prob = LogReg_model.predict_proba(X1_test)
fpr1,tpr1, thresholds1 = roc_curve(y1_test, LogReg_prob[:,1])
roc_auc1 = auc(fpr1,tpr1)
print("Area under the ROC curve  :  %f" %roc_auc1)

#### STANDARDIZATION

In [None]:
col_names_standard = bank_per_loan_1_df.columns
scaler_1 = preprocessing.StandardScaler()
scaled_X1_train = scaler_1.fit_transform(X1_train)
scaled_X1_test = scaler_1.fit_transform(X1_test)

In [None]:
LogReg_scaled_model = LogisticRegression()
LogReg_scaled_model.fit(scaled_X1_train,y1_train)

In [None]:
y_1_scaled_pred = LogReg_scaled_model.predict(scaled_X1_test)
print(classification_report(y1_test,y_1_scaled_pred))
print(accuracy_score(y1_test,y_1_scaled_pred))
print(confusion_matrix(y1_test,y_1_scaled_pred))

In [None]:
LogReg_scaled_prob = LogReg_scaled_model.predict_proba(scaled_X1_test)
fpr2,tpr2, thresholds2 = roc_curve(y1_test, LogReg_scaled_prob[:,1])
roc_auc2 = auc(fpr2,tpr2)
print("Area under the ROC curve  :  %f" % roc_auc2)

We get an increase in accuracy and clearly see the difference between evaluation metrics with standardization of the data. As mentioned before, accuracy alone can’t define my model how well it predicted so we will play with recall now.

We get a recall value of 66%, which means our model did much better in predicting True Positives.

Also, the area under the curve is around 96%, much higher than previously.

Further, we will analyze other models with only scaled data.

# Naive Bayes

In [None]:
naive_model_2 = GaussianNB()
naive_model_2.fit(scaled_X1_train, y1_train)

y_2_scaled_pred = naive_model_2.predict(scaled_X1_test)
print(classification_report(y1_test,y_2_scaled_pred))
print(accuracy_score(y1_test,y_2_scaled_pred))
print(confusion_matrix(y1_test,y_2_scaled_pred))

In [None]:
naive_scaled_prob = naive_model_2.predict_proba(scaled_X1_test)
fpr3,tpr3, thresholds3 = roc_curve(y1_test, naive_scaled_prob[:,1])
roc_auc3 = auc(fpr3,tpr3)
print("Area under the ROC curve  :  %f" % roc_auc3)

We got an accuracy score of around 90% with a recall value of 64% which is much less as compared to the Logistic Regression.

Also, the area under the curve is around 93%, less than the logistic regression one.

Hence Naive Bayes terms out to be not a good classifier with this particular dataset.

# kNN

In [None]:
kNN_scaled_model = KNeighborsClassifier(n_neighbors= 3)
kNN_scaled_model.fit(scaled_X1_train, y1_train)  

y_3_scaled_pred = kNN_scaled_model.predict(scaled_X1_test)
print(classification_report(y1_test,y_3_scaled_pred))
print(accuracy_score(y1_test,y_3_scaled_pred))
print(confusion_matrix(y1_test,y_3_scaled_pred))

In [None]:
kNN_scaled_prob = kNN_scaled_model.predict_proba(scaled_X1_test)
fpr4,tpr4, thresholds4 = roc_curve(y1_test, kNN_scaled_prob[:,1])
roc_auc4 = auc(fpr4,tpr4)
print("Area under the ROC curve  :  %f" % roc_auc4)

And here we are with around 97% accuracy in determining if a customer will buy the personal loan or not. Also, the recall value is 75% is much better than logistic regression and Naive Bayes algorithms. Also, the area under the curve is fairly good.

# SVM (Support Vector Machine)

In [None]:
clf_1 = svm.SVC(C=3, kernel ='rbf', probability = True)
clf_1.fit(scaled_X1_train, y1_train)

y_4_scaled_pred = clf_1.predict(scaled_X1_test)
print(classification_report(y1_test,y_4_scaled_pred))
print(accuracy_score(y1_test,y_4_scaled_pred))
print(confusion_matrix(y1_test,y_4_scaled_pred))

In [None]:
svm_scaled_prob = clf_1.predict_proba(scaled_X1_test)
fpr5,tpr5, thresholds5 = roc_curve(y1_test, svm_scaled_prob[:,1])
roc_auc5 = auc(fpr5,tpr5)
print("Area under the ROC curve  :  %f" % roc_auc5)

We got 98% accuracy score with 81% recall value, also the area under the curve is about 99%… JUST WOW…!

##### -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Another  approach, by breaking down the features into sub category,  then doing the model analysis

In [None]:
loan_ml1_df = copy.deepcopy(bank_per_loan_df)
loan_ml1_df = loan_ml1_df.drop('Experience', axis=1)
loan_ml1_df = loan_ml1_df.drop('ID', axis=1)
loan_ml1_df

In [None]:
max_bin = 20
force_bin = 3

# define a binning function
def mono_bin(Y, X, n = max_bin):
    
    df1 = pd.DataFrame({"X": X, "Y": Y})
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]
    r = 0
    while np.abs(r) < 1:
        try:
            d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y, "Bucket": pd.qcut(notmiss.X, n)})
            d2 = d1.groupby('Bucket', as_index=True)
            r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
            n = n - 1 
        except Exception as e:
            n = n - 1

    if len(d2) == 1:
        n = force_bin         
        bins = algos.quantile(notmiss.X, np.linspace(0, 1, n))
        if len(np.unique(bins)) == 2:
            bins = np.insert(bins, 0, 1)
            bins[1] = bins[1]-(bins[1]/2)
        d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y, "Bucket": pd.cut(notmiss.X, np.unique(bins),include_lowest=True)}) 
        d2 = d1.groupby('Bucket', as_index=True)
    
    d3 = pd.DataFrame({},index=[])
    d3["MIN_VALUE"] = d2.min().X
    d3["MAX_VALUE"] = d2.max().X
    d3["COUNT"] = d2.count().Y
    d3["EVENT"] = d2.sum().Y
    d3["NONEVENT"] = d2.count().Y - d2.sum().Y
    d3=d3.reset_index(drop=True)
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    d3["EVENT_RATE"] = d3.EVENT/d3.COUNT
    d3["NON_EVENT_RATE"] = d3.NONEVENT/d3.COUNT
    d3["DIST_EVENT"] = d3.EVENT/d3.sum().EVENT
    d3["DIST_NON_EVENT"] = d3.NONEVENT/d3.sum().NONEVENT
    d3["WOE"] = np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]       
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    
    return(d3)

def char_bin(Y, X):
        
    df1 = pd.DataFrame({"X": X, "Y": Y})
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]    
    df2 = notmiss.groupby('X',as_index=True)
    
    d3 = pd.DataFrame({},index=[])
    d3["COUNT"] = df2.count().Y
    d3["MIN_VALUE"] = df2.sum().Y.index
    d3["MAX_VALUE"] = d3["MIN_VALUE"]
    d3["EVENT"] = df2.sum().Y
    d3["NONEVENT"] = df2.count().Y - df2.sum().Y
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    d3["EVENT_RATE"] = d3.EVENT/d3.COUNT
    d3["NON_EVENT_RATE"] = d3.NONEVENT/d3.COUNT
    d3["DIST_EVENT"] = d3.EVENT/d3.sum().EVENT
    d3["DIST_NON_EVENT"] = d3.NONEVENT/d3.sum().NONEVENT
    d3["WOE"] = np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]      
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    d3 = d3.reset_index(drop=True)
    
    return(d3)

def data_vars(df1, target):
    
    stack = traceback.extract_stack()
    filename, lineno, function_name, code = stack[-2]
    vars_name = re.compile(r'\((.*?)\).*$').search(code).groups()[0]
    final = (re.findall(r"[\w']+", vars_name))[-1]
    
    x = df1.dtypes.index
    count = -1
    
    for i in x:
        if i.upper() not in (final.upper()):
            if np.issubdtype(df1[i], np.number) and len(Series.unique(df1[i])) > 2:
                conv = mono_bin(target, df1[i])
                conv["VAR_NAME"] = i
                count = count + 1
            else:
                conv = char_bin(target, df1[i])
                conv["VAR_NAME"] = i            
                count = count + 1
                
            if count == 0:
                iv_df = conv
            else:
                iv_df = iv_df.append(conv,ignore_index=True)
    
    iv = pd.DataFrame({'IV':iv_df.groupby('VAR_NAME').IV.max()})
    iv = iv.reset_index()
    return(iv_df,iv)

In [None]:
final_iv, IV = data_vars(loan_ml1_df,loan_ml1_df['Personal Loan'])

In [None]:
final_iv

In [None]:
IV

In [None]:
IV.sort_values('IV')

In [None]:
loan_ml1_df['California']=(loan_ml1_df['ZIP Code']<96200).astype(int)
loan_ml1_df['undergraduate']=(loan_ml1_df['Education']==1).astype(int)
loan_ml1_df['graduate']=(loan_ml1_df['Education']==2).astype(int)
loan_ml1_df['family_1']=(loan_ml1_df['Family']==1).astype(int)
loan_ml1_df['family_2']=(loan_ml1_df['Family']==2).astype(int)
loan_ml1_df['family_3']=(loan_ml1_df['Family']==3).astype(int)
loan_ml1_df=loan_ml1_df.drop('ZIP Code',axis=1)
loan_ml1_df=loan_ml1_df.drop('Education',axis=1)
loan_ml1_df=loan_ml1_df.drop('Family',axis=1)
loan_ml1_df

In [None]:
loan_ml1_df['Age_0_25']=(loan_ml1_df['Age']<=25).astype(int)
loan_ml1_df['Age_25_30']=(loan_ml1_df['Age']>25).astype(int) & (loan_ml1_df['Age']<=30).astype(int)
loan_ml1_df['Age_30_35']=(loan_ml1_df['Age']>30).astype(int) & (loan_ml1_df['Age']<=35).astype(int)
loan_ml1_df['Age_35_40']=(loan_ml1_df['Age']>35).astype(int) & (loan_ml1_df['Age']<=40).astype(int)
loan_ml1_df['Age_40_45']=(loan_ml1_df['Age']>40).astype(int) & (loan_ml1_df['Age']<=45).astype(int)
loan_ml1_df['Age_45_50']=(loan_ml1_df['Age']>45).astype(int) & (loan_ml1_df['Age']<=50).astype(int)
loan_ml1_df['Age_50_55']=(loan_ml1_df['Age']>50).astype(int) & (loan_ml1_df['Age']<=55).astype(int)
loan_ml1_df['Age_55_60']=(loan_ml1_df['Age']>55).astype(int) & (loan_ml1_df['Age']<=60).astype(int)
loan_ml1_df['Age_60_65']=(loan_ml1_df['Age']>60).astype(int) & (loan_ml1_df['Age']<=65).astype(int)
loan_ml1_df

In [None]:
loan_ml1_df['CC_0_1']=(loan_ml1_df['CCAvg']<=1).astype(int)
loan_ml1_df['CC_1_2']=(loan_ml1_df['CCAvg']>1).astype(int) & (loan_ml1_df['CCAvg']<=2).astype(int)
loan_ml1_df['CC_2_3']=(loan_ml1_df['CCAvg']>2).astype(int) & (loan_ml1_df['CCAvg']<=3).astype(int)
loan_ml1_df['CC_3_4']=(loan_ml1_df['CCAvg']>3).astype(int) & (loan_ml1_df['CCAvg']<=4).astype(int)
loan_ml1_df['CC_4_5']=(loan_ml1_df['CCAvg']>4).astype(int) & (loan_ml1_df['CCAvg']<=5).astype(int)
loan_ml1_df['CC_5_6']=(loan_ml1_df['CCAvg']>5).astype(int) & (loan_ml1_df['CCAvg']<=6).astype(int)
loan_ml1_df['CC_6_7']=(loan_ml1_df['CCAvg']>6).astype(int) & (loan_ml1_df['CCAvg']<=7).astype(int)
loan_ml1_df['CC_7_8']=(loan_ml1_df['CCAvg']>7).astype(int) & (loan_ml1_df['CCAvg']<=8).astype(int)
loan_ml1_df['CC_8_9']=(loan_ml1_df['CCAvg']>8).astype(int) & (loan_ml1_df['CCAvg']<=9).astype(int)
loan_ml1_df

In [None]:
loan_ml1_df['Income_0_20']=(loan_ml1_df['Income']<=20).astype(int)
loan_ml1_df['Income_20_40']=(loan_ml1_df['Income']>20).astype(int) & (loan_ml1_df['Income']<=40).astype(int)
loan_ml1_df['Income_40_60']=(loan_ml1_df['Income']>40).astype(int) & (loan_ml1_df['Income']<=60).astype(int)
loan_ml1_df['Income_60_80']=(loan_ml1_df['Income']>60).astype(int) & (loan_ml1_df['Income']<=80).astype(int)
loan_ml1_df['Income_80_100']=(loan_ml1_df['Income']>80).astype(int) & (loan_ml1_df['Income']<=100).astype(int)
loan_ml1_df['Income_100_120']=(loan_ml1_df['Income']>100).astype(int) & (loan_ml1_df['Income']<=120).astype(int)
loan_ml1_df['Income_120_140']=(loan_ml1_df['Income']>120).astype(int) & (loan_ml1_df['Income']<=140).astype(int)
loan_ml1_df['Income_140_160']=(loan_ml1_df['Income']>140).astype(int) & (loan_ml1_df['Income']<=160).astype(int)
loan_ml1_df['Income_160_180']=(loan_ml1_df['Income']>160).astype(int) & (loan_ml1_df['Income']<=180).astype(int)
loan_ml1_df['Income_180_200']=(loan_ml1_df['Income']>180).astype(int) & (loan_ml1_df['Income']<=200).astype(int)
loan_ml1_df

In [None]:
loan_ml1_df['Mortgage_0_75']=(loan_ml1_df['Mortgage']==0).astype(int)
loan_ml1_df['Mortgage_75_125']=(loan_ml1_df['Mortgage']>=75).astype(int) & (loan_ml1_df['Mortgage']<125).astype(int)
loan_ml1_df['Mortgage_125_175']=(loan_ml1_df['Mortgage']>=125).astype(int) & (loan_ml1_df['Mortgage']<175).astype(int)
loan_ml1_df['Mortgage_175_225']=(loan_ml1_df['Mortgage']>=175).astype(int) & (loan_ml1_df['Mortgage']<225).astype(int)
loan_ml1_df['Mortgage_225_275']=(loan_ml1_df['Mortgage']>=225).astype(int) & (loan_ml1_df['Mortgage']<275).astype(int)
loan_ml1_df['Mortgage_275_325']=(loan_ml1_df['Mortgage']>=275).astype(int) & (loan_ml1_df['Mortgage']<325).astype(int)
loan_ml1_df['Mortgage_325_400']=(loan_ml1_df['Mortgage']>=325).astype(int) & (loan_ml1_df['Mortgage']<400).astype(int)
loan_ml1_df['Mortgage_400_500']=(loan_ml1_df['Mortgage']>=400).astype(int) & (loan_ml1_df['Mortgage']<500).astype(int)
loan_ml1_df

In [None]:
loan_ml1_df[['Age_sq','Income_sq','CCAvg_sq','Mortgage_sq']]=loan_ml1_df[['Age','Income','CCAvg','Mortgage']].apply(lambda x: np.square(x))
loan_ml1_df[['Age_sqrt','Income_sqrt','CCAvg_sqrt','Mortgage_sqrt']]=loan_ml1_df[['Age','Income','CCAvg','Mortgage']].apply(lambda x: np.sqrt(x))
loan_ml1_df[['Age_ln','Income_ln','CCAvg_ln','Mortgage_ln']]=loan_ml1_df[['Age','Income','CCAvg','Mortgage']].apply(lambda x: np.log(x))
loan_ml1_df

In [None]:
loan_ml1_df.loc[loan_ml1_df['Mortgage_ln']<0,['Mortgage_ln']]= 0

In [None]:
correl_13 = loan_ml1_df.corr().abs()
# Select upper triangle of correlation matrix
upper = correl_13.where(np.triu(np.ones(correl_13.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
to_drop
correl_13.to_csv('file13.csv',index=False)
correl_13.style.background_gradient(cmap='coolwarm').set_precision(2)

In [None]:
# can remove age and its transformations
# can remove mortgage and its transformations except mortgage_sq
# can remove income transformations
# can remove all CCavg variables
loan_ml1_df.columns
loan_ml1_df_new=loan_ml1_df.drop(['Age','CCAvg','Income','Mortgage','Age_sq', 'Income_sq', 'CCAvg_sq',
                'Age_sqrt','Income_sqrt', 'CCAvg_sqrt', 'Mortgage_sqrt', 'Age_ln','Income_ln',
                'CCAvg_ln', 'Mortgage_ln','Age_0_25', 'Age_25_30', 'Age_30_35','Age_35_40', 
                'Age_40_45', 'Age_45_50', 'Age_50_55', 'Age_55_60','Age_60_65','California',
                'Mortgage_sq'],axis=1)
loan_ml1_df_new
# no dependency on online, creditcard, securities account
# Customers having high CCAvg need personal loan
# Family with income less than 100k are less likely to take loan
# higher mortgage is more likely to get the loan
# Income with more than 50 is more likely to get the personal loan
# customers having COD account have high prob of taking loan
# undergraduate has very less prob of taking the loan
# family size of more than 3 are more likely to get the loan
# 'CC_0_1', 'CC_1_2', 'CC_2_3', 'CC_3_4', 'CC_4_5', 'CC_5_6','CC_6_7', 'CC_7_8', 'CC_8_9',
# 'Mortgage_0_75', 'Mortgage_75_125', 'Mortgage_125_175','Mortgage_175_225','Mortgage_225_275', 'Mortgage_275_325', 'Mortgage_325_400', 'Mortgage_400_500',

In [None]:
loan_ml1_df_new.columns

In [None]:
correl_24 = loan_ml1_df_new.corr().abs()
# Select upper triangle of correlation matrix
upper = correl_24.where(np.triu(np.ones(correl_24.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.75
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
to_drop
correl_24.to_csv('file24.csv',index=False)
correl_24.style.background_gradient(cmap='coolwarm').set_precision(2)

In [None]:
X7=loan_ml1_df_new.iloc[:,1:]
y7=loan_ml1_df_new.iloc[:,0]
X7_train,X7_test,y7_train,y7_test=train_test_split(X7,y7,test_size=0.3,random_state=28)

In [None]:
print("X_train shape   ",X7_train.shape)
print("X_test shape   ",X7_test.shape)
print("y_train shape   ",y7_train.shape)
print("y_test shape   ",y7_test.shape)

In [None]:
logit_model_base7=sm.Logit(endog=y7_train,exog=X7_train)
result_17=logit_model_base7.fit()
print(result_17.summary())

In [None]:
def elimination(x,sl,y):
    numvars=len(x.columns)
    for i in range(0,numvars):
        lr=sm.Logit(y,x.values).fit()
        maxvar=max(lr.pvalues)
        if maxvar>sl:
            for j in range(0,numvars-i):
                if(lr.pvalues[j]==maxvar):
                    del x[x.columns[j]]
    lr.summary()
    return x

sl = 0.05
x7_model=elimination(X7_train,sl,y7_train)
lr7=sm.Logit(endog=y7_train,exog=x7_model).fit()
print(lr7.summary())

vif=pd.DataFrame()
vif['VIF Factor']=[variance_inflation_factor(x7_model.values,i) for i in range(x7_model.shape[1])]
vif['features']=x7_model.columns
vif

In [None]:
del x7_model['CC_0_1']
vif=pd.DataFrame()
vif['VIF Factor']=[variance_inflation_factor(x7_model.values,i) for i in range(x7_model.shape[1])]
vif['features']=x7_model.columns
vif

In [None]:
del x7_model['Online']
vif=pd.DataFrame()
vif['VIF Factor']=[variance_inflation_factor(x7_model.values,i) for i in range(x7_model.shape[1])]
vif['features']=x7_model.columns
vif

In [None]:
del x7_model['CreditCard']
vif=pd.DataFrame()
vif['VIF Factor']=[variance_inflation_factor(x7_model.values,i) for i in range(x7_model.shape[1])]
vif['features']=x7_model.columns
vif

In [None]:
del x7_model['CC_8_9']
vif=pd.DataFrame()
vif['VIF Factor']=[variance_inflation_factor(x7_model.values,i) for i in range(x7_model.shape[1])]
vif['features']=x7_model.columns
vif

In [None]:
del x7_model['CC_1_2']
del x7_model['CC_2_3']
del x7_model['CC_7_8']
vif=pd.DataFrame()
vif['VIF Factor']=[variance_inflation_factor(x7_model.values,i) for i in range(x7_model.shape[1])]
vif['features']=x7_model.columns
vif

In [None]:
del x7_model['CC_6_7']
del x7_model['Income_80_100']
vif=pd.DataFrame()
vif['VIF Factor']=[variance_inflation_factor(x7_model.values,i) for i in range(x7_model.shape[1])]
vif['features']=x7_model.columns
vif

In [None]:
mylist7=list(x7_model.columns)
print(mylist7)
print(X7_test)
print(X7_train)
X7_test=X7_test.loc[:, X7_test.columns.str.contains('|'.join(mylist7))]

In [None]:
X7_test

In [None]:
X7_train

In [None]:
X7

In [None]:
x7_model

### Logistic Regression

In [None]:
lr7 = LogisticRegression(C=1.0, class_weight=None,    dual=False, fit_intercept=True,    intercept_scaling=1, max_iter=100,    multi_class='ovr', n_jobs=1, penalty='l2',    random_state=42, solver='liblinear',    tol=0.0001, verbose=0, warm_start=False)
print(lr7.fit(X7_train, y7_train))
print(lr7.score(X7_test, y7_test))
print(lr7.predict(X7_test.iloc[[0]]))
print(lr7.predict_proba(X7_test.iloc[[0]]))
print(lr7.predict_log_proba(X7_test.iloc[[0]]))
print(lr7.decision_function(X7_test.iloc[[0]]))

intercept is the log odds of the baseline condition. We can convert it back to a percent accuracy (proportion)

In [None]:
lr7.intercept_

#### Using the inverse logit function, we see that the baseline for personal loan approval is 3.11%:

In [None]:
def inv_logit(p):
    return np.exp(p) / (1 + np.exp(p))
inv_logit(lr7.intercept_)

In [None]:
y_pred7 = lr7.predict(X7_test)
skplt.metrics.plot_confusion_matrix(y7_test, y_pred7)
plt.show()

In [None]:
y_probas17 = lr7.predict_proba(X7_test)
skplt.metrics.plot_cumulative_gain(y7_test, y_probas17)
plt.show()

In [None]:
skplt.metrics.plot_precision_recall(y7_test, y_probas17)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
fi_viz = FeatureImportances(lr7)
fi_viz.fit(X7, y7)
fi_viz.poof()

### Naive Bayes

In [None]:
nb7 = GaussianNB(priors=None, var_smoothing=1e-09)
print(nb7.fit(X7_train, y7_train))
print(nb7.score(X7_test, y7_test))
print(nb7.predict(X7_test.iloc[[0]]))
print(nb7.predict_proba(X7_test.iloc[[0]]))
print(nb7.predict_log_proba(X7_test.iloc[[0]]))

In [None]:
y_pred27 = nb7.predict(X7_test)
skplt.metrics.plot_confusion_matrix(y7_test, y_pred27)
plt.show()

In [None]:
y_probas27 = nb7.predict_proba(X7_test)
skplt.metrics.plot_cumulative_gain(y7_test, y_probas27)
plt.show()

In [None]:
skplt.metrics.plot_precision_recall(y7_test, y_probas27)
plt.show()

### KNN

In [None]:
knc7 = KNeighborsClassifier(algorithm='auto',  leaf_size=30, metric='minkowski',  metric_params=None, n_jobs=1, n_neighbors=5,  p=2, weights='uniform')
print(knc7.fit(X7_train, y7_train))
print(knc7.score(X7_test, y7_test))
print(knc7.predict(X7_test.iloc[[0]]))
print(knc7.predict_proba(X7_test.iloc[[0]]))

In [None]:
y_pred37 = knc7.predict(X7_test)
skplt.metrics.plot_confusion_matrix(y7_test, y_pred37)
plt.show()

In [None]:
y_probas37= knc7.predict_proba(X7_test)
skplt.metrics.plot_cumulative_gain(y7_test, y_probas37)
plt.show()

In [None]:
skplt.metrics.plot_precision_recall(y7_test, y_probas37)
plt.show()

##### -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Additional EDA Analysis 

##### What are those Main Characteristics that has a higher association with Loan Fact and what the strength of association ?

Here is a subset of the initial data frame with just characteristics that have a positive association with 'Personal Loan' and the size of association is higher than moderate

In [None]:
exp_df = bank_per_loan_df[['Income', 'CCAvg', 'Family', 'Education', 'CD Account', 'Personal Loan']].copy()
exp_df

Let's apply logistic regression on this subset

In [None]:
exp_df['intercept'] = 1

log_mod_5 = sm.Logit(exp_df['Personal Loan'], exp_df[['intercept','Income', 'CCAvg', 'Family', 'Education', 'CD Account']]).fit()

Get P-Values for each variable

In [None]:
log_mod_5.pvalues[1:]

##### All p-values are less than 0.05

Get Odds for each variable

In [None]:
odds_exp = np.exp(log_mod_5.params)
odds_exp

In [None]:
odds_df = pd.DataFrame(odds_exp[1:], columns = ["Odds"])
odds_df

In [None]:
odds_df['odds_increment'] = odds_df.Odds
odds_df

Here is the data frame with Main Characteristics ...

... and their odds to increase the chance to sell Personal Loan with increase value of variable by one unit

In [None]:
odds_df.sort_values('Odds', ascending = False)

##### The chart demonstrating the proportion of strength of association between Personal Loan and values of Main Characteristic

In [None]:
sizes = odds_df.Odds.tolist()# list of sizes of slices
labels = odds_df.index.tolist() # list of labels 
explode = (0.15, 0.1, 0.2, 0.1, 0)  # "explode" the 2nd and 3rd slices  
fig = plt.figure(figsize=(10, 5))
plt.suptitle('The Proportion of Strength of Association  Between  \n Personal Loan and Main Characteristics', \
          fontsize = 14, y = 1.18)
plt.axis('equal'); # set aspect ration as equal to make sure the pie is drawn as a circle
plt.pie(sizes, labels = labels, explode = explode, radius = 1.5, \
        shadow = True, startangle = 90,autopct= '%1.1f%%')

plt.savefig('proportion_of_stregth_of_association1.png', bbox_inches = 'tight');

##### What the Segments of Main Characteristics, that has a higher strength of association with Personal Loan?

Lets get a closer look at each of Main Characteristics

#### CD Account


Here is the distribution of "Personal Loan" values among groups of "CD Account" values

In [None]:
series_cd = exp_df[exp_df['Personal Loan'] == 1]['CD Account'].value_counts()
series_cd

In [None]:
series_cdd = exp_df[exp_df['Personal Loan'] == 0]['CD Account'].value_counts()
series_cdd

In [None]:
pd.DataFrame(dict( NO_PL= series_cdd, PL= series_cd,)).plot.bar(figsize = (8,6))
plt.ylabel('Frequency')
plt.xticks(np.arange(2),('No CD Account','CD Account'), rotation = 'horizontal')
plt.legend(('NO Personal Loan', 'Personal Loan'));
plt.title('Distribution of "Personal Loan" Values \n among Groups of "CD Account" Values', fontsize = 14, y = 1.05);
plt.savefig('distribution_of_PL_among_CDacc1.png', bbox_inches = 'tight')

##### Observation
We may say that the proportion of persons who has Personal Loan among them who has CD account with The Bank is quit high.


Let's see the exact number of proportion of "loanees" among "depositees"

In [None]:
series = exp_df[exp_df['CD Account'] == 1]['Personal Loan'].value_counts()
series

In [None]:
plt.axis('equal')
plt.title('Proportion of Customers Who Have Personal Loan and Who Don\'t,\n among CD Account Holders', \
          fontsize = 14, y = 1.2)
labels = ['NO Personal Loan','Personal Loan']
plt.pie(series, labels = labels,autopct= '%1.1f%%', shadow = True,explode = (0.1, 0), radius = 1.6, startangle = 90)
plt.savefig('Proportion_of_loanees_among_depositees1.png', bbox_inches = 'tight');

##### Conclusion

    46.4% of CD Account Holders have Perconal Loan.
    For 'CD Account' characteristic - the main segment to sell Personal Loan is the people who already have a CD Account with the Bank.
    Target value of 'CD Account' variable = 1

#### Education


Here is the distribution of "Personal Loan" values among groups of "Education" values

In [None]:
series_ed = exp_df[exp_df['Personal Loan'] == 1]['Education'].value_counts()
series_ed

In [None]:
series_edd = exp_df[exp_df['Personal Loan'] == 0]['Education'].value_counts()
series_edd

In [None]:
pd.DataFrame(dict(NO_PL= series_edd, PL= series_ed)).plot.bar(figsize = (8,6))
plt.ylabel('Frequency')
plt.xlabel('Education Level')
plt.xticks(np.arange(3),('1','2','3'), rotation = 'horizontal')
plt.legend(('NO Personal Loan', 'Personal Loan'))
plt.title('Distribution of "Personal Loan" Values \n among Groups of "Education" Values', fontsize = 14, y = 1.05);
plt.savefig('distribution_PL_among_Education1.png', bbox_inches = 'tight')

###### Observations 

We may say that the proportion of persons who has Personal Loan among them who has Third and Second Level of Education is higher than proportion among people who has First level of Edication.

Let's see the exact numbers of proportions.

In [None]:
series_edu_3 = exp_df[exp_df['Education'] == 3]['Personal Loan'].value_counts()
series_edu_3

In [None]:
series_edu_2 = exp_df[exp_df['Education'] == 2]['Personal Loan'].value_counts()
series_edu_2

In [None]:
series_edu_1 = exp_df[exp_df['Education'] == 1]['Personal Loan'].value_counts()
series_edu_1

In [None]:
labels = ['NO Personal Loan','Personal Loan']
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize = (18,6),subplot_kw=dict(aspect="equal"))
plt.axis('equal')
ax1.pie(series_edu_3, labels = labels, autopct= '%1.1f%%', shadow = True,explode = (0, 0.1), radius = 1.25, startangle = 90)
ax1.set_title('Education Level 3',fontsize = 14, y = 1.1)

ax2.pie(series_edu_2, labels = labels, autopct= '%1.1f%%', shadow = True,explode = (0, 0.1), radius = 1.25, startangle = 90)
ax2.set_title('Education Level 2', fontsize = 14, y = 1.1)

ax3.pie(series_edu_1, labels = labels, autopct= '%1.1f%%', shadow = True,explode = (0, 0.1), radius = 1.25, startangle = 90);
ax3.set_title('Education Level 1',fontsize = 14, y = 1.1)

plt.suptitle('Proportion of Customers Who Have Personal Loan and Who Don\'t, among CD Account Holders', \
             fontsize = 16, y = 1.12);

plt.savefig('Proportion_of_PL_among edu_levels1.png', bbox_inches = 'tight');

In [None]:
series_edu_4 = exp_df[exp_df['Personal Loan'] == 1]['Education'].value_counts()
series_edu_4

In [None]:
plt.axis('equal')
plt.title('Proportion of Customers With Different Levels of Education \n among Personal Loan Holders', \
          fontsize = 14, y = 1.3)
labels = ['Education Level  3',' Education Level 2','Education Level 1']
plt.pie(series_edu_4, labels = labels, autopct= '%1.2f%%', shadow = True,explode = (0.1, 0, 0), radius = 1.6, startangle = 90);
plt.savefig('Proportion_edu_levels_among_PL1.png', bbox_inches = 'tight');

##### Conclusion


    42.7% and 37.9% of persons who have Personal Loan, have Education level 3 and Level 2 respectively.
    For 'Education' characteristic - the main segments to sell Personal Loan is the people who have Second and Third levels of education
    Target values of 'Education' variable are 3 and 2 in descending order of priority

#### Family

Here is the distribution of "Personal Loan" values among groups of "Family" valuesx

In [None]:
series_fam = exp_df[exp_df['Personal Loan'] == 1]['Family'].value_counts()
series_fam

In [None]:
series_famm = exp_df[exp_df['Personal Loan'] == 0]['Family'].value_counts()
series_famm

In [None]:
pd.DataFrame(dict( NO_PL = series_famm, PL= series_fam,)).plot.bar(figsize = (8,6))
plt.ylabel('Frequency')
plt.xlabel('Family Size')
plt.xticks(np.arange(4),('1', '2', '3', '4'), rotation = 'horizontal')
plt.legend(('NO Personal Loan', 'Personal Loan'));
plt.title('Distribution of "Personal Loan" Values \n among Groups of "Family" Values', fontsize = 14, y = 1.05);
plt.savefig('distribution_of_PL_among_family1.png', bbox_inches = 'tight')

##### Observation

We may say that the proportion of persons who has Personal Loan among them who has Family size 2 and 3 is highest proportion. 


Let's see the exact number of that proportions of "loanees" among "depositees"

In [None]:
series_fam_3 = exp_df[exp_df['Family'] == 3]['Personal Loan'].value_counts()
series_fam_3

In [None]:
series_fam_4 = exp_df[exp_df['Family'] == 4]['Personal Loan'].value_counts()
series_fam_4

In [None]:
labels = ['NO Personal Loan','Personal Loan']

fig, (ax1, ax2) = plt.subplots(1,2, figsize = (12,6),subplot_kw=dict(aspect="equal"))
fig.suptitle('Proportion of Customers Who Have Personal Loan and Who Don\'t, \
among Different Family Sizes', fontsize = 16, y = 1.1, x = 0.51);

ax1.pie(series_fam_3, labels = labels, autopct= '%1.1f%%', shadow = True,explode = (0, 0.1), radius = 1.25, startangle = 90)
ax1.set_title('Family Size 3',fontsize = 14, y = 1.1)

ax2.pie(series_fam_4, labels = labels, autopct= '%1.1f%%', shadow = True,explode = (0, 0.1), radius = 1.25, startangle = 90)
ax2.set_title('Family Size 4', fontsize = 14, y = 1.1);

plt.savefig('Proportion_of_PL_among_family_levels1.png', bbox_inches = 'tight');

In [None]:
plt.axis('equal')
plt.title('Proportion of Customers With Different Family Sizes \n among Personal Loan Holders', \
          fontsize = 14, y = 1.3)
labels = ['Family 2',' Family 1','Family 3','Family 4']
plt.pie(series_fam.sort_values(ascending = True), labels = labels, \
        autopct= '%1.2f%%', shadow = True, explode = (0.1, 0.1, 0.1,0.15), radius = 1.6, startangle = 90);
plt.savefig('Proportion_family_size_among_PL1.png', bbox_inches = 'tight');

##### Conclusion

    27.9% and 27.7% of persons who have Personal Loan, have Family size 4 and Level 3 respectively.
    
    For 'Family' characteristic - the main segments to sell Personal Loan is the people who have Family Size 3 and 4.
    
    Target values of 'Family' variable are 3 and 4 in descending order of priority, since the proportion of people who has Personal Loan is the higthest with Family Size 3 - 13,2%.

#### CCAvg

Here is the distribution of "CCAvg" values among Personal Loan holders and among whole population.

In [None]:
series_cca = exp_df[exp_df['Personal Loan'] == 1]['CCAvg'].value_counts()
series_cca

In [None]:
series_cca.describe()

In [None]:
width = 1.5 #wdth of bins in histogram - play with it to find good point for groupping
series_cca.plot.hist(bins = np.arange(series_cca.min(), series_cca.max() + width, width ), figsize = (8,6))
plt.xlabel('CCAvg')
plt.axvline(x = series_cca.mean(), color = 'red')
plt.axvline(x = series_cca.min(), color = 'green')
plt.axvline(x = series_cca.mean() + series_cca.std(), color = 'green')
plt.title('Distribution of "CCAvg" values among "Personal Loan" holders', fontsize = 14, y = 1.05);
plt.savefig('Distrib_ccavg_among_PL1.png', bbox_inches = 'tight')

##### Observation

We may say that CCAvg characteristics values can be devided in three groups in descending order of priority consider its frequncy among Personal Loan holder:

    Group I: 1 < CCAvg < 2.5
    Group II: 4 < CCAvg < 5.5
    Group III: 7 < CCAvg < 8.5

In [None]:
series_ccaa = exp_df['CCAvg'].value_counts()
width = 8.5 #wdth of bins in histogram - play with it to find good point for groupping
series_ccaa.plot.hist(bins = np.arange(series_ccaa.min(), series_ccaa.max() + width, width ), figsize = (8,6))
plt.xlabel('CCAvg')
plt.title('Distribution of "CCAvg" values among whole population', fontsize = 14, y = 1.05);
plt.savefig('Distrib_ccavg_among_population1.png', bbox_inches = 'tight')

##### Observation
We may say, that all our groups of 'CCAvg' defined as priority groups to sell Personal Loan, lies inside segment with pretty high frequency among whole population.

##### Conclusion

Target groups of 'CCAvg' characteristic is in descending order of priority:

    Group I: 1 < CCAvg < 2.5
    Group II: 4 < CCAvg < 5.5
    Group III: 7 < CCAvg < 8.5


#### Income

Here is the distribution of "Income" values among Personal Loan holders and among whole population

In [None]:
series_inc = exp_df[exp_df['Personal Loan'] == 1]['Income'].value_counts()
series_inc

In [None]:
series_inc.describe()

In [None]:
width = 1.5 #wdth of bins in histogram - play with it to find good point for groupping
series_inc.plot.hist(bins = np.arange(series_inc.min(), series_inc.max() + width, width ), figsize = (8,6))
plt.xlabel('Income')
plt.axvline(x = series_inc.mean(), color = 'red')
plt.axvline(x = series_inc.min(), color = 'green')
plt.axvline(x = series_inc.mean() + series_inc.std(), color = 'green')
plt.title('Distribution of "Income" values among "Personal Loan" holders', fontsize = 14, y = 1.05);
plt.savefig('Distrib_income_among_PL1.png', bbox_inches = 'tight')

##### Observation

We may say that 'Income' characteristic values can be devided in three groups in descending order of priority consider its frequncy among Personal Loan holder:

    Group I: 1 < Income < 2.5
    Group II: 4 < Income < 5.5
    Group III: 7 < Income < 8.5

In [None]:
series_incc = exp_df['Income'].value_counts()
width = 8.5 #wdth of bins in histogram - play with it to find good point for groupping
series_incc.plot.hist(bins = np.arange(series_incc.min(), series_incc.max() + width, width ), figsize = (8,6))
plt.xlabel('Income')
plt.title('Distribution of "Income" values among whole population', fontsize = 14, y = 1.05);
plt.savefig('Distrib_income_among_population1.png', bbox_inches = 'tight')

##### Observation
We may say, that all our groups of 'Income' defined as priority groups to sell Personal Loan, lies inside segment with pretty high frequency among whole population.

##### Conclusion

Target groups of 'Income' characteristic is:

    Group I: 1 < Income < 2.5
    Group II: 4 < Income < 5.5
    Group III: 7 < Income < 8.5

##### Summary Conclusion

We made the simple step-by-step analysis of customer's characteristics to identify patterns to effectively choose the subset of customers who have a higher probability to buy new product "Personal Loan" from The Bank. We performed the following steps:

    We check all twelve characteristics whether or not each of them has an association with the fact the product been sold.
    We find FIVE main characteristics that have higher than moderate strength of association with the product.
    We analyze main characteristics and get segments in each with different strength of association with the product.

###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------