# DATA SET: Bank_Personal_Loan_Modelling.csv

Data Description:
The file Bank.xls contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Domain:Banking

Context:This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

Objective:The classification goal is to predict the likelihood of a liability customer buying personal loans.

## 1. Import the necessary libraries

In [None]:
# To enable plotting graphs in Jupyter notebook
%matplotlib inline

# Importing libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

#Let us break the X and y dataframes into training set and test set. For this we will use
#Sklearn package's data splitting function which is based on random function

from sklearn.model_selection import train_test_split

import numpy as np
#import os,sys
from scipy import stats

# calculate accuracy measures and confusion matrix
from sklearn import metrics

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 2. Read the data 

In [None]:
datapath = '../input'

In [None]:
my_data = pd.read_csv(datapath+'/Bank_Personal_Loan_Modelling.csv')
my_data.columns = ["ID","Age","Experience","Income","ZIPCode","Family","CCAvg","Education","Mortgage","Personal_Loan","SecuritiesAccount","CDAccount","Online","CreditCard"]


## 3. Basic EDA

In [None]:
my_data.head(10)

a.The variable ID does not add any particular information.

b.There are 2 nominal variables:

    1.ID    
    2.Zip Code
    
c.There are 2 Ordinal Categorical Variables:

    1.Family - Family size of the customer    
    2.Education - education level of the customer
    
d.There are 5 independent variables:

    1.Age:Age of the customer
    2.Experience:Years of experience of the customer
    3.Income:Annual income in dollars
    4.CCAvg:Average credit card spending
    5.Mortage:Value of House Mortgage
    
e.There are 5 binary category variables:

    1.Personal Loan:Did this customer accept the personal loan offered in the last campaign?
    2.Securities Account:Does the customer have a securities account with the bank?
    3.CD Account:Does the customer have a certificate of deposit (CD) account with the bank?
    4.Online:Does the customer use internet banking facilities?
    5.Credit Card:Does the customer use a credit card issued by UniversalBank?

f.And the Target variable is :Personal Loan

### a. Shape of the data

In [None]:
my_data.shape

There are 5000 customers.

In [None]:
my_data.columns

### b. Data type of each attribute 

In [None]:
my_data.dtypes

Almost all atributes are numeric. 

## c.Check for the null values 

In [None]:
#null values
my_data.isnull().values.any()

## d. Checking the presence of missing values 

In [None]:
val=my_data.isnull().values.any()

if val==True:
    print("Missing values present : ", my_data.isnull().values.sum())
    my_data=my_data.dropna()
else:
    print("No missing values present")

## e. 5 point summary of numerical attributes 

In [None]:
my_data.describe().T

In [None]:
my_data.info()

## f.Finding unique data 

In [None]:
my_data.apply(lambda x: len(x.unique()))

In [None]:
#Find Shape
my_data.shape

In [None]:
#Find Mean
my_data.mean()

In [None]:
#Find Median
my_data.median()

In [None]:
#Find Standard Deviation
my_data.std()

## g.Ploting histogram to check that if data columns are normal or almost normal or not 

In [None]:
my_data.hist(figsize=(10,10),color="blueviolet",grid=False)
plt.show()

# 4.PairPlot

In [None]:
sns.pairplot(my_data.iloc[:,1:])

### 1.Here we can see "Age" feature is almost normally distributed where majority of customers are between age 30 to 60 years.Also we can see median is equal to mean.
### 2."Experience" feature is also almost normally distibuted and mean is also equal to median.But there are some negative values present which should be deleted, as Experience can not be negative.
### 3.We can see for "Income" , "CCAvg" , "Mortgage" distribution is positively skewed.
### 4.For "Income" mean is greater than median.Also we can confirm from this that majority of the customers have income between 45-55K.
### 5.For "CCAvg" majority of the customers spend less than 2.5K and the average spending is between 0-10K.
### 6.For "Mortage" we can see that almost 70% of the customers have Value of house mortgage less than 40K and the maximum value is 635K.
### 7.Distributin of "Family" and "Education" are evenly distributed

In [None]:
my_data[my_data['Experience'] < 0]['Experience'].count()

There are 52 records with negative experience.We have to clean it.

### Cleaning the negative values 

In [None]:
my_dataExp = my_data.loc[my_data['Experience'] >0]
negExp = my_data.Experience < 0
column_name = 'Experience'
my_data_list = my_data.loc[negExp]['ID'].tolist()

In [None]:
negExp.value_counts()

52 records with negative experience

In [None]:
for id in my_data_list:
    age = my_data.loc[np.where(my_data['ID']==id)]["Age"].tolist()[0]
    education = my_data.loc[np.where(my_data['ID']==id)]["Education"].tolist()[0]
    df_filtered = my_dataExp[(my_dataExp.Age == age) & (my_dataExp.Education == education)]
    exp = df_filtered['Experience'].median()
    my_data.loc[my_data.loc[np.where(my_data['ID']==id)].index, 'Experience'] = exp
    
#The records with the ID, get the values of Age and Education columns.
#Then apply filter for the records matching the criteria from the dataframe 
#which has records with positive experience and take the median.
#Apply the median again to the location(records) which had negative experience.   

### Check if there are any records still present with negative Experience 

In [None]:
my_data[my_data['Experience'] < 0]['Experience'].count()

In [None]:
my_data.describe().T

### Measure of skewness  

In [None]:
my_data.skew(axis = 0, skipna = True) 

## 5.Boxplot 

In [None]:
sns.boxplot(x=my_data["Age"])

In [None]:
sns.boxplot(x=my_data["Experience"])

In [None]:
sns.boxplot(x=my_data["Income"])

In [None]:
import matplotlib.pylab as plt

my_data.boxplot(by = 'Personal_Loan',  layout=(4,4), figsize=(20, 20))
print(my_data.boxplot('Age'))
print(my_data.boxplot('Income'))
print(my_data.boxplot('Education'))


In [None]:
my_data['Personal_Loan'].hist(bins=10)

In [None]:
sns.boxplot(x='Education',y='Income',hue='Personal_Loan',data=my_data)

###  Here the customers whose education level is 1 is having more income than the others.

### We can see the customers who has taken the Personal Loan have the same Income levels. 

### Also the Customers with education levels 2 and 3 have same income level with no Personal Loan. 

In [None]:
sns.boxplot(x="Education", y='Mortgage', hue="Personal_Loan", data=my_data)

### There are so many outliers in each case. 

### But the customers with and without Personal Loan  have high Mortage.

In [None]:
sns.boxplot(x="Family",y="Income",hue="Personal_Loan",data=my_data)

###  Families with income less than 100K are less likely to take loan,than families with high income

# 6.CountPlot

In [None]:
sns.countplot(x='Family',data=my_data,hue='Personal_Loan')

### Ther is no that much impact on Personal Loan if we consider Family attribute. 

### But the Family with size 3 is taking more Personal loan as compare to other family size. 

In [None]:
sns.countplot(x="SecuritiesAccount", data=my_data,hue="Personal_Loan")

### The Majority is the customers  who do not have Personal loan have Securities Account.

In [None]:
sns.countplot(x='CDAccount',data=my_data,hue='Personal_Loan')

### The customers having no CDAccount do not have Personal loan. 

### And the customers with CDAccount almost have Personal Loan.  

In [None]:
sns.countplot(x='Online',data=my_data,hue='Personal_Loan')

### Customers with Personal Loan have less count in both the conditions. 

In [None]:
sns.countplot(x='CreditCard',data=my_data,hue='Personal_Loan')

### Customers with Personal Loan have less count in both the conditions. 

# 7.ScatterPlot

In [None]:
plt.figure(figsize = (10,8))
sns.scatterplot(x = "Experience", y = "Age",data =my_data, hue = "Education")
plt.xlabel("Experience")
plt.ylabel("Age")
plt.title("Distribution of Education by Age and Experience")

### Experience and Age gives a positive correlation ,as Experience increases Age also increases.

### We can see with the help of colors of education level that more people are in the under graduate level.  

# 8.DistPlot

In [None]:
sns.distplot( my_data[my_data.Personal_Loan == 0]['CCAvg'])

In [None]:
sns.distplot( my_data[my_data.Personal_Loan == 1]['CCAvg'])

### Here we can see that the customers with higher CCAvg have Personal Loan. 

In [None]:
#Credit card spending of Non-Loan customers
my_data[my_data.Personal_Loan == 0]['CCAvg'].median()*1000

In [None]:
#Credit card spending of Loan customers
my_data[my_data.Personal_Loan == 1]['CCAvg'].median()*1000

### The customers who are spending average  credit card  with a median of 3800 dollar gives a higher probability of Personal loan,whereas the customers who are spending Lower credit card with a median of 1400 dollars are less likely to take a loan.

# 9.Calculate the correlation matrix

In [None]:
cor=my_data.corr()
cor

# 10.Heatmap

In [None]:
plt.subplots(figsize=(10,8))
sns.heatmap(cor,annot=True)

# 11.Conclusion from EDA:

### 1.Here we can see "Age" feature is almost normally distributed where majority of customers are between age 30 to 60 years.Also we can see median is equal to mean.
### 2."Experience" feature is also almost normally distibuted and mean is also equal to median.But there are some negative values present which should be deleted, as Experience can not be negative.
### 3.We can see for "Income" , "CCAvg" , "Mortgage" distribution is positively skewed.
### 4.For "Income" mean is greater than median.Also we can confirm from this that majority of the customers have income between 45-55K.
### 5.For "CCAvg" majority of the customers spend less than 2.5K and the average spending is between 0-10K.
### 6.For "Mortage" we can see that almost 70% of the customers have Value of house mortgage less than 40K and the maximum value is 635K.
### 7.Distributin of "Family" and "Education" are evenly distributed
### 8.Income and CCAvg is moderately correlated.
### 9.Experience and Age gives a positive correlation.
### 10.Families with income less than 100K are less likely to take loan,than families with high income.
### 11.The customers whose education level is 1 is having more income than the others.
### 12.The customers with and without Personal Loan  have high Mortage.
###  13.Families with income less than 100K are less likely to take loan,than families with high income.
### 14.Ther is no that much impact on Personal Loan if we consider Family attribute. But the Family with size 3 is taking more Personal loan as compare to other family size. 
### 15.The Majority is the customers  who do not have Personal loan have Securities Account.
### 16.The customers having no CDAccount do not have Personal loan. 
### 17.Customers with Personal Loan have less count in both the conditions. 

# 12.Applying classification models (Logistic, K-NN and Naïve Bayes,SVM)

# A.Logistic regression 

In [None]:
data=my_data.drop(['ID','ZIPCode','Experience'], axis =1 )
data.head(10)

In [None]:
data.info()

In [None]:
data1=data[['Age','Income','Family','CCAvg','Education','Mortgage','SecuritiesAccount','CDAccount','Online','CreditCard','Personal_Loan']]

In [None]:
data1.head(10)

In [None]:
data1.shape

In [None]:
data1["Personal_Loan"].value_counts(normalize=True)

In [None]:
array = data1.values
X = array[:,0:9] # select all rows and first 10 columns which are the attributes
Y = array[:,10]   # select all rows and the 10th column which is the classification "0", "1"
test_size = 0.30 # taking 70:30 training and test set
seed = 15  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) # To set the random state
type(X_train)

In [None]:
# Fit the model on 30%
model = LogisticRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
model_score = model.score(X_test, y_test)
print('Accuracy:',model_score)
print('confusion_matrix:')
print(metrics.confusion_matrix(y_test, y_predict))
A=model_score  # Accuracy of Logistic regression model

# B.Naive Bayes

In [None]:
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [None]:
X = data1.values[:,0:9]  ## Features
Y = data1.values[:,10]  ## Target.values[:,10]  ## Target

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 7)

In [None]:
clf = GaussianNB()
clf.fit(X_train, Y_train)

In [None]:
Y_pred = clf.predict(X_test)

In [None]:
B=accuracy_score(Y_test, Y_pred, normalize = True) #Accuracy of Naive Bayes' Model
print('Accuracy_score:',B)

In [None]:
from sklearn.metrics import recall_score
print(recall_score(Y_test, Y_pred))

In [None]:
print('Confusion_matrix:')
print(metrics.confusion_matrix(Y_test,Y_pred))

# C.KNN

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

In [None]:
X_std = pd.DataFrame(StandardScaler().fit_transform(data1))
X_std.columns = data1.columns

In [None]:
#split the dataset into training and test datasets
import numpy as np
from sklearn.model_selection import train_test_split

# Transform data into features and target
X = np.array(data1.iloc[:,1:11]) 
y = np.array(data1['Personal_Loan'])

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
# loading library
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

# instantiate learning model (k = 1)
knn = KNeighborsClassifier(n_neighbors = 1)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
y_pred = knn.predict(X_test)

# evaluate accuracy
print(accuracy_score(y_test, y_pred))

# instantiate learning model (k = 5)
knn = KNeighborsClassifier(n_neighbors=5)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
y_pred = knn.predict(X_test)

# evaluate accuracy
print(accuracy_score(y_test, y_pred))

# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors=3)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
y_pred = knn.predict(X_test)

# evaluate accuracy
print(accuracy_score(y_test, y_pred))
# instantiate learning model (k = 7)
knn = KNeighborsClassifier(n_neighbors=7)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
y_pred = knn.predict(X_test)

# evaluate accuracy
print(accuracy_score(y_test, y_pred))

In [None]:
myList = list(range(1,20))

# subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))

In [None]:
ac_scores = []

# perform accuracy metrics for values from 1,3,5....19
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    # predict the response
    y_pred = knn.predict(X_test)
    # evaluate accuracy
    scores = accuracy_score(y_test, y_pred)
    ac_scores.append(scores)

# changing to misclassification error
MSE = [1 - x for x in ac_scores]

# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)

In [None]:
#Plot misclassification error vs k (with k value on X-axis) using matplotlib.
import matplotlib.pyplot as plt
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

In [None]:
#Use k=1 as the final model for prediction
knn = KNeighborsClassifier(n_neighbors = 1)

# fitting the model
knn.fit(X_train, y_train)

# predict the response
y_pred = knn.predict(X_test)

# evaluate accuracy
C=accuracy_score(y_test, y_pred)   #Accuracy of KNN model
print('Accuracy_score:',C)    
print(recall_score(y_test, y_pred))


In [None]:
print('Confusion_matrix:')
print(metrics.confusion_matrix(y_test, y_pred))

# D.SVM 

In [None]:
from sklearn.model_selection import train_test_split

# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score, confusion_matrix

target = my_data["Personal_Loan"]
features=my_data.drop(['ID','ZIPCode','Experience'], axis =1 )
X_train, X_test, y_train, y_test = train_test_split(features,target, test_size = 0.30, random_state = 10)

In [None]:
from sklearn.svm import SVC

# Building a Support Vector Machine on train data
svc_model= SVC(kernel='linear')
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)


In [None]:
# check the accuracy on the training set
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

In [None]:
print("Confusion Matrix:\n",confusion_matrix(prediction,y_test))

In [None]:
#Store the accuracy results for each kernel in a dataframe for final comparison
resultsDf = pd.DataFrame({'Kernel':['Linear'], 'Accuracy': svc_model.score(X_train, y_train)})
resultsDf = resultsDf[['Kernel', 'Accuracy']]
resultsDf

In [None]:
# Building a Support Vector Machine on train data
svc_model = SVC(kernel='rbf')
svc_model.fit(X_train, y_train)

In [None]:
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

In [None]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Kernel':['RBF'], 'Accuracy': svc_model.score(X_train, y_train)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Kernel', 'Accuracy']]
resultsDf

In [None]:
#Building a Support Vector Machine on train data(changing the kernel)
svc_model  = SVC(kernel='poly')
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)

print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

In [None]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Kernel':['Poly'], 'Accuracy': svc_model.score(X_train, y_train)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Kernel', 'Accuracy']]
resultsDf

In [None]:
svc_model = SVC(kernel='sigmoid')
svc_model.fit(X_train, y_train)

prediction = svc_model.predict(X_test)

##print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

In [None]:
#Store the accuracy results for each kernel in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Kernel':['Sigmoid'], 'Accuracy': svc_model.score(X_train, y_train)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Kernel', 'Accuracy']]
resultsDf

# 13.Comparison of different  Models:

In [None]:
print(A) #Accuracy of Logistic regression model

In [None]:
print(B) #Accuracy of Naive Bayes' Model

In [None]:
print(C)  #Accuracy of KNN Model

In [None]:
resultsDf #Accuracy of SVM Model

# Conclusion:

## The classification goal is to predict the likelihood of a liability customer buying personal loans.

## A bank wants a new marketing campaign; so that they need information about the correlation between the variables given in the dataset. 

## Here I used 4 classification models to study.

## From the accuracy scores , it seems like "KNN" algorithm have the highest accuracy and stability.

## But we can use SVM also as all the Kernels have good accuracy as well.