# Bank Loan Modelling

## Importing modules and data

<b>The classification goal is to predict the likelihood of a liability customer buying personal
loans.

dataset: https://www.kaggle.com/itsmesunil/bank-loan-modelling/download

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
#path of your csv file
path='../input/bank-loan-modelling/Bank_Personal_Loan_Modelling.xlsx'
df=pd.read_excel(path,'Data')
df.head()

# Understanding the Data

 <b>Attributes Information</b>
    
    1. ID: Customer ID
    2. Age: Customer's age in completed years
    3. Experience: #years of professional experience
    4. Income: Annual income of the customer
    5. ZIP Code: Home Address ZIP code.
    6. Family: Family size of the customer
    7. CCAvg: Avg. spending on credit cards per month
    8. Education: Education Level. 1: Undergrad; 2: Graduate; 3:Advanced/Professional
    9. Mortgage: Value of house mortgage if any.
    10. Personal Loan: Did this customer accept the personal loan offered in the last campaign?
    11. Securities Account: Does the customer have a securities account with the bank?
    12. CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
    13. Online: Does the customer use internet banking facilities?
    14. Credit card: Does the customer use a credit card issued by the bank?

In [None]:
df.shape

In [None]:
df.describe()

1. Experience contains some negative values and experience can't be negative
2. The max of Income, Experience, CCAvg, Mortgage, Security Account, CD Accounts, CreditCard is much high then their mean it means they contains some extream values

In [None]:
df.isnull().sum()

<b>Data doesn't contain any null values so now lets find either they are in proper datatypes or not

In [None]:
df.info()

<b>Each attribute is in proper datatype i.e., int or float attributes doesn't contain character or object data

In [None]:
df.nunique()

<b>On the basis of number of unique values we can seperate the continuous and categorical data

In [None]:
categorical_variables=[col for col in df.columns if df[col].nunique()<=5]
print(categorical_variables)
continuous_variables=[col for col in df.columns if df[col].nunique()>5]
print(continuous_variables)

<B>We will remove the Personal Loan from categorical variable list as it is the target variable (Dependent Variable) and ID from the continuous variable list as it doesn't take part in data modeling

In [None]:
categorical_variables.remove("Personal Loan")
print(categorical_variables)
continuous_variables.remove("ID")
print(continuous_variables)

# Data Visualization

### Univarient Analysis

<b>Analysing the distribution of particular attributes

<b>Continuous variable

In [None]:
fig=plt.figure(figsize=(20,10))
#fig.subplots_adjust(wspace=0.4,hspace=0.4)
for i,col in enumerate(continuous_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.distplot(df[col])

1. We can see that Age and Experience are uniformaly distributed and show a good similarities in distribution.
2. Income, CCAvg, Mortgage are positive Skew
3. ZIP code is negative Skew or it contain values from single region.
4. Mortgage contain most of the values as 0

<b>Categorical variables

In [None]:
fig=plt.figure(figsize=(20,10))
#fig.subplots_adjust(wspace=0.4,hspace=0.4)
for i,col in enumerate(categorical_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.countplot(df[col])

1. Most of the customer doesn't have Securities Account, CD Account and CreditCard
2. More number of customer use internet banking facilities.
3. More number of customer are Undergrad and have family size one.

### Bivariate Analysis 

<b>Analysing each column (Independent Attribute) first in regard with Personal Loan (Dependent attribute) and then relating them with one another for finding patterns in data

<b>Continious Variables

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(continuous_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.boxplot(y=df[col],x=df['Personal Loan'])

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(continuous_variables):
    ax=fig.add_subplot(2,3,i+1)
    ax1=sns.distplot(df[col][df['Personal Loan']==0],hist=False,label='No Personal Lone')
    sns.distplot(df[col][df['Personal Loan']==1],hist=False,ax=ax1,label='Personal Lone')

1. Personal Loan doesn't show variations with Age and Experience.
2. Income has a good effect on Personal Loan Customers with High Income have more chances of having Personal Loan.
3. CCAvg also show a good relationship with Personal Loan customers with personal loan have high Avg. spending on credit cards per month
4. Customers want to have high Mortgage have opted for Personal Loan

<b>Categorical Variable

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(categorical_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.barplot(x=col,y='Personal Loan',data=df,ci=None)

1. Customers with family size equal to 3 have more chances of having Personal Loan.
2. Customers with Undergrad degree have less chances of having Personal Loan as compaired to other customers having Graduate or Advanced/Professional degree
3. Customers with CD Account and Securities Account have more chances of having Personal Loan.
4. Customers using online Facitilies or not, having credit card or not doesn't effect much in chances of having Personal Loan

<b>Income is a strong attribute which effect the chances of having Personl Loan High the Income High the chances of having Personal Loan. So we will analysise Income with other attributes

In [None]:
con=continuous_variables.copy()
con.remove('Income')

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(con):
    ax=fig.add_subplot(2,3,i+1)
    sns.scatterplot('Income',col,hue='Personal Loan',data=df)

1. Age and Experience have not effect much as customer with high income no matter of what age group have chances of having Personal Loan.
2. ZIP Code shows that all the customers are of particular area and they have no pattern in chances of having personal loan.
3. Customers with high income and having personal loan shows high Avg. spending on credit cards per month and high Value of house mortgage

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(categorical_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.scatterplot(col,'Income',hue='Personal Loan',data=df)

1. Customers with high Degrees and having family size greater then 3 with high income have personal loan.
2. Customers with CD Accounts have chances of having Personal Loan

<b> After income we will find relation of CCAvg with other attributes

In [None]:
con.remove('CCAvg')

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(con):
    ax=fig.add_subplot(2,2,i+1)
    sns.scatterplot('CCAvg',col,hue='Personal Loan',data=df)

1. Age and Experience show same patterns on customers with personal loan or not.
2. high CCAvg and Mortgage have more chances of customer having Personal Loan

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(categorical_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.scatterplot(col,'CCAvg',hue='Personal Loan',data=df)

1. CCAvg Shows similar trends as shown by Income

In [None]:
fig=plt.figure(figsize=(20,10))
for i,col in enumerate(categorical_variables):
    ax=fig.add_subplot(2,3,i+1)
    sns.countplot(x=col,hue='Personal Loan',data=df)

<b> these graph clearly depicts the facts that we had observed previously above like person with CD Account have more chances of having Personal Loan

# Feature Engineering

### Data Cleaning

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.shape

<b>Since there are no duplicate enteries or row in data so we can set ID as index of dataframe

In [None]:
df.set_index("ID",inplace=True)

<b> Since ZIP Code doesn't show any effect on the chances of Personal Loan it is better to remove it form our data

In [None]:
df.drop('ZIP Code',axis=1,inplace=True)

Zip code contain data of particular area so it contain minimum variations so we removed it.

<b>Finding relationship between Experience and Age

In [None]:
corr=df.corr()
plt.figure(figsize=(10,10))
plt.title('Correlation')
sns.heatmap(corr > 0.90, annot=True, square=True)

In [None]:
df[['Age','Experience','Personal Loan']].corr()

since Age shows a little better correlation with Personal loan we will drop the Experience attribute

In [None]:
df.drop('Experience',axis=1,inplace=True)

<b>Creating Attributes

We will try to create a new feature Account contain 1 if any customer is having either CD Account or Security Account or zero if customer have none of them and will try to find its relation with target variable if it shows a better relation compaired to other two then we will keep it otherwise we will remove it

In [None]:
df['Account']=df['CD Account']+df['Securities Account']

In [None]:
df[['CD Account','Securities Account','Account','Personal Loan']].corr()

Since it shows a week correlation as compaired to CD Account we will drop it 

In [None]:
df.drop('Account',axis=1,inplace=True)

we will now try to create a attribute facilities which will contain 1 if customer uses the bank facilities like online banking or credit card else it will contain 0 and then we will check for relation with the target variable if it is better then the previous one we will keep that attribute

In [None]:
df['Facilities']=df['Online']+df['CreditCard']

In [None]:
df[['Facilities','Online','CreditCard','Personal Loan']].corr()

We will keep the Facilities attribute and drop the Online and CreditCard attributes

In [None]:
df.drop(['Online','CreditCard'],axis=1,inplace=True)

In [None]:
df.head()

<b>Applying Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

In [None]:
scaled_df=scaler.fit_transform(df.drop('Personal Loan',axis=1))

In [None]:
scaled_df=pd.DataFrame(scaled_df)

In [None]:
scaled_df.columns=df.drop('Personal Loan',axis=1).columns
scaled_df.head()

# Model Development

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [None]:
X=scaled_df
y=df['Personal Loan']

In [None]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=100)

In [None]:
model_list=[]
model_f1_score=[]
model_accuracy_score=[]

### LogisticRegression

In [None]:
model_list.append('LogisticRegression')
lm=LogisticRegression()

In [None]:
lm.fit(x_train,y_train)

In [None]:
yhat_lm=lm.predict(x_test)

In [None]:
lm_score=f1_score(y_test,yhat_lm)
model_f1_score.append(lm_score)
lm_score

In [None]:
lm_accuracy=accuracy_score(y_test,yhat_lm)
model_accuracy_score.append(lm_accuracy)
lm_accuracy

In [None]:
print(classification_report(y_test,yhat_lm))

In [None]:
sns.heatmap(confusion_matrix(y_test,yhat_lm),annot=True,fmt='',cmap='YlGnBu')

### DecisionTreeClassifier

In [None]:
model_list.append('DecisionTreeClassifier')
tree=DecisionTreeClassifier()

In [None]:
tree.fit(x_train,y_train)

In [None]:
yhat_tree=tree.predict(x_test)

In [None]:
tree_score=f1_score(y_test,yhat_tree)
model_f1_score.append(tree_score)
tree_score

In [None]:
tree_accuracy=accuracy_score(y_test,yhat_tree)
model_accuracy_score.append(tree_accuracy)
tree_accuracy

In [None]:
print(classification_report(y_test,yhat_tree))

In [None]:
sns.heatmap(confusion_matrix(y_test,yhat_tree),annot=True,fmt='',cmap='YlGnBu')

### RandomForestClassifier

In [None]:
model_list.append('RandomForestClassifier')
forest=RandomForestClassifier()

In [None]:
forest.fit(x_train,y_train)

In [None]:
yhat_forest=forest.predict(x_test)

In [None]:
forest_score=f1_score(y_test,yhat_forest)
model_f1_score.append(forest_score)
forest_score

In [None]:
forest_accuracy=accuracy_score(y_test,yhat_forest)
model_accuracy_score.append(forest_accuracy)
forest_accuracy

In [None]:
print(classification_report(y_test,yhat_forest))

In [None]:
sns.heatmap(confusion_matrix(y_test,yhat_forest),annot=True,fmt='',cmap='YlGnBu')

### SVC

In [None]:
model_list.append('SVC')
svc=SVC()

In [None]:
svc.fit(x_train,y_train)

In [None]:
yhat_svc=svc.predict(x_test)

In [None]:
svc_score=f1_score(y_test,yhat_svc)
model_f1_score.append(svc_score)
svc_score

In [None]:
svc_accuracy=accuracy_score(y_test,yhat_svc)
model_accuracy_score.append(svc_accuracy)
svc_accuracy

In [None]:
print(classification_report(y_test,yhat_svc))

In [None]:
sns.heatmap(confusion_matrix(y_test,yhat_svc),annot=True,fmt='',cmap='YlGnBu')

### KNeighborsClassifier

In [None]:
model_list.append('KNeighborsClassifier')
neighbour=KNeighborsClassifier()

In [None]:
neighbour.fit(x_train,y_train)

In [None]:
yhat_neighbour=neighbour.predict(x_test)

In [None]:
neighbour_score=f1_score(y_test,yhat_neighbour)
model_f1_score.append(neighbour_score)
neighbour_score

In [None]:
neighbour_accuracy=accuracy_score(y_test,yhat_neighbour)
model_accuracy_score.append(neighbour_accuracy)
neighbour_accuracy

In [None]:
print(classification_report(y_test,yhat_neighbour))

In [None]:
sns.heatmap(confusion_matrix(y_test,yhat_neighbour),annot=True,fmt='',cmap='YlGnBu')

### Ploting the Results

<b>F1-Score

In [None]:
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(model_list,model_f1_score)
ax.set_title("F1 Score of  Test Data",pad=20)
ax.set_xlabel("Models",labelpad=20)
ax.set_ylabel("F1_Score",labelpad=20)
plt.xticks(rotation=90)

for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0%}'.format(height), (x+0.25, y + height + 0.01))

<b>Accuracy Score

In [None]:
fig,ax=plt.subplots(figsize=(10,6))
sns.barplot(model_list,model_accuracy_score)
ax.set_title("Accuracy of Models on Test Data",pad=20)
ax.set_xlabel("Models",labelpad=20)
ax.set_ylabel("Accuracy",labelpad=20)
plt.xticks(rotation=90)

for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0%}'.format(height), (x+0.25, y + height + 0.01))

# Conclusion

Among the 5 models that we have implemented DecisionTreeClassifier and RandomForestClassifier gives the same and best F1 Score and accuracy score with almost accuracy of 98% and F1-Score of 91%