# Overview

This case is about a bank called Thera Bank whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.

# Objective 

The classification goal is to predict the likelihood of a liability customer buying personal loans.

# Data Description

The file Bank_loan.csv contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.



### Data Information:

1. **ID** : Customer ID

2. **Age** : Customer's age in completed years

3. **Experience** : #years of professional experience

4. **Income** : Annual income of the customer ($000)

5. **ZIP Code** : Home Address ZIP code.

6. **Family** : Family size of the customer

7. **CCAvg** : Avg. spending on credit cards per month ($000)

8. **Education** : Education Level. 1: Undergrad; 2: Graduate;
3: Advanced/Professional

9. **Mortgage** : Value of house mortgage if any. ($000)

10. **Personal Loan** : Did this customer accept the personal loan offered in the
last campaign?

11. **Securities Account** : Does the customer have a securities account with the bank?

12. **CD Account** : Does the customer have a certificate of deposit (CD)
 account with the bank?

13. **Online** : Does the customer use internet banking facilities?

14. **Credit card** : Does the customer use a credit card issued by
UniversalBank?

# Activities to be Performed

* Exploratory Data Analysis (EDA)

* Data Cleaning

* Building Model

* Model Evaluation

In [1]:
# Importing the dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [2]:
data = pd.read_csv('Data/Bank_loan.csv')

In [3]:
data.head()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [4]:
# Shape of Dataset
data.shape

(5000, 14)

* Rows - 5000
* Columns - 14

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


In [6]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,5000.0,2500.5,1443.520003,1.0,1250.75,2500.5,3750.25,5000.0
Age,5000.0,45.3384,11.463166,23.0,35.0,45.0,55.0,67.0
Experience,5000.0,20.1046,11.467954,-3.0,10.0,20.0,30.0,43.0
Income,5000.0,73.7742,46.033729,8.0,39.0,64.0,98.0,224.0
ZIP Code,5000.0,93152.503,2121.852197,9307.0,91911.0,93437.0,94608.0,96651.0
Family,5000.0,2.3964,1.147663,1.0,1.0,2.0,3.0,4.0
CCAvg,5000.0,1.937938,1.747659,0.0,0.7,1.5,2.5,10.0
Education,5000.0,1.881,0.839869,1.0,1.0,2.0,3.0,3.0
Mortgage,5000.0,56.4988,101.713802,0.0,0.0,0.0,101.0,635.0
Personal Loan,5000.0,0.096,0.294621,0.0,0.0,0.0,0.0,1.0


In [7]:
#checking for null values in the dataset
data.isna().sum()

ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64

There is no Null Values in the dataset

# Exploratory Data Analysis

In [None]:
sns.pairplot(data)

In [None]:
data.corr()

From the above correlation table, we can see that the customer ID provides less relationship with the rest of the data attributes, having negative correlation with almost all the other attrinutes.

In [None]:
# Checking any outliers present in Age Attribute
sns.boxplot(data['Age'])

In [None]:
# Checking the distribution of Age Attribute
sns.distplot(data['Age'], color='b')

In [None]:
data['Age'].mean()

In [None]:
data['Age'].median()

Age Attribute is normally distributed.

In [None]:
sns.boxplot(data['Experience'])

In [None]:
sns.distplot(data['Experience'])

In [None]:
data['Experience'].mean()

In [None]:
data['Experience'].median()

Experience Attribute is normally distributed as well with more customer having experience above 10-30 years. Here mean(20) and median(20) are same. And it contains some negative values present in Experience which could be as a result of wrong data input which shouldn't be the case. Therefore we will replace these data points with the mean of the postive experience.

In [None]:
data[data['Experience'] < 0].head()

In [None]:
# Checking how many rows conatins Experience as negative values
data[data['Experience'] < 0]['Experience'].count()

In [None]:
sns.boxplot(data['Income'])

**Income** is positively skewed. Majority of the customers have income between 45K and 100K. The boxplot below confirms the distribution.

In [None]:
sns.boxplot(data['CCAvg'])

<br>

**CCAvg** is also a positively skewed variable with an average spending falling between 0K-10K and majority spends less than 2.5K

In [None]:
sns.boxplot(data['Family'])

In [None]:
sns.boxplot(data['Mortgage'])

<br>

Mortgage has majority of the individuals having a mortgage of less than 40K with a huge range of 0k-635K

# Data Cleaning

In [None]:
PositiveExp = data.loc[data['Experience'] > 0]
NegativeExp = data['Experience'] < 0
col_name = 'Experience'
my_list = data.loc[NegativeExp]['ID'].tolist()

For the record with the ID, get the value of Age column

For the record with the ID, get the value of Education column

Filter the records matching the above criteria from the data frame which has records with positive experience and take the median

Apply the median back to the location which had negative experience

In [None]:
for id in my_list:
    age = data.iloc[np.where(data['ID']==id)]["Age"].tolist()[0]
    education = data.loc[np.where(data['ID']==id)]["Education"].tolist()[0]
    filtered_data = PositiveExp[(PositiveExp.Age == age) & (PositiveExp.Education == education)]
    Exp =filtered_data['Experience'].median()
    data.loc[data.loc[np.where(data['ID']==id)].index, 'Experience'] = Exp

In [None]:
# checking if there are records with negative experience
data[data['Experience'] < 0]['Experience'].count()

In [None]:
data.describe().T

In [None]:
# Checking if there is any relation between Personal loan and Mortgages.
sns.boxplot(x='Education', y='Income', hue='Personal Loan', data=data)

The customers who have education level to be 1 are having more income, however, customers who have taken the personal loan are also having the same income levels

In [None]:
top5_loc = data[data['Personal Loan']==1]['ZIP Code'].value_counts().head(5)
top5_loc

In [None]:
sns.boxplot(x="Education", y='Mortgage', hue="Personal Loan", data=data)

**Observation :**

It is observed that customers who have personal loans also have high mortgage

In [None]:
# Is there any influence of family size on whether a customer accepts a personal loan or not?

sns.countplot(x='Family',data=data,hue='Personal Loan',palette='Set1')

In [None]:
from scipy import stats

stats.ttest_ind(data[data["Personal Loan"] == 1]['Family'] , data[data['Personal Loan'] == 1]['Family'])


From the observation above, Family size seems to have no impact on decision to take a loan.

In [None]:
sns.countplot(x='CD Account',hue='Personal Loan',data=data)

It is observed that customers who do not have CD account , also do not have loan as well. Though this seems to be majority. But almost all customers who has CD account has loan as well

In [None]:
# Personal Loan Vs Credit Card Average

sns.distplot( data[data['Personal Loan'] == 0]['CCAvg'], color = 'b')
sns.distplot( data[data['Personal Loan'] == 1]['CCAvg'], color = 'g')

In [None]:
data[data['Personal Loan'] == 0]['CCAvg'].median()*1000

In [None]:
data[data['Personal Loan'] == 1]['CCAvg'].median()*1000

The graph show persons who have personal loan have a higher credit card average.

Average credit card spending with a median of 3800 dollar indicates a higher probability of personal loan. Also, Lower credit card spending with a median of 1400 dollars is less likely to take a loan.

In [None]:
# HeatMap

plt.figure(figsize=(25, 25))
ax = sns.heatmap(data.corr(), vmax=.8, square=True, fmt='.2f', annot=True, linecolor='white', linewidths=0.01)
plt.title('Correlation Matrix')
rotx = ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
roty = ax.set_yticklabels(ax.get_yticklabels(), rotation=30)
plt.show()

Age and Experoence is highly corelated

Income and CCAvg also corelated

In [None]:
data = data.dropna()

In [None]:
# Splitting features and targets
x = data.drop('Personal Loan', axis=1)
y = data['Personal Loan']

# Splitting data into Train and Test

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=80)

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

# Building Model

## 1.) Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()

In [None]:
case = lr_model.fit(x_train, y_train)

In [None]:
y_pred =lr_model.predict(x_test)
y_pred

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

lr_accuracy = accuracy_score(y_test, y_pred)*100
lr_precision = precision_score(y_test, y_pred)*100
lr_recall = recall_score(y_test, y_pred)*100
lr_f1score = f1_score(y_test, y_pred)*100

print("Accuracy : {:.2f}%".format(lr_accuracy))
print("Precision : {:.2f}%".format(lr_precision))
print("Recall : {:.2f}%".format(lr_recall))
print("F1 Score : {:.2f}%".format(lr_f1score))

In [None]:
from sklearn.metrics import confusion_matrix

conf = confusion_matrix(y_test, y_pred)
plt.clf()
plt.imshow(conf,interpolation='nearest',cmap=plt.cm.Wistia)
Xnames=['Negative','Postive']
Ynames=['True','False']
plt.title('Confusion matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
tick_marks=np.arange(len(classnames))
plt.xticks(tick_marks,Xnames,rotation=45)
plt.yticks(tick_marks,Ynames)
s=[['TN','FP'],['FN','TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j]) + '=' + str(conf[i][j]))
plt.show()

## 2.) Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc_model = DecisionTreeClassifier(criterion = 'entropy' , max_depth = 3)

In [None]:
dtc_model.fit(x_train, y_train)

In [None]:
y_pred = dtc_model.predict(x_test)

In [None]:
dtc_accuracy = accuracy_score(y_pred, y_test)
dtc_precision = precision_score(y_pred, y_test)
dtc_recall = recall_score(y_pred, y_test)
dtc_f1score = f1_score(y_pred, y_test)

print("Accuracy : {:.2f}%".format(dtc_accuracy))
print("Precision : {:.2f}%".format(dtc_precision))
print("Recall : {:.2f}%".format(dtc_recall))
print("F1 Score : {:.2f}%".format(dtc_f1score))

In [None]:
confusion_matrix(y_test, y_pred)