<a href="https://colab.research.google.com/github/Leon-Castelino/Internship-Project/blob/master/Bank_Loan_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Internship Project Title: Marketing Campaign for Banking Products**

**Abstract:**

A bank consists of a datafile ***'Bank_Loan.csv'*** which has data of 5000 customers. This bank has a growing customer base and wants to increase its borrowers (asset customers) base to bring in more loan business and earn more through the interest on loans. A campaign that the bank ran last year for liability customers showed that only 480 out of 5000 customers (i.e.9.6%) accepted the personal loan that was offered to them. The bank wants to convert the liability based customers to personal loan customers while retaining them as depositors. 

The goal of the department is to build a model that will help them identify the potential customers who have a higher probability of purchasing the loan which will eventually increase the success ratio while at the same time reduce the cost of the campaign. 

The above mentioned datafile includes customer demographic information (age, income etc), customer's relationship with the bank (mortgage, securities account, etc.), and the customer's response to the last personal loan campaign (Personal Loan).



### **STEP 1**

**1.1 Importing the Required Datasets and Libraries**

In [None]:
#importing all the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

#plot outputs will be stored internally
%matplotlib inline                 

In [None]:
#uploading the .csv file
from google.colab import files
uploaded=files.upload()

In [None]:
#reading the .csv file
import io
data=pd.read_csv(io.BytesIO(uploaded['Bank_Loan.csv']))

In [None]:
#checking all column names
data.columns

In [None]:
#checking the top 5 elements to ensure the data starts from index 0
data.head()

In [None]:
#checking the last 5 elements to confirm 5000 enteries
data.tail()

**1.2. Checking the DataTypes**

In [None]:
#displaying datatype of each column and non-null rows for each column
data.info()

**1.3 Checking the Summary stats**

In [None]:
#displaying the summary stats i.e. mean, standard deviation, count etc
data.describe()

**1.4 Checking Null Values**

In [None]:
#display number of null values present in each row
data.isnull().sum()   

**1.5 Checking the Shape of the DataFrame**

In [None]:
#displaying number of rows and columns
data.shape

### **STEP 2**

**2. Preprocessing the Data**

In [None]:
#checking all the rows for negative values
data.lt(0).sum()

Only 'Experience' column consists of negative values in the entire dataset.

In [None]:
#replacing negative values with np.nan
data.loc[~(data['Experience']>0), 'Experience']=np.nan

In [None]:
#replacing np.nan values with median
data['Experience'].fillna(data['Experience'].median())

In [None]:
#displaying the summary stats i.e. mean, standard deviation, count etc
data.describe()

Since our main motive is to predict the likelihood of a customer buying personal loans using Logistic Regression model, it is necessary to avoid multicollinearity as more weightage would be given to highly correlated data.

Here, we see that 'ID' and 'Experience' columns will not play a major role of a customer buying personal loans and also a prediction can be made that 'Age' and 'Experience' columns might be correlated.

By using scatter plots and .corr() function we can find out the correlation between 'Age' and 'Experience'.

In [None]:
#using scatter plots to check linearity between 'Age' and 'Experience'
x=data['Age']
y=data['Experience']
plt.scatter(x,y,s=10,color='green')
plt.xlabel('Age')
plt.ylabel('Experience')
plt.show

In [None]:
#depicting correlation between 'Age' and 'Experience'
exp=data['Experience']
age=data['Age']
correlation=exp.corr(age)
correlation

Here, we can see that 'Age' and 'Experience' columns are correlated and either one of them can be dropped along with 'ID' to avoid multicollinearity.

In [None]:
#dropping irrelevant columns
data=data.drop(['ID','Experience'],axis=1)
data.head()

### **STEP 3**

**3. Exploratory Data Analysis (EDA)**

**3.1 Unique Values**

In [None]:
#displaying number of unique values in each row
data.nunique()

We saw that 'ZIP Code' column has 467 Unique Values which is quite huge. Since ZIP Code is a nominal - categorical variable, we can create dummy variables using One Hot Encoding Method, but the problem of multicollinearity arises. Hence we drop 'ZIP Code' column.

In [None]:
#dropping 'ZIP Code' column
data=data.drop(['ZIP Code'], axis=1)
data.head()

**3.2 People with Zero Mortgage**

In [None]:
#displaying total number of people with zero mortgage
(data.Mortgage==0).sum()

**3.3 People with Zero CC per month**

In [None]:
#displaying total number of people with zero credit card spending per month
(data.CCAvg==0).sum()

**3.4 Value Counts of Categorical Values**

Except 'Id', 'Income', 'CCAvg' and 'Mortgage' columns, rest all columns are categorical in nature.

In [None]:
#displaying counts of values of 'Family' column in descending order
data['Family'].value_counts()

In [None]:
#displaying counts of values of 'Education' column in descending order 
data['Education'].value_counts()

In [None]:
#displaying counts of values of 'Securities Account' column in descending order
data['Securities Account'].value_counts()

In [None]:
#displaying counts of values of 'CD Account' column in descending order
data['CD Account'].value_counts()

In [None]:
#displaying counts of values of 'Online' column in descending order
data['Online'].value_counts()

In [None]:
#displaying counts of values of 'Credit Card' column in descending order
data['CreditCard'].value_counts()

**3.5 Univariate and Bivariate Analysis**

**A). Univariate Analysis**





We use 'displot' to plot a single variable of any type; in this case, a numerical variable. 

In [None]:
#'Age' column is symmetrically distributed
sns.distplot(data['Age'])

In [None]:
#'Income' column is positively skew distributed
sns.distplot(data['Income'])

In [None]:
#'CCAvg' column is positively skew distributed
sns.distplot(data['CCAvg'])

In [None]:
#'Mortgage' column is highly skew distributed
sns.distplot(data['Mortgage'])

We use 'countplot' to plot a categorical variable.

In [None]:
#analysis of 'Family' column
sns.countplot(data['Family'])

In [None]:
#analysis of 'Education' column
sns.countplot(data['Education'])

In [None]:
#analysis of 'Securities Account' column
sns.countplot(data['Securities Account'])

In [None]:
#analysis of 'CD Account' column
sns.countplot(data['CD Account'])

In [None]:
#analysis of 'Online' column
sns.countplot(data['Online'])

In [None]:
#analysis of 'CreditCard' column
sns.countplot(data['CreditCard'])

**B). Multivariate Analysis**

In [None]:
#plotting graphs of all columns with respect to each other
sns.pairplot(data)

We use 'boxplot' to plot a graph between any two variables to determine the eligibility of a customer taking a Personal Loan.

(Only 4 graphs have been plotted)

In [None]:
#Customers with more Income are granted Personal Loan at different Education Levels
sns.boxplot(x='Education',y='Income',hue='Personal Loan',data=data,showfliers=False)

In [None]:
#Customers with more Income are granted Personal Loan through different Credit Cards
sns.boxplot(x='CreditCard',y='Income',hue='Personal Loan',data=data,showfliers=False)

In [None]:
#Customers that often spend through Credit Cards are granted Personal Loan through different Credit Cards
sns.boxplot(x='CreditCard',y='CCAvg',hue='Personal Loan',data=data,showfliers=False)

In [None]:
#Customers with more Income are granted Personal Loan through any kind of Security Accounts
sns.boxplot(x='Securities Account',y='Income',hue='Personal Loan',data=data,showfliers=False)

We use 'countplot' to plot a categorical value with respect to Personal Loan.

(Only 4 graphs have been plotted)

In [None]:
#Majority of customers having Security Accounts don't have a Personal Loan 
sns.countplot(x='Securities Account',hue='Personal Loan',data=data)

In [None]:
#Customers with Higher Education don't have a Personal Loan
sns.countplot(x='Education',hue='Personal Loan',data=data)

In [None]:
#Majority of customers having CD Account don't have a Personal Loan
sns.countplot(x='CD Account',hue='Personal Loan',data=data)

In [None]:
#Majority of customers having Credit Cards don't have a Personal Loan
sns.countplot(x='CreditCard',hue='Personal Loan',data=data)

Here we plot the Correlation Matrix (HeatMap), with 'Income' and 'CCAvg' column having higher correlation of 0.65

In [None]:
#plotting the HeatMap using numerical data
fig,ax=plt.subplots(figsize=(15,10))
sns.heatmap(data.corr(),cmap='RdPu',annot=True);

### **STEP 4**

Before doing any transformations, we split the entire dataset into two parts, namely data_X and data_Y

In [None]:
#creating attribute (X) and target (y) variables
data_X=data.loc[:,data.columns!='Personal Loan']
data_y=data['Personal Loan']

In [None]:
#displaying rows and columns
data_X.shape, data_y.shape

**4. Transformation for Feature Variables**

In Step 3, we saw that 'Income' and 'CCAvg' columns were not symmetrically distributed. Hence we perform Power Transformations to make them symmetrical to the Standard Bell Curve.

In [None]:
#applying box-cox transformation to 'Income' column
p=PowerTransformer(method='box-cox', standardize=False)
p.fit(data_X['Income'].values.reshape(-1,1))
num=p.transform(data_X['Income'].values.reshape(-1,1))

In [None]:
#'Income' column is normally distributed
sns.distplot(num)

In [None]:
#applying yeo-johnson transformation to 'CCAvg' column
p=PowerTransformer(method='yeo-johnson', standardize=False)
p.fit(data_X['CCAvg'].values.reshape(-1,1))
num=p.transform(data_X['CCAvg'].values.reshape(-1,1))

In [None]:
#'CCAvg' column is normally distributed
sns.distplot(num)

In Step 3, we also noticed that the graph for 'Mortgage' column was highly skewed wherein most of the values are almost zero and there is discontinuity present between the values. Hence we avoid Power Transformtion and use Binning Method to sort the values in a number of bins.

In [None]:
#label encoding applied to 'Mortgage' column
data_X['Mortgage_Int']=pd.cut(data_X['Mortgage'],bins=[0,100,200,300,400,500,600,700],labels=[0,1,2,3,4,5,6],include_lowest=True)
data_X.drop('Mortgage',axis=1,inplace=True)

In [None]:
#displaying top 10 columns of the data
data.head(5)

In [None]:
#'Mortgage' column is normally distributed 
sns.distplot(data_X.Mortgage_Int)

### **STEP 5**

**5.1 Normalizing the Data**

Here, we Normalize data_X to remove any unwanted effects in the plot in order to make it look like an ordinary distribution.

In [None]:
#Normalizing the data using MinMaxScalar Method
data_X_scaler_minmax=preprocessing.MinMaxScaler(feature_range=(0,1))
data_X=data_X_scaler_minmax.fit_transform(data_X)
data_X

In [None]:
#displaying number of rows and columns
data_X.shape

**5.2 Splitting the Data using Stratified Sampling**

We use Stratified Sampling on data_Y to ensure equal distribution of samples of 'Personal Loan' column.

In [None]:
#splitting the data into training and testing data
X_train,X_test,y_train,y_test=train_test_split(data_X, data_y, test_size=0.3, stratify=data_y, random_state=0)

In [None]:
#displaying training and testing data
X_train,X_test,y_train,y_test

In [None]:
#displaying number of rows and columns
X_train.shape, X_test.shape, y_train.shape, y_test.shape

### **STEP 6**

**6. Logistic Regression Model**

Using Logistic Regression Model, we determine the liability of customer buying Personal Loan using Training and Testing data.

In [None]:
#fitting Logistic Regression model with training data
Logic_Reg=LogisticRegression()
Logic_Reg.fit(X_train,y_train)

In [None]:
#making predictions on Testing Data
pred=Logic_Reg.predict(X_test)
pred

In [None]:
#using Score function to predict the accuracy of Training and Testing Data
train_score=Logic_Reg.score(X_train, y_train)
print('Training Data Accuracy:', train_score*100)
test_score=Logic_Reg.score(X_test, y_test)
print('Testing Data Accuracy:', test_score*100)

### **STEP 7**

**7. Metrics to Evaluate Logistic Regression Model's Performance**

In order to check the model's performance, we simply use three different methods i.e. Confusion Matrix, ROC Curve and Precision-Recall Curve

**7.1 Confusion Matrix**

First, we make predictions on how many customers would buy a Personal Loan and accordingly plot the Confusion Matrix.

In [None]:
#creating two different classes
#class_names = 0 implies customer rejecting Personal Loan
#class_names = 1 implies customer buying Personal Loan
class_names=[0,1]

In [None]:
#making correct and incorrect predictions
confusion_matrix=confusion_matrix(y_test, pred)
confusion_matrix

In [None]:
#plotting confusion matrix for Logistic Regression
fig,ax=plt.subplots()
plt.xticks(np.arange(len(class_names)),class_names)
plt.yticks(np.arange(len(class_names)),class_names)
sns.heatmap(pd.DataFrame(confusion_matrix),annot=True,cmap="YlOrRd",fmt='g')
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual Labels')
plt.xlabel('Predicted Labels')
plt.legend()
plt.show()

**7.2 ROC Curve**

First, we calculate the AUC Score and then plot the ROC Curve.

In [None]:
#calculating AUC score
probs = Logic_Reg.predict_proba(X_test)[::,1]
fpr,tpr,threshold = metrics.roc_curve(y_test,probs)
auc = metrics.roc_auc_score(y_test,probs)
print('AUC Score:',auc*100)

In [None]:
#plotting ROC Curve for Logistic Regression
plt.plot(fpr,tpr)
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

**7.3 Precision-Recall Curve**

In [None]:
#calculating Precision-Recall score
prec_recall = average_precision_score(y_test,probs)
print('Precision-Recall Score:',prec_recall*100)

In [None]:
#plotting Precision-Recall Curve
precision,recall,_= precision_recall_curve(y_test,probs)
plt.step(recall, precision,where='post')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()

We check the overall accuracy of Logistic Regression model.

In [None]:
#displaying overall accuracy of Logistic Regression Model
accuracy = metrics.accuracy_score(y_test,pred)
A1=accuracy
print("Accuracy:",accuracy*100)

### **STEP 8**

**8. Classification Algorithms**

We check the Accuracy of the dataset using different classifiers depending on which we would draw conclusion as to which would be the Best Classifier. 

**8.1 Decision Tree Classifier**

In [None]:
#fitting Decision Tree Classifier with training data
Dec_Tree=DecisionTreeClassifier(criterion='entropy',max_depth=5,min_samples_leaf=2,random_state=0)
Dec_Tree.fit(X_train,y_train)

In [None]:
#making predictions on Testing Data
pred=Dec_Tree.predict(X_test)
pred

In [None]:
#using Score function to predict the accuracy of Training and Testing Data
train_score=Dec_Tree.score(X_train, y_train)
print('Training Data Accuracy:', train_score*100)
test_score=Dec_Tree.score(X_test, y_test)
print('Testing Data Accuracy:', test_score*100)

In [None]:
#making correct and incorrect predictions
class_names=[0,1]
confusion_matrix=confusion_matrix(y_test, pred)
confusion_matrix

In [None]:
#plotting confusion matrix for Decision Tree 
fig,ax=plt.subplots()
plt.xticks(np.arange(len(class_names)),class_names)
plt.yticks(np.arange(len(class_names)),class_names)
sns.heatmap(pd.DataFrame(confusion_matrix),annot=True,cmap="Purples",fmt='g')
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual Labels')
plt.xlabel('Predicted Labels')
plt.legend()
plt.show()

In [None]:
#calculating AUC score
probs = Dec_Tree.predict_proba(X_test)[::,1]
fpr,tpr,threshold = metrics.roc_curve(y_test,probs)
auc = metrics.roc_auc_score(y_test,probs)
print('AUC Score:',auc*100)

In [None]:
#plotting ROC Curve for Decision Tree
plt.plot(fpr,tpr)
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

In [None]:
#calculating Precision-Recall score
prec_recall = average_precision_score(y_test,probs)
print('Precision-Recall Score:',prec_recall*100)

In [None]:
#plotting Precision-Recall Curve
precision,recall,_= precision_recall_curve(y_test,probs)
plt.step(recall, precision,where='post')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()

In [None]:
#displaying overall accuracy of Decision Tree Classifier
accuracy = metrics.accuracy_score(y_test,pred)
A2=accuracy
print("Accuracy:",accuracy*100)

**8.2 Random Forest Classifier**

In [None]:
#fitting Random Forest Classifier with training data
Ran_For=RandomForestClassifier(n_estimators=100,max_features=2,max_depth=5,min_samples_leaf=2,random_state=0)
Ran_For.fit(X_train,y_train)

In [None]:
#making predictions on Testing Data
pred=Ran_For.predict(X_test)
pred

In [None]:
#using Score function to predict the accuracy of Training and Testing Data
train_score=Ran_For.score(X_train, y_train)
print('Training Data Accuracy:', train_score*100)
test_score=Ran_For.score(X_test, y_test)
print('Testing Data Accuracy:', test_score*100)

In [None]:
#making correct and incorrect predictions
class_names=[0,1]
confusion_matrix=confusion_matrix(y_test, pred)
confusion_matrix

In [None]:
#plotting confusion matrix for Random Forest Classifier
fig,ax=plt.subplots()
plt.xticks(np.arange(len(class_names)),class_names)
plt.yticks(np.arange(len(class_names)),class_names)
sns.heatmap(pd.DataFrame(confusion_matrix),annot=True,cmap="Greens",fmt='g')
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual Labels')
plt.xlabel('Predicted Labels')
plt.legend()
plt.show()

In [None]:
#calculating AUC score
probs = Ran_For.predict_proba(X_test)[::,1]
fpr,tpr,threshold = metrics.roc_curve(y_test,probs)
auc = metrics.roc_auc_score(y_test,probs)
print('AUC Score:',auc*100)

In [None]:
#plotting ROC Curve for Random Forest Classifier
plt.plot(fpr,tpr)
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

In [None]:
#calculating Precision-Recall score
prec_recall = average_precision_score(y_test,probs)
print('Precision-Recall Score:',prec_recall*100)

In [None]:
#plotting Precision-Recall Curve
precision,recall,_= precision_recall_curve(y_test,probs)
plt.step(recall, precision,where='post')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()

In [None]:
#displaying overall accuracy of Random Forest Classifier
accuracy = metrics.accuracy_score(y_test,pred)
A3=accuracy
print("Accuracy:",accuracy*100)

**8.3 K-Nearest Neighbour Classifier**

In [None]:
#fitting K-Nearest Neighbour Classifier with training data
K_N = KNeighborsClassifier(n_neighbors=5,weights ='uniform',metric='euclidean')
K_N.fit(X_train,y_train)

In [None]:
#making predictions on Testing Data
pred=K_N.predict(X_test)
pred

In [None]:
#using Score function to predict the accuracy of Training and Testing Data
train_score=K_N.score(X_train, y_train)
print('Training Data Accuracy:', train_score*100)
test_score=K_N.score(X_test, y_test)
print('Testing Data Accuracy:', test_score*100)

In [None]:
#making correct and incorrect predictions
class_names=[0,1]
confusion_matrix=confusion_matrix(y_test, pred)
confusion_matrix

In [None]:
#plotting confusion matrix for K-Nearest Neighbour Classifier
fig,ax=plt.subplots()
plt.xticks(np.arange(len(class_names)),class_names)
plt.yticks(np.arange(len(class_names)),class_names)
sns.heatmap(pd.DataFrame(confusion_matrix),annot=True,cmap="Reds",fmt='g')
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual Labels')
plt.xlabel('Predicted Labels')
plt.legend()
plt.show()

In [None]:
#calculating AUC score
probs = K_N.predict_proba(X_test)[::,1]
fpr,tpr,threshold = metrics.roc_curve(y_test,probs)
auc = metrics.roc_auc_score(y_test,probs)
print('AUC Score:',auc*100)

In [None]:
#plotting ROC Curve for K-Nearest Neighbour Classifier
plt.plot(fpr,tpr)
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

In [None]:
#calculating Precision-Recall score
prec_recall = average_precision_score(y_test,probs)
print('Precision-Recall Score:',prec_recall*100)

In [None]:
#plotting Precision-Recall Curve
precision,recall,_= precision_recall_curve(y_test,probs)
plt.step(recall, precision,where='post')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()

In [None]:
#displaying overall accuracy of K-Nearest Neighbour Classifier
accuracy = metrics.accuracy_score(y_test,pred)
A4=accuracy
print("Accuracy:",accuracy*100)

### **STEP 9**

**9. Business Understanding Of Model**

In [None]:
#accuracies of all models
print('Accuracy of Logistic Regression:',A1*100)
print('Accuracy of Decision Trees:',A2*100)
print('Accuracy of Random Forest:',A3*100)
print('Accuracy of K-Nearest Neighbour:',A4*100)

In this project, the dataset provided was initially imported, preprocessed, normalized, splitted into training and testing data; from which the training data was trained on the Logistic Regression Model and the accuracy for certain classifiers were calculated.

We used three different classifiers, namely Decision Tree Classifier, Random Forest Classifier and K-Nearest Neighbour Classifier.

The main motive of the project was to convert the liability based customers to personal loan customers. EDA (Exploratory Data Analysis) provided better analysis on outliers, distribution of variables etc. which can help the bank gain an edge to convert liable customers to loan customers 

From the above classifiers, the accuracy of Decision Tree Classifier outstanded the rest of the classifiers and hence it's algorithm can be used to identify the potential customers who would purchase a loan. Hence, Decision Tree can be used as the final model as a solution for this problem. 
