### Objective: 
The classification goal is to predict the likelihood of a liability customer buying personal loans.

### (1.) Import the datasets and libraries, check datatype, statistical summary,shape,null values or incorrect imputation

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
## importing libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns

In [None]:
# importing data

df = pd.read_csv("Axe_Bank_Personal_Loan_Data.csv")

In [None]:
df

In [None]:
df.shape

In [None]:
df.info()        ## this would give datatype of each column

In [None]:
df.describe()

In [None]:
#There are negative numbmers in experience! maybe typing error. 
# Convert to non-negative using .abs function

df['Experience'] = df['Experience'].abs()
df.describe()

In [None]:
df.isnull().sum()  #check for null Values

### (2) EDA

* Number of unique in each column?
* Number of people with zero mortgage?
* Number of people with zero credit card spending per month?
* Value counts of all categorical columns.
* Univariate and Bivariate
* Get data model ready

In [None]:
df.nunique()
# gives number of unique values in each column

In [None]:
df.drop(['ID','ZIP Code'],axis=1,inplace=True)

# dropping 'ID' column as it all the unique value and this column wont provide any insight to build a model
# Zip Code represents region and region wise distribution of customers is not helping here
# as alot region are there in just 5000 customers, therefore dropping 'ZIP Code'

<b> Value Counts for Categorical Data

In [None]:
vc = df[['Personal Loan', 'Securities Account', 'CD Account',
       'Online', 'CreditCard']].sum().reset_index().rename(columns={'index':'Col_Name',0:"Value_Count_1"})
vc['Value_Count_0'] = df.shape[0] - vc['Value_Count_1']
vc

# Value counts of all the category column with two unique values (0,1)

In [None]:
vc['Value_Count_0']=(vc['Value_Count_0']*100)/5000
vc['Value_Count_1']=(vc['Value_Count_1']*100)/5000
vc

In [None]:
df[df['Mortgage']==0].shape[0]

#count of people having home mortgage as zero, Most of the people donot have mortgage

## Bivariate Analysis

In [None]:
pd.crosstab(df['Personal Loan'], df['CreditCard'])

In [None]:
143/(1327+143)

In [None]:
pd.crosstab(df['Personal Loan'], df['CreditCard'],normalize='columns')

`When CreditCard value is 0 or 1 in both cases the distribution of target variable is same therefore dropping CreditCard`

In [None]:
df.drop('CreditCard',axis=1,inplace=True)

In [None]:
pd.crosstab(df['Personal Loan'], df['Education'],normalize='columns')

In [None]:
pd.crosstab(df['Personal Loan'], df['Family'],normalize='columns')

In [None]:
sns.distplot(df[df['Personal Loan']==0]['Mortgage'],color='r',label=0)
sns.distplot(df[df['Personal Loan']==1]['Mortgage'],color='g',label=1)
plt.legend()
plt.show()

# Most people with zero motgage are not taking personal loans

In [None]:
sns.distplot(df[df['Personal Loan']==0]['Income'],color='r',label=0)
sns.distplot(df[df['Personal Loan']==1]['Income'],color='g',label=1)
plt.legend()
plt.show()

In [None]:
# Number of People with high income taking personal loan are high as compared to low income

In [None]:
sns.distplot(df[df['Personal Loan']==0]['CCAvg'],color='r',label=0)
sns.distplot(df[df['Personal Loan']==1]['CCAvg'],color='g',label=1)
plt.legend()
plt.show();

In [None]:
# People with high avg credit card spending per month are taking personal loans

In [None]:
df[df['CCAvg']==0].shape[0]

#count of people having zero monthly spending on credit card

In [None]:
df['Family'] = df['Family'].astype('category')
df['Education'] = df['Education'].astype('category')

In [None]:
df.head()

### (3) Split the data into training and test set in the ratio of 70:30 respectively

In [None]:
# Separate the independent attributes i.e. every column except personal loan
# Store the target column (Personal Loan) into Y array

X = df.loc[:, df.columns != 'Personal Loan']  # independent variables

y = df.loc[:, df.columns == 'Personal Loan']  # Target variable


In [None]:
X = pd.get_dummies(X,drop_first=True)

In [None]:
y.head()  

In [None]:
X.head()

In [None]:
# Create the training and test data set in the ratio of 70:30 respectively. Can be of any ratio...

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=70)

# Random state seeding for reapeatability of the code
# if random state is not mentioned it would generate different train test sample in every run
# test_size is to select the size of test data

# two variables taken for split therefore output will generate 4 variables: test train for x and test train for y

In [None]:
X_train.shape,X_test.shape

In [None]:
X_train.head()

### (4)  Training Logistic Regression model to predict the likelihood of a customer buying personal loans. Print all the metrics related for evaluating the model performance

In [None]:
## importing necessary metrics to evaluate model performance

from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,roc_curve

# Blanks list to store model name, training score, testing score, recall, precision and roc

algo= []
tr = []
te = []
recall = []
precision = []
roc = []

**Logistic Regression**

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=7)

model.fit(X_train, y_train)

In [None]:
model.coef_.round(2)

In [None]:
model.intercept_.round(2)

In [None]:
y_pred_class=model.predict(X_test)
y_pred_prob=model.predict_proba(X_test)

In [None]:
y_pred_class[:20]

In [None]:
y_pred_class[:5][:]

In [None]:

y_pred_prob[:5,:]

In [None]:

y_pred_prob[:5,0]

In [None]:
#y_pred_prob[:20,:]
(y_pred_prob[:5,0]>0.5)*1

<b> Confusion Matrix

In [None]:
## function to get confusion matrix in a proper format
def draw_cm( actual, predicted ):
    cm = confusion_matrix( actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.0f', xticklabels = [0,1] , yticklabels = [0,1] )
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()

In [None]:
draw_cm(y_test,y_pred_class);

In [None]:
95/(95+50) #recall

In [None]:
95/(95+22) #Precision

In [None]:
draw_cm(y_test,y_pred_prob[:,1]>.7);

In [None]:
78/(78+67)

In [None]:
78/(78+6)

**Confusion matrix means**

* True Positive (observed=1,predicted=1): Predicted Personal loan will be taken and the customer took it

* False Positive (observed=0,predicted=1): Predicted Personal loan will be taken and the customer did not take it

* True Negative (observed=0,predicted=0): Predicted Personal loan will not be taken and the customer did not take it

* False Negative (observed=1,predicted=0): Predicted Personal loan will not be taken and the customer took it

Here more focus towards should be towards recall because our target variable is 'Personal Loan' , i.e whether the customer is accepting the personal loan or not. And the bank wants more people to accept personal loan i.e. less number of False Negative, so that bank doesn't lose real customers who want to take loan. Hence the focus should be on increasing Recall.

After achieving the desired accuracy we can deploy the model for practical use. As in the bank now can predict who will say yes for the personnel loan. They can use the model for upcoming customers.

<b> ROC Curve

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob[:,1])

In [None]:
# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [None]:
roc_df=pd.DataFrame([fpr,tpr,thresholds]).T
roc_df.columns=['fpr','tpr','thresholds']
roc_df

### Reference Links & Addtional Material :

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

<b>Model Evaluation & Validation </b>

* https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/
* https://medium.com/analytics-vidhya/a-simple-introduction-to-validating-and-testing-a-model-part-1-2a0765deb198

<b> Blogs on Same Data - </b>
* https://medium.com/@rohanaggarwal45/thera-bank-case-with-univariate-as-well-as-bivariate-analysis-all-the-machine-learning-models-7f61d04eaa2a

* https://www.kaggle.com/pritech/bank-personal-loan-modelling