   ### Personal Loan Campaign Modelling

#### Objective

Goal of the project is to predict the likelihood of a liabiliy customer to convert to buy personal loan with the bank.

#### Import the necessary libraries

In [None]:
import os 

os.getcwd()

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt# matplotlib.pyplot plots data
%matplotlib inline 
from sklearn.model_selection import train_test_split
import missingno as msno
import zipcodes as zc

#### Read the dataset

In [None]:
data=pd.read_csv('therabankdata.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe().transpose()


In [None]:
data.nunique()  ## Unique values is in each column

##### Basic checks on the data before getting it ready for analysis 

In [None]:
def basic_checks(df):
    
    print('='*50)
    print('Shape of the dataframe is: \n',df.shape)
    print('='*50)
    print('Basic stats for the data: \n',df.describe())
    print('='*50)
    print('Data type and info :')
    print(df.info())
    print('='*50)
    print('Missing value information : \n',df.isnull().any())
    print('='*50)
    print('Sum of missing values if any : \n',df.isnull().sum())

In [None]:
basic_checks(data)

##### Missing values matrix

In [None]:
msno.matrix(data)

Absolutely no missing values in the data shown.

##### Zero mortgage and zero credit spend from the dataset 

In [None]:
Mortgagezero=data['Mortgage']==0

In [None]:
Mortgagezero.value_counts() ## 3462 records have no mortgage on top of the loan .

Approximately 70% of the customers have no mortgage 

In [None]:
Mortgagezero1=data[(data['Mortgage']==0)&(data['Personal Loan']==1)]

In [None]:
Mortgagezero1['Mortgage'].value_counts()/Mortgagezero.value_counts()

20.29 % of those who dont have a mortgage have opted for a personal loan.

Credit spend is zero 

In [None]:
creditspendzero=data['CCAvg']==0

In [None]:
creditspendzero.value_counts()

2 % of the customers dont spend using their credit cards

#### EDA - Part 1 - Summary 

1.5000 records and 14 columns in the dataset.

2.Summary stats show that :

a.People between the age range of 23-67 have been targeted for the campaign, the average age range is 45.
b.Customers have average work experience of 20 years ,with Income ranging from 8k to 224k .
c.The average family size of the customer is 2.3, with an Credit card spend of 1.93k .
d.The customer base are not highly educated ,they mostly fit in the level 1 and 2 of education.
e.Mortgage values range from 0-635k USD.

3.Null value checks show that there are no null values in the data.

4.Approximately 70 % of the customers have no mortgage ,20 % of those that have no mortgage have opted for a personal loan in the campaign last year .This could be indicative of a good opportunity for the bank in targeting these customers as they could prove to be low risk.

5.2% of the customer base doesn’t spend using the credit cards


#### Dropping ID as it adds no value to the dataset

In [None]:
data.drop(['ID'],axis=1,inplace=True)

In [None]:
data.head()


##### Data Cleaning 

Some level of data cleaning is required on the data set which require the following tasks to be done 

1. ID column needs to be dropped 
2. ZIP code needs to be converted so that some meaningful insights  can be generated from it .
3. Education needs to be converted from numeric to categorical variable
4. Experience columns shows negative values which need to be converted to the corresponding positive values assuming they were data entry issues.


In [None]:
data1=data

In [None]:
data1.head()

In [None]:
data1['Experience'].replace([-3,-2,-1],[3,2,1],inplace=True)

In [None]:
data1.head()

In [None]:
basic_checks(data1)

##### Processing the zip codes 

In [None]:
#zc.matching('94143')

In [None]:
data1['ZIP Code']=data1['ZIP Code'].astype(str)

In [None]:
data1.dtypes

In [None]:
def zipcode_state(x):
    l= zc.matching(x)
    if len(l)>0:
        state=l[0]['state']
    else :
        state='Unknown'
    return state

In [None]:
data1['state']=data1['ZIP Code'].map(zipcode_state)

In [None]:
data1['state'].value_counts()

In [None]:
def zipcode_city(x):
    l= zc.matching(x)
    if len(l)>0:
        City=l[0]['city']
    else :
        City='Unknown'
    return City

In [None]:
data1['city']=data1['ZIP Code'].map(zipcode_city)

In [None]:
data1['city'].value_counts()

In [None]:
data1.head()

##### Education - making it a categorical variable

In [None]:
data1.dtypes

In [None]:
data1['Education']=data1['Education'].astype(str)

In [None]:
data1.dtypes

##### Value counts for categorical columns

In [None]:
def valuecounts(variable):
        print(data1[variable].value_counts())

In [None]:
valuecounts('ZIP Code')

In [None]:
valuecounts('city')

In [None]:
valuecounts('state')

In [None]:
valuecounts('Education')

##### Univariate & Bi variate analysis

##### Plotting the correlations between the variables 

In [None]:
data1.corr()

In [None]:
# However we want to see correlation in graphical representation so below is function for that
def plot_corr(df, size=11):
    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns)
    plt.yticks(range(len(corr.columns)), corr.columns)
    for (i, j), z in np.ndenumerate(corr):
        ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')


In [None]:
plot_corr(data1)

##### Correlation pairplot 

In [None]:
corr_data=data1.iloc[:,0:12]
sns.pairplot(corr_data)


##### Univariate & Bivariate plots

In [None]:
def plots(variable):
    fig=plt.figure(figsize=(10,5))
    plt.subplot(131)
    sns.distplot(data1[variable])
    plt.subplot(132)
    sns.boxplot(x=data1[variable])
    plt.subplot(133)
    sns.boxplot(x=data1['Personal Loan'],y=data1[variable])

Age - Bivariate and Univariate

In [None]:
plots('Age')

Experience - Bivariate and Univariate

In [None]:
plots('Experience')

Income - Bivariate and Univariate

In [None]:
plots('Income')

CCAvg - Bivariate and Univariate

In [None]:
plots('CCAvg')

Mortgage - Bivariate and Univariate

In [None]:
plots('Mortgage')

#### Countplots for the categorical variables

In [None]:
def catplot(variable):
    fig=plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    sns.countplot(x=variable,data=data1)
    plt.subplot(1,2,2)
    sns.countplot(x=variable, hue='Personal Loan', data=data1)

In [None]:
catplot('Family')

In [None]:
catplot('Education')

##### Personal Loan

In [None]:
sns.countplot(x='Personal Loan',data=data1)

In [None]:
sns.countplot(x='Securities Account',hue='Personal Loan',data=data1)

In [None]:
sns.countplot(x='CD Account',hue='Personal Loan',data=data1)

In [None]:
sns.countplot(x='Online',hue='Personal Loan',data=data1)

In [None]:
sns.countplot(x='CreditCard',hue='Personal Loan',data=data1)

#### Zip Codes

Which STATE has the highest customer base ?

In [None]:
sns.countplot(x='state',hue='Personal Loan',data=data1)

California state is  the customer base for the bank .Lets look at the breakdown by City.

Top 10 cities in California with customer base in the bank .

In [None]:
data1[['city','Personal Loan']]

In [None]:
test=data1['city'].value_counts(ascending=False)

In [None]:
test.head(10)

In [None]:
data1['city'].nunique() ## Number of unique cities in the dataset is 245.

#### EDA part -2 

##### Correlation :

1. Age and Experience seem to be highly correlated 
2. Personal loan and Income seem to be somewhat correlated .
3. We don’t see any significant correlations between the variables

###### Univariate and bivariate plots :

###### Numerical variables :

1.Age shows some what normal distribution.For those that haven’t opted for personal loan the age range is between 20-60 and for those that have opted for a personal loan range from mid twenties to a bit over 60 .No outliers seen.

2.Income shows right skewed data.There seems to be outliers in the data for those that did not opt for personal loan, the data shows outliers greater than 150k .For those that opted the personal loan, there are no outliers and the data ranges from 60 k -200k

3.CCAvg is right skewed data ,there seems to be outliers in the data, which show spend greater than 5k .Those that have a personal loan have a more diverse range of CC Average and the median spend seems to clearly higher for those that opted for personal loan.

4.Mortgage is also right skewed.There are plenty of outliers in this data too.The IQR range is higher for those that opted personal loan.

***** Outliers seemed to have to effect on the overall model (checked this and struck the option off) and hence they were not treated while trying to clean the data .******



###### Categorical variables :

1.Zip code was converted to categorical from numerical to extract city and state from the dataset .CA /California has the highest number of customers ,the top 5 cities with customers are Loas Angeles,San Diego,San Francisco,Berkeley,Sacramento . The customers are from 245 different cities .

2.The most popular family sizes are singles and couples.Those who did opt personal loan the family size seems to be around 3.

3.The customer base shows that most have basic educational qualification followed by people who have an advanced qualification.

4.Personal loans were granted to customers that have higher education qualification.

5.11% of Securities account holders have a personal loan

6.46 % of CD Account holders have a personal loan 

7.Online - 9 % of those that use the internet banking have a personal loan

8 .Credit card - Approx 10 % of those that have a credit card with the bank have a personal loan 



### Dependant variable distribution 

0 - Customers that did not opt for a personal loan in the previous campaign
1 - Customers that did opt for a personal loan 

In [None]:
data1['Personal Loan'].value_counts()

90.4 % did not opt for a personal loan ,9.6 % opted for a personal loan 

In [None]:
sns.countplot(x='Personal Loan',data=data1)

Changes to other attributes with respect to the dependant variable

In [None]:
data1.groupby(['Personal Loan']).mean()

In [None]:
data1.groupby(['Personal Loan']).median()

##### Inferences:

1. 90.4% of the customers don’t opt for personal loan,9.6 % of the customers opt for personal loan.
2. The average age for those that opt for personal loan and those that don’t is 45.The average Experience for the customers who opt for personal loan in 19.8
3. The average  Family size is 2.6 for those that opt for personal loan and the credit card average is 3.9K

##### Getting the data ready for model building :

1. Applying ***One Hot Encoding*** to the Education variable .
2. Dropping the variables ‘City’ and ‘State’ from the dataset as we have done the initial round of analysis on the Zip code .
3. Drop ZIP Code from the dataset .

NB : ***The model was run with both ZIP code and without ZIP Code as a variable ,adding ZIPCode only adds numerous more variables after applying one hot encoding and it added no value to the model performance .It was then decided to drop ‘ZIP Code’ only to see the model performance unchanged.***

4. Split the data into train and test data sets  in 70 ,30 ratio.

5. Once the data is ready to be used in the model apply the classifier ,Logistic Regression from sklearn.





In [None]:
data1=data1.drop(['city', 'state'], axis=1)

In [None]:
data1.columns

Dropping the ZIP Code from the model 

In [None]:
data2=data1.drop(['ZIP Code'],axis=1)

In [None]:
data2.dtypes

In [None]:
data2=pd.get_dummies(data2)

In [None]:
data2.head()

##### Split the data into train and test 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X=data2.drop('Personal Loan',axis=1)
Y=data2['Personal Loan']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

In [None]:
x_train.head()

In [None]:
print("{0:0.2f}% data is in training set".format((len(x_train)/len(data2.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(data2.index)) * 100))

In [None]:
print("Original Personal Loan True Values    : {0} ({1:0.2f}%)".format(len(data2.loc[data2['Personal Loan'] == 1]), (len(data2.loc[data2['Personal Loan'] == 1])/len(data2.index)) * 100))
print("Original Personal Loan False Values   : {0} ({1:0.2f}%)".format(len(data2.loc[data2['Personal Loan'] == 0]), (len(data2.loc[data2['Personal Loan'] == 0])/len(data2.index)) * 100))
print("")
print("Training Personal Loan True Values    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Personal Loan False Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Personal Loan True Values        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Personal Loan False Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")

## Modelling - Logistic regression

1. Build a basic Logistic regression with default parameters.Note the accuracy, recall, precision,F1 score and ROC AUC score.
2. Tweak the parameters - C,Solver, to see any improvement in the model performance.
3. Balance the data to improve the performance measures.
4. Compare the performance measures, and pick the model that gives us the best Recall,ROC AUC scores .

In [None]:
from sklearn import metrics

#from sklearn import confusion_matrix,recall_score,precision_score,f1_score,roc_auc_score,accuracy_score

from sklearn.linear_model import LogisticRegression

## Fitting the model on training dataset 

model = LogisticRegression()
model.fit(x_train,y_train)

## Predicting on test set 
y_predict=model.predict(x_test)


Inferences:

The above gives us the classifier with the default parameters apart from 'Solver' which is mentioned as 'liblinear'

*** Model Scores on Train & Test Data set ***

In [None]:
model_score=model.score(x_test,y_test)
print(model_score)

In [None]:
model_score_train=model.score(x_train,y_train)
print(model_score_train)

The train and test datasets show good model score .Lets us now print the confusion matrix.

#### Confusion Matrix

In [None]:
confusion_matrix=metrics.confusion_matrix(y_test,y_predict)
print('The Confusion Matrix is displayed below :')
print('')

print(confusion_matrix)
print('')


## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

In [None]:
print(metrics.classification_report(y_test,y_predict))

The model has high precision low recall and an OK f1 score. We would like to see a high recall score .

#### Accuracy,Precision,Recall,ROC_auc_score

In [None]:
## Accuracy

print('Training Accuracy is :',model.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',model.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')



#### ROC/AUC curve

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

roc_auc_score(y_test,y_predict)

In [None]:
logit_roc_auc = roc_auc_score(y_test, model.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

#### Improving Model performance 

In [None]:
## Getting the regression model parameters

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
model.get_params()

In [None]:
train_score=[]
test_score=[]
solver=['newton-cg','lbfgs','liblinear','sag', 'saga']

for i in solver:
    model=LogisticRegression(random_state=None,penalty='l2', C = 1,solver=i)
    model.fit(x_train,y_train)
    y_predict=model.predict(x_test)
    train_score.append(round(model.score(x_train,y_train),2))
    test_score.append(round(model.score(x_test,y_test),2))
    
print(solver)
print()
print(train_score)
print()
print(test_score)

In [None]:
### picking 'liblinear' solver ,we pick this and proceed
##with tweaking the C value  to see if it improves the recall and precision

In [None]:
### Model 

model=LogisticRegression(random_state=None,penalty='l2', C = 1,solver='liblinear')
model.fit(x_train,y_train)
y_predict=model.predict(x_test)


In [None]:
train_score=[]
test_score=[]

C=[0.01,0.1,0.25,0.5,0.75,1]

for i in C :
    model=LogisticRegression(random_state=None,penalty='l2', C = i ,solver='liblinear')
    model.fit(x_train,y_train)
    y_predict=model.predict(x_test)
    train_score.append(round(model.score(x_train,y_train),3))
    test_score.append(round(model.score(x_test,y_test),3))
  
print(C)
print()
print(train_score)
print()
print(test_score)

#print(metrics.f1_score(y_test,y_predict))

In [None]:
## Accuracy
model=LogisticRegression(random_state=None,penalty='l2',solver='liblinear')
model.fit(x_train,y_train)
y_predict=model.predict(x_test)

print('Training Accuracy is :',model.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',model.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')

##### Confusion Matrix

In [None]:
confusionmatrix=metrics.confusion_matrix(y_test,y_predict)
print('The Confusion Matrix is displayed below :')
print('')

print(confusionmatrix)
print('')


## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

The true positives show slight improvement ,along with the Recall,F1,ROC/AUC score. We will hence stick to the parameters used and then use 'class_weight'= 'balanced' as our data seems to be highly imbalanced.

`Business Insights`

True Negative (observed=0,predicted=0)

Predicted that the customer would not accept the personal loan and they actually do not .

False Positive (observed=0,predicted=1)

Predicted that the customer would accept the personal loan while the customer did not.

True Negative (observed=0,predicted=0)

Predicted that the customer would not accept the personal loan and the customer did not.

False Negative(observed=1,predicted=0)

Predicted the customer would not accept the loan when the customer did.

`Metrics of main interest`

In the 1st model we get an accuracy which is greater than 90% ,however we would like to focus on the metrics - Precision and F1 score here .

The False negatives in this case are missed opportunity to make money for the bank .Hence the lower the number of False negatives the better.

We therefore have the model tuned so it gives us an optimal F1 score,recall and precision.

##### Treating the imbalance in the data by tweaking the class_weight parameter .

After changing the parameters ,adjusting the class_weight parameter to 'balanced',so the model performance can be made better.

In [None]:
## Accuracy
model=LogisticRegression(random_state=None,penalty='l2',solver='liblinear',class_weight='balanced')
model.fit(x_train,y_train)
y_predict=model.predict(x_test)

print('Training Accuracy is :',model.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',model.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')

In [None]:
confusionmatrix=metrics.confusion_matrix(y_test,y_predict)
print('The Confusion Matrix is displayed below :')
print('')

print(confusionmatrix)
print('')


## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

` Business Insights `

True Negative (observed=0,predicted=0)

Predicted that the customer would not accept the personal loan and they actually do not .

False Positive (observed=0,predicted=1)

Predicted that the customer would accept the personal loan while the customer did not.

True Negative (observed=0,predicted=0)

Predicted that the customer would not accept the personal loan and the customer did not.

False Negative(observed=1,predicted=0)

Predicted the customer would not accept the loan when the customer did.

`Metrics of main interest`

***Our fundamental assumption here is that we are more interested in figuring customers out that convert as personal loan customers,hence high recall and AUC score should be our metrics***

Since accuracy is not our main metric ,so we look at the other numbers.

The ***False negatives*** in this case are much lower than the previous model ,indicating that we reduce our 'missed opportunity' to get a 'much better' conversion for the personal loan,thus increasing the income for the bank.

The ***True positives*** number has also gone up in the current model,which means that the  predictions made that the customer would buy a personal loan matches the actual.

The ***False Positives number*** has also gone up here ,however missing this could be relatively harmless .Precision goes down as the False positives have gone up.

It is worth noting here that the Accuracy has gone down by a small % ,while Recall and ROC AUC values have shown a significant increase.

Although F1 score and precision have gone down ,our focus is on the Recall and AUC values as these highlight the number of False Negatives and True positives and these show a significant increase . We will therefore finalize this as our final model.