# <font color='green'>Lending Club Loan Data Analysis</font>

### DESCRIPTION

Create a model that predicts whether or not a loan will be default using the historical data.

### Problem Statement:  

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

**Domain:** Finance

Analysis to be done: Perform data preprocessing and build a deep learning prediction model. 

**Content: **

Dataset columns and definition:

 

**credit.policy:** 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

**purpose:** The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

**int.rate:** The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

**installment:** The monthly installments owed by the borrower if the loan is funded.

**log.annual.inc:** The natural log of the self-reported annual income of the borrower.

**dti:** The debt-to-income ratio of the borrower (amount of debt divided by annual income).

**fico:** The FICO credit score of the borrower.

**days.with.cr.line:** The number of days the borrower has had a credit line.

**revol.bal:1** The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

**revol.util:** The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

**inq.last.6mths:** The borrower's number of inquiries by creditors in the last 6 months.

**delinq.2yrs:** The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

**pub.rec:** The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

 

**Steps to perform:**

Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

**Tasks:**

1.     Feature Transformation

    * Transform categorical values into numerical values (discrete)

2.     Exploratory data analysis of different factors of the dataset.

3.     Additional Feature Engineering

      * You will check the correlation between features and will drop those features which have a strong correlation

    * This will help reduce the number of features and will leave you with the most relevant features

4.     Modeling

    * After applying EDA and feature engineering, you are now ready to build the predictive models

    * In this part, you will create a deep learning model using Keras with Tensorflow backend



In [None]:
# Import all Important Librarys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
data=pd.read_csv("loan_data.csv")

In [None]:
data.head(10)

In [None]:
#this data set has 9578 rows and 14 columns
data.shape

In [None]:
data.describe()

In [None]:
# check null values
data.isnull().sum()

In [None]:
data.isnull().sum().any()

In [None]:
data.dtypes

Exploratory data analysis of different factors of the dataset.

In [None]:
data['not.fully.paid'].value_counts()

0 - Full Paid

1 - Not Paid

imbalanced data

In [None]:
sns.countplot(x=data['not.fully.paid'])
plt.show()

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(x=data['purpose'])
plt.show()

In [None]:
# purpose and not fully paid
plt.figure(figsize=(15,6))
sns.countplot(x='purpose',hue='not.fully.paid',data=data)
plt.show()

In [None]:
# bi variate analysis
sns.jointplot(x='fico',y='int.rate',data=data, kind='hex', color='g')
plt.show()

In [None]:
sns.scatterplot(x='fico',y='int.rate',data=data)
plt.show()

In [None]:
sns.histplot(data['fico'])
plt.show()

In [None]:
sns.histplot(x='fico', hue='not.fully.paid', data=data)
plt.show()

#### Feature Transformation
#### Transform categorical values into numerical values (discrete)

In [None]:
# Handle imbalanced dataset
data['not.fully.paid'].value_counts()

In [None]:
not_fully_paid_0 = data[data['not.fully.paid']==0]
not_fully_paid_1 = data[data['not.fully.paid']==1]

In [None]:
not_fully_paid_0.shape

In [None]:
not_fully_paid_1.shape

In [None]:
#resample
data_minor_upsample=resample(not_fully_paid_1,replace=True,n_samples=8045)

In [None]:
new_data=pd.concat([not_fully_paid_0,data_minor_upsample])

In [None]:
#shuffle
new_data=shuffle(new_data)

In [None]:
new_data['not.fully.paid'].value_counts()

In [None]:
new_data.shape

In [None]:
new_data.dtypes

In [None]:
# convert purpose into num data
le = LabelEncoder()

In [None]:
for i in new_data.columns:
    if new_data[i].dtypes =='object':
        new_data[i]=le.fit_transform(new_data[i])

In [None]:
new_data.dtypes

### Additional Feature Engineering

#### You will check the correlation between features and will drop those features which have a strong correlation

#### This will help reduce the number of features and will leave you with the most relevant features

In [None]:
new_data.corr()

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(new_data.corr(),annot=True)

In [None]:
#see the sorted results
new_data.corr().abs()['not.fully.paid'].sort_values(ascending=False)

In [None]:
new_data.columns

In [None]:
# take columns
X=new_data[['credit.policy','purpose', 'int.rate', 'installment','fico','revol.bal','revol.util','inq.last.6mths','pub.rec']]

In [None]:
X.shape

In [None]:
y=new_data['not.fully.paid']

In [None]:
# Create train set & test set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.2,random_state=42)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# Apply scaling
sc=StandardScaler()

In [None]:
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

#### In this part, you will create a deep learning model using Keras with Tensorflow backend


In [None]:
# create the architecture
# 2 ANN layer
model=Sequential()
model.add(Dense(19,activation='relu',input_shape=[9]))
model.add(Dropout(0.20))

model.add(Dense(10,activation='relu'))
model.add(Dropout(0.20))

# output layer
model.add(Dense(1,activation='sigmoid'))

In [None]:
model.summary()

In [None]:
# compile the model
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
early_stop=EarlyStopping(monitor='val_loss',min_delta=0.001,mode='min',patience=10,verbose=1)

In [None]:
model.fit(X_train,y_train,
          epochs=50,
          batch_size=256,
          validation_data=(X_test,y_test),
          callbacks=[early_stop])

In [None]:
history=model.fit(X_train,y_train,
          epochs=50,
          batch_size=256,
          validation_data=(X_test,y_test))

In [None]:
model.evaluate(X_test,y_test)

In [None]:
y_pred=model.predict(X_test)

In [None]:
y_pred

In [None]:
predictions=(y_pred>0.5).astype('int')

In [None]:
predictions

In [None]:
y_test

In [None]:
accuracy_score(predictions,y_test)

In [None]:
print(classification_report(predictions,y_test))

In [None]:
model.save('loan_default1.h5')

Model2 Architecture

In [None]:
# create the architecture model2
from tensorflow.keras.layers import BatchNormalization
# batch Normalization
# First ANN layer
model1=Sequential()
model1.add(Dense(128,activation='relu',input_shape=[9]))
model1.add(BatchNormalization())
model1.add(Dropout(0.20))

# Second ANN layer
model1.add(Dense(64,activation='tanh'))
model1.add(BatchNormalization())
model1.add(Dropout(0.20))
           

# third ANN layer
model1.add(Dense(32,activation='relu'))
model1.add(BatchNormalization())
model1.add(Dropout(0.20))

# output layer
model1.add(Dense(1,activation='sigmoid'))

In [None]:
model1.summary()

In [None]:
# compile the model
model1.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
history=model1.fit(X_train,y_train,
          epochs=100,
          batch_size=256,
          validation_data=(X_test,y_test))

In [None]:
model1.evaluate(X_test,y_test)

In [None]:
model1.evaluate(X_train,y_train)

Hyparameter tuning in Keras

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
def build_model(hp):
    model=Sequential()
    
    # first hidden layer
    model.add(Dense(units=hp.Int('units',min_value=32,max_value=1024,step=16),
                   activation=hp.Choice('activation',['relu','tanh']),input_shape=[9]))
    
    model.add(BatchNormalization())
    model.add(Dropout(hp.Float('rate',min_value=0.1,max_value=0.5,step=0.1)))
                      
        
    # Second hidden layer
    model.add(Dense(units=hp.Int('units',min_value=32,max_value=1024,step=16),
                   activation=hp.Choice('activation',['relu','tanh'])))
    
    model.add(BatchNormalization())
    model.add(Dropout(hp.Float('rate',min_value=0.1,max_value=0.5,step=0.1)))
    
     # third hidden layer
    model.add(Dense(units=hp.Int('units',min_value=32,max_value=1024,step=16),
                   activation=hp.Choice('activation',['relu','tanh'])))
    
    model.add(BatchNormalization())
    model.add(Dropout(hp.Float('rate',min_value=0.1,max_value=0.5,step=0.1)))
    
    model.add(Dense(1,activation='sigmoid'))
    
    learning_rate=hp.Float('learning_rate',min_value=0.001,max_value=0.1,step=0.01)
        
    model.compile(loss='binary_crossentropy',
                  optimizer=keras.optimizers.Adam(learning_rate),
                 metrics=['accuracy'])
    return model

In [None]:
import keras_tuner as kt

In [None]:
build_model(kt.HyperParameters())

In [None]:
rtuner=kt.RandomSearch(hypermodel=build_model,
                       objective='val_accuracy',
                       max_trials=10                   
                      )

In [None]:
rtuner.search(X_train,y_train,
             epochs=50,validation_data=(X_test,y_test),
             verbose=2)

In [None]:
par=rtuner.get_best_hyperparameters()

In [None]:
par

In [None]:
models=rtuner.get_best_models()

In [None]:
len(models)

In [None]:
models[0].summary()

In [None]:
y_pred=models[0].predict(X_test)>=0.5

In [None]:
y_pred

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)