## Data Summary
This data was comprised of the following features: credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

installment: The monthly installments owed by the borrower if the loan is funded.

log.annual.inc: The natural log of the self-reported annual income of the borrower.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

fico: The FICO credit score of the borrower.

days.with.cr.line: The number of days the borrower has had a credit line.

revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

### Problem Statement:

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans

#### Data source: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis

In [1]:
#Packages for data reading and manipulation
import pandas as pd
import numpy as np
#for data visualization
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objs as go
#for encoding
from sklearn.preprocessing import OneHotEncoder
#for data scaling
from sklearn.preprocessing import RobustScaler
#for data splitting
from sklearn.model_selection import train_test_split
#for modelling
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
#for evaluation 
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


# Data Pre-processing

In [2]:
df = pd.read_csv("/kaggle/input/lending-club-loan-data-analysis/loan_data.csv") #loading data
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [3]:
df.describe() #data summary statistics

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
count,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0
mean,0.80497,0.12264,319.089413,10.932117,12.606679,710.846314,4560.767197,16913.96,46.799236,1.577469,0.163708,0.062122,0.160054
std,0.396245,0.026847,207.071301,0.614813,6.88397,37.970537,2496.930377,33756.19,29.014417,2.200245,0.546215,0.262126,0.366676
min,0.0,0.06,15.67,7.547502,0.0,612.0,178.958333,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.1039,163.77,10.558414,7.2125,682.0,2820.0,3187.0,22.6,0.0,0.0,0.0,0.0
50%,1.0,0.1221,268.95,10.928884,12.665,707.0,4139.958333,8596.0,46.3,1.0,0.0,0.0,0.0
75%,1.0,0.1407,432.7625,11.291293,17.95,737.0,5730.0,18249.5,70.9,2.0,0.0,0.0,0.0
max,1.0,0.2164,940.14,14.528354,29.96,827.0,17639.95833,1207359.0,119.0,33.0,13.0,5.0,1.0


In [4]:
df.isnull().sum() #null values

credit.policy        0
purpose              0
int.rate             0
installment          0
log.annual.inc       0
dti                  0
fico                 0
days.with.cr.line    0
revol.bal            0
revol.util           0
inq.last.6mths       0
delinq.2yrs          0
pub.rec              0
not.fully.paid       0
dtype: int64

In [5]:
df.shape #data dimensions

(9578, 14)

In [6]:
df.columns 

Index(['credit.policy', 'purpose', 'int.rate', 'installment', 'log.annual.inc',
       'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid'],
      dtype='object')

In [7]:
df[['credit.policy', 'purpose', 'int.rate', 'installment', 'log.annual.inc',
       'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid']].nunique() #unique values

credit.policy           2
purpose                 7
int.rate              249
installment          4788
log.annual.inc       1987
dti                  2529
fico                   44
days.with.cr.line    2687
revol.bal            7869
revol.util           1035
inq.last.6mths         28
delinq.2yrs            11
pub.rec                 6
not.fully.paid          2
dtype: int64

In [8]:
df.dtypes # variables data types

credit.policy          int64
purpose               object
int.rate             float64
installment          float64
log.annual.inc       float64
dti                  float64
fico                   int64
days.with.cr.line    float64
revol.bal              int64
revol.util           float64
inq.last.6mths         int64
delinq.2yrs            int64
pub.rec                int64
not.fully.paid         int64
dtype: object

### Exploratory Analysis

In [9]:
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [10]:
# target value counts
target_value_counts = df['not.fully.paid'].value_counts()

fig = px.pie(df, names=target_value_counts.index, values=target_value_counts.values)

fig.update_layout(title='Not Fully Paid')

fig.show()


In [11]:
#purpose vs not fully paid
grouped = df.groupby(['purpose', 'not.fully.paid']).size().reset_index(name='count')
print(grouped)
fig = px.bar(grouped, x='purpose', y='count', color='not.fully.paid', 
             barmode='group', labels={'count': 'Count', 'purpose': 'Purpose', 'not.fully.paid': 'Not Fully Paid'})
fig.show()


               purpose  not.fully.paid  count
0            all_other               0   1944
1            all_other               1    387
2          credit_card               0   1116
3          credit_card               1    146
4   debt_consolidation               0   3354
5   debt_consolidation               1    603
6          educational               0    274
7          educational               1     69
8     home_improvement               0    522
9     home_improvement               1    107
10      major_purchase               0    388
11      major_purchase               1     49
12      small_business               0    447
13      small_business               1    172


In [12]:
df.columns

Index(['credit.policy', 'purpose', 'int.rate', 'installment', 'log.annual.inc',
       'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid'],
      dtype='object')

In [13]:
#Relationship between target and features 
grouped1 = df.pivot_table(values='int.rate', index='not.fully.paid', aggfunc='mean')

grouped2 = df.pivot_table(values='installment', index='not.fully.paid', aggfunc='mean')

grouped3 = df.pivot_table(values='log.annual.inc', index='not.fully.paid', aggfunc='mean')

grouped4 = df.pivot_table(values='dti', index='not.fully.paid', aggfunc='mean')

grouped5 = df.pivot_table(values='fico', index='not.fully.paid', aggfunc='mean')


fig = sp.make_subplots(rows=3, cols=2,
                       subplot_titles=['Interest Rate', 'Installment', 'Log Annual Income', 'DTI',
                                       'FICO', 'Days with CR Line', 'Revolving Balance', 'Revolving Utilization',
                                       'Delinq. in Last 2 Years', 'Inquiries in Last 6 Months', 'Public Records'])

# Add traces for each pivot table to the subplot
fig.add_trace(go.Bar(x=grouped1.index, y=grouped1['int.rate'], name='Interest Rate'), row=1, col=1)
fig.add_trace(go.Bar(x=grouped2.index, y=grouped2['installment'], name='Installment'), row=1, col=2)
fig.add_trace(go.Bar(x=grouped3.index, y=grouped3['log.annual.inc'], name='Log Annual Income'), row=2, col=1)
fig.add_trace(go.Bar(x=grouped4.index, y=grouped4['dti'], name='DTI'), row=2, col=2)
fig.add_trace(go.Bar(x=grouped5.index, y=grouped5['fico'], name='FICO'), row=3, col=1)

fig.show()



In [14]:
grouped6 = df.pivot_table(values='days.with.cr.line', index='not.fully.paid', aggfunc='mean')

grouped7 = df.pivot_table(values='revol.bal', index='not.fully.paid', aggfunc='mean')

grouped8 = df.pivot_table(values='revol.util', index='not.fully.paid', aggfunc='mean')

grouped9 = df.pivot_table(values='delinq.2yrs', index='not.fully.paid', aggfunc='mean')

grouped10 = df.pivot_table(values='inq.last.6mths', index='not.fully.paid', aggfunc='mean')

grouped11 = df.pivot_table(values='pub.rec', index='not.fully.paid', aggfunc='mean')

fig = sp.make_subplots(rows=3, cols=2,
                       subplot_titles=['Days with CR Line', 'Revolving Balance', 'Revolving Utilization',
                                       'Delinq. in Last 2 Years', 'Inquiries in Last 6 Months', 'Public Records'])
fig.add_trace(go.Bar(x=grouped6.index, y=grouped6['days.with.cr.line'], name='Days with CR Line'), row=1, col=1)
fig.add_trace(go.Bar(x=grouped7.index, y=grouped7['revol.bal'], name='Revolving Balance'), row=1, col=2)
fig.add_trace(go.Bar(x=grouped8.index, y=grouped8['revol.util'], name='Revolving Utilization'), row=2, col=1)
fig.add_trace(go.Bar(x=grouped9.index, y=grouped9['delinq.2yrs'], name='Delinq. in Last 2 Years'), row=2, col=2)
fig.add_trace(go.Bar(x=grouped10.index, y=grouped10['inq.last.6mths'], name='Inquiries in Last 6 Months'), row=3, col=1)
fig.add_trace(go.Bar(x=grouped11.index, y=grouped11['pub.rec'], name='Public Records'), row=3, col=2)
fig.show()



In [15]:
df.columns

Index(['credit.policy', 'purpose', 'int.rate', 'installment', 'log.annual.inc',
       'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid'],
      dtype='object')

### Data Encoding

In [16]:
encoder = OneHotEncoder(sparse_output=False, drop='first').set_output(transform="pandas")
cat_encoded = encoder.fit_transform(df[['purpose']])
cat_encoded #one hot encoder because the code is nominal there is no hierachical importance

Unnamed: 0,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...
9573,0.0,0.0,0.0,0.0,0.0,0.0
9574,0.0,0.0,0.0,0.0,0.0,0.0
9575,0.0,1.0,0.0,0.0,0.0,0.0
9576,0.0,0.0,0.0,1.0,0.0,0.0


In [17]:
df = pd.concat([df,cat_encoded],axis=1) #merging the encoded with the original data frame

In [18]:
df.columns

Index(['credit.policy', 'purpose', 'int.rate', 'installment', 'log.annual.inc',
       'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid',
       'purpose_credit_card', 'purpose_debt_consolidation',
       'purpose_educational', 'purpose_home_improvement',
       'purpose_major_purchase', 'purpose_small_business'],
      dtype='object')

In [19]:
df.drop(['purpose'],axis=1,inplace=True) #deleting the string purpose column

#### Correlation

In [20]:

corr = df[['credit.policy', 'int.rate', 'installment', 'log.annual.inc',
                    'dti', 'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
                    'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid',
                    'purpose_credit_card', 'purpose_debt_consolidation',
                    'purpose_educational', 'purpose_home_improvement',
                    'purpose_major_purchase', 'purpose_small_business']].corr()

fig = px.imshow(corr, text_auto=".2f",color_continuous_scale='Viridis', aspect="auto")

fig.update_layout(title='Correlation Heatmap',
                  xaxis_title='Features',
                  yaxis_title='Features')
fig.update_layout(height=1000)

fig.show()

### Data Splitting

In [21]:
df.columns

Index(['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti',
       'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec', 'not.fully.paid',
       'purpose_credit_card', 'purpose_debt_consolidation',
       'purpose_educational', 'purpose_home_improvement',
       'purpose_major_purchase', 'purpose_small_business'],
      dtype='object')

In [22]:
x = df[['credit.policy', 'int.rate', 'installment', 'log.annual.inc', 'dti', #defining features and target variable
       'fico', 'days.with.cr.line', 'revol.bal', 'revol.util',
       'inq.last.6mths', 'delinq.2yrs', 'pub.rec',
       'purpose_credit_card', 'purpose_debt_consolidation',
       'purpose_educational', 'purpose_home_improvement',
       'purpose_major_purchase', 'purpose_small_business']]
y = df['not.fully.paid']
print(len(x))
print(len(y))

9578
9578


In [23]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25) #splitting the data into training and testing 75/25
print(x_train.shape)
print(x_test.shape)

(7183, 18)
(2395, 18)


In [24]:
from sklearn.preprocessing import StandardScaler #scaling the data sets 
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

# Custom Neural Network

In [25]:
import tensorflow as tf #custom model with four hidden layers

class CustomNetwork(tf.keras.Model):
    def __init__(self):
        super(CustomNetwork, self).__init__()
        self.dense1 = tf.keras.layers.Dense(100, activation='sigmoid')
        self.dropout1 = tf.keras.layers.Dropout(0.2)
        self.dense2 = tf.keras.layers.Dense(50, activation='sigmoid')
        self.dropout2 = tf.keras.layers.Dropout(0.2)
        self.dense3 = tf.keras.layers.Dense(20, activation='sigmoid')
        self.dropout3 = tf.keras.layers.Dropout(0.2)
        self.dense4 = tf.keras.layers.Dense(1, activation='sigmoid')

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dropout1(x)
        x = self.dense2(x)
        x = self.dropout2(x)
        x = self.dense3(x)
        x = self.dropout3(x)
        x = self.dense4(x)
        return x


model = CustomNetwork()
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
              loss=tf.keras.losses.BinaryCrossentropy(), #binary cross entropy because the classification is binary 1,0
              metrics=['accuracy'])

model.fit(x_train_scaled, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7a367b754910>

In [26]:
y_pred = model.predict(x_test_scaled) #model predictions
y_pred



array([[0.05399778],
       [0.32406816],
       [0.08176716],
       ...,
       [0.1325723 ],
       [0.13701808],
       [0.13314494]], dtype=float32)

In [27]:
df = pd.DataFrame({'Prediction': y_pred.flatten(), 'Label': y_test})
print(df.head(10))


      Prediction  Label
3348    0.053998      0
9536    0.324068      1
5801    0.081767      0
3355    0.128471      0
8733    0.133123      0
4277    0.135725      0
1194    0.142374      0
1896    0.138604      1
2739    0.305392      0
8836    0.321406      1


In [28]:
from sklearn.metrics import accuracy_score  #model evaluation
accuracy = accuracy_score(y_test, y_pred.round()) 
print("Accuracy:", accuracy)

Accuracy: 0.8275574112734865


In [29]:
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_test, y_pred.round())
cm
fig = px.imshow(cm,text_auto=True, color_continuous_scale='RdYlBu', labels=dict(x='Predicted', y='True', color='Count'),
                title='Confusion matrix') 
fig.show()

# Using Sequential API

In [30]:
model2 = tf.keras.Sequential([
    tf.keras.layers.Dense(100, activation='sigmoid'), 
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(50, activation='sigmoid'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(20, activation='sigmoid'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model2.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

model2.fit(x_train_scaled, y_train, epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7a36605b8130>

In [31]:
y_pred2 = model2.predict(x_test_scaled)
y_pred2



array([[0.02420272],
       [0.2990367 ],
       [0.05011741],
       ...,
       [0.09919486],
       [0.11836512],
       [0.09478398]], dtype=float32)

In [32]:
from sklearn.metrics import accuracy_score 
accuracy = accuracy_score(y_test, y_pred2.round()) 
print("Accuracy:", accuracy)

Accuracy: 0.8275574112734865
