#Business Background
LendingClub, based in San Francisco, California, is a pioneering peer-to-peer lending company in the US. It set a precedent by becoming the first of its kind to register its services as securities with the Securities and Exchange Commission (SEC) and introduced loan trading on a secondary market. Today, it stands as the globe's premier peer-to-peer lending platform.

You are part of the LendingClub team, a company that offers diverse loan options to urban clients. Whenever a loan request is submitted, LendingClub must evaluate the applicant's profile to make an informed loan approval decision. The outcome hinges on two potential risks:

* Denying a loan to an applicant who is capable of repayment means missed business opportunities for LendingClub.
* Conversely, approving a loan to an applicant prone to defaulting can spell financial setbacks for the company.

The provided dataset encompasses historical data on loan applicants, highlighting who defaulted and who didn't. The goal is to discern patterns that signal the likelihood of an applicant defaulting. Such insights can guide strategies like denying the loan, adjusting the loan amount, or setting higher interest rates for riskier borrowers.

Curated from https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-prediction/notebook by Fares Sayah

# Settings

In [None]:
# ! pip install -q shap

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report,
    roc_auc_score, roc_curve, auc,
)
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.ensemble import RandomForestClassifier

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC
from tensorflow.keras.metrics import Recall

pd.set_option('display.float', '{:.2f}'.format)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

# Load Data

Note: the data has been cleaned up and transformed. To learn more about the original data, check here
https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-prediction/notebook

In [None]:
data_url = 'https://raw.githubusercontent.com/JHU-CDHAI/Dataset/main/lending_club_loan_processed.csv'

data = pd.read_csv(data_url)
print(data.shape)
data.head()

In [None]:
# 0: means Fully Paid
# 1: means Charged Off
data['loan_status'].value_counts()

In [None]:
data.head()

In [None]:
data.columns

In [None]:
data.shape

In [None]:
# Comment this out if you want to use Zip Code information.
new_cols = [i for i in data.columns if 'zip' not in i]
data = data[new_cols]

In [None]:
data.columns

# Data preparation

In [None]:
train, test = train_test_split(data, test_size=0.33, random_state=42)

print(train.shape)
print(test.shape)

# (264796, 81)
# (130423, 81)

In [None]:
# Removing outliers
print(train.shape)
train = train[train['annual_inc'] <= 250000]
train = train[train['dti'] <= 50]
train = train[train['open_acc'] <= 40]
train = train[train['total_acc'] <= 80]
train = train[train['revol_util'] <= 120]
train = train[train['revol_bal'] <= 250000]
print(train.shape)

In [None]:
# Normalizing the data
X_train, y_train = train.drop('loan_status', axis=1), train['loan_status']
X_test,  y_test  = test.drop('loan_status', axis=1),  test['loan_status']

In [None]:
# y_train

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train = np.array(X_train).astype(np.float32)
X_test  = np.array(X_test).astype(np.float32)
y_train = np.array(y_train).astype(np.float32)
y_test  = np.array(y_test).astype(np.float32)

# Simple Neural Network model

## Build Model

In [None]:
# Build a simple neural network model
model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))  # Binary classification, so use sigmoid activation

## Train Model

In [None]:
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[Recall(), 'accuracy'])

In [None]:
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

# Plot the training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

## Evaluate Model

In [None]:
# assuming 'model' is your trained model
test_loss, test_recall, test_accuracy = model.evaluate(X_test, y_test, verbose=2)
print(f'Test accuracy: {test_accuracy}')
print(f'Test recall: {test_recall}')

In [None]:
#by default, the threshold is 0.5. Feel free to play othe threshold and see how the confusion matrix changes
y_pred_probs = model.predict(X_test)

In [None]:
# Set the threshold

threshold = 0.5
y_pred = (y_pred_probs > threshold).astype(int)

# Now you can use 'y_pred' for evaluation with the new threshold


## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
from sklearn.metrics import precision_score, recall_score

# Assuming y_pred and y_test are numpy arrays or lists containing binary labels (0 or 1)

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("Precision:", precision)
print("Recall:", recall)

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')

Questions to think:

1. How good is the model performance?
2. Use Gemini AI to create the ROC curve. On the cure, point out where is the p=0.5 threshould.
3. Can you think of ways to further improve the model performance?