# Introduction

In this project, we aim to build a predictive model to determine whether a borrower will fully repay a loan or default (charged off), based on historical data provided by LendingClub.

This classification task has real-world implications for financial institutions, helping reduce default risk and improve lending decisions.

We'll go through a complete machine learning pipeline, including data preprocessing, exploratory analysis, model training, evaluation, and model persistence.




### Data Features
-----
There are many LendingClub data sets on Kaggle. Here is the information on this particular data set:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>zip_code</td>
      <td>The first 3 numbers of the zip code provided by the borrower in the loan application.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>17</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>18</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>19</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>20</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>22</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>23</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>24</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>25</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>26</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>27</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>

----

# Section 1: Data Loading

In this section, we load the LendingClub dataset and conduct a quick overview of the data structure.

Steps:
- Import necessary libraries
- Load the dataset into a Pandas DataFrame
- Display dataset shape and column information
- Preview the first few rows

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
df = pd.read_csv('data/lending_club_loan_data.csv')
df.head(5)

In [None]:
df.info() 

In [None]:
df.describe().T

# Section 2: Exploratory Data Analysis (EDA)

EDA helps uncover trends, patterns, and relationships in the data.

In this section, we:
- Visualize class imbalance between loan statuses
- Explore distributions of numerical and categorical features
- Analyze correlations with the target variable (`loan_status`)
- Identify potential features for modeling

**create a countplot of the (`loan_status`) column.**

In [None]:
sns.countplot(x='loan_status',data=df,hue='loan_status',legend=False)
plt.savefig("plots/CountPlot_of_LoanStatus.png", dpi=300, bbox_inches='tight')

**Create a histogram of the (`loan_amnt`) column.**

In [None]:
plt.figure(figsize=(12,6))  
sns.histplot(df['loan_amnt'], bins=40, edgecolor=None, alpha=0.7)
plt.savefig("plots/Hits_Count_Histplot_of_LoanAmount.png", dpi=300, bbox_inches='tight')

**Correlation between all continuous numeric variables using .corr() method.**

In [None]:
df.corr(numeric_only=True)

**Visualize using a heatmap**

In [None]:
corr_df=df.corr(numeric_only=True)
plt.figure(figsize=(10, 8))
sns.heatmap(corr_df, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.savefig("plots/Heatmap_of_Corr.png", dpi=300, bbox_inches='tight')

**We have noticed almost perfect correlation with the (`installment`) feature. So we must explore this feature more**

In [None]:
print(df['loan_amnt'].describe())
print("\n")
print(df['installment'].describe())

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x='installment', y='loan_amnt', data=df)
plt.savefig("plots/Scatterplot_of_Installment_vs_LoanAmount.png", dpi=300, bbox_inches='tight')

**Create a boxplot showing the relationship between the (`loan_status`) and the (`loan_amnt`).**

In [None]:
sns.boxplot(x='loan_status', y='loan_amnt', data=df, palette='Set1')
plt.savefig("plots/Boxplot_of_LoanStatus_vs_Loanamount.png", dpi=300, bbox_inches='tight')

**Calculate the summary statistics for the loan amount, grouped by the loan_status.**

In [None]:
df['loan_amnt'].groupby(df['loan_status']).describe()

**Now let's explore the Grade and SubGrade columns that LendingClub attributes to the loans.**

In [None]:
grades=df['grade'].to_numpy()
grades = np.array(sorted(set(grades)))  
grades

In [None]:
sub_grades=df['sub_grade'].to_numpy()
sub_grades = np.array(sorted(set(sub_grades)))  
sub_grades

**Create a countplot per (`grade`) column.**

In [None]:
sns.countplot(x='grade', data=df,hue='loan_status', palette='Set1')
plt.savefig("plots/Countplot_of_Grade.png", dpi=300, bbox_inches='tight')

**Create a count plot per (`sub_grade`)**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='sub_grade', data=df,palette='coolwarm',order=sub_grades)
plt.savefig("plots/Countplot_of_SubGrade.png", dpi=300, bbox_inches='tight')

**Create a count plot per (`sub_grade`), but here we will use hue as (`loan_status`) to compare**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='sub_grade', data=df,palette='coolwarm',order=sub_grades, hue='loan_status')
plt.savefig("plots/Countplot_of_SubGrade_w_hue_LoanStatus.png", dpi=300, bbox_inches='tight')

# Section 3: Feature Engineering and Data PreProcessing

Here we transform the data into a format suitable for machine learning algorithms.

Steps include:
- Encoding categorical variables (e.g., one-hot encoding)
- Scaling or normalizing numerical features
- Creating new features (e.g., interaction terms or flag variables)
- Removing highly correlated or irrelevant features
- Identify and handle missing values
- Drop irrelevant or redundant columns
- Ensure correct data types
- Remove duplicate rows (if any)

**Now, We will create a new column called (`loan_repaid`) which will contain a 1 if the (`loan_status`) was "Fully Paid" and a 0 if it was "Charged Off".**

In [None]:
status=df['loan_status'].to_numpy()
status = np.array(sorted(set(status)))  
status

In [None]:
def ispaid(x):
    if x=="Fully Paid":
        return 1
    return 0

In [None]:
df['loan_repaid'] = df['loan_status'].apply(ispaid)

In [None]:
df[['loan_status', 'loan_repaid']]


**Create a bar plot showing the correlation of the numeric features to the new (`loan_repaid`) column.**

In [None]:
df.corr(numeric_only=True)['loan_repaid'].drop('loan_repaid').dropna().sort_values().plot(kind='bar')
plt.savefig("plots/Barplot_of_Corr_with_LoanRepaid.png", dpi=300, bbox_inches='tight')

**Our new Data**

In [None]:
df.head(5)

### Missing Data

**Let's explore this missing data columns. We use a variety of factors to decide whether or not they would be useful, to see if we should keep, discard, or fill in the missing data.**

In [None]:
len(df)

**Create a Series that displays the total count of missing values per column.**

In [None]:
total_count= df.isnull().sum()
total_count

**We check missing data based on the total count of data.**

In [None]:
total_count = (total_count / len(df)) * 100
total_count

***We will cover the missing data in each feature by itself and do the best approuch we can inorder to get better outputs***

**We check how many unique employment job titles are there**

In [None]:
df['emp_title'].nunique()

In [None]:
df['emp_title'].value_counts()

**There are no options to do to save this feature, we cannot convert it to dummy variables due to the large number of different values, so the bewst solution is to drop it**

In [None]:
df.drop(columns=['emp_title'], inplace=True)

In [None]:
df.head(5)

**Create a count plot of the (`emp_length`) feature column.**

In [None]:
df['emp_length'].unique()

In [None]:
order = ['< 1 year', '1 year', '2 years', '3 years','4 years','5 years','6 years','7 years','8 years', '9 years','10+ years']
order


In [None]:
plt.figure(figsize=(12,5))
sns.countplot(x='emp_length', data=df, order=order, palette='Set1')
plt.savefig("plots/Countplot_of_EmpLength.png", dpi=300, bbox_inches='tight')

**Plot out the countplot with a hue separating Fully Paid vs Charged Off**

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(x='emp_length', data=df, order=order,hue='loan_status', palette='Set1')
plt.savefig("plots/Countplot_of_EmpLength_w_hue_LoanStatus.png", dpi=300, bbox_inches='tight')

**This still doesn't really inform us if there is a strong relationship between employment length and being charged off, what we want is the percentage of charge offs per category. Essentially informing us what percent of people per employment category didn't pay back their loan. There are a multitude of ways to create this Series.**

In [None]:
emp_co = df[df['loan_status']=="Charged Off"]['emp_length'].value_counts()
emp_co

In [None]:
emp_fp = df[df['loan_status']=="Fully Paid"]['emp_length'].value_counts()
emp_fp


In [None]:
per_cat=emp_co/emp_fp
per_cat

In [None]:
plt.figure(figsize=(10,4))
sns.barplot(per_cat, palette='Set1')
plt.savefig("plots/Countplot_of_Emplength_w_per_cat.png", dpi=300, bbox_inches='tight')

***Charge off rates are extremely similar across all employment lengths, so we will drop the (`emp_length`)column.***

In [None]:
df.drop(columns=['emp_length'], inplace=True)

In [None]:
df.head(5)

**Revisit the DataFrame to see what feature columns still have missing data.**

In [None]:
df.isnull().sum()

**Review the title column vs the purpose column.**

In [None]:
df['purpose'].head(10)

In [None]:
df['purpose'].unique()

In [None]:
df['title'].head(10)

In [None]:
df['title'].unique()

**The title column is simply a string subcategory/description of the purpose column. So, we will drop the title column.**

In [None]:
df.drop(columns=['title'], inplace=True)

In [None]:
df.head(5)

**Create a (`value_counts`) of the (`mort_acc`) column.**

In [None]:
value_counts=df['mort_acc'].value_counts()
value_counts    

**Let's review the other columsn to see which most highly correlates to (`mort_acc`)**

In [None]:
df.corr(numeric_only=True)['mort_acc']
 

**Looks like the (`total_acc`) feature correlates with the (`mort_acc`) , this makes sense! Let's try this fillna() approach. We will group the dataframe by the (`total_acc`) and calculate the mean value for the (`mort_acc`) per (`total_acc`) entry.**

In [None]:
mort_acc_mean = df.groupby('total_acc')['mort_acc'].mean()
mort_acc_mean


**Let's fill in the missing (`mort_acc`) values based on their total_acc value. If the mort_acc is missing, then we will fill in that missing value with the mean value corresponding to its (`total_acc`) value from the Series we created above. This involves using an .apply() method with two columns.**

In [None]:
def fill_missing(tot,mort):
    if pd.isnull(mort):
        return mort_acc_mean[tot]
    return mort

In [None]:
df['mort_acc'] = df.apply(lambda row: fill_missing(row['total_acc'],row['mort_acc']), axis=1)

In [None]:
df['mort_acc'].isnull().sum()

In [None]:
df.head(5)

**We will recheck the left features that have missing data**

In [None]:
df.isnull().sum()

**`revol_util` and the `pub_rec_bankruptcies` have missing data points, but they account for less than 0.5% of the total data. We will remove the rows that are missing those values in those columns with dropna().**

In [None]:
df.dropna(inplace=True)


In [None]:
df.isnull().sum()

### Categorical Variables and Dummy Variables

**We're done working with the missing data! Now we just need to deal with the string values due to the categorical columns.**

**List all the columns that are currently non-numeric**

In [None]:
df.select_dtypes(include=object).columns    

---
**Let's now go through all the string features to see what we should do with them.**

 #### `term` feature

**Convert the term feature into either a 36 or 60 integer numeric data type.**

In [None]:
df['term']=df['term'].apply(lambda x: int(x.split()[0]))

In [None]:
df['term'].value_counts()

#### `grade` feature

**We already know grade is part of sub_grade, so just drop the grade feature.**

In [None]:
df.drop(columns='grade', inplace=True)

In [None]:
df.head(5)

#### `sub_grade` feature

**We will convert the subgrade into dummy variables. Then concatenate these new columns to the original dataframe.**

In [None]:
dummies = pd.get_dummies(df['sub_grade'], drop_first=True)

In [None]:
df=pd.concat([df.drop(columns=['sub_grade']), dummies], axis=1)

In [None]:
df.columns

In [None]:
df.select_dtypes(include=object).columns    

#### `verification_status`, `application_type`, `initial_list_status`, and `purpose` features
**We will convert these columns into dummy variables and concatenate them with the original dataframe.**

In [None]:
vs_dummies=pd.get_dummies(df['verification_status'], drop_first=True)
df=pd.concat([df.drop(columns=['verification_status']), vs_dummies], axis=1)

In [None]:
at_dummies=pd.get_dummies(df['application_type'], drop_first=True)
df=pd.concat([df.drop(columns=['application_type']), at_dummies], axis=1)

In [None]:
ils_dummies=pd.get_dummies(df['initial_list_status'], drop_first=True)
df=pd.concat([df.drop(columns=['initial_list_status']), ils_dummies], axis=1)

In [None]:
p_dummies=pd.get_dummies(df['purpose'], drop_first=True)
df=pd.concat([df.drop(columns=['purpose']), p_dummies], axis=1)

In [None]:
df.columns

In [None]:
df.head(25)

#### `home_ownership` feature
**We will review its value_counts.**

In [None]:
df['home_ownership'].value_counts()

**We will convert these to dummy variables, but replace NONE and ANY with OTHER, so that we end up with just 4 categories, MORTGAGE, RENT, OWN, OTHER. Then concatenate them with the original dataframe.**

In [None]:
df['home_ownership'] = df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')


In [None]:
df['home_ownership'].value_counts()

In [None]:
ho_dummies=pd.get_dummies(df['home_ownership'], drop_first=True)
df=pd.concat([df.drop(columns=['home_ownership']), ho_dummies], axis=1)

In [None]:
df.head(5)

#### `address` feature
**Now let's feature engineer a zip code column from the address in the data set.We will Create a column called `zip_code` that extracts the zip code from the address column.**

In [None]:
df['zip_code']=df['address'].apply(lambda x: x.split()[-1])

In [None]:
df.head()

In [None]:
df['zip_code'].value_counts()

**Now we will  make this `zip_code` column into dummy variables,concatenate the result, and drop the original `zip_code` column along with dropping the `address` column.**

In [None]:
zc_dummies=pd.get_dummies(df['zip_code'],drop_first=True)
df=pd.concat([df.drop(['zip_code','address'], axis=1),zc_dummies], axis=1)

In [None]:
df.head()

#### `issue_d` feature

**This would be data leakage, we wouldn't know beforehand whether or not a loan would be issued when using our model, so in theory we wouldn't have an `issue_d`, So we drop this feature.**

In [None]:
df.drop(['issue_d'],inplace=True, axis=1)

In [None]:
df.columns

#### `earliest_cr_line` feature
**This appears to be a historical time stamp feature. We will extract the year from this feature using a .apply function, then convert it to a numeric feature. Then set this new data to a feature column called `earliest_cr_year`, then drop the `earliest_cr_line` feature.**

In [None]:
df['earliest_cr_year']=df['earliest_cr_line'].apply(lambda x: int(x.split('-')[1]))

In [None]:
df.drop(['earliest_cr_line'],inplace=True, axis=1)

In [None]:
df.head()

# Section 4: Train Test Split

To evaluate model generalization, we split the dataset into:
- **Training set**: used to train the model
- **Testing set**: used to evaluate model performance on unseen data

We typically use an 80/20 or 70/30 split.

### Splitting the data

**We will split the data into training and testing splits.**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df.drop(['loan_status'], inplace=True, axis=1)

In [None]:
df.columns

In [None]:
X= df.drop('loan_repaid', axis=1).values

In [None]:
y= df['loan_repaid'].values

In [None]:
print(len(df))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

### Normalizing the Data

**We will use a MinMaxScaler to normalize the feature data X_train and X_test.**

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler=MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test=scaler.transform(X_test)

# Section 5: Model Building

In this section, we build machine learning models to classify loan status.

Models used:
- **Artificial Neural Network (ANN)** using Keras
- **Random Forest Classifier** using scikit-learn

We fit each model on the training data and compare their capabilities.

Hyperparameters and architectures are chosen based on experimentation or grid search.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

#### ANN Model

In [None]:
model = Sequential()

model.add(Dense(128,  activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(units=1,activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
model.fit( x=X_train,y=y_train,epochs=25,batch_size=256,validation_data=(X_test, y_test))

In [None]:
loss=pd.DataFrame(model.history.history)
loss

In [None]:
loss.plot(figsize=(12,6))
plt.savefig("plots/Loss_vs_ValLoss_of_ANN.png", dpi=300, bbox_inches='tight')

#### Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
rf=RandomForestClassifier(class_weight='balanced',random_state=101)

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

In [None]:
grid_search_rf = GridSearchCV(estimator=rf, param_grid=param_grid, 
                              scoring='f1', cv=3, verbose=2, n_jobs=-1)


In [None]:
grid_search_rf.fit(X_train, y_train)

# Section 6: Model Evaluatione.

We evaluate model performance using several metrics:

- **Confusion Matrix**: true positives, false positives, etc.
- **Accuracy**: overall correctness
- **Precision & Recall**: especially important with imbalanced data
- **F1 Score**: harmonic mean of precision and recall

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

**For Ann Model**

In [None]:
ann_pred=model.predict(X_test)

In [None]:
print(classification_report(y_test,ann_pred.round()))
print("\n")
print(confusion_matrix(y_test,ann_pred.round()))

**For Random Forest Model**

In [None]:
rf_pred=grid_search_rf.predict(X_test)

In [None]:
print(classification_report(y_test,rf_pred.round()))
print("\n")
print(confusion_matrix(y_test,rf_pred.round()))

# Secrtion 7: Training/Testing with a Random Row

After training and evaluating our models on the test set, it's valuable to simulate how the model would behave in a real-world scenario.

In this section, we:
- Randomly select a sample from the original dataset (excluding the target).
- Feed the row into the trained model to get predictions.
- Compare the predicted labels with the actual `loan_status` values (if available).

This process helps validate that the model can generalize to individual, unseen cases — not just overall metrics.

In [None]:
import random

### Predicting 1

In [None]:
fully_paid_df=df[df['loan_repaid']==1]

In [None]:
random.seed(101)
random_ind_1 = random.randint(0,len(fully_paid_df))

new_customer_1 = fully_paid_df.drop('loan_repaid',axis=1).iloc[random_ind_1]
new_customer_1

In [None]:
model.predict(new_customer_1.astype(float).values.reshape(1, -1))

**Now lets check whether the perosn paid his loan or no**

In [None]:
fully_paid_df.iloc[random_ind_1]['loan_repaid']

In [None]:
#Both are 1, so the model is correct in predicting that this customer will repay the loan.

### Predicting 0

In [None]:
charged_off_df=df[df['loan_repaid']==0]

In [None]:
random.seed()
random_ind_0 = random.randint(0,len(charged_off_df))

new_customer_0 = charged_off_df.drop('loan_repaid',axis=1).iloc[random_ind_0]
new_customer_0

In [None]:
model.predict(new_customer_0.astype(float).values.reshape(1, -1))


**Now lets check whether the perosn paid his loan or no**

In [None]:
charged_off_df.iloc[random_ind_0]['loan_repaid']

In [None]:
#Oops here we have a problem with the inbalance class 0

# Secrtion 8: Saving Models

After training and validating the models, we save them for future inference.

Tools used:
- `joblib` for scikit-learn models
- `.h5` format for Keras models 

In [None]:
import joblib

In [None]:
joblib.dump(grid_search_rf, 'models/rf_model.pkl')
model.save('models/ann_model.h5')  

# THANK YOU!!