# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
df = pd.read_csv('Resources/lending_data.csv')
df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


## Check the data and perform any necessary cleaning ##

In [3]:
# check data types
df.dtypes

loan_size           float64
interest_rate       float64
borrower_income       int64
debt_to_income      float64
num_of_accounts       int64
derogatory_marks      int64
total_debt            int64
loan_status           int64
dtype: object

All values are numeric

In [4]:
# check for null values
for column in df.columns:
    print(f"Column {column} has {df[column].isnull().sum()} null values")

Column loan_size has 0 null values
Column interest_rate has 0 null values
Column borrower_income has 0 null values
Column debt_to_income has 0 null values
Column num_of_accounts has 0 null values
Column derogatory_marks has 0 null values
Column total_debt has 0 null values
Column loan_status has 0 null values


There are no null values.

In [5]:
# Find duplicate entries
# QUESTION: can we check if an entire row is duplicated? What does duplicated actually do?
print(f"Duplicate entries: {df.duplicated().sum()}")

Duplicate entries: 72307


In [6]:
df['loan_status'].value_counts()

0    75036
1     2500
Name: loan_status, dtype: int64

In [7]:
2500/75036

0.0333173410096487

#### The data input is imbalanced and prone to base-rate fallacy ###
Only 3% of loans have loan_status of 1. Both models will struggle to accurately predict loans with loan_status of 1.

#### No cleaning required ####

## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

#### Prediction ####
Both models are classification models, which is appropriate for the dataset.

Logistic Regression is suitable for data points that are distributed linearly. 

Question: how will the data points be distributed?
Estimate: I would plot 'ability to pay' and 'liability'. 

X axis: income * debt-to-income-ratio. High value = better ability to pay = more likely to be approved.
Y axis: loan size * interest rate. High value = greater liability = less likely to be approved.

A line of best fit would be a positive relationship between x and y = linear = Logistic Regression would work.

However, we can't see how the clusters will be distributed (what people are actually applying for). The cut-off for loan approval may be a clear line . . .
how do the point clusters work? Would they align with a perpendicular line? What if they are spread evenly?
The point clusters would be aligned from 'strongly no' to 'strongly yes' ('no' being most likely rejected, 'yes' most likely approved). So, Logistic Regression could work in this case.

What is Random Forests Classifier suited to? Does it take into account every field or only a pair at a time? 

I suspect it would struggle if the comparison between 'ability to pay' and 'liability' is not captured. 
Example: large loan amount gets approved, low loan amount does not. This could be due to the large loan applicant having a high income, low interest rate, low debt-to-income ratio. The implication from a simplistic data point is that large loans are more likely to be approved than small loans, which seems counterintuitive.

Other things to consider:
Loans can also be rejected due to their purpose. Small loans may be sought simply because the applicant is struggling financially and cannot pay for an expense, whereas a large loan or mortgage is usually an investment which is expected to return a profit long term.
*People undergoing financial hardship may be categorised differently altogether and qualify for financial assistance from the government or a reapyment plan from the bank, rather than just being rejected outright for a small loan. I have no experience in this area so this is just speculation*

## Split the Data into Training and Testing Sets

In [8]:
# Split the data into X_train, X_test, y_train, y_test
X = df.drop(columns='loan_status')
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [10]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(X_train_scaled, y_train)
print(f'Logistic Regression test score: {model.score(X_test_scaled, y_test)}')

Logistic Regression test score: 0.9936545604622369


In [11]:
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test_scaled)
confusion_matrix(y_test, y_pred)

array([[18652,   113],
       [   10,   609]], dtype=int64)

In [12]:
# Loan status 0 = Negative, 1 = Positive
TNeg = 18652
FPos = 113
FNeg = 10
TPos = 609
accuracy= (TPos + TNeg)/ (TPos + FPos + TNeg + FNeg)
print(f'Logistic Regression accuracy:{accuracy}')

# Base rate fallacy exploration
NegPrecision = TNeg/(TNeg + FNeg)
PosPrecision = TPos/(TPos + FPos)
print(f'Precision for Negative status: {NegPrecision}')
print(f'Precision for Positive status: {PosPrecision}')


Logistic Regression accuracy:0.9936545604622369
Precision for Negative status: 0.9994641517522238
Precision for Positive status: 0.8434903047091413


In [13]:
from sklearn.metrics import classification_report

print('Logistic Regression Classification Report')
print(classification_report(y_test, y_pred))

Logistic Regression Classification Report
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.84      0.98      0.91       619

    accuracy                           0.99     19384
   macro avg       0.92      0.99      0.95     19384
weighted avg       0.99      0.99      0.99     19384



In [14]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(X_train_scaled, y_train)
print(f'Random Forest test score: {clf.score(X_test_scaled, y_test)}')

Random Forest test score: 0.9914878250103177


In [15]:
y_pred = clf.predict(X_test_scaled)
confusion_matrix(y_test, y_pred)

array([[18664,   101],
       [   64,   555]], dtype=int64)

In [18]:
# Loan status 0 = Negative, 1 = Positive
TNeg = 18664
FPos = 101
FNeg = 64
TPos = 555
accuracy= (TPos + TNeg)/ (TPos + FPos + TNeg + FNeg)
print(f'Random Forest accuracy: {accuracy}')

# Base rate fallacy exploration
NegId = TNeg + FNeg
PosId = TPos + FPos

NegPrecision = TNeg/NegId
PosPrecision = TPos/PosId

print(f'Precision for Negative status: {NegPrecision}')
print(f'Precision for Positive status: {PosPrecision}')

Random Forest accuracy: 0.9914878250103177
Precision for Negative status: 0.9965826569841948
Precision for Positive status: 0.8460365853658537


In [19]:
print('Random Forests Classification Report')
print(classification_report(y_test, y_pred))

Random Forests Classification Report
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.90      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.93     19384
weighted avg       0.99      0.99      0.99     19384



Both models performed equally well. Both were less accurate when predicting positive values due to the imbalance of negative and positive values in the dataset.