# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
df = pd.read_csv("Resources//lending_data.csv")
df.head()


Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

I believe the Logistic Regression will perform better since most/all of the data is numeric and not categorical/classes.

## Split the Data into Training and Testing Sets

In [4]:
#create X and Y
X= df.drop(columns=['loan_status'])
Y= df['loan_status']

# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [7]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression(random_state=0).fit(X_train, Y_train)
print(f"Regression Training Score: {reg.score(X_train, Y_train)}")
print(f"Regression Testing Score: {reg.score(X_test, Y_test)}")

Regression Training Score: 0.9920553033429633
Regression Testing Score: 0.9917457697069748


The accuracy itself doesn't tell us a lot about a model's performance so I'm also going to run a classification report to find precision and recall for both models.

In [9]:
from sklearn.metrics import confusion_matrix

Y_true = Y_test
Y_pred = reg.predict(X_test)
confusion_matrix(Y_true, Y_pred)

array([[18666,    94],
       [   66,   558]], dtype=int64)

In [10]:
tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred).ravel()
print(f"True positives (TP): {tp}")
print(f"True negatives (TN): {tn}")
print(f"False positives (FP): {fp}")
print(f"False negatives (FN): {fn}")

accuracy = (tp + tn) / (tp + fp + tn + fn)
print(f"Accuracy: {accuracy}")

precision = tp / (tp + fp)
print(f"Precision: {precision}")

sensitivity = tp / (tp + fn)
print(f"Recall/Sensitivity: {sensitivity}")

f1 = 2*precision*sensitivity / (precision + sensitivity)
print(f"f1 Score: {f1}")

True positives (TP): 558
True negatives (TN): 18666
False positives (FP): 94
False negatives (FN): 66
Accuracy: 0.9917457697069748
Precision: 0.8558282208588958
Recall/Sensitivity: 0.8942307692307693
f1 Score: 0.8746081504702194


In [11]:
from sklearn.metrics import classification_report
print(classification_report(Y_true, Y_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18760
           1       0.86      0.89      0.87       624

    accuracy                           0.99     19384
   macro avg       0.93      0.94      0.94     19384
weighted avg       0.99      0.99      0.99     19384



In [12]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
forst = RandomForestClassifier(random_state=0).fit(X_train, Y_train)
print(f"Forest Training Score: {forst.score(X_train, Y_train)}")
print(f"Forest Testing Score: {forst.score(X_test, Y_test)}")

Forest Training Score: 0.9973173751547668
Forest Testing Score: 0.9921068922822947


In [13]:
from sklearn.metrics import confusion_matrix

Y_true = Y_test
Y_pred = forst.predict(X_test)
confusion_matrix(Y_true, Y_pred)

array([[18667,    93],
       [   60,   564]], dtype=int64)

In [14]:
tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred).ravel()
print(f"True positives (TP): {tp}")
print(f"True negatives (TN): {tn}")
print(f"False positives (FP): {fp}")
print(f"False negatives (FN): {fn}")

accuracy = (tp + tn) / (tp + fp + tn + fn) 
print(f"Accuracy: {accuracy}")

precision = tp / (tp + fp)
print(f"Precision: {precision}")

sensitivity = tp / (tp + fn)
print(f"Sensitivity: {sensitivity}")

f1 = 2*precision*sensitivity / (precision + sensitivity)
print(f"f1 Score: {f1}")

True positives (TP): 564
True negatives (TN): 18667
False positives (FP): 93
False negatives (FN): 60
Accuracy: 0.9921068922822947
Precision: 0.8584474885844748
Sensitivity: 0.9038461538461539
f1 Score: 0.8805620608899296


In [15]:
from sklearn.metrics import classification_report
print(classification_report(Y_true, Y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18760
           1       0.86      0.90      0.88       624

    accuracy                           0.99     19384
   macro avg       0.93      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



*Which model performed better? How does that compare to your prediction? Replace the text in this markdown cell with your answers to these questions.*
It seems that the Random Forest Classifier performed better than the Logistic Regression model, but only by a hair. In precision, recall, and f1-score, RFC is just slightly closer to the ideal score of 1.0 than those of LR. RFC also had slightly more accurate "results" where there were fewer false positives and false negatives than LR's results.

My prediction did not turn out to be correct, I would be interested in how the two compare with different data sets and types of data.