In [2]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

In [5]:
## Data Import
lend_data = pd.read_csv('Resources/lending_data.csv')
lend_data.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Pre-Model thoughts and predictions

I believe that the Random Forest Classifier will outperform the Logistic Regression model *if* I model the data to properly suit either model's strengths. To expand, I believe that the Logistic Regression model is more likely to perform better if I do not create well thought out models. From my reading and understanding, Logistic Regression models suffer from overfitting which could lead to reliability risk in the predictive ability of the model. Logistic Regression models also suffer from limited scope in their dimensional analysis, something that a Random Forest Classifier does not have issues with. With my limited understanding of credit risk evaluation, I would think that the kind of analysis to most accurately provide an answer of whether or not someone is lended money would be multidimensional. I also believe that you are not always going to be able to have access to the information that would make a Random Forest Classifier outperform the Logistic Regression Model. 

With that said, I've been excessively safe about my decision, giving plenty of conditions for either models performance. For the sake of this assignment, I will say I think for this dataset the Logistic Regression Model will outperform the Random Forest Classifier. My primary reasoning is that all the data is numeric, which from my reading supports the success/efficacy of the Logistic Regression model. My secondary reasoning is that I believe with my ability, I am more likely to have success using the Logistic Regression Model.


## Training and Fitting Logistical Regression

In [283]:
lend_data['loan_status'].unique()

array([0, 1], dtype=int64)

In [303]:
## X_train, X_Test, y_train, y_test

## Independant input variables
X = lend_data.drop('loan_status', axis=1)

## Outcome dependant variable
y = lend_data['loan_status']
print(X.shape, y.shape)

(77536, 7) (77536,)


In [304]:
X_train, X_test, y_train, y_test =  train_test_split(X, y, random_state=1)

In [301]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

targets = ["0", "1"]
log_model =  LogisticRegression()
log_model.fit(X_train, y_train)
prediction =  log_model.predict(X_test)
print(classification_report(y_test, prediction, target_names=targets ))
print(log_model.score(X_train, y_train))
print(log_model.score(X_test, y_test))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384

0.9921240885954051
0.9918489475856377


In [302]:
## A dataframe visualization of the predicted loan status vs actual loan status.

pd.DataFrame({"predictions": prediction, "actual":y_test})

Unnamed: 0,predictions,actual
60914,0,0
36843,0,0
1966,0,0
70137,0,0
27237,0,0
...,...,...
45639,0,0
11301,0,0
51614,0,0
4598,0,0


Well, if I did not have a reference value for what is to be expected, I would have spent far more time trying to find out if the 99% accuracy was a cause for concern. The reference for this says `0.9908171687990095` and my result is `0.9918489475856377`. Either way, the 99% accuracy, if I am interpreting correctly, means that the trained model was able to predict if a loan status should be a 0 or 1 with that 99% accuracy based on the independent input variables I passed to it. 

## Training and Fitting Random Forest

In [78]:
X =  lend_data.drop('loan_status', axis=1)
y = lend_data['loan_status']
targets = ["0", "1"]

In [296]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [300]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

trees = RandomForestClassifier(random_state=1).fit(X_train, y_train)
prediction =  trees.predict(X_test)
print(classification_report(y_test, prediction, target_names=targets ))
print(f"Trained: {trees.score(X_train, y_train)}")
print(f"Test: {trees.score(X_test, y_test)}")

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.89      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.94      0.93     19384
weighted avg       0.99      0.99      0.99     19384

Trained: 0.9975409272252029
Test: 0.9914878250103177


The trained accuracy is really strong. The test accuracy is also strong. As explained above, I would be skeptical if I was not given reference values. The reference value for this model is `0.9910751134956666`, which is not too terribly far away, as far as I know, from a value I received, `0.9914878250103177`.

## Conclusions/ Final Thoughts

Both models performed well, performing just as well as each other. The Logistic Regression model returned an accuracy for me that generally falls around `0.991848` and so on. The Random Forest Classifier model returned an accuracy for me that generally falls around `0.991487` and so on. The accuracy difference between these is `0.00036`. At this point, I still consider myself a layman-in-learning about machine learning. At this point, justifying a difference between either result feels like splitting hairs. My prediction is numerically correct in that the Logistic Regression model provided a higher score. To me, my prediction feels like a wash, and either model can be used on this data to provide insight into what loan status the parameters provided deserve.

Why did they perform nearly the same? Ignoring any potential user error, the discussion could be about the dataset. The dataset, from my understanding/interpretation is linearly separable, and is not complicated by the need for multidimensional analysis, nullifying an advantage Random Forest might have.

Ultimately, I would say based on this, this dataset can be interpreted and trained on either model to produce a well trained result, as either model must be well suited to the parameters of the dataset. To me, that means that the markers that determine a 0 or 1 for a given loan status are consistent and well based on the parameters in this dataset. If I had to say, the dataset does have a lot of information that lends itself to establishing a good basis for predictability, even just for a human. Loan size, Debt:Income, and derogatory marks seem like strong rules of thumb for lending practices. If my understanding is correct, these are also relative parameters that are factors in one's credit score, which from my knowledge, a credit score can be a deciding factor in a borrower receiving consideration for a loan. 

So, my conclusion would be that both models are suitable for this dataset, and both perform well.