# Jeff Pinegar
jeffPinegar1@gmail.com <br>
717-982-0516<br>
## Challenge 19 - Supervised Machine Learing Credit Risk Evaluator<br>

Due: Feb. 22, 2023<br>

---

In [1]:
import numpy as np
import pandas as pd
import os

# ML Data prep
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

# ML models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# ML Evaluation
from sklearn.metrics import confusion_matrix, classification_report


## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
my_data = os.path.join('.','Resources', 'lending_data.csv')
df_raw = pd.read_csv(my_data, index_col=False)
df_raw.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
df_raw.describe()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
count,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0,77536.0
mean,9805.562577,7.292333,49221.949804,0.377318,3.82661,0.392308,19221.949804,0.032243
std,2093.223153,0.889495,8371.635077,0.081519,1.904426,0.582086,8371.635077,0.176646
min,5000.0,5.25,30000.0,0.0,0.0,0.0,0.0,0.0
25%,8700.0,6.825,44800.0,0.330357,3.0,0.0,14800.0,0.0
50%,9500.0,7.172,48100.0,0.376299,4.0,0.0,18100.0,0.0
75%,10400.0,7.528,51400.0,0.416342,4.0,1.0,21400.0,0.0
max,23800.0,13.235,105200.0,0.714829,16.0,3.0,75200.0,1.0


In [4]:
df = df_raw
df = df.drop_duplicates()
df = df.dropna()
df.describe()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
count,5229.0,5229.0,5229.0,5229.0,5229.0,5229.0,5229.0,5229.0
mean,12844.214955,8.583721,61376.286097,0.461037,6.560719,1.078409,31376.286097,0.301396
std,4779.22887,2.03113,19116.38397,0.166747,4.306406,0.974999,19116.38397,0.458908
min,5000.0,5.25,30000.0,0.0,0.0,0.0,0.0,0.0
25%,8800.0,6.857,45100.0,0.334812,3.0,0.0,15100.0,0.0
50%,11500.0,8.016,56000.0,0.464286,5.0,1.0,26000.0,0.0
75%,17800.0,10.674,81100.0,0.630086,11.0,2.0,51100.0,1.0
max,23800.0,13.235,105200.0,0.714829,16.0,3.0,75200.0,1.0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

---
## Answer - Logistic Regression
I believe the best choice in this application is Logistic regression. Why do I feel this way? Because logistic regression is the simpler model and likely the best model assuming all of the assumptions are true. So what are the key assumptions?  

* Binary outcome (dependent variable) - CHECK
* Independent observations – I will delete duplication - CHECK
* The relationship between the dependent and independent variables is linear - CHECK
* Larger enough sample size to ensure stability. The ratio of observation to the independent variable is very, very high in this case – CHECK
* No multicollinearity: There should be no high correlation among the independent variables in the model. I noted that one variable is the ratio of two other variables. I plan to test the model with the ratio removed and with the components extracted, therefore eliminating the collinearity – CHECK
* Equal variance of independent variable – I will perform standard scaling on the data to ensure an equal variance. – CHECK
* Outliers are a problem for both models. An examination of the data indicates there are no significant outliers. – CHECK

Other factors to consider:
* Every classification will likely need to be explained, and the results of logistic regression are easier to explain. 
* Logistic regression can run very effectively with a small data set like this. However, if the dataset was enormous, a random forest likely would outperform faster.

On the other hand, Random Forest would have been better if we suspected any of the relationships were non-linear or when there are many features, some of which may be irrelevant or highly correlated. Since debt_to_income is an apparent potential problem because it is the ratio of two other values in the set, I will test by removing this value or its components.


## Split the Data into Training and Testing Sets

In [5]:
# Seperate out the dependent and independen variables
y = df["loan_status"].values


# Drop y out of the dataframe to get the independent variables
X = df.drop("loan_status", axis=1)

# since debt_to_income is a variable also I am dropping these two in an effort to remove collineary and dependance improving the reliability of the model
X = X.drop('total_debt', axis=1)
X = X.drop('borrower_income', axis=1)


In [6]:
X.head()

Unnamed: 0,loan_size,interest_rate,debt_to_income,num_of_accounts,derogatory_marks
0,10700.0,7.672,0.431818,5,1
1,8400.0,6.692,0.311927,3,0
2,9000.0,6.963,0.349241,3,0
3,10700.0,7.664,0.43074,5,1
4,10800.0,7.698,0.433962,5,1


In [7]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

In [8]:
# Scaling the X data by using StandardScaler()
# calculate a scaler
# standard scaler transforms the data around zero.  This looks like creating z-scores for the values (mean = 0, variance = 1)
scaler = StandardScaler().fit(X_train)

# use the scaler to transform my training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled

array([[ 0.20951277,  0.21375305,  0.4825129 ,  0.32336824, -0.0912527 ],
       [-0.88081568, -0.87413624, -0.79421148, -0.84050809, -1.11941892],
       [-0.08403719, -0.08177788,  0.22069233, -0.14218229, -0.0912527 ],
       ...,
       [-0.14694076, -0.13703574,  0.16954273, -0.14218229, -0.0912527 ],
       [-0.67113713, -0.66395899, -0.46433579, -0.60773283, -1.11941892],
       [-0.9646871 , -0.95948991, -0.94322692, -1.07328336, -1.11941892]])

## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [10]:
# Train a Logistic Regression model and print the model score
classifier = LogisticRegression()
classifier.fit(X_train_scaled, y_train)
print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.9173680183626626
Testing Data Score: 0.922782874617737


In [11]:
# continue the evaluation of the model
y_true = y_test

# Predict  - using the model calculate results for the text data
y_pred = classifier.predict(X_test_scaled)

# create and evaluate the confusion matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"True positives (TP): {tp}")
print(f"True negatives (TN): {tn}")
print(f"False positives (FP): {fp}")
print(f"False negatives (FN): {fn}")
precision = tp / (tp + fp)
print(f'precision =  {precision}')
accuracy = (tp + tn)/(tp+tn+fp+fn)
print(f'accuracy =  {accuracy}')
sensitivity = tp / (tp + fn)
print(f'sensitivity =  {sensitivity}')
F1_j = 2*tp/(2*tp+fn+fp)
print(f'F1 = {F1_j}')
print(classification_report(y_true, y_pred))

True positives (TP): 373
True negatives (TN): 834
False positives (FP): 87
False negatives (FN): 14
precision =  0.8108695652173913
accuracy =  0.922782874617737
sensitivity =  0.9638242894056848
F1 = 0.8807556080283353
              precision    recall  f1-score   support

           0       0.98      0.91      0.94       921
           1       0.81      0.96      0.88       387

    accuracy                           0.92      1308
   macro avg       0.90      0.93      0.91      1308
weighted avg       0.93      0.92      0.92      1308



In [12]:
# Train a Random Forest Classifier model and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=50).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.966590155572558
Testing Score: 0.8585626911314985


In [13]:
# continue the evaluation of the model
y_true = y_test

# Predict  - using the model calculate results for the text data
y_pred = clf.predict(X_test_scaled)

# create and evaluate the confusion matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"True positives (TP): {tp}")
print(f"True negatives (TN): {tn}")
print(f"False positives (FP): {fp}")
print(f"False negatives (FN): {fn}")
precision = tp / (tp + fp)
print(f'precision =  {precision}')
accuracy = (tp + tn)/(tp+tn+fp+fn)
print(f'accuracy =  {accuracy}')
sensitivity = tp / (tp + fn)
print(f'sensitivity =  {sensitivity}')
F1_j = 2*tp/(2*tp+fn+fp)
print(f'F1 = {F1_j}')
print(classification_report(y_true, y_pred))

True positives (TP): 287
True negatives (TN): 836
False positives (FP): 85
False negatives (FN): 100
precision =  0.771505376344086
accuracy =  0.8585626911314985
sensitivity =  0.7416020671834626
F1 = 0.7562582345191041
              precision    recall  f1-score   support

           0       0.89      0.91      0.90       921
           1       0.77      0.74      0.76       387

    accuracy                           0.86      1308
   macro avg       0.83      0.82      0.83      1308
weighted avg       0.86      0.86      0.86      1308



---
## Conclusion

Both models performed well. However, the logistic regression model performed better, with a higher testing sample accuracy, higher F1 Score, and significantly fewer false negatives. 

The Random Forest model has more options for tweaking the model fit. It is likely that by adjusting these parameters, this model could be improved. Still, given the performance and simplicity of the Logistic Regression model, this would not be time well spent.
