### Logistic Regression vs. Random Forest Classifier
#### Prediction:
The Random Forest Classifier will have a better level of performance due to the fact that it is a more robust model that might take longer to run but will help to account for potential overfitting. The Random Forest Classifier also uses the average of the trees in the forest, which should also help predictions when accounting for outliers. Thus the Random Forest Classifier will provide more accurate predictions. 



In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

In [2]:
# Import the data
file_path = Path("Resources/lending_data.csv")
df = pd.read_csv(file_path)
df.head() 

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
# Define the X (features) and y (target) sets
y = df["loan_status"].values
X = df.drop("loan_status", axis=1)
target_names = ["not approved", "approved"]

In [4]:
# Split the data into X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [5]:
# Adding scaler to further prepare the data
# scaler = StandardScaler().fit(X_train)
# X_train_scaled = scaler.transform(X_train)
# X_test_scaled = scaler.transform(X_test)

In [6]:
# Create the logistic regression model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
classifier

LogisticRegression(max_iter=10000)

In [8]:
# Train a Logistic Regression model 
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [9]:
# Print the model score
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9921240885954051
Testing Data Score: 0.9918489475856377


In [10]:
# Prepare the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# scaler = StandardScaler().fit(X_train)
# X_train_scaled = scaler.transform(X_train)
# X_test_scaled = scaler.transform(X_test)

In [11]:
# Import a Random Forests classifier
from sklearn.ensemble import RandomForestClassifier

In [12]:
# Train a Random Forest Classifier model
clf = RandomForestClassifier(random_state=1).fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [13]:
# print the model score
print(classification_report(y_test, y_pred, target_names=target_names))
print(f'Training Score: {clf.score(X_train, y_train)}')
print(f'Testing Score: {clf.score(X_test, y_test)}')

              precision    recall  f1-score   support

not approved       1.00      0.99      1.00     18765
    approved       0.85      0.89      0.87       619

    accuracy                           0.99     19384
   macro avg       0.92      0.94      0.93     19384
weighted avg       0.99      0.99      0.99     19384

Training Score: 0.9975409272252029
Testing Score: 0.9914878250103177


### Conclusion:
The results did not indicate that the models performed as I expected. The scores show that in this case, the logistic regression model performed better than the random forest model based on the gap between the training and testing scores. These results lead me to believe that the data being used for this exercise is imbalanced which might lead to the logistic regression model presenting higher than anticipated scores. Further adjustments may be required in order to balance the data to enable more true depiction of how evenly matched the models are. 