In [5]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

In [8]:
# Import the data
lending_scores_df = pd.read_csv('resources/lending_data.csv')
lending_scores_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


### Consider the models
You will be creating and comparing two models on this data: a logistic regression, and a random forests classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! Write down (in markdown cells in your Jupyter Notebook or in a separate document) your prediction, and provide justification for your educated guess.

In general, logistic regression performs better when the number of noise variables is less than or equal to the number of explanatory variables and random forest has a higher true and false positive rate as the number of explanatory variables increases in a dataset.

The features are
loan size
interest rate
borrower income
debt to income
number accounts
derogatory marks
total debt

I believe all of these values will contribute to a customer's loan status, so I would not expect much noise. However, we also don't have an enormous amount of variables, only 8. Going with the idea that Random Forests do better with more variables, I predict the Logistic Regression model will do better. 

### Fit a LogisticRegression model and RandomForestClassifier model
Create a LogisticRegression model, fit it to the data, and print the model's score. Do the same for a RandomForestClassifier. You may choose any starting hyperparameters you like. Which model performed better? How does that compare to your prediction? Write down your results and thoughts.

In [16]:
# Split the data into X_train, X_test, y_train, y_test
# Loan status is obvious y
y = lending_scores_df['loan_status'].values
X = lending_scores_df.drop('loan_status', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [17]:
# Lets provide scaled data for both models
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [18]:
from sklearn.linear_model import LogisticRegression
# Train a Logistic Regression model print the model score
# Starting with LogReg, tried and true model which I predict will do better
model = LogisticRegression()
log_reg = model.fit(X_train_scaled, y_train)
print(f"Training Data Score: {log_reg.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {log_reg.score(X_test_scaled, y_test)}")

Training Data Score: 0.9941188609162196
Testing Data Score: 0.9941704498555509


In [None]:
# Train a Random Forest Classifier model and print the model score
# lets keep same sample data

In [19]:
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [20]:
clf = RandomForestClassifier(random_state=42, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9971970009629936
Testing Score: 0.9918489475856377


### Conclusion
It would be interesting to have a noisier data set, as the results are extremely accurate in both models. The Logistic Model edged out the RFC by a thousandth of a percentage!

Random hyperparam tuning below

In [21]:
clf = RandomForestClassifier(random_state=42, n_estimators=50).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9971626083367726
Testing Score: 0.9919521254643004


In [22]:
clf = RandomForestClassifier(random_state=42, n_estimators=10).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9968874673270051
Testing Score: 0.9915394139496492
