#### Dependencies and Setup

In [34]:
# dependencies
import numpy as np
import pandas as me
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [35]:
# import the data
file_path = Path('Resources/lending_data.csv')
lending_df = me.read_csv(file_path)
lending_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [36]:
# verify non-null data
lending_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64
 2   borrower_income   77536 non-null  int64  
 3   debt_to_income    77536 non-null  float64
 4   num_of_accounts   77536 non-null  int64  
 5   derogatory_marks  77536 non-null  int64  
 6   total_debt        77536 non-null  int64  
 7   loan_status       77536 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 4.7 MB


#### Prediction
The logistic regression model will perform better than the random forest classifier as logistic regression tends to do better when presented with categorical or bianry data. Since the data used determies if a loan was approved or not by indicating "1" or "0", logistic regression would be better suited.  

In [37]:
# define the X and y sets
y = lending_df['loan_status'].values
X = lending_df.drop('loan_status', axis = 1)

In [38]:
# separate the target feature 'loan_status' for training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 777)
X_train.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
36107,10200.0,7.447,50700,0.408284,4,1,20700
24526,9900.0,7.325,49500,0.393939,4,0,19500
39387,8500.0,6.757,44200,0.321267,3,0,14200
46172,8700.0,6.807,44700,0.328859,3,0,14700
39976,10900.0,7.77,53700,0.441341,5,1,23700


##### Unscaled Results

In [39]:
# train a Logistic Regression model print the model score (unscaled)
from sklearn.linear_model import LogisticRegression
model_us = LogisticRegression(max_iter=10000)
model_us.fit(X_train, y_train)
print(f'Unscaled LR Training Data Score: {model_us.score(X_train, y_train)}')
print(f'Unscaled LR Testing Data Score: {model_us.score(X_test, y_test)}')

Unscaled LR Training Data Score: 0.9923304443527308
Unscaled LR Testing Data Score: 0.9912298803136608


In [40]:
# train a Random Forest Classifier model and print the model score (unscaled)
classifier_us = RandomForestClassifier(random_state=777, n_estimators=500).fit(X_train, y_train)
print(f'Unscaled RFC Training Score: {classifier_us.score(X_train, y_train)}')
print(f'Unscaled RFC Testing Score: {classifier_us.score(X_test, y_test)}')

Unscaled RFC Training Score: 0.9974893382858715
Unscaled RFC Testing Score: 0.991642591828312


In [41]:
# scale the data
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

##### Scaled Results

In [42]:
# train a Logistic Regression model print the model score (scaled)
model_s = LogisticRegression()
model_s.fit(X_train_scaled, y_train)
print(f'Scaled LR Training Data Score: {model_s.score(X_train_scaled, y_train)}')
print(f'Scaled LR Testing Data Score: {model_s.score(X_test_scaled, y_test)}')

Scaled LR Training Data Score: 0.9942736277342138
Scaled LR Testing Data Score: 0.9937061494015683


In [43]:
# train a Random Forest Classifier model and print the model score (scaled)
classifier_s = RandomForestClassifier(random_state=777, n_estimators=500).fit(X_train_scaled, y_train)
print(f'Scaled RFC Training Score: {classifier_s.score(X_train_scaled, y_train)}')
print(f'Scaled RFC Testing Score: {classifier_s.score(X_test_scaled, y_test)}')

Scaled RFC Training Score: 0.9974893382858715
Scaled RFC Testing Score: 0.9913330581923235


#### Analysis

Both the unscaled logistic regression (LR) testing and training scores were very close as were the scaled LR scores. Additionally, both the scaled and unscaled LR results were close to each other. On the other side, the random forest classifier (RFC) scores were close to one another. Maybe it was the fact I used a high random state and it would be equally likely that these scores would drop if I reduced the state to a much lower number. Technically, RFC unscaled and scaled slightly outperformed LR unscaled and scaled, but I was expecting the LR models to do better in this instance. 

Here are my initial results (truncated): 
* Unscaled LR Training Data Score: 0.99233
* Unscaled LR Testing Data Score:  0.99122
* Unscaled RFC Training Score: 0.99748
* Unscaled RFC Testing Score: 0.99164
* Scaled LR Training Data Score: 0.99427
* Scaled LR Testing Data Score: 0.99370
* Scaled RFC Training Score: 0.99748
* Scaled RFC Testing Score: 0.99133

Again, I was certain these would change if I reduced the number. Running them again with a lower random state number, keeping the LR iterations at 10,000 and n_estimators at 500 most likely would not move the results with the slightest significance. So while my initial assumption that logistic regression would yield better results, I can take solace that the two models ended in a virtual tie. 