## Prediction

   #### I think that the Logistic Regression Model will do better than the Random Forest Classifier Model. The Random Forest Model works better with more categorical data and since this data set is made up of numeric data, I think the Logistic Regression will do better.


In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Import the data
file = Path("Resources/lending_data.csv")

credit_df = pd.read_csv(file)

credit_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [3]:
# creating X_data
X = credit_df.drop(['loan_status'], axis =1)

X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [4]:
# creating y_data
y= credit_df["loan_status"]

In [5]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [6]:
X_train.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
29175,8600.0,6.792,44500,0.325843,3,0,14500
23020,7800.0,6.419,41000,0.268293,2,0,11000
31269,10000.0,7.386,50100,0.401198,4,1,20100
35479,9300.0,7.093,47300,0.365751,3,0,17300
13470,9200.0,7.045,46900,0.360341,3,0,16900


In [7]:
X_test.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
60914,12600.0,8.469,60300,0.502488,6,1,30300
36843,9800.0,7.289,49200,0.390244,4,0,19200
1966,10900.0,7.77,53700,0.441341,5,1,23700
70137,10700.0,7.666,52700,0.43074,5,1,22700
27237,9900.0,7.353,49800,0.39759,4,0,19800


In [8]:
# create a logistic regression model
from sklearn.linear_model import LogisticRegression

L_classifier = LogisticRegression()

L_classifier

LogisticRegression()

In [9]:
# Train a Logistic Regression model 
L_classifier.fit(X_train, y_train)

LogisticRegression()

In [10]:
#print the model score
print(f"Training Data Score: {L_classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {L_classifier.score(X_test, y_test)}")

Training Data Score: 0.9921240885954051
Testing Data Score: 0.9918489475856377


In [11]:
#Create the Random Forest Classifier
F_classifier = RandomForestClassifier(random_state=1)

F_classifier

RandomForestClassifier(random_state=1)

In [12]:
# Train a Random Forest Classifier model 
F_classifier.fit(X_train, y_train)

RandomForestClassifier(random_state=1)

In [13]:
#print the model score
print(f"Traing Data Score: {F_classifier.score(X_train, y_train)}")
print(f"Test Data Score: {F_classifier.score(X_test, y_test)}")

Traing Data Score: 0.9975409272252029
Test Data Score: 0.9914878250103177


## What if I scaled the data?

#### From my model score results, both the Logistic Regression and the Random Forest scores are very very close, but the Logistic Regression performed 4 ten thousandths of a point better (0.0004). This falls in line with my prediction, but it is much closer than I expected. I would like to explore what would happen if I scaled the data.


In [14]:
#scaling the X data

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
# create a logistic regression model
from sklearn.linear_model import LogisticRegression

L_classifier_scaled = LogisticRegression()

L_classifier_scaled

LogisticRegression()

In [16]:
# Train a Logistic Regression model 
L_classifier_scaled.fit(X_train_scaled, y_train)

LogisticRegression()

In [17]:
#print the model score
print(f"Training Data Score: {L_classifier_scaled.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {L_classifier_scaled.score(X_test_scaled, y_test)}")

Training Data Score: 0.9942908240473243
Testing Data Score: 0.9936545604622369


In [18]:
#Create the Random Forest Classifier
F_classifier_scaled = RandomForestClassifier(random_state=1)

F_classifier_scaled

RandomForestClassifier(random_state=1)

In [19]:
# Train a Random Forest Classifier model 
F_classifier_scaled.fit(X_train_scaled, y_train)

RandomForestClassifier(random_state=1)

In [20]:
#print the model score
print(f"Traing Data Score: {F_classifier_scaled.score(X_train_scaled, y_train)}")
print(f"Test Data Score: {F_classifier_scaled.score(X_test_scaled, y_test)}")

Traing Data Score: 0.9975409272252029
Test Data Score: 0.9914878250103177


## Results

#### After scaling the data, the Logistic Regression score improved but the Random Forest stayed the same. Now the Logistic Regression score is 2 thousandths of a point better (0.002). This still falls in line with my hypothesis and the scores are still really close. I do think that the Logistic Regression performed better since the data was more numerical. There is one column, derogatory_marks that is categorical and I think helps the Random Forest model. Either way, since both models’ scores are so close, I think either model will be a really good predictor of if a loan will be approved or not.
