# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data
file_path = Path("Resources/lending_data.csv")
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

Between Logistic Regression and Random Forests Classifier, I believe that Random Forests Classifier will give a more accurate prediction data for this dataset because this data will have 2 clusters in a specific area that can be divided by a line due to the difficulty of making credit decisions.

## Split the Data into Training and Testing Sets

In [3]:
# Split the data into X_train, X_test, y_train, y_test
y = df["loan_status"].values
X = df.drop("loan_status", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

In [4]:
X_train.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
26017,11500.0,7.999,55900,0.463327,5,1,25900
36500,10500.0,7.59,52000,0.423077,4,1,22000
63022,7800.0,6.43,41100,0.270073,2,0,11100
5234,7900.0,6.494,41700,0.280576,2,0,11700
73437,10000.0,7.373,50000,0.4,4,0,20000


In [5]:
X_test.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
73983,8500.0,6.726,43900,0.316629,3,0,13900
30429,9100.0,7.009,46600,0.356223,3,0,16600
38175,10700.0,7.663,52700,0.43074,5,1,22700
50897,9500.0,7.179,48200,0.377593,4,0,18200
36810,9200.0,7.053,47000,0.361702,3,0,17000


## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [10]:
# Train a Logistic Regression model and print the model score
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

In [11]:
classifier.fit(X_train, y_train)

In [12]:
print(f"Training Score: {classifier.score(X_train, y_train)}")
print(f"Testing Score: {classifier.score(X_test, y_test)}")

Training Score: 0.9917457697069748
Testing Score: 0.9924164259182832


In [13]:
# Train a Random Forest Classifier model and print the model score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
X_train_scaled

array([[ 0.80886832,  0.79384697,  0.79707891, ...,  0.61617959,
         1.04220332,  0.79707891],
       [ 0.33129969,  0.33418772,  0.33137672, ...,  0.0910059 ,
         1.04220332,  0.33137672],
       [-0.95813562, -0.96949134, -0.97020118, ..., -0.95934147,
        -0.67439464, -0.97020118],
       ...,
       [ 0.80886832,  0.82531509,  0.82096107, ...,  0.61617959,
         1.04220332,  0.82096107],
       [ 0.37905655,  0.35666494,  0.35525888, ...,  0.0910059 ,
         1.04220332,  0.35525888],
       [-1.34019053, -1.31901219, -1.31649255, ..., -1.48451515,
        -0.67439464, -1.31649255]])

In [16]:
X_test_scaled

array([[-0.62383758, -0.63682841, -0.63585089, ..., -0.43416778,
        -0.67439464, -0.63585089],
       [-0.3372964 , -0.31877567, -0.31344169, ..., -0.43416778,
        -0.67439464, -0.31344169],
       [ 0.42681342,  0.41622959,  0.41496429, ...,  0.61617959,
         1.04220332,  0.41496429],
       ...,
       [ 0.09251537,  0.11278705,  0.11643725, ...,  0.0910059 ,
         1.04220332,  0.11643725],
       [ 0.28354283,  0.29934802,  0.29555348, ...,  0.0910059 ,
         1.04220332,  0.29555348],
       [-0.3372964 , -0.35698696, -0.36120602, ..., -0.43416778,
        -0.67439464, -0.36120602]])

In [17]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state = 13).fit(X_train_scaled, y_train)

In [18]:
print(f"Training Score: {clf.score(X_train_scaled, y_train)}")
print(f"Testing Score: {clf.score(X_test_scaled, y_test)}")

Training Score: 0.9972313935892144
Testing Score: 0.9913330581923235


Although the training score for the Random Forest Model was higher than that of the Logistic Regression model, the testing score was higher on the Logistic Regression model. However, both models had an accuracy scores of greater than 99%, making them both successful.