# Credit Risk Evaluator

In [18]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [19]:
# Import the data
file_path = ("Resources/lending_data.csv")
lending_df = pd.read_csv(file_path)
lending_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

#### My prediction is the Logistic Regression model will perform better in this instance because the Logistic Regression model would produce a more accuarate representation of the data we will be using.

## Split the Data into Training and Testing Sets

In [20]:
# Create the features DataFrame, X, by removing the loan_status column. 
# Create y, the labels set, by using the loan_status column.
y = lending_df['loan_status'].values #returns a numpy array
X = lending_df.drop('loan_status', axis=1)

In [21]:
# Split the data into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.5) # set the test size to 50%
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [22]:
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (38768, 7)
X_test shape: (38768, 7)
y_train shape: (38768,)
y_test shape: (38768,)


## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

#### Logistic Regression Model
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set.

In [23]:
# Import Logistic Regression package
# Create a Logistic Regression model 
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000) # how many iterations to run before the solver terminates
classifier

LogisticRegression(max_iter=10000)

In [24]:
# Fit (train) our model by using the training data
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [25]:
# print the model score
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.991565208419315
Testing Data Score: 0.9925453982666116


In [30]:
# Import confusion Matrix
from sklearn.metrics import confusion_matrix
y_true = y_test
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_true, y_pred)
cm

array([[37371,   184],
       [  105,  1108]], dtype=int64)

In [48]:
# retrieve the accuracy
tn,fp,fn, tp = cm.ravel()
accuracy = (tp + tn) / (tp + fp + tn + fn)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9925453982666116


In [49]:
print(f"True positvies (TP): {tp}")
print(f"True negatives (TN): {tn}")
print(f"False positvies (FP): {fp}")
print(f"False negatives (FN): {fn}")

True positvies (TP): 1108
True negatives (TN): 37371
False positvies (FP): 184
False negatives (FN): 105


In [50]:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     37555
           1       0.86      0.91      0.88      1213

    accuracy                           0.99     38768
   macro avg       0.93      0.95      0.94     38768
weighted avg       0.99      0.99      0.99     38768



#### Random Forest Classifier
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.

In [43]:
# Scale the data to put into the Random Forest Classifier model
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [44]:
# Import RandomForestClassifier pacakge
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [45]:
# Create a Random Forest Classifier model
clf = RandomForestClassifier(random_state=1, n_estimators=500)

In [46]:
# Fit the Random Forest Classifier to the data
clf.fit(X_train_scaled, y_train)

RandomForestClassifier(n_estimators=500, random_state=1)

In [47]:
# print the model score
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9975409272252029
Testing Score: 0.9917457697069748


In [15]:
print(f'Logistic Regression model testing score: {classifier.score(X_test, y_test)}')
print(f'Random Forest Classifier model testing score: {clf.score(X_test_scaled, y_test)}')

Logistic Regression model testing score: 0.9919521254643004
Random Forest Classifier model testing score: 0.9917457697069748


#### Which model performed better? How does that compare to your prediction? 

The Logistic Regression model did a slightly better job at accurately classifying the data than the Random Forest Classifier. However,there is not a significant difference in the accuracy of both models.