# Credit Risk Evaluator

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

%matplotlib inline
from matplotlib import pyplot as plt

from sklearn.datasets import make_regression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

## Retrieve the Data

The data is located in the Challenge Files Folder:

* `lending_data.csv`

Import the data using Pandas. Display the resulting dataframe to confirm the import was successful.

In [2]:
# Import the data

lend_data_df = pd.read_csv('./Resources/lending_data.csv')

lend_data_df.head (40)

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
5,10100.0,7.438,50600,0.407115,4,1,20600,0
6,10300.0,7.49,51100,0.412916,4,1,21100,0
7,8800.0,6.857,45100,0.334812,3,0,15100,0
8,9300.0,7.096,47400,0.367089,3,0,17400,0
9,9700.0,7.248,48800,0.385246,4,0,18800,0


In [4]:
lend_data_df.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64
 2   borrower_income   77536 non-null  int64  
 3   debt_to_income    77536 non-null  float64
 4   num_of_accounts   77536 non-null  int64  
 5   derogatory_marks  77536 non-null  int64  
 6   total_debt        77536 non-null  int64  
 7   loan_status       77536 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 4.7 MB


## Predict Model Performance

You will be creating and comparing two models on this data: a Logistic Regression, and a Random Forests Classifier. Before you create, fit, and score the models, make a prediction as to which model you think will perform better. You do not need to be correct! 

Write down your prediction in the designated cells in your Jupyter Notebook, and provide justification for your educated guess.

# My Prediction

The Logistic Regression model is more efficient and a much more simple method in regard to both linear and binary classification problems. As this is a classification model, the model obtains a high level of accuracy along with linearly separable classes.

The Random Forest Classifier(RFC) model is able to performe both regression and classificaation tasks. The RFC model is able to work with large datasets as well as with non-linear data. The RFC also provides a more accurate prediction when compared to other classification algorithms.

Therefore, I feel that the RFC model will overall perform better than the Linear Regression model as the data provided may not be able to be separated linearly.

## Split the Data into Training and Testing Sets

In [5]:
# Split the data into X_train, X_test, y_train, y_test

x=lend_data_df.drop("loan_status",axis =1)

y=lend_data_df["loan_status"].values

x_train,x_test,y_train,y_test =train_test_split(x,y,random_state =23)


## Create, Fit and Compare Models

Create a Logistic Regression model, fit it to the data, and print the model's score. Do the same for a Random Forest Classifier. You may choose any starting hyperparameters you like. 

Which model performed better? How does that compare to your prediction? Write down your results and thoughts in the designated markdown cell.

In [7]:
# Train a Logistic Regression model and print the model score

classifier_LR = LogisticRegression()

classifier_LR.fit(x_train, y_train)

print(f"LR Training Data: {classifier_LR.score(x_train, y_train)}")

print(f"LR Testing Data: {classifier_LR.score(x_test, y_test)}")



LR Training Data: 0.9918145549594167
LR Testing Data: 0.9925711927362774


In [9]:
# Train a Random Forest Classifier model and print the model score

scaler_standard = StandardScaler().fit(x_train)

x_train_scaler = scaler_standard.transform(x_train)

x_test_scaler = scaler_standard.transform(x_test)

Classifier_model =RandomForestClassifier( random_state = 23).fit(x_train_scaler, y_train)

y_pred=Classifier_model.predict(x_test_scaler)

print(f'RFC Training Data: {Classifier_model.score(x_train_scaler,y_train)}')

print(f'RFC Testing Data: {Classifier_model.score(x_test_scaler,y_test)}')



RFC Training Data: 0.9973001788416563
RFC Testing Data: 0.9922616591002889


# Conclusion

Overall the RFC training model performs better than the LR training model supporting my initial prediction. 

The RFC training data score of 0.9973 is higher than the LR score of 0.9918. The testing data scores show very little difference between the two. The RFC model is 0.9923 to 4 d.p. and the LR model is 0.9926 to 4 d.p. This shows that there is no significant difference betweeen both the LR and RFC scores. This could be due to the data being linear and so there will be similar overall outcomes. 

The final outcome shows the LR model is slightly better than the RFC model which does not support my prediction. 