# Consider the models: Logistic Regression vs Random Classifier

Make a prediction as to which model you think will perform better and provide justification for your educated guess.

1. Prediction
    - <strong><em>Random Forest Classifier</em></strong>
2. Justification
    - Due to known category of loan status (0 & 1) and several dependent variables, random forest classifier would be better to predict whether a loan will be approved or not. Individual Random Forest Classifier decision trees with random selection can capture more complicated feature patterns and deliver the best accuracy. Logistic Regression model would better predict numerical range/data such as weather forecasting, market forecasting, estimating life expectancy, advertising popularity prediction, etc.. as opposed Random Forest Classifier that does better at predicting categorical data.
    
<strong><em>see below for outcome</em></strong>

In [31]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Credit Risk Evaluator: Lending Dataset

Lending data uploaded from Resource folder to import into pandas. Data reflects individual borrowers' financial information. The information consists of the amount of loan, interest rate, borrow's income, debt-to-income ratio, number of accounts, derogatory marks, total debt, and loan status. A loan status determines whether the borrower is approved for the loan (0) or not(1). 

### Lending Dataset Columns

1. loan_size	
2. interest_rate	
3. borrower_income	
4. debt_to_income	
5. num_of_accounts	
6. derogatory_marks	
7. total_debt	
8. loan_status - 0 = True (approved) & 1 = False (not approved)

Loan Approval Dataset (2022). Data generated by Trilogy Education Services, a 2U, Inc. brand, and is intended for educational purposes only.

In [2]:
# Import the data
data_df = pd.read_csv('/Users/tanishacooper/code/supervised_machine_learning/Resources/lending_data.csv')
data_df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [26]:
# data_df.describe()
# data_df.info()

In [33]:
# Create data, validate that the data columns reflect "numeric" values (validate dtype of your data using df.info())
for col in data_df.columns:
    if data_df[col].dtype == 'object':
        data_df[col] = pd.to_numeric(data_df[col], errors='coerce')

# Create X,y by droping loan_status column (0,1 data)
X = data_df.drop('loan_status', axis=1)
y = data_df['loan_status'] != 1

X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [5]:
# Split the data into X_train, X_test, y_train, y_test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# X_test
# X_train

# Logistic Regression

Logistic regression is a statistical method for predicting binary outcomes from data.

Examples of this are "yes" vs. "no" or "young" vs. "old". 

These are categories that translate to a probability of being a 0 or a 1.

We take the lending data to see what the prediction would be if the lender meet categorical data to likely be approved (0) or not approved (0) for the loan.

### Create a Logistic Regression Model

In [16]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=1000)
classifier

LogisticRegression(max_iter=1000)

In [17]:
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [18]:
# Train a Logistic Regression model print the model score
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9919177328380795
Testing Data Score: 0.9924680148576145


# Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables as in the case of regression and categorical variables as in the case of classification. It performs better results for classification problems.

### Random Forest Classifier Model

In [20]:
# Train a Random Forest Classifier model and print the model score
clf = RandomForestClassifier(random_state=42, n_estimators=1000).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9971970009629936
Testing Score: 0.9918489475856377


## Fit a LogisticRegression model and RandomForestClassifier model

Which model performed better? 
- Random Forest Classifier

How does that compare to your prediction? 
- I chose Random Forest Classifer, as well. "A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting (scikit-learn, 2007-2022)." Random forest classifier models is good due to high performance and less need for interpretation. The decision tree within the Random Forest Classifier split the data into smaller data groups based on features such as debt-to-income ratio, loan amount, etc..