In [31]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd 

In [32]:
# Load dataset
df = pd.read_csv("Resources/lending_data.csv")
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


<b>Prediction:</b> Based on the data and the requested analysis, I would predict that the logistic regression would be a better model.

<b>Rationale:</b> Multivariate logistic regression is more suited than random forest classification to analysis involving one nominal variable and two or more measurement variables.

In this case, our goal is to know how or whether other variables affect the nominal variable to determine whether or not a loan should be issued.

In [33]:
# Define the X (features) and y (target) sets
y = df["loan_status"].values
X = df.drop("loan_status", axis=1)
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


In [34]:
# Split data into training and testing data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [35]:
# Create a logistic regression model
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
classifier

In [36]:
# Fit (train) model by using the training data
classifier.fit(X_train, y_train)

In [37]:
# Validate the model by using the test data
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9921240885954051
Testing Data Score: 0.9918489475856377


In [38]:
from sklearn.metrics import confusion_matrix
y_true = y_test
y_pred = classifier.predict(X_test)
confusion_matrix(y_true, y_pred)

array([[18663,   102],
       [   56,   563]], dtype=int64)

In [39]:
confusion_matrix(y, classifier.predict(X))

array([[74657,   379],
       [  237,  2263]], dtype=int64)

In [40]:
# Import random forest classifier
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [41]:
# Create data
X, y = make_classification(random_state=1, n_features=50, n_informative=5, n_redundant=0)
X = pd.DataFrame(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [44]:
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)
print(f"RandomForestClassifier Training Data Score: {clf.score(X_train_scaled, y_train)}")
print(f"RandomForestClassifier Testing Data Score: {clf.score(X_test_scaled, y_test)}")

RandomForestClassifier Training Data Score: 1.0
RandomForestClassifier Testing Data Score: 0.76


<b>Conclusion:</b> At data scores of 0.99 for both trainig and testing, as predicted the logistic regression model is better suited for supervised machine learning in this context than the random forest classifier.

While the random forest classifier's training data score was a perfect 1.0, in actual testing its score decreased to 0.76.  While this is still considered a good result for a data score, this result is not as strong as  the logistic regression model's.

