In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

## Preprocessing

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
# Familiarizing myself with the data to be used in this project
# train_df.head()
# test_df.head()

In [4]:
# Convert categorical data to numeric and separate target feature for training data
# Convert categorical data to numeric and separate target feature for testing data
X_train = pd.get_dummies(train_df.drop('target', axis=1))
X_test = pd.get_dummies(test_df.drop('target', axis=1))
y_train = train_df['target']
y_test = test_df['target']

In [5]:
# add missing dummy variables to testing set
for X in X_train.columns:
    if X not in X_test.columns:
        X_test[X]=0

## Consider the Models

Logistic regression is a great model for making binary decisions as an output, such as 'high-risk' vs 'low-risk' in this dataset. I believe that the logistic model will perform better than the random forest classifier in this set. Let's code and see!

In [7]:
# Train the Logistic Regression model on the unscaled data and print the model score
model = LogisticRegression()

model.fit(X_train, y_train)

print(f"Training Score: {model.score(X_train, y_train)}")
print(f"Testing Score: {model.score(X_test, y_test)}")

Training Score: 0.652791461412151
Testing Score: 0.5082943428328371


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Thoughts
It doesn't look like the logistic regression was able to perform as well as I may have hoped, but that could also be due to the fact that the model did not have scaled data to reference. I am interested to see if scaling the data later will improve performance!

In [9]:
# Train a Random Forest Classifier model and print the model score

clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train, y_train)

print(f'Training Score: {clf.score(X_train, y_train)}')
print(f'Testing Score: {clf.score(X_test, y_test)}')

Training Score: 1.0
Testing Score: 0.646958740961293


## Thoughts
It looks like the random forest classifier outperformed the logistic regression model! Go figure. I'll be interested to see if this holds true after the data is scaled.

In [10]:
# Scale the data
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [11]:
# Train the Logistic Regression model on the scaled data and print the model score
model = LogisticRegression()

model.fit(X_train_scaled, y_train)

print(f"Training Score: {model.score(X_train_scaled, y_train)}")
print(f"Testing Score: {model.score(X_test_scaled, y_test)}")

Training Score: 0.710919540229885
Testing Score: 0.7598894087622289


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf = RandomForestClassifier(random_state=1, n_estimators=500).fit(X_train_scaled, y_train)

print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 1.0
Testing Score: 0.6480221182475542


## Final Reflection
It looks like, without scaling, the random forest classifier outperformed the logistic regression model. However, after applying a standard scalar to the data, the logistic model was the winner! Interesting to see how the scaler improves performance significantly for the logistic regression model, but had little to no effect on the random forest model. I wouldn't say I fully understand why this is, but this exercise has certainly reinforced the importance of testing different models when approaching a predictive problem that needs solving!