In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
train_df["loan_status"]

0         low_risk
1         low_risk
2         low_risk
3         low_risk
4         low_risk
           ...    
12175    high_risk
12176    high_risk
12177    high_risk
12178    high_risk
12179    high_risk
Name: loan_status, Length: 12180, dtype: object

In [4]:
# Convert categorical data to numeric and separate target feature for training data
X_train = pd.get_dummies(train_df.drop(columns=["loan_status"]))
y_train = train_df["loan_status"]

In [5]:
# Convert categorical data to numeric and separate target feature for testing data
X_test = pd.get_dummies(test_df.drop(columns=["loan_status"]))
y_test = test_df["loan_status"]

In [6]:
# add missing dummy variables to testing set
for col in X_train.columns:
    if col not in X_test.columns:
        X_test[col] = 0


I predict that the logistic regression model will give better scores than the forest classifier because it's based off probability and looks at every feature where as the random forrest picks a random feature to look at.

In [7]:
# Train the Logistic Regression model on the unscaled data and print the model score
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.6485221674876848
Testing Data Score: 0.5253083794130158


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [8]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print(f'Training Score: {clf.score(X_train, y_train)}')
print(f'Testing Score: {clf.score(X_test, y_test)}')

Training Score: 1.0
Testing Score: 0.5986814121650361


The random forrest classifier performed better than the logistic regression despite both scores still being rather low. I guess my origional assumption that seems to be the opposite  when it comes to the number of features the model looks at. The random forrest looking at one random feature scores higher than the regression model that looks at all features. 

The logistic regression model should perform better on the scaled data. I think so because scaling the data puts all features on sort of the same level and since the logistic regression model looks at all features it should give a more accurate score as compared to the random forrest classifier than randomly picks one feature.

In [9]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [10]:
# Train the Logistic Regression model on the scaled data and print the model score
classifier_scaled = LogisticRegression()
classifier_scaled.fit(X_train_scaled, y_train)
print(f"Training Data Score: {classifier_scaled.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier_scaled.score(X_test_scaled, y_test)}")

Training Data Score: 0.713136288998358
Testing Data Score: 0.7201190982560612


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf_scaled = RandomForestClassifier()
clf_scaled.fit(X_train_scaled, y_train)
print(f'Training Score: {clf_scaled.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf_scaled.score(X_test_scaled, y_test)}')

Training Score: 1.0
Testing Score: 0.6522756273925989


For the test scores we see that both models performed better on the scaled data with the logistic regression model performing better that the random forrest classifier just as I expected while working with the scaled data.