# Supervised Learning Exercise
In this notebook, I will be analyzing information from Lending Club. The data procured have been undersampled to give an even number of high risk and low risk loans.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
len(train_df.columns)

86

In [4]:
set(train_df.dtypes)

{dtype('int64'), dtype('float64'), dtype('O')}

# Data Cleaning

In [5]:
# Convert categorical data to numeric and separate target feature for training data
X_train = train_df.copy()
cat_columns = []
for col, dtype in zip(train_df.columns, train_df.dtypes):
    if dtype == 'object':
        cat_columns.append(col)
        new_dummies = pd.get_dummies(train_df[col], prefix=col)
        X_train.drop(columns=[col], inplace=True)
        X_train = pd.concat([X_train, new_dummies], axis=1)    

In [6]:
# Convert categorical data to numeric and separate target feature for testing
test_num = test_df.copy()
test_cat = []
for col, dtype in zip(test_df.columns, test_df.dtypes):
    if dtype == 'object':
        cat_columns.append(col)
        new_dummies = pd.get_dummies(test_df[col], prefix=col)
        test_num.drop(columns=[col], inplace=True)
        test_num = pd.concat([test_num, new_dummies], axis=1) 

In [7]:
set(X_train.columns) - set(test_num.columns)

{'debt_settlement_flag_Y'}

In [8]:
X_train.drop(columns = ['debt_settlement_flag_Y'], inplace=True)

In [9]:
X_train.columns

Index(['Unnamed: 0', 'index', 'loan_amnt', 'int_rate', 'installment',
       'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'total_acc', 'out_prncp', 'out_prncp_inv',
       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
       'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',
       'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
       'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
       'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct',
       'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
       'mort_acc', 'mths_since_recent_bc'

# Data Analysis

I will be performing both logistic regression and random forest classification. It is my current belief that logistic regression will generate a better model, because I have a data set with a huge number of dimensions. I worry that this will lead random forest to overfitting.

In [10]:
#Split off the Target Columns
target_train = X_train['loan_status_high_risk'].copy()
X_train.drop(columns=['loan_status_high_risk', 'loan_status_low_risk'], inplace=True)
target_test = test_num['loan_status_high_risk'].copy()
test_num.drop(columns=['loan_status_high_risk', 'loan_status_low_risk'], inplace=True)

In [15]:
# Train the Logistic Regression model on the unscaled data and print the model score
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, target_train)
lr.score(test_num, target_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.5859208847299021

Above, the logistic model is not performing very well. This is likely because the data has not been scaled - I will return to this later to see if scaling yields better results.

In [17]:
# Train a Random Forest Classifier model and print the model score
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, target_train)
rfc.score(test_num, target_test)

0.6020842194810719

As of right now, our random forest model is outperforming our logistic regression model. However, the regression model has as of yet failed to converge. I was not expecting to see random forest outperform logistic regression here, but I believe that it's small edge may be because it is able to more easily process the large number of dimensions. However, after scaling both data sets, i will tune their paramaters to determine which might produce a better model once optomized

# Scaled Data Analysis

In [20]:
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
test_scaled = scaler.transform(test_num)

In [23]:
# Train the Logistic Regression model on the scaled data and print the model score
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_scaled, target_train)
lr.score(test_scaled, target_test)
print(f'Training Score: {lr.score(X_train_scaled, target_train)}')
print(f'Testing Score: {lr.score(test_scaled, target_test)}')

Training Score: 0.7127257799671592
Testing Score: 0.7203317737133135


This is a result I feel very positive about. It does not show any evidence of overfitting, while still getting a reasonably strong accuracy.

In [24]:
# Train a Random Forest Classifier model on the scaled data and print the model score
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train_scaled, target_train)
rfc.score(test_scaled, target_test)
print(f'Training Score: {rfc.score(X_train_scaled, target_train)}')
print(f'Testing Score: {rfc.score(test_scaled, target_test)}')

Training Score: 1.0
Testing Score: 0.6014461931093151


Here we see that the random forest classifier has done what I expected and severely overfit the data. In fact, our testing score is nearly exactly what we achieved with unscaled data. This suggests to me that even without scaling the data, random forest had overfit the data to an extreme.

# Tuned Data Analysis
I will now be performing two grid searches to find the best set of paramaters that I can use to remedy the problems that these models are facing.

In [32]:
from sklearn.model_selection import GridSearchCV

lr_param_grid = {
    'solver': ['lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter': [100, 300, 500, 1000],
}

rfc_param_grid = {
    'n_estimators': [10, 30, 50, 100, 500],
    'max_depth': [2, 3, 4, 5, 7, 10, None],
    #I include 1/3 to see if random forest performs better pretending to be doing a regression task.
    'max_features': ['auto', 'sqrt', 1/3, None]
}

In [33]:
lr = LogisticRegression()
rfc = RandomForestClassifier()
grid_lr = GridSearchCV(lr, lr_param_grid, verbose=3)
grid_rfc = GridSearchCV(rfc, rfc_param_grid, verbose = 3)

In [29]:
grid_lr.fit(X_train_scaled, target_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 1/5] END .....................max_iter=100, solver=lbfgs; total time=   0.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 2/5] END .....................max_iter=100, solver=lbfgs; total time=   0.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 3/5] END .....................max_iter=100, solver=lbfgs; total time=   0.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 4/5] END .....................max_iter=100, solver=lbfgs; total time=   0.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV 5/5] END .....................max_iter=100, solver=lbfgs; total time=   0.2s
[CV 1/5] END .................max_iter=100, solver=liblinear; total time=   0.8s
[CV 2/5] END .................max_iter=100, solver=liblinear; total time=   0.7s
[CV 3/5] END .................max_iter=100, solver=liblinear; total time=   0.7s
[CV 4/5] END .................max_iter=100, solver=liblinear; total time=   0.8s
[CV 5/5] END .................max_iter=100, solver=liblinear; total time=   0.8s




[CV 1/5] END .......................max_iter=100, solver=sag; total time=   0.8s




[CV 2/5] END .......................max_iter=100, solver=sag; total time=   0.8s




[CV 3/5] END .......................max_iter=100, solver=sag; total time=   0.8s




[CV 4/5] END .......................max_iter=100, solver=sag; total time=   0.8s




[CV 5/5] END .......................max_iter=100, solver=sag; total time=   0.8s




[CV 1/5] END ......................max_iter=100, solver=saga; total time=   1.0s




[CV 2/5] END ......................max_iter=100, solver=saga; total time=   1.0s




[CV 3/5] END ......................max_iter=100, solver=saga; total time=   1.0s




[CV 4/5] END ......................max_iter=100, solver=saga; total time=   1.0s




[CV 5/5] END ......................max_iter=100, solver=saga; total time=   1.0s
[CV 1/5] END .....................max_iter=300, solver=lbfgs; total time=   0.4s
[CV 2/5] END .....................max_iter=300, solver=lbfgs; total time=   0.4s
[CV 3/5] END .....................max_iter=300, solver=lbfgs; total time=   0.4s
[CV 4/5] END .....................max_iter=300, solver=lbfgs; total time=   0.4s
[CV 5/5] END .....................max_iter=300, solver=lbfgs; total time=   0.5s
[CV 1/5] END .................max_iter=300, solver=liblinear; total time=   0.7s
[CV 2/5] END .................max_iter=300, solver=liblinear; total time=   0.7s
[CV 3/5] END .................max_iter=300, solver=liblinear; total time=   0.7s
[CV 4/5] END .................max_iter=300, solver=liblinear; total time=   0.8s
[CV 5/5] END .................max_iter=300, solver=liblinear; total time=   0.8s




[CV 1/5] END .......................max_iter=300, solver=sag; total time=   2.6s




[CV 2/5] END .......................max_iter=300, solver=sag; total time=   2.6s




[CV 3/5] END .......................max_iter=300, solver=sag; total time=   2.6s




[CV 4/5] END .......................max_iter=300, solver=sag; total time=   2.6s




[CV 5/5] END .......................max_iter=300, solver=sag; total time=   2.6s




[CV 1/5] END ......................max_iter=300, solver=saga; total time=   3.1s




[CV 2/5] END ......................max_iter=300, solver=saga; total time=   3.1s




[CV 3/5] END ......................max_iter=300, solver=saga; total time=   3.1s




[CV 4/5] END ......................max_iter=300, solver=saga; total time=   3.1s




[CV 5/5] END ......................max_iter=300, solver=saga; total time=   3.3s
[CV 1/5] END .....................max_iter=500, solver=lbfgs; total time=   0.4s
[CV 2/5] END .....................max_iter=500, solver=lbfgs; total time=   0.4s
[CV 3/5] END .....................max_iter=500, solver=lbfgs; total time=   0.4s
[CV 4/5] END .....................max_iter=500, solver=lbfgs; total time=   0.4s
[CV 5/5] END .....................max_iter=500, solver=lbfgs; total time=   0.5s
[CV 1/5] END .................max_iter=500, solver=liblinear; total time=   0.7s
[CV 2/5] END .................max_iter=500, solver=liblinear; total time=   0.8s
[CV 3/5] END .................max_iter=500, solver=liblinear; total time=   0.7s
[CV 4/5] END .................max_iter=500, solver=liblinear; total time=   0.8s
[CV 5/5] END .................max_iter=500, solver=liblinear; total time=   0.8s




[CV 1/5] END .......................max_iter=500, solver=sag; total time=   4.3s




[CV 2/5] END .......................max_iter=500, solver=sag; total time=   4.4s




[CV 3/5] END .......................max_iter=500, solver=sag; total time=   4.4s




[CV 4/5] END .......................max_iter=500, solver=sag; total time=   4.5s




[CV 5/5] END .......................max_iter=500, solver=sag; total time=   4.4s




[CV 1/5] END ......................max_iter=500, solver=saga; total time=   5.3s




[CV 2/5] END ......................max_iter=500, solver=saga; total time=   5.2s




[CV 3/5] END ......................max_iter=500, solver=saga; total time=   5.4s




[CV 4/5] END ......................max_iter=500, solver=saga; total time=   5.3s




[CV 5/5] END ......................max_iter=500, solver=saga; total time=   5.2s
[CV 1/5] END ....................max_iter=1000, solver=lbfgs; total time=   0.4s
[CV 2/5] END ....................max_iter=1000, solver=lbfgs; total time=   0.4s
[CV 3/5] END ....................max_iter=1000, solver=lbfgs; total time=   0.4s
[CV 4/5] END ....................max_iter=1000, solver=lbfgs; total time=   0.4s
[CV 5/5] END ....................max_iter=1000, solver=lbfgs; total time=   0.5s
[CV 1/5] END ................max_iter=1000, solver=liblinear; total time=   0.7s
[CV 2/5] END ................max_iter=1000, solver=liblinear; total time=   0.7s
[CV 3/5] END ................max_iter=1000, solver=liblinear; total time=   0.7s
[CV 4/5] END ................max_iter=1000, solver=liblinear; total time=   0.8s
[CV 5/5] END ................max_iter=1000, solver=liblinear; total time=   0.8s




[CV 1/5] END ......................max_iter=1000, solver=sag; total time=   8.7s




[CV 2/5] END ......................max_iter=1000, solver=sag; total time=   8.9s




[CV 3/5] END ......................max_iter=1000, solver=sag; total time=   8.8s




[CV 4/5] END ......................max_iter=1000, solver=sag; total time=   8.9s




[CV 5/5] END ......................max_iter=1000, solver=sag; total time=   8.8s




[CV 1/5] END .....................max_iter=1000, solver=saga; total time=  10.6s




[CV 2/5] END .....................max_iter=1000, solver=saga; total time=  10.6s
[CV 3/5] END .....................max_iter=1000, solver=saga; total time=   8.4s




[CV 4/5] END .....................max_iter=1000, solver=saga; total time=  10.6s




[CV 5/5] END .....................max_iter=1000, solver=saga; total time=  10.7s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(estimator=LogisticRegression(),
             param_grid={'max_iter': [100, 300, 500, 1000],
                         'solver': ['lbfgs', 'liblinear', 'sag', 'saga']},
             verbose=3)

In [30]:
grid_lr.best_params_

{'max_iter': 100, 'solver': 'lbfgs'}

We can see that the best model found by grid search on our scaled data was in fact the default paramaters used by scikit learn for linear regression. 

In [34]:
grid_rfc.fit(X_train_scaled, target_train)

Fitting 5 folds for each of 140 candidates, totalling 700 fits
[CV 1/5] END max_depth=2, max_features=auto, n_estimators=10; total time=   0.0s
[CV 2/5] END max_depth=2, max_features=auto, n_estimators=10; total time=   0.0s
[CV 3/5] END max_depth=2, max_features=auto, n_estimators=10; total time=   0.0s
[CV 4/5] END max_depth=2, max_features=auto, n_estimators=10; total time=   0.0s
[CV 5/5] END max_depth=2, max_features=auto, n_estimators=10; total time=   0.0s
[CV 1/5] END max_depth=2, max_features=auto, n_estimators=30; total time=   0.1s
[CV 2/5] END max_depth=2, max_features=auto, n_estimators=30; total time=   0.1s
[CV 3/5] END max_depth=2, max_features=auto, n_estimators=30; total time=   0.2s
[CV 4/5] END max_depth=2, max_features=auto, n_estimators=30; total time=   0.2s
[CV 5/5] END max_depth=2, max_features=auto, n_estimators=30; total time=   0.2s
[CV 1/5] END max_depth=2, max_features=auto, n_estimators=50; total time=   0.3s
[CV 2/5] END max_depth=2, max_features=auto, n

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [2, 3, 4, 5, 7, 10, None],
                         'max_features': ['auto', 'sqrt', 0.3333333333333333,
                                          None],
                         'n_estimators': [10, 30, 50, 100, 500]},
             verbose=3)

In [35]:
grid_rfc.best_score_

0.5446633825944172