# Supervised Machine Learning Notes

I took one look at the dataset and decided I could do much better. My comparison of the source data and final processed data revealed:

- Loss of ~75% of the available data by choosing to drop all rows with any NaN value.
- Inclusion of redundant and meaningless (all values the same) columns.
- Exclusion of columns with meaningful information related to credit risk.
- Errors in the assignment of credit risk categories, and a questionable choice of assigning "In Grace Period" (loans that are 1 to 15 days past due) to high risk.

Not that it was easy to clean the dataset and figure out reasonable ways to deal with the NaNs. I ended up writing a JSON file with the instructions to deal with data transformations and methods for filling NaNs. I also redownloaded the datasets from LendingClub, which was beneficial since they've updated them since the homework was developed. Was the effort worth it? See notes at the end!

The fresh LendingClub datasets, my processing Jupyter notebooks, and output files are under /Resources/Generator in my repo, if you'd like to take a look. I didn't include my Excel workbook (which is a bit messy) and a couple screening Jupyter notebooks I used along the way, but I'd be happy to share them. Feel free to use any/all of my work for future classes without any restriction. 

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [2]:
data_dir = 'Resources'

df_train = pd.read_csv(os.path.join(data_dir, '2019loans.csv'))
df_train

Unnamed: 0,target,loan_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,verification_status,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,len_cr_hist_yrs
0,low_risk,6025.0,36.0,0.1640,213.02,C,10.0,MORTGAGE,40000.0,Source Verified,...,0.0,2.0,76.9,0.0,0.0,264142.0,1093.0,2600.0,0.0,13.998631
1,low_risk,13000.0,36.0,0.1774,468.29,C,1.0,MORTGAGE,120000.0,Source Verified,...,0.0,3.0,94.7,20.0,1.0,153209.0,94459.0,18900.0,133309.0,24.670773
2,low_risk,10000.0,60.0,0.0819,203.68,A,7.0,MORTGAGE,70000.0,Source Verified,...,0.0,1.0,93.3,0.0,0.0,101702.0,57977.0,17600.0,75502.0,10.165640
3,low_risk,10000.0,60.0,0.1033,214.10,B,1.0,MORTGAGE,25000.0,Not Verified,...,0.0,3.0,100.0,0.0,0.0,533288.0,139385.0,2000.0,138357.0,12.914442
4,low_risk,40000.0,60.0,0.1240,897.89,B,10.0,MORTGAGE,50000.0,Not Verified,...,0.0,1.0,100.0,25.0,0.0,152187.0,116566.0,18100.0,108887.0,15.501711
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98867,high_risk,25000.0,36.0,0.1033,810.56,B,1.0,MORTGAGE,150000.0,Source Verified,...,0.0,1.0,78.9,100.0,0.0,296082.0,80389.0,22200.0,75382.0,24.251882
98868,high_risk,31900.0,60.0,0.1308,727.14,B,6.0,MORTGAGE,145000.0,Not Verified,...,0.0,0.0,80.0,80.0,0.0,469750.0,76916.0,55000.0,0.0,31.082820
98869,high_risk,40000.0,36.0,0.0881,1268.46,A,10.0,MORTGAGE,53000.0,Source Verified,...,0.0,2.0,100.0,0.0,0.0,398250.0,11772.0,11400.0,0.0,15.668720
98870,high_risk,15000.0,36.0,0.1774,540.34,C,3.0,RENT,54080.0,Not Verified,...,0.0,1.0,90.0,100.0,0.0,48968.0,29786.0,2900.0,33968.0,6.083504


In [3]:
df_test = pd.read_csv(os.path.join(data_dir, '2020Q1loans.csv'))
df_test

Unnamed: 0,target,loan_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,verification_status,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,len_cr_hist_yrs
0,low_risk,10000.0,36.0,0.0881,317.12,A,10.0,MORTGAGE,56777.0,Verified,...,0.0,5.0,90.0,0.0,0.0,28521.0,7738.0,13500.0,4500.0,17.503080
1,low_risk,17050.0,60.0,0.2305,481.14,D,6.0,OWN,27000.0,Not Verified,...,0.0,2.0,100.0,50.0,0.0,43100.0,19383.0,29400.0,0.0,13.084189
2,low_risk,8000.0,36.0,0.0881,253.70,A,6.0,OWN,63999.0,Not Verified,...,0.0,5.0,94.4,0.0,1.0,84985.0,47085.0,7700.0,52385.0,10.836413
3,low_risk,16150.0,60.0,0.1612,393.77,C,3.0,MORTGAGE,165000.0,Verified,...,0.0,10.0,100.0,0.0,0.0,629755.0,77548.0,172850.0,65000.0,25.084189
4,low_risk,28000.0,60.0,0.1308,638.24,B,5.0,RENT,119584.0,Source Verified,...,0.0,3.0,100.0,0.0,0.0,132927.0,102569.0,17300.0,115127.0,17.670089
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12783,high_risk,8000.0,60.0,0.1862,205.86,D,8.0,RENT,38000.0,Source Verified,...,0.0,3.0,95.0,0.0,1.0,31357.0,19595.0,1500.0,9657.0,10.165640
12784,high_risk,30000.0,36.0,0.2055,1123.34,D,1.0,RENT,180000.0,Source Verified,...,0.0,1.0,100.0,33.3,0.0,218686.0,209389.0,5600.0,209986.0,14.913073
12785,high_risk,17000.0,36.0,0.1524,591.32,C,8.0,RENT,240000.0,Source Verified,...,0.0,4.0,87.5,16.7,0.0,151330.0,113872.0,9800.0,140230.0,17.333333
12786,high_risk,25000.0,36.0,0.2565,1002.62,D,2.0,RENT,60000.0,Source Verified,...,0.0,4.0,100.0,62.5,0.0,57163.0,42830.0,16000.0,41163.0,19.835729


In [4]:
# Convert categorical data to numeric and separate target feature for training data
y_train = df_train['target']
map = {'low_risk': 0, 'high_risk': 1}
y_train = y_train.map(map)

X_train = df_train.drop('target', axis=1)
X_train = pd.get_dummies(X_train).astype('float')

In [5]:
# Convert categorical data to numeric and separate target feature for testing data
y_test = df_test['target']
map = {'low_risk': 0, 'high_risk': 1}
y_test = y_test.map(map)

X_test = df_test.drop('target', axis=1)
X_test = pd.get_dummies(X_test).astype('float')

In [6]:
# Identity columns to add to X_test
col_train = list(X_train.columns)
col_test = list(X_test.columns)

for add_col in list(set(col_train) ^ set(col_test)):
    X_test[add_col] = 0
    
# Re-order columns in X_test to match X_train
X_test = X_test[X_train.columns]

## Prediction: Logistic Regression vs. Random Forest for Unscaled Data

I predict the Random Forest will perform better for the unscaled data. I suspect the large numeric ranges will affect the ability of the Logistic Regression model to create a viable 0 to 1 result.

In [30]:
# Train the Logistic Regression model on the unscaled data and print the model score
classifier = LogisticRegression(solver='liblinear', max_iter=500, random_state=66).fit(X_train, y_train)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.9493486528036249
Testing Data Score: 0.9071004066312167


In [15]:
# Train the Random Forest Classifier model on the unscaled data and print the model score
clf = RandomForestClassifier(random_state=66, 
                              n_estimators=50, 
                              criterion='gini', 
                              min_samples_split=50,
                              bootstrap=False).fit(X_train, y_train)
print(f'Training Score: {clf.score(X_train, y_train)}')
print(f'Testing Score: {clf.score(X_test, y_test)}')

Training Score: 0.9783963103810988
Testing Score: 0.9303253049734126


### Discussion for Unscaled Data

The Random Forest model did perform better in terms of the training and testing scores both beinig higher than for Logistic Regression. Initially the Random Forest training score was 1.0 and the testing score was similar to that of the Logistic Regression. As I tuned the parameters, the training score decreased as the testing score increased. 

I believe the Random Forest model could be improved further (i.e., less over-fitting and a higher testing score) by utilizing a robust grid search or random search. However, the models ran far too slow for me to have time to try this.

## Prediction: Logistic Regression vs. Random Forest for Scaled Data

Having digested some guidance while playing around with the unscaled models, I predict the Logistic Regression will perform better for the scaled data. Picking 0 versus 1 is pretty much baked into Logistic Regression, making it a better option than Random Forest. However, it's likely I'll need to use a different solver to get a good fit.

In [7]:
# Scale the data
scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=col_train)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=col_train)

In [11]:
# Train the Logistic Regression model on the scaled data and print the model score
classifier = LogisticRegression(solver='saga', 
                                max_iter=500, 
                                random_state=66, 
                                tol=0.005,
                                penalty='elasticnet',
                                l1_ratio=0.5).fit(X_train_scaled, y_train)
print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

Training Data Score: 0.9419957116271543
Testing Data Score: 0.9157804191429465


In [16]:
# Train a Random Forest Classifier model on the scaled data and print the model score
clf = RandomForestClassifier(random_state=66, 
                              n_estimators=50, 
                              criterion='gini', 
                              min_samples_split=50,
                              bootstrap=False).fit(X_train_scaled, y_train)
print(f'Training Score: {clf.score(X_train_scaled, y_train)}')
print(f'Testing Score: {clf.score(X_test_scaled, y_test)}')

Training Score: 0.9783963103810988
Testing Score: 0.9304035032843291


### Discussion for Scaled Data

The Random Forest model performed essentially the same for the scaled data as the unscaled data, while the Logistic Regression (using a different solver suited to scaled data) improved slightly from 0.907 to 0.916. Is this enough to make an investor's day? I don't know. 

It's difficult to draw meaningful conclusions without spending significant time on the problem. Would reducing the number of columns help? Almost certainly. Grid search? Possibly. Dimensionality reduction is also likely to improve the input data set.

#### Addendum: Did All My Data Cleaning Help?

Well, I just ran the original datasets through the exact same models. **Wow!** My work really paid off in terms of improving the fit of the models *a priori*. So much so in fact that the performace differences between the models using my dataset are farily small. Here's the breakdown for the test scores:

                       My Dataset   Orig Dataset
     =============================================
     Unscaled data
         Log Regr        0.9071      0.5766
         Random For      0.9303      0.6474
     Scaled data
         Log Regr        0.9158      0.7414
         Random For      0.9304      0.6470

