# Predictions 

I predict that the Logistic regression model will perform better using scaled data by reducing any sparsity that can introduce noise. In comparison, I dont think the random forest classifier model will improve much after scaling and will produce similar scores for both sclaed and unscaled data, as these models perform well with large data sets. I expect the random forest classifier model to perform better with this large dataset, and also predict that scaling the data will lead to slightly better model scores in the random forest classifier model and substantially better scores in the logistic regression. I expect scaling the data to also reduce overfitting effects in both models.

In [46]:
# Import dependancies
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [47]:
# Read in the datasets
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [48]:
train_df.head()

Unnamed: 0,loan_amnt,int_rate,installment,home_ownership,annual_inc,verification_status,pymnt_plan,dti,delinq_2yrs,inq_last_6mths,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,hardship_flag,debt_settlement_flag,target
0,7000.0,0.1894,256.38,MORTGAGE,75000.0,Not Verified,n,28.62,0.0,2.0,...,87.5,0.0,0.0,352260.0,62666.0,35000.0,10000.0,N,N,low_risk
1,40000.0,0.1614,975.71,MORTGAGE,102000.0,Source Verified,n,11.72,2.0,0.0,...,0.0,0.0,0.0,294664.0,109911.0,9000.0,71044.0,N,N,low_risk
2,11000.0,0.2055,294.81,RENT,45000.0,Verified,n,37.25,1.0,3.0,...,7.7,0.0,0.0,92228.0,36007.0,33000.0,46328.0,N,N,low_risk
3,4000.0,0.1612,140.87,MORTGAGE,38000.0,Not Verified,n,42.89,1.0,0.0,...,100.0,0.0,0.0,284273.0,52236.0,13500.0,52017.0,N,N,low_risk
4,14000.0,0.1797,505.93,MORTGAGE,43000.0,Source Verified,n,22.16,1.0,0.0,...,25.0,0.0,0.0,120280.0,88147.0,33300.0,78680.0,N,N,low_risk


In [49]:
train_df['target'].values

array(['low_risk', 'low_risk', 'low_risk', ..., 'high_risk', 'high_risk',
       'high_risk'], dtype=object)

In [65]:
# Convert categorical data to numeric and separate target feature for training data
train_dum = train_df.drop('target', axis=1)
X = pd.get_dummies(train_dum)
y = train_df['target']
new_values = {'low_risk' : 0, 'high_risk' : 1}   
y = y.replace(new_values)

In [62]:
# Convert categorical data to numeric and separate target feature for testing data
test_dum = test_df.drop('target', axis=1)
X_test = pd.get_dummies(test_dum)
y_test = test_df['target']
new_values = {'low_risk' : 0, 'high_risk' : 1}   
y_test = y_test.replace(new_values)

In [64]:
X_test.columns

Index(['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti',
       'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal',
       'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
       'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',
       'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
       'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
       'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct',
       'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
       'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_in

In [67]:
X.columns

Index(['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti',
       'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal',
       'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_amnt', 'collections_12_mths_ex_med', 'policy_code',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
       'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',
       'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',
       'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
       'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
       'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct',
       'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
       'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_in

In [53]:
# add missing dummy variables to testing set
X_test['debt_settlement_flag_Y'] = 0

In [54]:
# check added values
X_test['debt_settlement_flag_Y'].values

array([0, 0, 0, ..., 0, 0, 0])

# Logistic Regression model

In [55]:
# Train the Logistic Regression model on the unscaled data and print the model score
classifier = LogisticRegression()

# Fit our model using the training data
classifier.fit(X, y)
print(f"Training Data Score: {classifier.score(X, y)}")
print(f"Testing Data Score: {classifier.score(X, y)}")

Training Data Score: 0.6528735632183909
Testing Data Score: 0.6528735632183909


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# RandomForestClassifier 

In [56]:
# Train a Random Forest Classifier model and print the model score
model = RandomForestClassifier(random_state=1)
model.fit(X, y)
print(f"Score with training data: {model.score(X, y)}")
print(f"Score with test data: {model.score(X_test, y_test)}")

Score with training data: 1.0
Score with test data: 0.6384517226712038


# Unscaled data analysis 

For the unscaled data, the random forest model seems to not be performing as well on the test data, compared to the train data suggesting overfitting. 
The logistic regression model has performed the same on both test and train data.


# Scaling the data 

In [57]:
# Scale the data
scaler = StandardScaler().fit(X)
X_train_scaled = scaler.transform(X)
X_test_scaled = scaler.transform(X_test)

In [58]:
# Train the Logistic Regression model on the scaled data and print the model score
model = LogisticRegression()
model.fit(X_train_scaled, y)
print(f"Score with training data: {model.score(X_train_scaled, y)}")
print(f"Score with test data: {model.score(X_test_scaled, y_test)}")

Score with training data: 0.710919540229885
Score with test data: 0.7598894087622289


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [59]:
# Train a Random Forest Classifier model on the scaled data and print the model score
model = RandomForestClassifier()
model.fit(X_train_scaled, y)
print(f"Score with training data: {model.score(X_train_scaled, y)}")
print(f"Score with test data: {model.score(X_test_scaled, y_test)}")

Score with training data: 0.9999178981937603
Score with test data: 0.6495108464483199


# Conclusions

The linear regression model performed better once the data was scaled, whereas the random forest classifier model has a similar performance for both scaled and unscaled data, suggesting scaling data has little impact to this models outcome. 
The random forest model shows overfitting again, where the training data produces a better score than the test data.
Overall, both models used don't fit this type of data well, and seem to be unreliable.