In [1]:
# import pandas as pd
# import numpy as np
# # import matplotlib.pyplot as plt
# %matplotlib inline

We've talked about Random Forests. Now it's time to build one.

Here we'll use data from Lending Club (2015) to predict the state of a loan given some information about it. You can download the dataset [here](https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1)

In [2]:
# # Replace the path with the correct path for your data.
# y2015 = pd.read_csv(
#     'https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/LoanStats3d.csv',
#     skipinitialspace=True,
#     header=1
# )

# # Note the warning about dtypes.

In [3]:
# y2015.head()

## The Blind Approach

Now, as we've seen before, creating a model is the easy part. Let's try just using everything we've got and throwing it without much thought into a Random Forest. SKLearn requires the independent variables to be be numeric, and all we want is dummy variables so let's use `get_dummies` from Pandas to generate a dummy variable for every categorical column and see what happens off of this kind of naive approach.

In [4]:
# from sklearn import ensemble
# from sklearn.model_selection import cross_val_score

# rfc = ensemble.RandomForestClassifier()
# X = y2015.drop('loan_status', 1)
# Y = y2015['loan_status']
# X = pd.get_dummies(X)

# cross_val_score(rfc, X, Y, cv=5)

Did your kernel die? My kernel died.

Guess it isn't always going to be that easy...

Can you think of what went wrong?

(You're going to have to reset your kernel and reload the column, BUT DON'T RUN THE MODEL AGAIN OR YOU'LL CRASH THE KERNEL AGAIN!)

## Data Cleaning

Well, `get_dummies` can be a very memory intensive thing, particularly if data are typed poorly. We got a warning about that earlier. Mixed data types get converted to objects, and that could create huge problems. Our dataset is about 400,000 rows. If there's a bad type there its going to see 400,000 distinct values and try to create dummies for all of them. That's bad. Lets look at all our categorical variables and see how many distinct counts there are...

In [5]:
# categorical = y2015.select_dtypes(include=['object'])
# for i in categorical:
#     column = categorical[i]
#     print(i)
#     print(column.nunique())

Well that right there is what's called a problem. Some of these have over a hundred thousand distinct types. Lets drop the ones with over 30 unique values, converting to numeric where it makes sense. In doing this there's a lot of code that gets written to just see if the numeric conversion makes sense. It's a manual process that we'll abstract away and just include the conversion.

You could extract numeric features from the dates, but here we'll just drop them. There's a lot of data, it shouldn't be a huge problem.

In [6]:
# # Convert ID and Interest Rate to numeric.
# y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
# y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# # Drop other columns with many unique variables
# y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
#             'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

Wonder what was causing the dtype error on the id column, which _should_ have all been integers? Let's look at the end of the file.

In [7]:
# y2015.tail()

In [8]:
# # Remove two summary rows at the end that don't actually contain data.
# y2015 = y2015[:-2]

Now this should be better. Let's try again.

In [9]:
# pd.get_dummies(y2015)

It finally works! We had to sacrifice sub grade, state address and description, but that's fine. If you want to include them you could run the dummies independently and then append them back to the dataframe.

## Second Attempt

Now let's try this model again.

We're also going to drop NA columns, rather than impute, because our data is rich enough that we can probably get away with it.

This model may take a few minutes to run.

In [10]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

# rfc = ensemble.RandomForestClassifier()
# X = y2015.drop('loan_status', 1)
# Y = y2015['loan_status']
# X = pd.get_dummies(X)
# X = X.dropna(axis=1)

# cross_val_score(rfc, X, Y, cv=10)

The score cross validation reports is the accuracy of the tree. Here we're about 98% accurate.

That works pretty well, but there are a few potential problems. Firstly, we didn't really do much in the way of feature selection or model refinement. As such there are a lot of features in there that we don't really need. Some of them are actually quite impressively useless.

There's also some variance in the scores. The fact that one gave us only 93% accuracy while others gave higher than 98 is concerning. This variance could be corrected by increasing the number of estimators. That will make it take even longer to run, however, and it is already quite slow.

## DRILL: Third Attempt

So here's your task. Get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

You'll want to do a few things in this process. First, dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. You may want to use PCA or correlation matrices.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?

In [11]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [12]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectFromModel, SelectKBest
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import confusion_matrix, classification_report

# !pip install category_encoders
from category_encoders import LeaveOneOutEncoder, TargetEncoder

<IPython.core.display.Javascript object>

In [13]:
import warnings
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor


def print_vif(x):
    """Utility for checking multicollinearity assumption
    
    :param x: input features to check using VIF. This is assumed to be a pandas.DataFrame
    :return: nothing is returned the VIFs are printed as a pandas series
    """
    # Silence numpy FutureWarning about .ptp
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        x = sm.add_constant(x)

    vifs = []
    for i in range(x.shape[1]):
        vif = variance_inflation_factor(x.values, i)
        vifs.append(vif)

    print("VIF results\n-------------------------------")
    print(pd.Series(vifs, index=x.columns))
    print("-------------------------------\n")

<IPython.core.display.Javascript object>

In [14]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/LoanStats3d.csv",
    skipinitialspace=True,
    header=1,
)

# Note the warning about dtypes.

  interactivity=interactivity, compiler=compiler, result=result)


<IPython.core.display.Javascript object>

In [15]:
# Your code here.

# Convert ID and Interest Rate to numeric.
y2015["id"] = pd.to_numeric(y2015["id"], errors="coerce")
y2015["int_rate"] = pd.to_numeric(y2015["int_rate"].str.strip("%"), errors="coerce")

# Drop other columns with many unique variables
y2015.drop(
    [
        "url",
        "emp_title",
        "zip_code",
        "earliest_cr_line",
        "revol_util",
        "sub_grade",
        "addr_state",
        "desc",
    ],
    1,
    inplace=True,
)

<IPython.core.display.Javascript object>

In [16]:
# Remove two summary rows at the end that don't actually contain data.
y2015 = y2015[:-2]

<IPython.core.display.Javascript object>

In [17]:
missingness_df = y2015.isna().mean().sort_values(ascending=False)

<IPython.core.display.Javascript object>

In [18]:
for col in missingness_df.index:
    if y2015[col].isna().mean()>0.05:
        y2015=y2015.drop(col, 1)


<IPython.core.display.Javascript object>

In [19]:
y2015.isna().mean()

id                            0.0
member_id                     0.0
loan_amnt                     0.0
funded_amnt                   0.0
funded_amnt_inv               0.0
                             ... 
tax_liens                     0.0
tot_hi_cred_lim               0.0
total_bal_ex_mort             0.0
total_bc_limit                0.0
total_il_high_credit_limit    0.0
Length: 78, dtype: float64

<IPython.core.display.Javascript object>

In [20]:
y2015 = y2015.dropna()

<IPython.core.display.Javascript object>

In [21]:
drop_cols = [
    "title",
    "last_pymnt_d",
    "acc_now_delinq",
    "policy_code",
    "purpose",
    "last_credit_pull_d",
    "delinq_amnt",
    "tax_liens",
    "grade",
    "loan_status",
    "id",
    "member_id",
    "issue_d",
    "pub_rec",
    "delinq_2yrs",
    "verification_status",
    "home_ownership",
    "num_tl_120dpd_2m",
    "num_tl_30dpd",
    "num_tl_90g_dpd_24m",
    "mort_acc",
    "chargeoff_within_12_mths",
    "collections_12_mths_ex_med",
    "initial_list_status",
    "inq_last_6mths",
    "tot_coll_amt",
    "num_sats",
    "tot_cur_bal",
    "open_acc",
    "pub_rec_bankruptcies",
    "mo_sin_rcnt_rev_tl_op",
    "total_acc",
    "bc_util",
    "num_tl_op_past_12m",
    "num_bc_sats",
    "num_actv_bc_tl",
    "percent_bc_gt_75",
    "num_actv_rev_tl",
    "num_op_rev_tl",
    "num_accts_ever_120_pd",
    "pct_tl_nvr_dlq",
    "acc_open_past_24mths",
    "num_il_tl",
    "mo_sin_rcnt_tl",
    "num_bc_tl",
    "num_rev_tl_bal_gt_0",
    "annual_inc",
    "total_rec_late_fee",
    "avg_cur_bal",
]
X = y2015.drop(columns=drop_cols)
y = y2015["loan_status"]

<IPython.core.display.Javascript object>

In [22]:
X.columns

Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'pymnt_plan', 'dti', 'revol_bal', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_amnt', 'application_type', 'total_rev_hi_lim',
       'bc_open_to_buy', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op',
       'mths_since_recent_bc', 'num_rev_accts', 'tot_hi_cred_lim',
       'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit'],
      dtype='object')

<IPython.core.display.Javascript object>

In [23]:
X = pd.get_dummies(X, drop_first=True)


<IPython.core.display.Javascript object>

In [24]:
# encoder = TargetEncoder(cols=cat_cols)
# encoder.fit(X, y)
# X = encoder.transform(X)


<IPython.core.display.Javascript object>

In [25]:
model = RandomForestClassifier(n_estimators=20, n_jobs=-1)
model.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

<IPython.core.display.Javascript object>

In [26]:
model.score(X, y)

0.9967473938947494

<IPython.core.display.Javascript object>

In [27]:
feat_imp = pd.DataFrame({"feat": X.columns, "importance": model.feature_importances_})

<IPython.core.display.Javascript object>

In [28]:
[
    "title",
    "last_pymnt_d",
    "num_tl",
    "policy_code",
    "purpose",
    "last_credit_pull_d",
    "delinq_amnt",
    "tax_liens",
    "grade",
    "issue_d",
    "pub_rec",
    "delinq_2yrs",
    "verification_status",
    "home_ownership",
    "num_tl_120dpd",
    "num_tl_30dpd",
    "num_tl_90g_dpd" "mort_acc",
    "chargeoff_within_12_mths",
    "collections_12_mths_ex_med",
    "initial_list_status_w",
    "inq_last_6mths",
    "tot_coll_amt",
    "num_sats",
    "tot_cur_bal",
]
[
    "open_acc",
    "pub_rec_bankruptcies",
    "mo_sin_rcnt_rev_tl_op",
    "total_acc",
    "bc_util",
    "num_tl_op_past_12m",
    "num_bc_sats",
    "num_actv_bc_tl",
    "percent_bc_gt_75" "num_actv_rev_tl",
    "num_op_rev_tl",
]
[
    "num_accts_ever_120_pd",
    "pct_tl_nvr_dlq",
    "acc_open_past_24mths",
    "num_il_tl",
    "mo_sin_rcnt_tl",
    "num_bc_tl",
    "num_rev_tl_bal_gt_0" "annual_inc",
    "total_rec_late_fee",
    "avg_cur_bal",
]

[
    "num_rev_accts",
]

['num_rev_accts']

<IPython.core.display.Javascript object>

In [29]:
feat_imp.sort_values(by="importance", ascending=False).loc[:][:16]

Unnamed: 0,feat,importance
8,out_prncp_inv,0.239001
15,last_pymnt_amnt,0.199523
7,out_prncp,0.178244
11,total_rec_prncp,0.080282
10,total_pymnt_inv,0.050711
9,total_pymnt,0.041496
14,collection_recovery_fee,0.032173
13,recoveries,0.021492
0,loan_amnt,0.018466
12,total_rec_int,0.017636


<IPython.core.display.Javascript object>

In [30]:
keep_cols = (
    feat_imp.sort_values(by="importance", ascending=False).loc[:][:3]["feat"].values
)

<IPython.core.display.Javascript object>

In [31]:
X = y2015[keep_cols]

<IPython.core.display.Javascript object>

In [36]:
y

0             Current
1             Current
2          Fully Paid
3             Current
5             Current
             ...     
421089    Charged Off
421090        Current
421092    Charged Off
421093    Charged Off
421094        Current
Name: loan_status, Length: 385537, dtype: object

<IPython.core.display.Javascript object>

In [32]:
rfc.fit(X, y)

NameError: name 'rfc' is not defined

<IPython.core.display.Javascript object>

In [34]:
n_sample = 10000

<IPython.core.display.Javascript object>

In [35]:
rfc = RandomForestClassifier()
# X = y2015.drop('loan_status', 1)
# Y = y2015['loan_status']
# X = pd.get_dummies(X)
# X = X.dropna(axis=1)
X_sample = X.sample(n=n_sample, random_state=34)
y_sample = y.sample(n=n_sample, random_state=34)
cross_val_score(rfc, X_sample, y_sample, cv=10)

KeyboardInterrupt: 

<IPython.core.display.Javascript object>

In [None]:
# col_filter = selector.get_support()
# X.columns[col_filter]