<H1>Drill: Build Performant Random Forest Classifier</H1><br><br>
The drill here is to predict loan status using a random forest classifier that is as "lean" as possible, meaning it uses minimal data to achieve high accuracy (consistently above 90% in cross validation).<br><br>
More specifically, the challenge is to exclude payment amount and outstanding principal. Luckily, I'd already spent a lot of time playing with and cleaning this data. Hopefully those hours of data preparation will finally pay off!<br><br>
Because of my previous experience with this data set, I'm going to refine the challenge a little bit more. I'd like to try and make the distinction between paid off and charged off loans given only the information that is available to lenders on the Lending Club platform when they fund loans. This would make my model useful to lenders who want to minimize lending risk.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from matplotlib.mlab import PCA as mlabPCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
%matplotlib inline

df = pd.read_csv(
    'LoanStats3d.csv',
    skipinitialspace=True,
    header=1
)

df = df[:-2] #Drop last two "summary" rows

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
#Just look at historical/closed loans
loans = df.copy()

In [3]:
can_use = [] #This list will contain features that lenders can use when looking at loans to invest in.

can_use += [     
    'issue_d',
    'loan_amnt',            #An important feature missing from this data is the FICO score range,
    'sub_grade',            #which is available for lenders. Let's see how well we can do without it.
    'int_rate',
    'term',
    'installment',
    'home_ownership',
    'emp_title', 
    'emp_length',
    'zip_code', 
    'addr_state', 
    'verification_status', 
    'application_type', 
    'annual_inc',
    'annual_inc_joint',
    'dti',
    'dti_joint',
    'earliest_cr_line',
    'open_acc',
    'total_acc',
    'revol_bal',
    'revol_util',
    'inq_last_6mths',
    'acc_now_delinq',
    'delinq_amnt',
    'delinq_2yrs',
    'mths_since_last_delinq',
    'pub_rec',
    'mths_since_last_record',
    'mths_since_last_major_derog',
    'collections_12_mths_ex_med'    
]


Let the cleaning begin.

In [4]:
#Convert dollar amounts: floating point to int in cents
loans.loan_amnt = loans.loan_amnt.astype('int')*100
loans.installment = loans.installment.astype('int')*100
loans.annual_inc = loans.annual_inc.astype('int')*100
loans.annual_inc_joint = np.where((pd.isnull(loans.annual_inc_joint)), loans.annual_inc, loans.annual_inc_joint)
loans.annual_inc_joint = loans.annual_inc_joint.astype('int')*100
loans.revol_bal = loans.revol_bal.astype('int')*100
loans.delinq_amnt = loans.delinq_amnt.astype('int')*100

#Convert strings to numeric types
loans.int_rate = pd.to_numeric(loans.int_rate.str.slice(0, -1)) #Convert percentages (strings) to numeric values
loans.revol_util = pd.to_numeric(loans.revol_util.str.slice(0, -1))

loans.term = pd.to_numeric(loans.term.str.slice(0, 3)).astype('int') #Convert loan term to number
loans['term_type'] = np.where((loans.term == 60), 1, 0) #Bool for loan term, as only two types exist.

loans.dti_joint = np.where((pd.isnull(loans.dti_joint)), loans.dti, loans.dti_joint) 

In [5]:
#Convert employee length strings to numeric values

def get_emp_length(emp_length):
    """Takes our emp_length data (string) and returns numeric value
    in number of years.
    """
    unique_values = {         #keys taken from loans.emp_length.unique()
        '10+ years' : 10, 
        '< 1 year' : .5,      #between 0 and 1
        '3 years' : 3, 
        '9 years' : 9, 
        '4 years' : 4, 
        '5 years' : 5,
        '1 year' : 1, 
        '6 years' : 6, 
        '2 years' : 2, 
        '7 years' : 7, 
        '8 years' : 8, 
        'n/a': 0
    }
    return unique_values[emp_length]

loans.emp_length = loans.emp_length.apply(get_emp_length)

In [6]:
#Want to convert dates into numeric format.

def convert_date(input_string):
    """changes date from format 'Mon-YYYY' to
    an integer number of months before Jan 1, 2016
    (when this data was published)
    """
    months_num = {
        'Jan' : 1,
        'Feb' : 2,
        'Mar' : 3,
        'Apr' : 4,
        'May' : 5,
        'Jun' : 6,
        'Jul' : 7,
        'Aug' : 8,
        'Sep' : 9,
        'Oct' : 10,
        'Nov' : 11,
        'Dec' : 12
    }
    mon = input_string[:3]
    year = int(input_string[-4:])
    if mon in months_num:
        num = months_num[mon]
    else:
        raise ValueError('{} not found in dictionary'.format(mon))
    months_passed = (12 - num) + ((2015 - year) * 12)
    if not str(months_passed).isnumeric():
        raise Exception('Error: return object not numeric: {}'.format(months_passed))
    return months_passed

loans.issue_d = loans.issue_d.apply(convert_date)
loans.earliest_cr_line = loans.earliest_cr_line.apply(convert_date)

In [7]:
#Calculate number of months first credit line before loan issued
loans['fcl_before_loan'] = loans.earliest_cr_line - loans.issue_d

can_use += ['fcl_before_loan']

#Remove this feature, as its relevance is encoded into the new feature
can_use.remove('earliest_cr_line')
#I'll keep issue date in case there's any time-dependent effects that the model can pick up on.

In [8]:
#Convert home status to a numeric value

def convert_home(home_type):
    """Convert housing status to numeric value.
    Tries to rank numbers roughly by wealth-association with each
    homeownership status.
    """
    unique_values = {   #keys taken from loans.home_status.unique() 
        'OTHER' : 0, 
        'NONE' : 0,     #These three vague categories all assigned to zero
        'ANY' : 0,
        'RENT' : 1,
        'MORTGAGE' : 2, 
        'OWN' : 3
    }
    return unique_values[home_type]

loans.home_ownership = loans.home_ownership.apply(convert_home)

In [9]:
# Convert grade to numeric value

def convert_grade(grade):
    """Converts borrower grade to numeric value"""
    unique_grades = { #keys taken from loans.grade.unique()
        'A1' : 35,
        'A2' : 34,
        'A3' : 33,
        'A4' : 32,
        'A5' : 31,
        'B1' : 30,
        'B2' : 29,
        'B3' : 28,
        'B4' : 27,
        'B5' : 26,
        'C1' : 25,
        'C2' : 24,
        'C3' : 23,
        'C4' : 22,
        'C5' : 21,
        'D1' : 20,
        'D2' : 19,
        'D3' : 18,
        'D4' : 17,
        'D5' : 16,
        'E1' : 15,
        'E2' : 14,
        'E3' : 13,
        'E4' : 12,
        'E5' : 11,
        'F1' : 10,
        'F2' : 9,
        'F3' : 8,
        'F4' : 7,
        'F5' : 6,
        'G1' : 5,
        'G2' : 4,
        'G3' : 3,
        'G4' : 2,
        'G5' : 1
    }
    return unique_grades[grade]

loans.sub_grade = loans.sub_grade.apply(convert_grade)

In [10]:
def handle_nulls(series):  #Use this because high values 
    """Takes pandas series with null values and replaces them with value significantly higher than
    maximum value.
    """
    new_val = series.max() + 5 * series.std()  #Set values way higher than mean
    nulls = pd.isnull(series)
    return np.where(nulls, new_val, series)

loans.mths_since_last_delinq = handle_nulls(loans.mths_since_last_delinq)
loans.mths_since_last_record = handle_nulls(loans.mths_since_last_record)
loans.mths_since_last_major_derog = handle_nulls(loans.mths_since_last_major_derog)

In [11]:
loans.emp_title = np.where((pd.isnull(loans.emp_title)), 'other', loans.emp_title)

titles = {}
for i, title in enumerate(loans.emp_title.value_counts().index):
    titles[title] = i

for row in loans.index: #Set titles by rank of frequency
    loans.set_value(row, 'emp_title', titles[loans.emp_title.loc[row]])

In [12]:
loans.zip_code = pd.to_numeric(loans.zip_code.str[:3]) #Maybe this can help pick up subtle geo-trends

In [13]:
loans.revol_util = np.where((pd.isnull(loans.revol_util)), 0, loans.revol_util)

In [14]:
loans.loan_status.value_counts()

Current               287414
Fully Paid             87989
Charged Off            29178
Late (31-120 days)      9510
In Grace Period         4320
Late (16-30 days)       1888
Default                  796
Name: loan_status, dtype: int64

In [15]:
states = {}                                                        #enumerate states
for i, state in enumerate(loans.addr_state.value_counts().index):
    titles[state] = i

for row in loans.index: #Set titles by rank of frequency
    loans.set_value(row, 'addr_state', titles[loans.addr_state.loc[row]])

In [16]:
def set_verification_status(status):
    """Enumerates varification statuses.
    """
    statuses = {
        'Not Verified' : 0,
        'Verified' : 1,
        'Source Verified' : 2
    }
    return statuses[status]

loans.verification_status = loans.verification_status.apply(set_verification_status)

In [17]:
loans.application_type = np.where((loans.application_type == 'INDIVIDUAL'), 0, 1)

In [18]:
#Very rough estimate of cash flow, using income and dti
loans['cash_flow'] = loans.annual_inc_joint/12 - (loans.dti_joint * loans.annual_inc_joint * .004)
can_use += ['cash_flow']

In [19]:
loans.cash_flow.describe()

count    4.210950e+05
mean     9.795257e+06
std      5.675659e+07
min     -5.961667e+08
25%     -9.882667e+06
50%      5.221333e+06
75%      2.281031e+07
max      7.871067e+09
Name: cash_flow, dtype: float64

In [20]:
#I want to try and classify "good" from "bad" loans, so I'm going to set the target variable into two groups:
#On time payers, and late payers of all types. 
#This makes this model into a useful tool for lenders.
good_statuses = ['Current', 'Fully Paid']
loans['is_good'] = np.where((loans.loan_status.isin(good_statuses)), 1, 0)

<H2>Random Forest Classifier</H2><br><br>
Now I'm finally ready to put my classifier together!

In [21]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

In [65]:
rfc = ensemble.RandomForestClassifier(
    criterion='entropy',                #These conditions seem to do best.
    max_depth=11
)
sample_ = loans.sample(frac=1)  #Shuffle rows (cross_val_score seems to have sampling bias)

X = np.nan_to_num(sample_[can_use]) #Deals with some inf values that appeared while making features

Y = sample_['is_good']

In [66]:
#rfc.fit(X, Y)

scores = pd.Series(cross_val_score(rfc, X, Y, cv=10))

#Note the extremely consistent performance!
print('mean score is : {} +/- {}, min: {}'.format(scores.mean(), scores.std(), scores.min()))

mean score is : 0.8914995429597093 +/- 3.398675636643912e-05, min: 0.8914721318482984


In [44]:
scores

0    0.891477
1    0.891477
2    0.891498
3    0.891496
4    0.891520
5    0.891496
6    0.891496
7    0.891496
8    0.891496
9    0.891472
dtype: float64

I'll take it!<br>

<H2>Conclusion</H2><br>

Although I've come just short of the goal of getting above 90% accuracy, I'm very happy to have gotten this close and to have reduced the variance in score so drastically. With a lot more tweaking, I think we could get that mean score up to 90%.<br><br>
More important to me is the fact that this model is actually usable to investors. It would be more helpful to train on more recent loans (data up to July 2017 is available), but with high confidence in 90% accurate predictions on "good" loans, lenders could find this model useful in minimizing risk.
