This notebook is a quick cleaning of the data just to get it usable for our baseline ML models. This is quick and dirty cleaning with further refinements (feature eng, cat encoding, etc) to be done later. Different models will require different cleaning so this mostly just establishes train/val/test sets.

In [1]:
import os, warnings, gc, requests, json, re
from requests.auth import HTTPDigestAuth
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)

The cleaning function does some very simple cleaning like converting interest rates from strings to floats.

In [2]:
def cleaning(df):
            
    # drop non_loan columns
    index = df[df['loan_amnt'].isnull()].index
    df.drop(index=index, inplace=True)
    
    # convert % rate from str to float
    df['int_rate'] = df['int_rate'].str[:-1].astype(float)
    df['revol_util'] = df['revol_util'].str[:-1].astype(float)
    
    # convert term from str to int
    df['term'] = df['term'].str.strip().str[:2].astype(int)
    
    # drop weird old columns
    index = (df[(df['loan_status'] == 'Does not meet the credit policy. Status:Charged Off')
            |(df['loan_status'] == 'Does not meet the credit policy. Status:Fully Paid')]).index
    df.drop(index=index, inplace=True)
    
    # convert dates to useable formats
    df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], format='%b-%Y')
    df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%Y')
    
    # convert earliest cr_line from date to months since loan request
    df['earliest_cr_line'] = ((df['issue_d'] - df['earliest_cr_line']) / np.timedelta64(1, 'M')).astype(int)
    
    return df 

The historical data is split into different csv files. The get data function will clean each one, then combine them all into one dataframe.

In [3]:
def get_data(approved_files, data_path):
    df = pd.DataFrame()
    for file in approved_files:
        print('reading in {}'.format(file))
        temp_df = pd.read_csv(data_path/file,header=1)
        temp_df = cleaning(temp_df)
        df = pd.concat([df,temp_df],ignore_index=True)
    del temp_df
    return df

My data folder also contains info on rejected loans. Bc of this, my a list of files containing only the csv files containing approved and issued loans

In [4]:
data_path = Path('data') # replace with name of your data file if different
files = os.listdir(data_path)
# only files with approved loans start with 'L'
approved_files = [f for f in files if f[0]=='L']

Clean and combine all historical loan files into one.

In [5]:
df = get_data(approved_files, data_path)

reading in LoanStats_securev1_2018Q4.csv
reading in LoanStats3b_securev1.csv
reading in LoanStats3c_securev1.csv
reading in LoanStats3d_securev1.csv
reading in LoanStats_securev1_2018Q2.csv
reading in LoanStats_securev1_2018Q3.csv
reading in LoanStats_securev1_2018Q1.csv
reading in LoanStats_securev1_2019Q1.csv
reading in LoanStats_securev1_2017Q1.csv
reading in LoanStats_securev1_2017Q2.csv
reading in LoanStats_securev1_2017Q3.csv
reading in LoanStats_securev1_2017Q4.csv
reading in LoanStats_securev1_2016Q2.csv
reading in LoanStats3a_securev1.csv
reading in LoanStats_securev1_2016Q3.csv
reading in LoanStats_securev1_2016Q1.csv
reading in LoanStats_securev1_2016Q4.csv


In [6]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,...,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,145647242,,9600.0,9600.0,9600.0,36,12.98,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,home_improvement,Home improvement,401xx,KY,0.84,...,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,145248657,,4000.0,4000.0,4000.0,36,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,070xx,NJ,26.33,...,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,145640422,,2500.0,2500.0,2500.0,36,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,109xx,NY,18.24,...,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,145631930,,30000.0,30000.0,30000.0,60,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,713xx,LA,26.52,...,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,145638579,,5000.0,5000.0,5000.0,36,17.97,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,2018-12-01,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,490xx,MI,10.51,...,,,,N,,,,,,,,,,,,,,,N,,,,,,


In [7]:
df.shape

(2373594, 150)

In [8]:
# to view all column names
#list(df.columns)

The dataframe currently holds information not available at the time of investment.

Enter your LC account info to preview the currently listed loans. From the currently available loans create a list of all available features. These are the features we know we can use in our ML models. Then try to match these features to the ones available in the Lending Club historical data.

In [9]:
api_key =  {'Authorization': ''} # put your api key here
investor_id = '' # put your account id here

# get loan listings data
loans = 'https://api.lendingclub.com/api/investor/v1/loans/listing'
res = requests.get(loans, headers=api_key)
data = json.loads(res.text)

# grabs the available features
avail_cols = list(data['loans'][0].keys())

In [10]:
# to view columns from the listings api
#avail_cols

Feature names from the api differ from the historical data. Since it's mostly slight formatting differences regex functions will be used to match column names. A dictionary is created to map other feature names from the api to the historical data. Some of the features mapped in the dictionary could also be matched using regex

In [11]:
to_map = {'secAppCollections12MthsExMed': 'sec_app_collections_12_mths_ex_med',
          'secAppInqLast6Mths': 'sec_app_inq_last_6mths',
          'numAcctsEver120Ppd': 'num_accts_ever_120_pd',
          'inqLast6Mths': 'inq_last_6mths',
          'numTl120dpd2m': 'num_tl_120dpd_2m',
          'numTl30dpd': 'num_tl_30dpd',
          'numTl90gDpd24m': 'num_tl_90g_dpd_24m',
          'numTlOpPast12m': 'num_tl_op_past_12m',
          'collections12MthsExMed': 'collections_12_mths_ex_med',
          'isIncV': 'verification_status',
          'isIncVJoint': 'verification_status_joint',
          'openIl12m': 'open_il_12m',
          'openIl24m': 'open_il_24m',
          'openRv12m': 'open_rv_12m',
          'openRv24m': 'open_rv_24m',
          'secAppChargeoffWithin12Mths': 'sec_app_chargeoff_within_12_mths',
          'addrZip': 'zip_code',
          'accOpenPast24Mths': 'acc_open_past_24mths',
          'chargeoffWithin12Mths': 'chargeoff_within_12_mths',
          'inqLast12m': 'inq_last_12m',
          'delinq2Yrs': 'delinq_2yrs',
          'percentBcGt75': 'percent_bc_gt_75',
          'loanAmount': 'loan_amnt',
          'iLUtil': 'il_util',          
         }

# cols w info regarding loan performace from LC dataset
# this will get stored with the data used for modeling so we can see returns on investment later
cols_of_interest = ['issue_d','loan_status','total_pymnt', 'total_rec_int',
                    'total_rec_late_fee','total_rec_prncp', 'recoveries',
                    'collection_recovery_fee', 'last_pymnt_d']
                    

# cols dropped from the listed loans features (these features are not in historical data)
to_drop = ['reviewStatus', 'housingPayment', 'creditPullD', 'ilsExpD', 'mtgPayment', 'expD', 'acceptD',
          'investorCount','serviceFeeRate', 'disbursementMethod', 'listD', 'expDefaultRate',
          'reviewStatusD','fundedAmount']

The next block calls the lendingclub api to get a list of available loans. From this we get the available features to use for our models. You can use your api key and investor id.

In [12]:
# performs feature matching between features from api call and features from dataset
# some features were easy to match with regex but for others is was quick to manually write mapping
# this can probably be clean up further
api_cols = []
for col in avail_cols:
    if col in list(to_map.keys()):
        new_col = to_map[col]
        api_cols.append(new_col)
        continue
    if col in to_drop:
        continue
    new_col = re.sub(r'([A-Z])', r'_\1', col).lower()
    new_col = re.sub(r'([0-9])+', r'_\1', new_col).lower()
    api_cols.append(new_col)

In [13]:
api_cols[:10]

['id',
 'member_id',
 'loan_amnt',
 'term',
 'int_rate',
 'installment',
 'grade',
 'sub_grade',
 'emp_length',
 'home_ownership']

In [14]:
len(api_cols)

105

api_cols is a list of usable features from the api call listed the feature names in the historical data. Now reassign the dataframe into a df with only the features we want.

In [15]:
df = df[api_cols+cols_of_interest]

In [16]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,desc,purpose,zip_code,addr_state,initial_list_status,emp_title,acc_now_delinq,acc_open_past_24mths,bc_open_to_buy,percent_bc_gt_75,bc_util,dti,delinq_2yrs,...,total_cu_tl,inq_last_12m,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,revol_bal_joint,open_act_il,sec_app_open_act_il,issue_d,loan_status,total_pymnt,total_rec_int,total_rec_late_fee,total_rec_prncp,recoveries,collection_recovery_fee,last_pymnt_d
0,145647242,,9600.0,36,12.98,323.37,B,B5,,MORTGAGE,35704.0,Not Verified,,home_improvement,401xx,KY,w,,0.0,3.0,3452.0,0.0,17.8,0.84,0.0,...,0.0,1.0,,,,,,,,,,,,,0.0,,2018-12-01,Current,1317.72,425.24,0.0,892.48,0.0,0.0,May-2019
1,145248657,,4000.0,36,23.4,155.68,E,E1,3 years,RENT,90000.0,Source Verified,,debt_consolidation,070xx,NJ,w,Security,0.0,15.0,20174.0,0.0,7.9,26.33,0.0,...,0.0,4.0,,,,,,,,,,,,,4.0,,2018-12-01,Current,770.6,366.75,0.0,403.85,0.0,0.0,May-2019
2,145640422,,2500.0,36,13.56,84.92,C,C1,10+ years,RENT,55000.0,Not Verified,,debt_consolidation,109xx,NY,w,Chef,0.0,9.0,34360.0,0.0,5.9,18.24,0.0,...,11.0,2.0,,,,,,,,,,,,,2.0,,2018-12-01,Current,421.78,131.95,0.0,289.83,0.0,0.0,May-2019
3,145631930,,30000.0,60,18.94,777.23,D,D2,10+ years,MORTGAGE,90000.0,Source Verified,,debt_consolidation,713xx,LA,w,Postmaster,0.0,10.0,13761.0,0.0,8.3,26.52,0.0,...,15.0,2.0,,,,,,,,,,,,,4.0,,2018-12-01,Current,11338.8,2210.86,0.0,9127.94,0.0,0.0,May-2019
4,145638579,,5000.0,36,17.97,180.69,D,D1,6 years,MORTGAGE,59280.0,Source Verified,,debt_consolidation,490xx,MI,w,Administrative,0.0,4.0,13800.0,0.0,0.0,10.51,0.0,...,5.0,0.0,,,,,,,,,,,,,1.0,,2018-12-01,Current,715.27,282.84,0.0,432.43,0.0,0.0,May-2019


In [17]:
# save data
df.to_pickle('clean_data/prelim-all.pkl')

Next create train/val/test splits. Since only historical data will be avaible when choosing which loans to invest in, I chose to split the data by date. For simplicity only use 36m term loans instead of both 36 and 60m.

Most recent data is Q1 2019 so 36m loans from Q1 2016 and earlier should be fully paid off or confirmed default by now. This most recent loan data from Q2 2015- Q1 2016 will be used as the test set.

Validation set will have loans from Q2 2014- Q1 2015

Train set will have loans from Q1 2014 and earlier

In [18]:
df = pd.read_pickle('clean_data/prelim-all.pkl')

Chose to restric to 36 month loans because so we have full payment info on more recent data. The other loan term is 5 years and would limit the time periods available for a train/val/holdout data split.

In [19]:
# restrict to 36m loans
df = df[df['term']==36]

In [20]:
df.shape

(1685745, 114)

Next a few features are added to later calculate annualized returns.

In [21]:
# how long until the loan was paid off/reached default status
df['loan_length'] = (pd.to_datetime(df['last_pymnt_d'])-df['issue_d']).dt.days

# loans that did not recieve any payments had NA values
# by adding 30 days it provides a timeframe to calculate returns
# NA values messed up the calculation
df['loan_length'] = df['loan_length'].fillna(30)

# loans that were paid off in the same month had a loan length of 0
# loan length of zero pushed returns to infinity
# loans also take some time to originate after initial investment
# adding 30 days creates cleaner and more realistic return calculations
df.loc[df['loan_length']==0, 'loan_length'] = 30

# recoveries are not factored into the last_pyment_d feature
# without this feature a loan can make the first payment, then default and make partial recovry pymnt
# and return on investment would be above 100%
# adding a year for the recovery peroid is also fairly realistic
df.loc[df['recoveries']>0, 'loan_length'] =df.loc[df['recoveries']>0, 'loan_length']+365

# standard annualized return formula
df['returns'] = (df['total_pymnt']/df['loan_amnt'])**(365/(df['loan_length'])) - 1


In [22]:
df.sort_values('returns')

Unnamed: 0,id,member_id,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,desc,purpose,zip_code,addr_state,initial_list_status,emp_title,acc_now_delinq,acc_open_past_24mths,bc_open_to_buy,percent_bc_gt_75,bc_util,dti,delinq_2yrs,...,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,revol_bal_joint,open_act_il,sec_app_open_act_il,issue_d,loan_status,total_pymnt,total_rec_int,total_rec_late_fee,total_rec_prncp,recoveries,collection_recovery_fee,last_pymnt_d,loan_length,returns
119633,141422742,,17500.0,36,17.97,632.41,D,D1,8 years,OWN,32000.0,Source Verified,,credit_card,295xx,SC,w,Assistant manager,0.0,9.0,21592.0,10.0,18.8,30.08,0.0,...,,,,,,,,,,,,,1.0,,2018-10-01,Charged Off,0.000000,0.00,0.0,0.00,0.00,0.0,,30.0,-1.000000
1090438,131393983,,10000.0,36,11.98,332.05,B,B5,1 year,RENT,45000.0,Source Verified,,debt_consolidation,103xx,NY,w,mail carrier,0.0,5.0,2722.0,20.0,61.1,17.36,0.0,...,,,,,,,,,,,,,1.0,,2018-04-01,Charged Off,0.000000,0.00,0.0,0.00,0.00,0.0,,30.0,-1.000000
1090462,131133802,,2100.0,36,14.07,71.85,C,C3,1 year,RENT,62500.0,Source Verified,,other,530xx,WI,f,Cnc operator,0.0,4.0,3234.0,0.0,14.9,1.09,0.0,...,,,,,,,,,,,,,0.0,,2018-04-01,Charged Off,70.210000,22.98,0.0,47.23,0.00,0.0,May-2018,30.0,-1.000000
1091065,130945620,,19200.0,36,6.07,584.72,A,A2,8 years,OWN,86500.0,Source Verified,,medical,940xx,CA,w,Asset recovery specialist,0.0,2.0,115536.0,0.0,1.4,2.63,0.0,...,,,,,,,,,,,,,0.0,,2018-04-01,Charged Off,0.000000,0.00,0.0,0.00,0.00,0.0,,30.0,-1.000000
1413673,146976744,,12000.0,36,14.47,412.88,C,C2,8 years,RENT,49000.0,Not Verified,,debt_consolidation,601xx,IL,w,Diamond Setter,0.0,8.0,10495.0,33.3,51.4,32.55,0.0,...,,,,,,,,,,,,,2.0,,2019-01-01,Charged Off,0.000000,0.00,0.0,0.00,0.00,0.0,,30.0,-1.000000
1413635,147023595,,9500.0,36,7.56,295.78,A,A3,10+ years,OWN,83600.0,Not Verified,,debt_consolidation,356xx,AL,w,Chemical laboratory analyst,0.0,3.0,61681.0,0.0,17.8,20.86,0.0,...,,,,,,,,,,,,,3.0,,2019-01-01,Charged Off,0.000000,0.00,0.0,0.00,0.00,0.0,,30.0,-1.000000
1092719,131172557,,10000.0,36,13.58,339.74,C,C2,5 years,RENT,38000.0,Not Verified,,major_purchase,750xx,TX,f,Cashier..cook,0.0,4.0,250.0,100.0,92.9,3.07,0.0,...,,,,,,,,,,,,,0.0,,2018-04-01,Charged Off,0.000000,0.00,0.0,0.00,0.00,0.0,,30.0,-1.000000
1090410,130821434,,5600.0,36,21.85,213.44,D,D5,1 year,RENT,50000.0,Source Verified,,vacation,967xx,HI,w,Care Coordinator,0.0,6.0,11844.0,20.0,51.1,21.41,0.0,...,680.0,684.0,May-1993,0.0,0.0,8.0,95.1,6.0,0.0,0.0,,74637.0,2.0,2.0,2018-04-01,Charged Off,0.000000,0.00,0.0,0.00,0.00,0.0,,30.0,-1.000000
1093015,131275227,,10000.0,36,14.07,342.12,C,C3,2 years,RENT,40000.0,Source Verified,,debt_consolidation,553xx,MN,f,Auditor,0.0,2.0,6189.0,50.0,22.6,14.47,0.0,...,,,,,,,,,,,,,1.0,,2018-04-01,Charged Off,334.300000,109.43,0.0,224.87,0.00,0.0,May-2018,30.0,-1.000000
77247,143092136,,3000.0,36,25.34,119.82,E,E3,2 years,RENT,73000.0,Source Verified,,small_business,770xx,TX,w,Rn,0.0,4.0,25295.0,0.0,1.6,14.96,2.0,...,,,,,,,,,,,,,2.0,,2018-11-01,Charged Off,115.600000,59.13,0.0,56.47,0.00,0.0,Dec-2018,30.0,-1.000000


The train/val/holdout split is designed to simulate future investments.
Train on historical data, validate on available data from a later time period, and the holdout set is the most recent loan data with full payment/chargoff information (3 years ago).

In [23]:
test_end = pd.datetime(2016,3,1)
test_start = pd.datetime(2015,3,1)
test_df = df[(df['issue_d'] > test_start) & (df['issue_d'] <= test_end)]

# verify start and end dates for test data
test_df.issue_d.min(), test_df.issue_d.max()

(Timestamp('2015-04-01 00:00:00'), Timestamp('2016-03-01 00:00:00'))

In [24]:
val_end = pd.datetime(2015,3,1)
val_start = pd.datetime(2014,3,1)
val_df = df[(df['issue_d'] > val_start) & (df['issue_d'] <= val_end)]

# verify start and end dates for test data
val_df.issue_d.min(), val_df.issue_d.max()

(Timestamp('2014-04-01 00:00:00'), Timestamp('2015-03-01 00:00:00'))

In [25]:
train_end = pd.datetime(2014,3,1)
train_df = df[df['issue_d'] <= train_end]

# verify start and end dates for test data
train_df.issue_d.min(), train_df.issue_d.max()

(Timestamp('2007-06-01 00:00:00'), Timestamp('2014-03-01 00:00:00'))

Verify no test or val loans leak into train data. Testing intersection between sets much faster than finding overlap between lists of ids.

In [26]:
train_ids = set(train_df.id)
val_ids = set(val_df.id)
test_ids = set(test_df.id)

In [27]:
train_ids.intersection(val_ids)

set()

In [28]:
train_ids.intersection(test_ids)

set()

In [29]:
val_ids.intersection(test_ids)

set()

Check loan status within each data split.

In [30]:
train_df['loan_status'].value_counts()

Fully Paid     181170
Charged Off     25892
Name: loan_status, dtype: int64

In [31]:
val_df['loan_status'].value_counts()

Fully Paid     158743
Charged Off     26322
Name: loan_status, dtype: int64

In [32]:
test_df['loan_status'].value_counts()

Fully Paid            273746
Charged Off            48567
Late (31-120 days)       223
Current                  161
In Grace Period           16
Late (16-30 days)         10
Default                    1
Name: loan_status, dtype: int64

In [33]:
# for simplicity
test_df = test_df[(test_df['loan_status']=='Fully Paid')|(test_df['loan_status']=='Charged Off')]

In [34]:
test_df['loan_status'].value_counts()

Fully Paid     273746
Charged Off     48567
Name: loan_status, dtype: int64

In [35]:
train_df.to_pickle('clean_data/api_train_df.pkl')
val_df.to_pickle('clean_data/api_val_df.pkl')
test_df.to_pickle('clean_data/api_test_df.pkl')