This notebook performs a high level analysis of the loans in the dataset, primarily from the perspective of assessing financial performance, identifying drivers of defaults and identifying financial targets for the machine learning model. 

Key Statistics
* Default Rate: 19.18% 
* Gains from good loans: ~912m
* Losses from bad loans: ~560m
* Net Gains: ~353m 
* The  loans in the bottom two "grades" are unprofitable overall, with the grade above it barely profitable (i.e. grades E, F, G) 

### What we've learned so far
##### this data is repeated at the bottom of this notebook, but placed here for convenience)
* Removing loans with an interest rate above 13.5% would wipe out about approximately 87M in gains, but increased annualized return on total loans by about 25%. The company would deploy 2.4 billion less in capital, but only earn around 87M less. An argument that Lending Club could find higher yield activities (~1.21% annualized gain) for ~2.4B. Note as if this writing (4/08/22) two year T-Bills are paying 2.51% and they were paying 1.63% a month ago. 
* Removing all loans below Class D barely changes net gains, but increases returns on the loan portfolio by about 15.6%. I.e. the loans below Class D barely broke even, while requiring nearly 1B less in lending funds to generate. 
* On average bad loans wipe out the gains from ~2.6 good loans. Meaning a "false positive" as far as giving a loan that will eventually go into default, is roughly 2.6X more expensive than rejecting a customer who would pay off their loan 
* In a perfect world, the company would earn about 5.9% annualized fromm its lending business, I.e. the loans that were paid in full delivered a 5.9% annnualized return 
* Even though F & G loans constitute only ~2.8% of all loans, the fact that both categories are a net loss means that LC should stop originating loans in this category 
* While we don't know how much it costs to service/manage an individual loan, the low average return (~333) of the loans in class E (8.22% of all loans) + the default rate of 37.9% suggests that the company should stop originating loans in that category as well. I.e. a loan clas with a high default rate suggests that this category is more expensive to manage than say Class A or B loans, meaning, once operational costs are removed from the gains from Class E, the return is probably significantly less than 333.00 


For the purposes of this exercise, "success" will be a model that provides greater value than the two proposed interventions:
* Not originating loans to customers with loan grades below D, meaning: <16.5% default rate, >351M in gains, <5.09B in deployed capital 
* Setting an interest cap at 13.5%, <3.6B in deployed capital, >270M in gains, <10.6% default rate

There are clear options to improve the performance of LC lending portfolio, so for a model to provide business value it would have to outperform one or both of the intervention options identified via the EDA process. I.e. the model would need to provide a 3rd option that is superior to the other two, which are fairly standard banking options: increase your lending standards. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from eda_class import EDA
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline
import xgboost as xgb
from sklearn import linear_model, metrics
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
%matplotlib inline
from xgboost import plot_importance
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
warnings.simplefilter(action='ignore', category=FutureWarning)




In [2]:
# set size parameters for the plots 
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100 


# set view parameters for the data frames 
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999




In [3]:
# instantiate the EDA functions that will help us analyze the data 

eda_functions = EDA()

In [4]:
# quickly revisit some of the EDA from the prior analysis in order to acquire some data around 
# cost trade offs. I.e. cost of false negative (rejecting a good loan) vs. false positive (rejecting a good loan)

data = pd.read_csv('data/LC_2015_clean(4)_updated_April2022.csv')

# moving the loan status column so the most important columns are all in one place
status_column = data.pop('loan_status')
data.insert(5, 'loan_status', status_column)


data.head(10)

Unnamed: 0,funded_amnt,term,int_rate,installment,grade,loan_status,emp_length,home_ownership,annual_inc,verification_status,issue_d,purpose,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,inq_last_6mths,mths_since_last_delinq,open_acc,pub_rec,pub_rec_bankruptcies,revol_bal,revol_util,total_acc,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,collections_12_mths_ex_med,application_type,acc_now_delinq,chargeoff_within_12_mths,acc_open_past_24mths,avg_cur_bal,delinq_amnt,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_il_tl,num_tl_120dpd_2m,pct_tl_nvr_dlq,tot_coll_amt,tot_cur_bal,total_bal_ex_mort,emp_length_months,dti_dec,length_of_credit_history,monthly_income,monthly_debt_payments,updated_monthly_debt_payments,lost_principle,total_payments,post_loan_dti,net_gain,debt_consolidation,consumer_credit,other,countdown_zero_delinq,countdown_zero_revol_delinq
0,20000.0,36 months,0.1485,691.84,C,Fully Paid,6,RENT,110000.0,Not Verified,2015-12-01,credit_card,IL,12.45,0.0,2007-06-01,690.0,0.0,0.0,8.0,0.0,0.0,21374.0,0.845,12.0,24889.01336,24889.01,20000.0,4889.01,0.0,0.0,Individual,0.0,0.0,3.0,5356.0,0.0,102.0,16.0,10.0,0.0,9.0,0.0,0.0,1.0,0.0,100.0,0.0,37491.0,37491.0,72,0.1245,102,9166.666667,1141.25,1833.09,0.0,35.9751,0.199973,4889.01,1,0,0,0.0,0.0
1,20000.0,36 months,0.1577,700.88,D,Fully Paid,5,RENT,70000.0,Not Verified,2015-12-01,house,FL,22.21,0.0,2004-07-01,680.0,0.0,38.0,24.0,0.0,0.0,19077.0,0.366,63.0,21780.58678,21780.59,20000.0,1780.59,0.0,0.0,Individual,0.0,0.0,9.0,2759.0,0.0,137.0,2.0,2.0,1.0,9.0,0.0,5.0,39.0,0.0,92.1,264.0,63456.0,63456.0,60,0.2221,137,5833.333333,1295.583333,1996.463333,0.0,31.076057,0.342251,1780.59,0,1,0,-46.0,0.0
2,10000.0,60 months,0.1797,253.78,D,Charged Off,2,MORTGAGE,55000.0,Not Verified,2015-12-01,credit_card,CO,35.7,0.0,2001-04-01,685.0,0.0,0.0,14.0,0.0,0.0,38623.0,0.78,28.0,5558.2,5558.2,2687.15,2871.05,0.0,0.0,Individual,0.0,0.0,4.0,20578.0,0.0,176.0,7.0,6.0,6.0,6.0,0.0,0.0,6.0,0.0,100.0,0.0,288087.0,71518.0,24,0.357,176,4583.333333,1636.25,1890.03,7312.85,21.901647,0.41237,-4441.8,1,0,0,0.0,0.0
3,20000.0,36 months,0.0849,631.26,B,Fully Paid,10,MORTGAGE,85000.0,Not Verified,2015-12-01,major_purchase,SC,17.61,1.0,1999-02-01,705.0,0.0,3.0,8.0,0.0,0.0,826.0,0.057,15.0,21538.50898,21538.51,20000.0,1538.51,0.0,0.0,Individual,0.0,0.0,4.0,17700.0,0.0,55.0,32.0,13.0,3.0,8.0,0.0,1.0,9.0,0.0,93.3,0.0,141601.0,27937.0,120,0.1761,201,7083.333333,1247.375,1878.635,0.0,34.11987,0.265219,1538.51,0,1,0,-81.0,0.0
4,10000.0,36 months,0.0649,306.45,A,Fully Paid,6,RENT,85000.0,Not Verified,2015-12-01,credit_card,PA,13.07,0.0,2002-04-01,685.0,1.0,0.0,14.0,1.0,1.0,10464.0,0.345,23.0,10998.97157,10998.97,10000.0,998.97,0.0,0.0,Individual,0.0,0.0,7.0,1997.0,0.0,129.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,0.0,95.7,8341.0,27957.0,27957.0,72,0.1307,164,7083.333333,925.791667,1232.241667,0.0,35.89157,0.173964,998.97,1,0,0,0.0,0.0
5,8000.0,36 months,0.1148,263.74,B,Fully Paid,10,MORTGAGE,42000.0,Not Verified,2015-12-01,credit_card,RI,34.8,0.0,1994-11-01,700.0,0.0,75.0,8.0,0.0,0.0,7034.0,0.391,18.0,8939.580503,8939.58,8000.0,939.58,0.0,0.0,Individual,0.0,0.0,5.0,28528.0,0.0,253.0,15.0,10.0,1.0,10.0,0.0,1.0,5.0,0.0,94.4,0.0,199696.0,113782.0,120,0.348,252,3500.0,1218.0,1481.74,0.0,33.895429,0.423354,939.58,1,0,0,-9.0,0.0
6,28000.0,36 months,0.0649,858.05,A,Fully Paid,10,MORTGAGE,92000.0,Not Verified,2015-12-01,debt_consolidation,NC,21.6,0.0,1984-05-01,720.0,0.0,42.0,16.0,0.0,0.0,51507.0,0.645,24.0,29939.01773,29939.02,28000.0,1939.02,0.0,0.0,Individual,0.0,0.0,1.0,13819.0,0.0,379.0,19.0,19.0,2.0,0.0,42.0,0.0,4.0,0.0,91.7,0.0,221110.0,74920.0,120,0.216,379,7666.666667,1656.0,2514.05,0.0,34.891927,0.32792,1939.02,1,0,0,-42.0,-42.0
7,18000.0,60 months,0.1199,400.31,C,Fully Paid,10,MORTGAGE,112000.0,Not Verified,2015-12-01,debt_consolidation,AZ,8.68,0.0,1993-11-01,800.0,0.0,0.0,17.0,0.0,0.0,10711.0,0.155,27.0,18387.22,18387.22,18000.0,387.22,0.0,0.0,Individual,0.0,0.0,5.0,17089.0,0.0,265.0,1.0,1.0,4.0,10.0,0.0,0.0,6.0,0.0,100.0,0.0,205067.0,36127.0,120,0.0868,264,9333.333333,810.133333,1210.443333,0.0,45.932452,0.12969,387.22,1,0,0,0.0,0.0
8,9600.0,36 months,0.0749,298.58,A,Fully Paid,8,MORTGAGE,60000.0,Not Verified,2015-12-01,credit_card,SC,22.44,0.0,1996-06-01,695.0,0.0,0.0,7.0,0.0,0.0,7722.0,0.594,9.0,10636.09843,10636.1,9600.0,1036.1,0.0,0.0,Individual,0.0,0.0,2.0,7912.0,0.0,91.0,9.0,9.0,0.0,9.0,0.0,0.0,5.0,0.0,100.0,0.0,55387.0,55387.0,96,0.2244,233,5000.0,1122.0,1420.58,0.0,35.622274,0.284116,1036.1,1,0,0,0.0,0.0
9,25000.0,36 months,0.0749,777.55,A,Fully Paid,10,MORTGAGE,109000.0,Not Verified,2015-12-01,debt_consolidation,VA,26.02,0.0,2001-12-01,745.0,1.0,0.0,9.0,0.0,0.0,20862.0,0.543,19.0,26224.23,26224.23,25000.0,1224.23,0.0,0.0,Individual,0.0,0.0,2.0,33976.0,0.0,168.0,13.0,13.0,3.0,0.0,0.0,0.0,7.0,0.0,100.0,0.0,305781.0,68056.0,120,0.2602,167,9083.333333,2363.483333,3141.033333,0.0,33.726744,0.345802,1224.23,1,0,0,0.0,0.0


In [5]:
# overall performance 
overall_performance = eda_functions.calc_performance(data)


#default rate
default = eda_functions.categorical_count(data, 'loan_status')

In [6]:
# overall performance
# ~5.9B in total loans vs. 352M in total net gains 

overall_performance.head()

Unnamed: 0,total_loan_value,total_net_gains,avg_interest rate,avg_loan_amount,avg_gains($),avg_gains(%),annualized_return
0,5909868050.0,353343408.02431,12.553595,15307.366478,915.20775,5.978871,1.9374


In [7]:
# defaults - default rate is about 19.17% 

default

Unnamed: 0,loan_status,count,per_of_total
0,Fully Paid,312052,0.808257
1,Charged Off,74028,0.191743


In [8]:
# split out two separate datasets for paid vs. defaulted loans 

default_df = data[(data['loan_status'] == 'Charged Off')]
default_losses = default_df['net_gain'].sum()


paid_df = data[(data['loan_status'] == 'Fully Paid')]
paid_gains = paid_df['net_gain'].sum()


In [9]:
# calculate the performance of the good loans

paid_performance = eda_functions.calc_performance(paid_df)
paid_performance

Unnamed: 0,total_loan_value,total_net_gains,avg_interest rate,avg_loan_amount,avg_gains($),avg_gains(%),annualized_return
0,4707936775.0,912922028.77431,11.966344,15087.026441,2925.544553,19.391128,5.494593


In [10]:
# calculate the performance of the bad loans 

bad_performance = eda_functions.calc_performance(default_df)
bad_performance


Unnamed: 0,total_loan_value,total_net_gains,avg_interest rate,avg_loan_amount,avg_gains($),avg_gains(%),annualized_return
0,1201931275.0,-559578620.75,15.029049,16236.171111,-7559.013086,-46.556624,-29.185224


The first item that immediately jumps out is that the average interest rate for defaulted loans is about 25% higher than the rate for loans that were paid off. This is probably most reflective of riskier customers being giving higher interest rates to shield LC against that risk, but given the burden caused by higher interest rates, LC is probably hurting themselves more than they are protecting against risk.The second item is that bad loans are very expensive, despite good loans outnumbering bad loans by nearly 4:1, the bad loans wiped out nearly 2/3rds of the gains from the good loans. I.e. every bad loan wiped out the gains from 2.6 good loans. 

In [11]:
# let's check and see if there are loans that are defaulted, but the person made enough payments
# so that the loan still delivered a positive return 

positive_default = default_df[(default_df['net_gain'] > 0)]

positive_default 



Unnamed: 0,funded_amnt,term,int_rate,installment,grade,loan_status,emp_length,home_ownership,annual_inc,verification_status,issue_d,purpose,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,inq_last_6mths,mths_since_last_delinq,open_acc,pub_rec,pub_rec_bankruptcies,revol_bal,revol_util,total_acc,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,collections_12_mths_ex_med,application_type,acc_now_delinq,chargeoff_within_12_mths,acc_open_past_24mths,avg_cur_bal,delinq_amnt,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_il_tl,num_tl_120dpd_2m,pct_tl_nvr_dlq,tot_coll_amt,tot_cur_bal,total_bal_ex_mort,emp_length_months,dti_dec,length_of_credit_history,monthly_income,monthly_debt_payments,updated_monthly_debt_payments,lost_principle,total_payments,post_loan_dti,net_gain,debt_consolidation,consumer_credit,other,countdown_zero_delinq,countdown_zero_revol_delinq
16,16000.0,36 months,0.1288,538.18,C,Charged Off,10,MORTGAGE,65000.0,Not Verified,2015-12-01,small_business,AL,18.96,0.0,1985-12-01,675.0,0.0,33.0,7.0,1.0,0.0,5157.0,0.543,20.0,17695.03,17695.03,13833.12,3402.05,161.46,0.0,Individual,0.0,0.0,2.0,5683.0,0.0,360.0,1.0,1.0,0.0,14.0,33.0,1.0,12.0,0.0,80.0,1830.0,39781.0,39781.0,120,0.1896,359,5416.666667,1027.000000,1565.180000,2166.88,32.879390,0.288956,1396.63,0,0,1,-51.0,-51.0
126,16000.0,36 months,0.1577,560.70,D,Charged Off,1,MORTGAGE,70000.0,Not Verified,2015-12-01,debt_consolidation,NJ,8.78,0.0,2004-03-01,705.0,1.0,0.0,7.0,0.0,0.0,15986.0,0.761,25.0,19186.72,19186.72,14888.33,4140.43,0.00,0.0,Individual,0.0,0.0,5.0,25497.0,0.0,141.0,4.0,4.0,1.0,4.0,0.0,0.0,2.0,0.0,100.0,0.0,178477.0,31037.0,12,0.0878,141,5833.333333,512.166667,1072.866667,1111.67,34.219226,0.183920,3028.76,1,0,0,0.0,0.0
141,5000.0,36 months,0.1199,166.05,C,Charged Off,1,RENT,30000.0,Source Verified,2015-12-01,debt_consolidation,MD,9.16,0.0,2005-04-01,675.0,0.0,29.0,12.0,2.0,0.0,5426.0,0.283,15.0,5388.34,5388.34,4193.70,945.52,0.00,0.0,Individual,0.0,0.0,8.0,493.0,0.0,128.0,9.0,9.0,0.0,9.0,29.0,1.0,0.0,0.0,86.7,0.0,5426.0,5426.0,12,0.0916,128,2500.000000,229.000000,395.050000,806.30,32.450105,0.158020,139.22,1,0,0,-55.0,-55.0
151,30000.0,60 months,0.1849,769.83,D,Charged Off,2,MORTGAGE,85000.0,Source Verified,2015-12-01,debt_consolidation,NJ,35.93,0.0,2003-10-01,665.0,0.0,45.0,21.0,1.0,1.0,12381.0,0.499,49.0,37644.63,37644.63,22265.82,15378.81,0.00,0.0,Individual,0.0,0.0,10.0,9300.0,0.0,146.0,9.0,4.0,2.0,0.0,0.0,0.0,14.0,0.0,97.7,0.0,195290.0,106382.0,24,0.3593,146,7083.333333,2545.041667,3314.871667,7734.18,48.899926,0.467982,7644.63,1,0,0,-39.0,0.0
273,10400.0,36 months,0.1344,352.63,C,Charged Off,10,MORTGAGE,150000.0,Verified,2015-12-01,debt_consolidation,MO,18.51,1.0,1991-03-01,670.0,1.0,22.0,23.0,0.0,0.0,29336.0,0.667,44.0,10784.00,10784.00,8099.76,2332.77,52.89,0.0,Individual,0.0,0.0,14.0,10990.0,0.0,297.0,2.0,2.0,2.0,2.0,22.0,3.0,17.0,0.0,74.4,0.0,252779.0,145972.0,120,0.1851,297,12500.000000,2313.750000,2666.380000,2300.24,30.581629,0.213310,85.42,1,0,0,-62.0,-62.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385756,8100.0,36 months,0.1499,280.75,C,Charged Off,4,MORTGAGE,52600.0,Source Verified,2015-01-01,debt_consolidation,MI,17.70,0.0,1990-05-01,660.0,1.0,25.0,11.0,0.0,0.0,8152.0,0.703,37.0,9140.51,9140.51,7007.98,1962.53,15.00,0.0,Individual,0.0,0.0,10.0,18569.0,0.0,295.0,1.0,1.0,7.0,1.0,32.0,1.0,12.0,0.0,89.2,490.0,167119.0,39799.0,48,0.1770,296,4383.333333,775.850000,1056.600000,1092.02,32.557471,0.241049,885.51,1,0,0,-59.0,-52.0
385929,15775.0,60 months,0.1924,411.30,E,Charged Off,1,OWN,36000.0,Not Verified,2015-01-01,credit_card,CA,14.23,0.0,2006-10-01,700.0,2.0,79.0,22.0,0.0,0.0,9695.0,0.257,27.0,20452.58,20452.58,9084.31,7745.27,0.00,0.0,Individual,0.0,0.0,7.0,6143.0,0.0,85.0,2.0,2.0,0.0,1.0,0.0,2.0,9.0,0.0,100.0,0.0,135139.0,135139.0,12,0.1423,99,3000.000000,426.900000,838.200000,6690.69,49.726672,0.279400,1054.58,1,0,0,-5.0,0.0
385945,12000.0,36 months,0.1144,395.37,B,Charged Off,10,MORTGAGE,125000.0,Not Verified,2015-01-01,home_improvement,CA,34.07,0.0,1994-10-01,670.0,1.0,29.0,17.0,0.0,0.0,25924.0,0.585,42.0,13031.96,13004.81,10836.16,2195.80,0.00,0.0,Individual,0.0,0.0,8.0,36564.0,0.0,170.0,3.0,3.0,3.0,6.0,29.0,0.0,15.0,0.0,97.6,0.0,621585.0,108575.0,120,0.3407,243,10416.666667,3548.958333,3944.328333,1163.84,32.961429,0.378656,1031.96,0,1,0,-55.0,-55.0
385991,30000.0,36 months,0.1366,1020.39,C,Charged Off,1,MORTGAGE,120000.0,Verified,2015-01-01,debt_consolidation,TX,18.23,0.0,1992-11-01,705.0,0.0,0.0,11.0,0.0,0.0,22658.0,0.693,27.0,31392.60,31392.60,24108.33,6468.46,0.00,0.0,Individual,0.0,0.0,5.0,27866.0,0.0,185.0,5.0,5.0,3.0,9.0,0.0,0.0,14.0,0.0,100.0,0.0,306521.0,82076.0,12,0.1823,265,10000.000000,1823.000000,2843.390000,5891.67,30.765296,0.284339,576.79,1,0,0,0.0,0.0


Given that there were roughly 70k bad loans, the above indicates that a little less than 10% of the bad loans still turned a profit. This suggests that a machine learning model that identified these loans as false positives, wouldn't per se be a bad thing, as the goal is to maximize profits, not just reduce the default rate. 

In [12]:
# LC groups loans by grade, how does the performance of the loans look within those grades?


grade_analysis = data[['int_rate', 'net_gain', 'loan_status', 'fico_range_low','grade', 
                       'monthly_debt_payments', 
                       'countdown_zero_delinq','annual_inc',
                       'monthly_income', 'dti_dec', 'post_loan_dti']]


# convert loan status to dummy variables so we have a column for paid and a column for default
# with boolean yes/no variables as 1s and 0s 
grade_performance = pd.get_dummies(grade_analysis, columns = ['loan_status'])

# update the column names 
grade_performance.rename(columns={'loan_status_Charged Off':'default'}, inplace = True)
grade_performance.rename(columns={'loan_status_Fully Paid':'paid'}, inplace = True)


grade_performance = grade_performance.groupby('grade').mean()

grade_performance.reset_index(inplace = True)

default_col = grade_performance.pop('default')
grade_performance.insert(1, 'default_rate', default_col)

grade_performance.head(7)



Unnamed: 0,grade,default_rate,int_rate,net_gain,fico_range_low,monthly_debt_payments,countdown_zero_delinq,annual_inc,monthly_income,dti_dec,post_loan_dti,paid
0,A,0.053517,0.069413,1013.090019,719.340537,1163.548274,-21.07169,92904.659333,7742.054944,0.162396,0.22911,0.946483
1,B,0.12296,0.10043,1140.22864,694.785282,1123.354834,-26.838001,80390.292756,6699.191063,0.17961,0.250233,0.87704
2,C,0.209486,0.132971,1059.093132,686.500232,1126.85386,-26.982221,73907.382085,6158.948507,0.196116,0.274171,0.790514
3,D,0.298315,0.167231,756.907693,682.838034,1153.083213,-26.799308,69925.417166,5827.118097,0.213271,0.303537,0.701685
4,E,0.378653,0.192983,337.293011,681.936256,1202.351869,-26.706381,72090.865136,6007.572095,0.216452,0.315093,0.621347
5,F,0.465843,0.236242,-740.637748,680.649755,1206.732995,-26.962951,72888.933013,6074.077751,0.21388,0.323252,0.534157
6,G,0.505226,0.268314,-1449.532723,679.168741,1112.8468,-28.800398,71157.650582,5929.804215,0.201376,0.320982,0.494774


In [13]:
# let's see how many loans are originated in each category 
# access the categorical count function to get a count of each category + calculate 
# percentage of total.

grade_count = eda_functions.categorical_count(data, 'grade')

# extract the default rate column and the grade column, so we can add the default rates per 
# grade to the grade count data frame 

grade_default = grade_performance[['grade', 'default_rate']]


# use a join to add the default column to the grade count data frame 
grade_count = pd.merge(grade_count, grade_default, on='grade')


# calculate the number of defaulted loans 
grade_count['default_count'] = grade_count['count'] * grade_count['default_rate']

# calculate total number of defaulted loans 
total_defaults = grade_count['default_count'].sum()

grade_count['default_per_of_total'] = grade_count['default_count']/total_defaults



grade_count.head(7)


Unnamed: 0,grade,count,per_of_total,default_rate,default_count,default_per_of_total
0,C,109683,0.284094,0.209486,22977.0,0.310383
1,B,108328,0.280584,0.12296,13320.0,0.179932
2,A,68950,0.17859,0.053517,3690.0,0.049846
3,D,56370,0.146006,0.298315,16816.0,0.227157
4,E,31752,0.082242,0.378653,12023.0,0.162412
5,F,8988,0.02328,0.465843,4187.0,0.05656
6,G,2009,0.005204,0.505226,1015.0,0.013711


Looking at the data so far, it's fairly obvious that the company should stop originating grade F and G loans, and given the small gains and high default rates should stop originating E loans as well, as the overhead and serving costs undoubtedly erase a significant portion of the small amount of gains from that class. What would performance look like if we removed those loans? 



In [14]:
top_grades =  (data[(data['grade'] == 'A') | (data['grade'] == 'B') | (data['grade'] == 'C') |\
                    (data['grade'] == 'D')])

top_performance = eda_functions.calc_performance(top_grades)

top_performance.head()



Unnamed: 0,total_loan_value,total_net_gains,avg_interest rate,avg_loan_amount,avg_gains($),avg_gains(%),annualized_return
0,5094368475.0,352202643.64421,11.556468,14838.067273,1025.839914,6.913568,2.230594


In [15]:
#default rate
default_top_grades = eda_functions.categorical_count(top_grades, 'loan_status')

default_top_grades

Unnamed: 0,loan_status,count,per_of_total
0,Fully Paid,286528,0.834553
1,Charged Off,56803,0.165447


Getting rid of loans below class D results in roughly 900M in loans being taken off the table, but a reduction in profit of barely 1M, but an increase in annualized returns of ~15.6%. In other words, the return from the loans that would no longer be originated was roughly 0.11%. 

In [16]:
# as a comparison, what would happen if instead of no longer originating loans in classes E, F and G, 
# the company were to instead set an interest rate cap, since interest rates and chances of default
# are highly correlated? 

rate_cap = data[(data['int_rate'] <= 0.135)]

cap_performance = eda_functions.calc_performance(rate_cap)

cap_performance

Unnamed: 0,total_loan_value,total_net_gains,avg_interest rate,avg_loan_amount,avg_gains($),avg_gains(%),annualized_return
0,3551611025.0,266735565.154124,9.908338,14495.18825,1088.627725,7.51027,2.421427


In [17]:
#default rate
default_rate_cap = eda_functions.categorical_count(rate_cap, 'loan_status')
default_rate_cap

Unnamed: 0,loan_status,count,per_of_total
0,Fully Paid,214927,0.877181
1,Charged Off,30093,0.122819


### What we've learned so far:
* Removing loans with an interest rate above 13.5% would wipe out about approximately 87M in gains, but increased annualized return on total loans by about 25%. The company would deploy 2.4 billion less in capital, but only earn around 87M less. An argument that Lending Club could find higher yield activities (~1.21% annualized gain) for ~2.4B. Note as if this writing (4/08/22) two year T-Bills are paying 2.51% and they were paying 1.63% a month ago. 
* Removing all loans below Class D barely changes net gains, but increases returns on the loan portfolio by about 15.6%. I.e. the loans below Class D barely broke even, while requiring nearly 1B less in lending funds to generate. 
* On average bad loans wipe out the gains from ~2.6 good loans. Meaning a "false positive" as far as giving a loan that will eventually go into default, is roughly 2.6X more expensive than rejecting a customer who would pay off their loan 
* In a perfect world, the company would earn about 5.9% annualized fromm its lending business, I.e. the loans that were paid in full delivered a 5.9% annnualized return 
* Even though F & G loans constitute only ~2.8% of all loans, the fact that both categories are a net loss means that LC should stop originating loans in this category 
* While we don't know how much it costs to service/manage an individual loan, the low average return (~333) of the loans in class E (8.22% of all loans) + the default rate of 37.9% suggests that the company should stop originating loans in that category as well. I.e. a loan clas with a high default rate suggests that this category is more expensive to manage than say Class A or B loans, meaning, once operational costs are removed from the gains from Class E, the return is probably significantly less than 333.00 


For the purposes of this exercise, "success" will be a model that provides greater value than the two proposed interventions:
* Not originating loans to customers with loan grades below D, meaning: <16.5% default rate, >351M in gains, <5.09B in deployed capital 
* Setting an interest cap at 13.5%, <3.6B in deployed capital, >270M in gains, <10.6% default rate

There are clear options to improve the performance of LC lending portfolio, so for a model to provide business value it would have to outperform one or both of the intervention options identified via the EDA process. I.e. the model would need to provide a 3rd option that is superior to the other two, which are fairly standard banking options: increase your lending standards. 
