# Prediction of Loan Default with a Classification Model
The LendingClub is a leading company in peer-to-peer lending. Peer-to-peer lending is disrupting the banking industry since it directly connects borrowers and potential lenders/investors. The LendingClub specializes in small personal finance loans. In this notebook, you will build a classification model based on data from the LendingClub website. The main use of classification models is to score the likelihood of an event occuring. For loan data, the model will be used to predict whether a loan will be paid off in full or the loan needs to be charged off and possibly go into default. You can use the model to score the quality of current loans and identify the ones most likely to default.

https://dato.com/learn/gallery/notebooks/predict-loan-default.html

## Import the Data into an SFrame
Import Graphlab and set canvas to show sframes and sgraphs in an ipython notebook

In [1]:
import graphlab as gl
gl.canvas.set_target('ipynb')



Read in a 257 MB csv file containing data for loan originations.

In [2]:
loans = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/lending_club/loanStats.csv')

[INFO] This trial license of GraphLab Create is assigned to renatbek@gmail.com and will expire on October 08, 2015. Please contact trial@dato.com for licensing options or to request a free non-commercial license for personal or academic use.

[INFO] Start server at: ipc:///tmp/graphlab_server-93358 - Server binary: /Library/Python/2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1441762835.log
[INFO] GraphLab Server Version: 1.5.2


PROGRESS: Downloading http://s3.amazonaws.com/dato-datasets/lending_club/loanStats.csv to /var/tmp/graphlab-rbekbolatov/93358/000000.csv
PROGRESS: Finished parsing file http://s3.amazonaws.com/dato-datasets/lending_club/loanStats.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.647075 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,int,int,int,str,float,float,str,str,str,str,str,int,str,str,str,str,str,str,str,str,str,str,float,int,str,int,int,int,int,int,int,float,int,str,float,float,float,float,float,float,float,float,float,str,float,str,str,int,str,int,int,str,int,int,int,int,float,int,int,int,int,float,str,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 76455 lines. Lines per second: 66691.4
PROGRESS: Finished parsing 

In [3]:
loans

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade
1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2
1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4
1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5
1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1
1075358,1311748,3000,3000,3000,60 months,12.69,67.79,B,B5
1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4
1069639,1304742,7000,7000,7000,60 months,15.96,170.08,C,C5
1072053,1288686,3000,3000,3000,36 months,18.64,109.43,E,E1
1071795,1306957,5600,5600,5600,60 months,21.28,152.39,F,F2
1071570,1306721,5375,5375,5350,60 months,12.69,121.45,B,B5

emp_title,emp_length,home_ownership,annual_inc,is_inc_v,issue_d,loan_status
,10+ years,RENT,24000,Verified,20111201T000000,Fully Paid
Ryder,< 1 year,RENT,30000,Source Verified,20111201T000000,Charged Off
,10+ years,RENT,12252,Not Verified,20111201T000000,Fully Paid
AIR RESOURCES BOARD,10+ years,RENT,49200,Source Verified,20111201T000000,Fully Paid
University Medical Group,1 year,RENT,80000,Source Verified,20111201T000000,Current
Veolia Transportaton,3 years,RENT,36000,Source Verified,20111201T000000,Fully Paid
Southern Star Photography,8 years,RENT,47004,Not Verified,20111201T000000,Current
MKC Accounting,9 years,RENT,48000,Source Verified,20111201T000000,Fully Paid
,4 years,OWN,40000,Source Verified,20111201T000000,Charged Off
Starbucks,< 1 year,RENT,15000,Verified,20111201T000000,Charged Off

pymnt_plan,url,desc,purpose,title
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I need to ...,credit_card,Computer
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I plan to use ...,car,bike
n,https://www.lendingclub.c om/browse/loanDetail. ...,,small_business,real estate business
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > to pay for ...,other,personel
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > I plan on ...,other,Personal
n,https://www.lendingclub.c om/browse/loanDetail. ...,,wedding,My wedding loan I promise to pay back ...
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/18/11 > I am planning ...,debt_consolidation,Loan
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > Downpayment ...,car,Car Downpayment
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > I own a small ...,small_business,Expand Business & Buy Debt Portfolio ...
n,https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > I'm trying to ...,other,Building my credit history. ...

zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record
860xx,AZ,27.65,0,19850101T000000,1,,
309xx,GA,1.0,0,19990401T000000,5,,
606xx,IL,8.72,0,20011101T000000,2,,
917xx,CA,20.0,0,19960201T000000,1,35.0,
972xx,OR,17.94,0,19960101T000000,0,38.0,
852xx,AZ,11.2,0,20041101T000000,3,,
280xx,NC,23.51,0,20050701T000000,1,,
900xx,CA,5.35,0,20070101T000000,2,,
958xx,CA,5.55,0,20040401T000000,2,,
774xx,TX,18.08,0,20040901T000000,0,,

open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt
3,0,13648,83.7,9,f,0.0,0.0,5861.07
3,0,1687,9.4,4,f,0.0,0.0,1008.71
2,0,2956,98.5,10,f,0.0,0.0,3003.65
10,0,5598,21.0,37,f,0.0,0.0,12226.3
15,0,27783,53.9,38,f,1384.34,1384.34,2496.48
9,0,7963,28.3,12,f,0.0,0.0,5631.38
7,0,17726,85.6,11,f,3365.36,3365.36,6265.96
4,0,8221,87.5,4,f,0.0,0.0,3938.14
11,0,5210,32.6,13,f,0.0,0.0,646.02
2,0,9279,36.5,3,f,0.0,0.0,1476.19

total_pymnt_inv,...
5831.78,...
1008.71,...
3003.65,...
12226.3,...
2496.48,...
5631.38,...
6265.96,...
3938.14,...
646.02,...
1469.34,...


---
## Fit a Model to Inactive Loans
Restrict the analysis to loans that are longer considered active since these are ones for which we have an outcome (fully paid, charged off or in default). This is indicated by the feature inactive_loans. The response variable is bad_loans which includes loans that are charged off of in actual default. Strip out rows with missing values in the training data for the set of variables we are using as predictors.

In [5]:
inactive_loans = loans[loans['inactive_loans']]
response = 'bad_loans'
features = ['grade',                     # grade of the loan (categorical)
            'sub_grade_num',             # sub-grade of the loan as a number from 0 to 1
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'payment_inc_ratio',         # ratio of the monthly payment to income
            'delinq_2yrs',               # number of delinquincies 
            'delinq_2yrs_zero',          # no delinquincies in last 2 years
            'inq_last_6mths',            # number of creditor inquiries in last 6 months
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'open_acc',                  # number of open credit accounts
            'pub_rec',                   # number of derogatory public records
            'pub_rec_zero',              # no derogatory public records
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

inactive_loans, loans_with_na = inactive_loans[[response] + features].dropna_split()

print 'Dropping {} rows; keeping {} '.format(loans_with_na.num_rows(), 
                                             inactive_loans.num_rows())
inactive_loans.show()

Dropping 29 rows; keeping 122578 


Divide the inactive loans into two sets: a training set to fit the model and a test set to evaluate performance.

In [7]:
train_set, test_set = inactive_loans.random_split(0.8, seed=1)

---
Fit a logistic regression to the train_set. It is important to set the argument class_weights='auto'. This will adjust the training set to ensure the bad loans are more highly represented. Otherwise, the model will under-predict the probability of a bad loan.

In [9]:
base_model = gl.logistic_classifier.create(train_set, 
                                           target='bad_loans',  
                                           class_weights='auto', 
                                           features=features, 
                                           validation_set=None)
base_model.summary()

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 98060
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 18
PROGRESS: Number of unpacked features : 18
PROGRESS: Number of coefficients    : 36
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+
PROGRESS: | 1         | 2        | 1.202510     | 0.643198          |
PROGRESS: | 2         | 3        | 1.328448     | 0.646472          |
PROGRESS: | 3         | 4        | 1.468999     | 0.647216          |
PROGRESS: | 4         | 5        | 1.602687     | 0.647196          |
PROGRESS: | 5         | 6        | 1.728122     | 0.647196          |
PROGRESS: +-----------+----

---
Now apply the model to the holdout set.

In [10]:
base_model.evaluate(test_set)

{'accuracy': 0.6438942817521821, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  1666 |
 |      1       |        1        |  2958 |
 |      0       |        1        |  7065 |
 |      0       |        0        | 12829 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns]}

---
Overall, the prediction accuracy for the test set shoud be about 64%. The accuracy for the test set is roughly the same as for the training set, so you can be confident that the model is not over-fitting the data. In fact, the accuracy for logistic regression is about as good as the accuracy for other classfiers (e.g., boosted trees) for this data. For noisy data, it is generally true to simpler classifiers tend to perform as well or better than more sophisticated techniques.

## Apply to Model to Active Loans
Now refit the modeling using all the data and use this model to predict the active loans most likely to be charged off or go into default.

In [11]:
base_model = gl.logistic_classifier.create(inactive_loans, 
                                           target='bad_loans',  
                                           class_weights='auto', 
                                           features=features, 
                                           validation_set=None,
                                           verbose=False)
active_loans = loans[1 - loans['inactive_loans']]
active_loans['prob_bad'] = base_model.predict(active_loans, output_type='probability')

---
Using the topk method, you can identify the riskiest 100 loans.

In [13]:
worst_loans = active_loans.topk('prob_bad', k=100)
worst_loans[['prob_bad'] + features].show()

## Wrap-up
In this notebook, in just a few lines of code, you fit a model to inactive loans to predict default. You then applied this model to predict which loans are most likely to be charged off or go into default. This allows you to identify the riskiest loans and possibly take preemptive action to avoid charging off the loan.