### **DSBD Project 2 Proposal**
### **LendingClub Peer-to-Peer Lending Default Prediction**
#### Team 14: Yunhan (Claire) Xu, Qiyang (Cathy) Chen

Starting from 2015, the idea of Peer-to-Peer (P2P) lending emerged, which allows individuals (borrowers) to obtain loans directly from other individuals (investors) through platforms such as LendingClub, Prosper, and Upstart. This kind of "social lending" becomes an alternative financing approach, cutting out the financial institution as the middleman. P2P lending enables investors to acquire a higher return on investment compared to a bank deposit. However, credit risk, one of the biggest concerns for investors, emerges when they make informed investment decisions within the P2P lending procedure. In particular, a default due to failing to make the required payment within the agreed date for a borrower would induce a loss for an investor. 

In this project, we plan to improve LendingClub's overall performance of identifying borrowers who are likely to make defaulted loans by constructing a credit risk assessment system. Specifically, we will build machine-learned classification models trained on [LendingClub's historical loan data](https://www.kaggle.com/ethon0426/lending-club-20072020q1) to predict LendingClub loan default. The collected dataset includes loan grade, indicating the credit risk from the credit report and loan application. We plan to set the loan grade as the baseline and compare it with our model performance. In particular, we will compute the default rate for the loan grade from A to C and compare it with our recall (among all the loans that we predicted as default loans, how many of them actually defaulted).

The domain of our project is as follows:
- Task (T) - Classification task that predicts defualted loan 
- Performance Measure (P) - % of defualted loans correctly classified
    - Precision: among all the actually defaulted loans, how many of them did we successfully identified
    - Recall: among all the loans that we predicted as default loans, how many of them actually defaulted
    - F-1 Score: weighted average for precision and recall
- Experience (E) - LendingClub database of loan records with pre-specified classifications of loan grade
<!-- machine-learned classification models (SVM, logistic regression, decision trees, random forests, XGBoost, etc.) -->

Considering a classification task of identifying borrowers who are likely to make defaulted loans in LendingClub, we will train the LendingClub database of loan records without the pre-specified classifications containing loan grades to perform the task and utilize precision, recall, and F-1 score as performance metrics to measure how well we complete the task.  

### Data

The original data contains all the LendingClub loan data from 2007 to 2018, and the corresponding 151 features are related to borrowers' credit history and loan characteristics. We performed some basic data cleaning beforehand:
- For the scoping purpose, we only focus on data from the latest four years, i.e., 2015-2018. 
- There are seven loan statuses: Charged Off, Current, Default, Fully Paid, In Grace Period, Late (16-30 days), Late (31-120 days). As we don't know whether a "Current" loan would be defaulted or not, we disregard those loans. We consider Late (16-30 days), Late (31-120 days), Default, and Charged Off as a Defaulted loan and Fully Paid as a desirable loan.
- We dropped 25 irrelevant features, such as "url", "zip_code", "IDs", etc. 

In [1]:
import pandas as pd
data = pd.read_csv('LC_15_18_new.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


Let's take a look at the data:

In [2]:
data.head()

Unnamed: 0,funded_amnt,term,int_rate,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,issue_d,...,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,issue_month,issue_year
0,3600,36 months,13.99,C,C4,10+ years,MORTGAGE,55000.0,Not Verified,Dec-15,...,Cash,N,,,,,,,Dec,2015
1,24700,36 months,11.99,C,C1,10+ years,MORTGAGE,65000.0,Not Verified,Dec-15,...,Cash,N,,,,,,,Dec,2015
2,20000,60 months,10.78,B,B4,10+ years,MORTGAGE,63000.0,Not Verified,Dec-15,...,Cash,N,,,,,,,Dec,2015
3,10400,60 months,22.45,F,F1,3 years,MORTGAGE,104433.0,Source Verified,Dec-15,...,Cash,N,,,,,,,Dec,2015
4,11950,36 months,13.44,C,C3,4 years,RENT,34000.0,Source Verified,Dec-15,...,Cash,N,,,,,,,Dec,2015


In [4]:
# Check the data size
data.shape

(927931, 128)

The data has 927,931 rows and 128 columns

In [6]:
# Check the data size for each issue year
data.groupby('issue_year').size()

issue_year
2015    377796
2016    300346
2017    181728
2018     68061
dtype: int64

It makes sense that the most recent data includes fewer records since the loans haven't reached maturity, and LendingClub hasn't received the result of if a loan is a default or not yet.  

In [7]:
# Check the data size for each loan status
data.groupby('loan_status').size()

loan_status
Charged Off           192060
Default                   39
Fully Paid            702191
In Grace Period         8235
Late (16-30 days)       4276
Late (31-120 days)     21130
dtype: int64

75.67% of the loans are fully paid, while 24.33% of the loans are default.