<img src="./result/logo.png" alt="Drawing" align="left" style="width: 600px;"/>

# Lending Club Loan Data Analysis and Modeling

Classification is one of two most common data science problems (another one is regression). For the supervised classification problem, imbalanced data are pretty common yet very challenging. 

For example, credit card fraud detection, disease classification, network intrusion and so on, are classification problem with imbalanced data. 

In this project, working with the Lending Club loan data, we hope to correctly predict whether or not on loan will be default using the history data.

# 0. Contents
For a traditional data science project, there are some common steps:

1. Problem Statement
    - Hypothesis and Goal
2. Data Collection
    - Can take $70\%$ of the total time for some real-world projects
3. Data Cleaning
    - Business Sense
    - Data Exploration
    - And so on ......
4. Visualization and Feature Engineering
    - Categorical vs. Numerical features
    - Missing Values
    - Feature Transformation
    - Feature Normalization
    - And so on ......
5. Machine Learning
    - Logistic Regression
    - Random Forest
    - Boosting
    - Neural Networks
    - And so on ......
6. Conclusions

Further, feature engineering and machine learning parts are usually iterative process. You may need to go through several rounds until you finish the whole modeling part.

# 1. Problem Statement

For companies like Lending Club, correctly predicting whether or not one loan will be default in the future is very important. In this project, using the historical data, more specifically, the Lending Club loan data from 2007 to 2018, we hope to build a machine learning model such that we can predict the chance of default for the future loans. 

As I will show later, this dataset is highly imbalanced and includes a lot of features, which makes this problem more challenging. 

# 2. Data Collection

There are several ways to download the dataset, for example, you can go to Lending Club's [website](https://www.lendingclub.com/info/download-data.action), or you can go to [Kaggle](https://www.kaggle.com/wendykan/lending-club-loan-data).

I have downloaded the data from [Lending Club's website](https://www.lendingclub.com/info/download-data.action). 

The original data is separated into different csv files. There are $145$ features originally. Based on my simple exploration and understanding, I finish the initial data cleaning part. More specifically:
* Choose the data from 2014 until 2018 to build model
* Remove features with large amount of missing values

The full procedures are listed in the Jupyter Notebook [1. Data Collection and Cleaning]. After above two steps, there are 87 features left, which are list below.

| Attribute                  | Explanation                                                       |
| -------------------------- | ----------------------------------------------------------------- |
| loan_amnt                  | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. |
| funded_amnt                | The total amount committed to that loan at that point in time. |
| funded_amnt_inv            | The total amount committed by investors for that loan at that point in time. |
| term                       | The number of payments on the loan. Values are in months and can be either 36 or 60. |
| int_rate                   | Interest Rate on the loan |
| installment                | The monthly payment owed by the borrower if the loan originates. |
| grade                      | LC assigned loan grade |
| sub_grade                  | LC assigned loan subgrade |
| emp_title                  | The job title supplied by the Borrower when applying for the loan.* |
| emp_length                 | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |
| home_ownership             | The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER. |
| annual_inc                 | The self-reported annual income provided by the borrower during registration. |
| verification_status        |  |
| issue_d                    | The month which the loan was funded |
| loan_status                | Current status of the loan |
| pymnt_plan                 | Indicates if a payment plan has been put in place for the loan |
| purpose                    | A category provided by the borrower for the loan request.  |
| title                      | The loan title provided by the borrower |
| zip_code                   | The first 3 numbers of the zip code provided by the borrower in the loan application. |
| addr_state                 | The state provided by the borrower in the loan application |
| dti                        | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| delinq_2yrs                | The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years |
| earliest_cr_line           | The date the borrower's earliest reported credit line was opened |
| inq_last_6mths             | The number of inquiries in past 6 months (excluding auto and mortgage inquiries) |
| open_acc                   | The number of open credit lines in the borrower's credit file. |
| pub_rec                    | Number of derogatory public records |
| revol_bal                  | Total credit revolving balance |
| revol_util                 | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. |
| total_acc                  | The total number of credit lines currently in the borrower's credit file |
| initial_list_status        | The initial listing status of the loan. Possible values are – W, F |
| out_prncp                  | Remaining outstanding principal for total amount funded |
| out_prncp_inv              | Remaining outstanding principal for portion of total amount funded by investors |
| total_pymnt                | Payments received to date for total amount funded |
| total_pymnt_inv            | Payments received to date for portion of total amount funded by investors |
| total_rec_prncp            | Principal received to date |
| total_rec_int              | Interest received to date |
| total_rec_late_fee         | Late fees received to date |
| recoveries                 | post charge off gross recovery |
| collection_recovery_fee    | post charge off collection fee |
| last_pymnt_d               | Last month payment was received |
| last_pymnt_amnt            | Last total payment amount received |
| last_credit_pull_d         | The most recent month LC pulled credit for this loan |
| collections_12_mths_ex_med | Number of collections in 12 months excluding medical collections |
| policy_code                | publicly available policy_code=1, new products not publicly available policy_code=2 |
| application_type           | Indicates whether the loan is an individual application or a joint application with two co-borrowers |
| acc_now_delinq             | The number of accounts on which the borrower is now delinquent. |
| tot_coll_amt               | Total collection amounts ever owed |
| tot_cur_bal                | Total current balance of all accounts |
| total_rev_hi_lim           | Total revolving high credit/credit limit |
| acc_open_past_24mths       | Number of trades opened in past 24 months. |
| avg_cur_bal                | Average current balance of all accounts |
| bc_open_to_buy             | Total open to buy on revolving bankcards. |
| bc_util                    | Ratio of total current balance to high credit/credit limit for all bankcard accounts. |
| chargeoff_within_12_mths   | Number of charge-offs within 12 months |
| delinq_amnt                | The past-due amount owed for the accounts on which the borrower is now delinquent. |
| mo_sin_old_il_acct         |  |
| mo_sin_old_rev_tl_op       | Months since oldest revolving account opened |
| mo_sin_rcnt_rev_tl_op      | Months since most recent revolving account opened |
| mo_sin_rcnt_tl             | Months since most recent account opened |
| mort_acc                   | Number of mortgage accounts. |
| mths_since_recent_bc       | Months since most recent bankcard account opened. |
| mths_since_recent_inq      | Months since most recent inquiry. |
| num_accts_ever_120_pd      | Number of accounts ever 120 or more days past due |
| num_actv_bc_tl             | Number of currently active bankcard accounts |
| num_actv_rev_tl            | Number of currently active revolving trades |
| num_bc_sats                | Number of satisfactory bankcard accounts |
| num_bc_tl                  | Number of bankcard accounts |
| num_il_tl                  | Number of installment accounts |
| num_op_rev_tl              | Number of open revolving accounts |
| num_rev_accts              | Number of revolving accounts |
| num_rev_tl_bal_gt_0        | Number of revolving trades with balance >0 |
| num_sats                   | Number of satisfactory accounts |
| num_tl_120dpd_2m           | Number of accounts currently 120 days past due (updated in past 2 months) |
| num_tl_30dpd               | Number of accounts currently 30 days past due (updated in past 2 months) |
| num_tl_90g_dpd_24m         | Number of accounts 90 or more days past due in last 24 months |
| num_tl_op_past_12m         | Number of accounts opened in past 12 months |
| pct_tl_nvr_dlq             | Percent of trades never delinquent |
| percent_bc_gt_75           | Percentage of all bankcard accounts > 75% of limit. |
| pub_rec_bankruptcies       | Number of public record bankruptcies |
| tax_liens                  | Number of tax liens |
| tot_hi_cred_lim            | Total high credit/credit limit |
| total_bal_ex_mort          | Total credit balance excluding mortgage |
| total_bc_limit             | Total bankcard high credit/credit limit |
| total_il_high_credit_limit | Total installment high credit/credit limit |
| hardship_flag              |  |
| disbursement_method        | The method by which the borrower receives their loan. Possible values are: CASH, DIRECT_PAY |
| debt_settlement_flag       |  |

# 3. Data Cleaning

# 4. Visualization and Feature Engineering

# 5. Machine Learning Modeling

## Imbalance Data
+ Anomaly Detection or Outlier Analysis
+ Over-sampling
+ Under-sampling
+ SMOTE and ADASYN

## Models
+ Logistic Regression
+ Random Forest
+ Boosting
+ Hierarchical Model
+ And so on

# 6. Conclusions