# Predicting Return on Investment (ROI) of Consumer Loans

**Andrew Nicholls** | Email: andrew.s.nicholls@gmail.com | [Github](https://github.com/Booleans)

If you are viewing this notebook on Github I recommend using the following nbviewer link instead to ensure proper formatting and working interactive charts.

[nbviewer link](https://nbviewer.jupyter.org/github/Booleans/Lending-Club-Loan-Analysis/blob/master/Loan_Analysis_Regression.ipynb)

## Summary

**Contents:**

1. [Problem Definition and Background Information](#1)
2. [Data Preparation: Wrangling, Cleaning, and Feature Extraction](#2)
3. [Exploratory Data Analysis](#3)
4. [Machine Learning Models](#4)
5. [Results and Findings](#4)
<a id='1'></a>

# 1. Introduction

### Problem Definition

LendingClub Corporation operates as an online marketplace that connects borrowers and investors in the United States. Its marketplace facilitates various types of loan products for consumers and small businesses, including unsecured personal loans, super prime consumer loans, unsecured education and patient finance loans, and unsecured small business loans. The company also offers investors an opportunity to invest in a range of loans based on term and credit characteristics. However, many loans issued through Lending Club end up being defaulted on by the borrower. The goal of this notebook is to examine the historical loan data available in order to generate a model to predict the return on investment that a new loan will generate.

### Files Provided

Lending Club provides CSV files of historical data for its loans. These files contain complete loan data for all loans issued through the time period stated, including the loan status (Current, Late, Fully Paid, etc.) and latest payment information. The files can be acquired on the [Lending Club Statistics Page](https://www.lendingclub.com/info/download-data.action). As of the creation of this project the latest data available from Lending Club was for Q1 2018.

For information on the definitions of fields contained within the historical data please see the [Lending Club Data Dictionary](https://github.com/Booleans/consumer-loan-survival-analysis/blob/master/data/LCDataDictionary.xlsx?raw=true).

The LoanStats3a file also contains information on loan applications that were declined and never issued. I have discarded those rows of data as they are not relevant to predicting loan defaults. I have also removed the last 2 rows of every spreadsheet, as they contained aggregate information on the number of loans in the file.

In [1]:
import datetime
import pickle
import numpy as np
np.set_printoptions(suppress=True)
import pandas as pd
from datetime import datetime as dt

%run src/columns.py
%run src/data-cleaning.py
%run src/feature-engineering.py

Reading in raw data, for now just use the pickle

In [2]:
loans_original = pd.read_pickle('data/raw_dataframe.pkl.bz2')

In [3]:
loans_original.head()

Unnamed: 0,id,loan_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,verification_status,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,1077501,5000.0,36 months,10.65%,162.87,B,10+ years,RENT,24000.0,Verified,...,,,,,0.0,0.0,,,,
1,1077430,2500.0,60 months,15.27%,59.83,C,< 1 year,RENT,30000.0,Source Verified,...,,,,,0.0,0.0,,,,
2,1077175,2400.0,36 months,15.96%,84.33,C,10+ years,RENT,12252.0,Not Verified,...,,,,,0.0,0.0,,,,
3,1076863,10000.0,36 months,13.49%,339.31,C,10+ years,RENT,49200.0,Source Verified,...,,,,,0.0,0.0,,,,
4,1075358,3000.0,60 months,12.69%,67.79,B,1 year,RENT,80000.0,Source Verified,...,,,,,0.0,0.0,,,,


In [4]:
loans = loans_original.sample(n=30000)

In [5]:
loans = drop_loan_status(loans)
loans = drop_joint_applicant_loans(loans)
loans = fix_rate_cols(loans)
loans.dropna(subset=['issue_d'], inplace=True)
loans = fix_date_cols(loans)
loans = clean_loan_term_col(loans)
loans = only_include_36_month_loans(loans)
loans = clean_employment_length(loans)
loans = create_missing_data_boolean_columns(loans)
loans = fill_nas(loans, value=-99)

In [6]:
loans.head()

Unnamed: 0,id,loan_amnt,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,verification_status,...,num_tl_30dpd_missing,num_tl_90g_dpd_24m_missing,num_tl_op_past_12m_missing,pct_tl_nvr_dlq_missing,percent_bc_gt_75_missing,pub_rec_bankruptcies_missing,tot_hi_cred_lim_missing,total_bal_ex_mort_missing,total_bc_limit_missing,total_il_high_credit_limit_missing
70537,96728210,35000.0,36,13.49,1187.57,C,10.0,MORTGAGE,242000.0,Verified,...,0,0,0,0,0,0,0,0,0,0
43591,7341146,10000.0,36,15.1,347.15,C,7.0,MORTGAGE,220000.0,Verified,...,0,0,0,0,0,0,0,0,0,0
51109,124820645,8000.0,36,6.72,246.0,A,10.0,MORTGAGE,41500.0,Not Verified,...,0,0,0,0,0,0,0,0,0,0
95097,76331622,22000.0,36,5.32,662.53,A,9.0,OWN,160000.0,Source Verified,...,0,0,0,0,0,0,0,0,0,0
84715,131399632,8000.0,36,6.71,245.96,A,3.0,MORTGAGE,68000.0,Source Verified,...,0,0,0,0,0,0,0,0,0,0


In [9]:
add_supplemental_rate_data(loans)

KeyError: 'year'