# Predicting Return on Investment (ROI) of Consumer Loans

**Andrew Nicholls** | Email: andrew.s.nicholls@gmail.com | [Github](https://github.com/Booleans)

If you are viewing this notebook on Github I recommend using the following nbviewer link instead to ensure proper formatting and working interactive charts.

[nbviewer link](https://nbviewer.jupyter.org/github/Booleans/Lending-Club-Loan-Analysis/blob/master/Loan_Analysis_Regression.ipynb)

## Summary

**Contents:**

1. [Problem Definition and Background Information](#1)
2. [Data Preparation: Wrangling, Cleaning, and Feature Extraction](#2)
3. [Exploratory Data Analysis](#3)
4. [Machine Learning Models](#4)
5. [Results and Findings](#4)
<a id='1'></a>

# 1. Introduction

### Problem Definition

LendingClub Corporation operates as an online marketplace that connects borrowers and investors in the United States. Its marketplace facilitates various types of loan products for consumers and small businesses, including unsecured personal loans, super prime consumer loans, unsecured education and patient finance loans, and unsecured small business loans. The company also offers investors an opportunity to invest in a range of loans based on term and credit characteristics. However, many loans issued through Lending Club end up being defaulted on by the borrower. The goal of this notebook is to examine the historical loan data available in order to generate a model to predict the return on investment that a new loan will generate.

### Files Provided

Lending Club provides CSV files of historical data for its loans. These files contain complete loan data for all loans issued through the time period stated, including the loan status (Current, Late, Fully Paid, etc.) and latest payment information. The files can be acquired on the [Lending Club Statistics Page](https://www.lendingclub.com/info/download-data.action). As of the creation of this project the latest data available from Lending Club was for Q1 2018.

For information on the definitions of fields contained within the historical data please see the [Lending Club Data Dictionary](https://github.com/Booleans/consumer-loan-survival-analysis/blob/master/data/LCDataDictionary.xlsx?raw=true).

The LoanStats3a file also contains information on loan applications that were declined and never issued. I have discarded those rows of data as they are not relevant to predicting loan defaults. I have also removed the last 2 rows of every spreadsheet, as they contained aggregate information on the number of loans in the file.

In [6]:
import datetime
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import plotly
import seaborn as sns

<a id='2'></a>


# 2. Data Preparation: Wrangling, Cleaning, and Feature Extraction

Certain columns may be immediately ignored and not loaded into the dataframe. Some columns provide information that was not available upon loan origination (e.g. interest received from loan payments, late fees collected). Zip code will be excluded because Lending Club has purposely removed the last 2 digits of the zip code in order to prevent investors from being able to discriminate against borrowers in poor and minority neighborhoods. Other columns are dropped for containing information that is not relevant to whether a loan will default or not (e.g. borrower id number, url that the loan is listed at). Joint loan applications are relatively new at Lending Club therefore all columns related to joint applications and secondary applicants will be ignored.

For a complete list of columns ignored please see the [Column_Info.py](Column_Info.py) file.

### Read in Relevant Loans:

We only want loans that have either been fully paid or charged off. We cannot calculate the final return on investment of loans that are still current. 