# Lending Club

In [2]:
%matplotlib inline

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.metrics import roc_curve, auc, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from scipy.stats import ttest_ind
import matplotlib.dates as mdates

sns.set_style('white')

## Introduction

Crowdfunding has become a new and exciting way to get capitale and to invest. Lending club has jumped into the trend by offering loans with fixed interest rates and terms that the public can choose to invest in. Lending club screens the loans that are applied for and only 10% gets approved and is subsequently offered to the public. By investing a small proportion in many different loans investors can diversify their portfolio and in this way keep the default risk to a minimum (which is estimated by lending club to be 4%). For their services lending club asks a fee of 1%. For investors this is an interesting way to get profit on their investment since it supposedly gives more stable returns than the stock market and higher interest rates than a savings account. The profits depend on the interest rate and the default rate. Therefore it is interesting to see whether certain characteristics of the loan or the buyer give a bigger chance of default. Hence this might help investors to upgrade their profits.

Lending club has provided the public with their records via their website. A previous dataset was released that holds the records from 2007-2011 and there has also been a Kaggle contest with a preprocessed lending club dataset in the past. In April 2016 lending club has provided their 2007-2015 dataset through Kaggle as dataset, not as contest. This is the dataset we will be working on with in this project. Nevertheless, previous work has usually been done on one of the earlier releases of their dataset. While most earlier work has been focussed on predicting good loans from bad loans which we also will be focussing on, most have incorporated also the current loans. This holds a problem, since loans with a 'late' status could still recover and end in 'fully paid'. And 'current' loans could still end in the status 'charged off'. This is why we will focus only on loans that are closed and are therefore either 'fully paid' or 'charged off'. The consequence is that previous work that has incorporated these current loans is not completely comparable.

To predict whether a loan will end in 'charged off' we will use machine learning algorithms. According to previous work, both Logistic Regression and Random Forest have been found to work the best. Although work that incorporates no external datasets usually ends up with a Area Under the Curve (AUC)-score around 0.7. Which is not really great, but better than chance. The most important feature is usually found to be 'grade'. This is a measure for risk assesment of the loans given by Lending Club itself. The categories are A-G including subcategories like A1 etc. The idea is that the closer to G the higher the chance on default. Usually the interest rate is also higher for the riskier loans in order to make these loans still attractive for investors. 

In this project, we will first focus on exploring the data. We will see whether Lending Club is right about their claimed 4% default rate. Subsequently, we will see whether loans with higher grades have indeed higher interest rates and higher default rates. And we will close the exploration part with how profitable the loans with the different grade categories actually are. Hereafter we will move on to the prediction part. Where we will use Random Forest and Logistic Regression to predict the 'charged off' from the 'fully paid' loans. We will see if an algorithm with just grade performs just as good as an algorithm with all features. Also if we can predict grade from the features, hence whether Lending Club provides the features they use. And finally, we will see whether return of investment increases with our algorithm and whether we can give some tips to the lending club investors.

## Methods

In [None]:
loans = pd.read_csv('../data/loan.csv')
closed_loans = loans[loans['loan_status'].isin(['Fully Paid', 'Charged Off'])]