
# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 5: Loan Classification
#### Due Wednesday, December 21, 9 am

### Overview

You are going to create a model to predict loan status (good / bad) using data from Lending Club.  Download the 2015 data from [here](https://www.lendingclub.com/info/download-data.action).  

#### Data Manipulation

The data require some cleaning before you can build the model.  Think about what you are trying to predict, and how to re-engineer categories in order to do this.  What are the categories for loan status in the data?<br>
*hint*:  only use loans that have been determined (i.e. not current loans).<br>
*hint*:  re-categorize the loans into "good" and "bad" (only two categories)

Let's use annual income, debt-to-income, interest rate, loan term, funded amount and home ownership to model the loan status.  If you don't know what these features are, have a look at the data dictionary on the Lending Club [page](https://www.lendingclub.com/info/download-data.action).

#### EDA
Before doing any kind of modelling, explore the data.  For example, what is the distribution of good / bad loans?  Are interest rate and DTI related?  Make some pivot tables / plots to better understand the data you have.

#### Model
Create your classification model using the above features!<br>
*hint*: your data must be numerical in order to create your model.  Are all of the data numerical?  What can you do to make them numerical?  (Look-up dummy variables)

Once you have your model, make a prediction based on the first row of data.  What is the probability of loan repayment for this person?  If your boss asked you whether the person is going to repay, what would you say?


binomial distribution bc 

prob is parameter - pdf


**Deliverables**: a Jupyter  notebook including EDA (plotting) and your model.  Or you can work in pycharm, but you must submit EDA as well.  Also you should submit a blog post describing your project.


In [2]:
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('./loan.csv')
df.head()


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,66310712,71035433.0,35000.0,35000.0,35000.0,60 months,14.85%,829.9,C,C5,...,0.0,1.0,100.0,0.0,0.0,0.0,381215.0,52226.0,62500.0,18000.0
1,68476807,73366655.0,10400.0,10400.0,10400.0,60 months,22.45%,289.91,F,F1,...,0.0,4.0,96.6,60.0,0.0,0.0,439570.0,95768.0,20300.0,88097.0
2,68341763,72928789.0,20000.0,20000.0,20000.0,60 months,10.78%,432.66,B,B4,...,0.0,0.0,100.0,50.0,0.0,0.0,218418.0,18696.0,6200.0,14877.0
3,68466916,73356753.0,25000.0,25000.0,25000.0,36 months,7.49%,777.55,A,A4,...,0.0,0.0,100.0,20.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0
4,68466961,73356799.0,28000.0,28000.0,28000.0,36 months,6.49%,858.05,A,A2,...,0.0,0.0,91.7,22.2,0.0,0.0,304003.0,74920.0,41500.0,42503.0


In [3]:
# SELECTING COLUMNS

#Let's use annual income, debt-to-income, interest rate, loan term, funded amount and home ownership 

column_set = ["annual_inc", "dti", "int_rate", "term", "funded_amnt", "home_ownership","loan_status"]
working_set1 = df.loc[:, column_set]

#I only want loan status that is not 'current'
status_list = ['Fully Paid', 'Default','Charged Off']

# TOOK OUT : current, 'Late (31-120 days)', 'In Grace Period', 'Late (16-30 days)'
working_set2 = working_set1[working_set1["loan_status"].isin(status_list)]

working_set2.head()

Unnamed: 0,annual_inc,dti,int_rate,term,funded_amnt,home_ownership,loan_status
1,104433.0,25.37,22.45%,60 months,10400.0,MORTGAGE,Fully Paid
3,109000.0,26.02,7.49%,36 months,25000.0,MORTGAGE,Fully Paid
5,112000.0,8.68,11.99%,60 months,18000.0,MORTGAGE,Fully Paid
8,55000.0,25.49,19.89%,36 months,8650.0,RENT,Fully Paid
18,180000.0,14.67,9.17%,36 months,20000.0,MORTGAGE,Fully Paid


In [4]:
# Turn all columns into numbers

def clean_interest(x):
    x = float((x.split("%"))[0])
    return x
    
working_set2.loc[:, "int_rate"] = [clean_interest(x) for x in working_set2.loc[:, "int_rate"]]


def normalize_status(x):
    if x == "Charged Off":
        return 0.0
    elif x == "Default":
        return 0.0
    else:
        return 1.0

working_set2.loc[:, "loan_status"] = [normalize_status(x) for x in working_set2.loc[:, "loan_status"]]


# convert owernship column into dummy variables

working_set2.loc[:,"Mortgage"] = [1.0 if x == "MORTGAGE" else 0.0 for x in working_set2.loc[:, "home_ownership"]]

working_set2.loc[:, "Rent"] = [1.0 if x =="RENT" else 0.0 for x in working_set2.loc[:, "home_ownership"]]

working_set2.loc[:, "Own"] = [1.0 if x == "OWN" else 0.0 for x in working_set2.loc[:, "home_ownership"]]

working_set2.loc[:, "Any"] = [1.0 if x== "ANY" else 0.0 for x in working_set2.loc[:, "home_ownership"]]


#Turn term from object into float

def clean_term(x):
    y = x[1:3]
    if y == "60":
        return 1.0
    else:
        return 0.0
   
    
working_set2.loc[:, "term"] = [clean_term(x) for x in working_set2["term"]]


#show first 5 rows of our aggregated data
working_set2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)


Unnamed: 0,annual_inc,dti,int_rate,term,funded_amnt,home_ownership,loan_status,Mortgage,Rent,Own,Any
1,104433.0,25.37,22.45,1.0,10400.0,MORTGAGE,1.0,1.0,0.0,0.0,0.0
3,109000.0,26.02,7.49,0.0,25000.0,MORTGAGE,1.0,1.0,0.0,0.0,0.0
5,112000.0,8.68,11.99,1.0,18000.0,MORTGAGE,1.0,1.0,0.0,0.0,0.0
8,55000.0,25.49,19.89,0.0,8650.0,RENT,1.0,0.0,1.0,0.0,0.0
18,180000.0,14.67,9.17,0.0,20000.0,MORTGAGE,1.0,1.0,0.0,0.0,0.0


In [5]:
#describes data type information
working_set2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101152 entries, 1 to 421093
Data columns (total 11 columns):
annual_inc        101152 non-null float64
dti               101152 non-null float64
int_rate          101152 non-null float64
term              101152 non-null float64
funded_amnt       101152 non-null float64
home_ownership    101152 non-null object
loan_status       101152 non-null float64
Mortgage          101152 non-null float64
Rent              101152 non-null float64
Own               101152 non-null float64
Any               101152 non-null float64
dtypes: float64(10), object(1)
memory usage: 9.3+ MB


In [6]:
# NORMALIZE COLUMNS

#normalize interest rates between 1 and 0

minim_I = working_set2["int_rate"].min()
maxim_I = working_set2["int_rate"].max()

def normalizeI(x):
    return (x-minim_I)/(maxim_I - minim_I)
working_set2.loc[:, "int_rate"] = working_set2["int_rate"].apply(normalizeI)



# normalize DTI

minim_D = working_set2["dti"].min()
maxim_D = working_set2["dti"].max()

def normalizeD(x):
    return (x-minim_D)/(maxim_D - minim_D)

working_set2.loc[:, "dti"] = working_set2["dti"].apply(normalizeD)



#normalize funded amount

minim_F = working_set2["funded_amnt"].min()
maxim_F = working_set2["funded_amnt"].max()

def normalizeF(x):
    return (x-minim_F)/(maxim_F - minim_F)
working_set2.loc[:, "funded_amnt"] = working_set2["funded_amnt"].apply(normalizeF)


# Normalize annual income

minim_A = working_set2["annual_inc"].min()
maxim_A = working_set2["annual_inc"].max()

def normalizeA(x):
    return (x-minim_A)/(maxim_A - minim_A)

working_set2.loc[:, "annual_inc"] = working_set2["annual_inc"].apply(normalizeA)


In [7]:
#TOTALLY CLEANED PANDAS DF
#all the columns, minus home_owernship column with strings
clean_set = working_set2.loc[:, ["annual_inc", "dti", "int_rate", "term", "funded_amnt", "Mortgage", "Rent","Own", "Any", "loan_status"]]


In [8]:
#sampling 6k of good loans and bad loans. 
#the dataset is unbalanced and is affecting parameters.

good_loan = clean_set[clean_set.loan_status == 1]
bad_loan = clean_set[clean_set.loan_status == 0]

good_df = good_loan.sample(n=3000)
bad_df= bad_loan.sample(n=3000)

data2 = pd.concat([good_df, bad_df])
data2.tail()


Unnamed: 0,annual_inc,dti,int_rate,term,funded_amnt,Mortgage,Rent,Own,Any,loan_status
208293,0.005393,0.001733,0.338403,1.0,0.264706,0.0,1.0,0.0,0.0,0.0
306315,0.004719,0.002583,0.599916,0.0,0.338971,1.0,0.0,0.0,0.0,0.0
207308,0.005056,0.003277,0.366286,1.0,0.352941,0.0,1.0,0.0,0.0,0.0
151682,0.004494,0.002955,0.493029,0.0,0.041176,0.0,1.0,0.0,0.0,0.0
352491,0.011685,0.001303,0.108576,0.0,0.191912,1.0,0.0,0.0,0.0,0.0


In [9]:
# PANDAS DF --> Numpy array

data1 = np.array(data2)

## Bayes 

In [10]:
sig = lambda x: 1./(1+np.exp(-x))

def lnpred(data, a):
    atom = a[0] + a[1] * data[0] + a[2] * data[1] + a[3] * data[2] + a[4]*data[3] + a[5]*data[4] + a[6]*data[5] +a[7] * data[6] + a[8] * data[7] + a[9]*data[8] 
    if data[9] == 1: 
        return np.log(sig(atom))
    else:
        return np.log(1 - (sig(atom)))
    
def lnprob(a, data):
    prior = np.sum([e**2 for e in a])
    return -0.5 * prior + np.sum([lnpred(e,a) for e in data])


In [11]:
import numpy as np
import emcee
import corner
import matplotlib.pyplot as plt
%matplotlib inline


class Bayes:
    def __init__(self, lnprob, data, nwalkers, ndim, nsteps):
        self.lnprob = lnprob
        self.data = data
        self.nwalkers = nwalkers
        self.ndim = ndim
        self.nsteps = nsteps
        self.N = len(data)
        
        sampler = emcee.EnsembleSampler(self.nwalkers, self.ndim, self.lnprob)
        p0 = np.random.rand(self.nwalkers * self.ndim).reshape((self.nwalkers , self.ndim))
        pos, prob, state = sampler.run_mcmc(p0, 1000)
        sampler.reset()
        pos, prob, state = sampler.run_mcmc(pos, self.nsteps)
        self.samples = sampler.flatchain
        global samples2
        samples2 = self.samples
        
    def MonteCarlo(self, f, samples):
        N = len(self.samples)
        return 1/float(N)*sum([f(e) for e in self.samples])
        
    def reg(self):
        #calling the MonteCarlo method for each column in my samples, 
        global omega0, omega1, omega2, omega3,omega4, omega5, omega6, omega7, omega8, omega9
        omega0 = self.MonteCarlo(lambda x: x[0], self.samples)
        omega1 = self.MonteCarlo(lambda x: x[1], self.samples)
        omega2 = self.MonteCarlo(lambda x: x[2], self.samples)
        omega3 = self.MonteCarlo(lambda x: x[3], self.samples)
        omega4 = self.MonteCarlo(lambda x: x[4], self.samples)
        omega5 = self.MonteCarlo(lambda x: x[5], self.samples)
        omega6 = self.MonteCarlo(lambda x: x[6], self.samples)
        omega7 = self.MonteCarlo(lambda x: x[7], self.samples)
        omega8 = self.MonteCarlo(lambda x: x[8], self.samples)
        omega9 = self.MonteCarlo(lambda x: x[8], self.samples)
        
        print ('omega0 = {0}, omega1 = {1},omega2 = {2},omega3 = {3}, omega4 = {4}'\
               'omega5 = {5}, omega6 = {6},omega7 = {7},omega8 = {8},omega9 = {9}'\
               .format(omega0, omega1, omega2, omega3,omega4, omega5, omega6, omega7, omega8, omega9))
        


In [12]:
ellen = Bayes(lambda a: lnprob(a,data1), data1, 26,10,1000)

In [13]:
ellen.reg()


omega0 = 1.05989008917, omega1 = 0.135981978518,omega2 = -0.235113842087,omega3 = -3.37674250055, omega4 = -0.234196073689omega5 = -0.0136364675112, omega6 = 0.524788702911,omega7 = 0.0307544045628,omega8 = 0.354092310583,omega9 = 0.354092310583


In [14]:
omegas = [omega0, omega1, omega2, omega3,omega4, omega5, omega6, omega7, omega8, omega9]
print omegas

[1.0598900891667695, 0.1359819785184323, -0.23511384208698494, -3.3767425005518237, -0.23419607368944345, -0.013636467511155445, 0.52478870291067259, 0.030754404562830063, 0.35409231058287327, 0.35409231058287327]


In [22]:
sig = lambda x: 1./(1+np.exp(-x))

## OUR TEST SAMPLES


In [23]:
# Test 1
#predict_prob(104433.0, 25.37, 22.45, 1, 104000, 1,0,0,0)

In [24]:
# Test 2
#predict_prob(64400.0, 27.19, 1.99, 1, 12000, 0,1,0,0)

In [64]:
def predict_prob(a,data):
    atom = a[0] + a[1] * data[0] + a[2] * data[1] + a[3] * data[2] + a[4]*data[3] + \
    a[5]*data[4] + a[6]*data[5] +a[7] * data[6] + a[8] * data[7] + a[9]*data[8] 
    return sig(atom)



In [65]:
N = len(samples2)
MonteCarlo =  lambda f,e:  1/float(N)*sum([f(e) for e in samples2])
predict = lambda d: MonteCarlo(lambda a: predict_prob(a,d),samples2)

In [66]:
# GOOD LOAN

row1= [0.01,0.001179,0.01,0.0,0.294118,1.0,0,0,0]


predict(row1)


0.8244040609129909

In [67]:
# BAD LOAN
row2= [0.004831,0.002506,0.281791,0.0,0.551471,0.0,1.0,0.0,0.0,0.0]
predict(row2)


0.53284011078192728

In [68]:
#my blog is having issues. i'm sorry-!

## Blog: Project 5: Loan Classification

Objective: Create a model to predict loan status using data from Lending Club.

Data Manipulation: Clean and re-engineer categories of data. Data was also normalized so that its prepped for regression analysis.

Data Exploration: Before going into the data modeling, exploratory data analysis was performed. General exploration suggested that home ownership status (mortgage, rent, own, other) seemed to be a big indication of whether someone paid or not. Also, the distribution of good to bad loans were extremely unbalanced. After I had cleaned my dataset there were almost 3 times more good loans than bad loans (76,833 – good, 24319- bad). To even give my model more exposure and information around bad loans, I created a random sample of 3,000 data samples from each category (good and bad loans).  


Regression: Ran a regression on my model, predicted omegas using Emcee Sampler and walkers.

Test Sample: Once I found omegas, I created a predictive function where I could input attributes as a hypothetical test case. 



