For this project we will be exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.

Lending club had a [very interesting year in 2016](https://en.wikipedia.org/wiki/Lending_Club#2016), so let's check out some of their data and keep the context in mind. This data is from before they even went public.

We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. You can download the data from [here](https://www.lendingclub.com/info/download-data.action) or just use the csv already provided. It's recommended you use the csv provided as it has been cleaned of NA values.

Here are what the columns represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

# Import Libraries

**Import the usual libraries for pandas and plotting. You can import sklearn later on.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Get the Data

** Use pandas to read loan_data.csv as a dataframe called loans.**

In [2]:
df = pd.read_csv('loan_data.csv')
df.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [3]:
## Select non numerical categorical column names
mylist = list(df.select_dtypes(include=['object']).columns)
print(mylist)


unique=df.purpose.unique()
print("\n")
print("Unique Categorical features:",unique)

count=df.purpose.unique().size
print("\n")
print("Count of Categorical features:",count)

['purpose']


Unique Categorical features: ['debt_consolidation' 'credit_card' 'all_other' 'home_improvement'
 'small_business' 'major_purchase' 'educational']


Count of Categorical features: 7


In [4]:
## Create dummy variables for non numerical categorical variables
dummies = pd.get_dummies(df[mylist], prefix= mylist)
#print(dummies.head())

df.drop(mylist, axis=1, inplace = True) ## Drop Non numerical categorical columns
#df.head()

df=pd.concat([df,dummies], axis =1 ) ## added encoded categorical columns
df.head()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,purpose_all_other,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
0,1,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0,0,0,1,0,0,0,0
1,1,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0,0,1,0,0,0,0,0
2,1,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0,0,0,1,0,0,0,0
3,1,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0,0,0,1,0,0,0,0
4,1,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0,0,1,0,0,0,0,0


** Check out the info(), head(), and describe() methods on loans.**

In [5]:
X = df.drop('not.fully.paid',1) ## This is the dependent variable
y=df['not.fully.paid']

In [7]:
print(X[0:5])
print("\n")
print(y[0:5])

   credit.policy  int.rate  installment  log.annual.inc    dti  fico  \
0              1    0.1189       829.10       11.350407  19.48   737   
1              1    0.1071       228.22       11.082143  14.29   707   
2              1    0.1357       366.86       10.373491  11.63   682   
3              1    0.1008       162.34       11.350407   8.10   712   
4              1    0.1426       102.92       11.299732  14.97   667   

   days.with.cr.line  revol.bal  revol.util  inq.last.6mths  delinq.2yrs  \
0        5639.958333      28854        52.1               0            0   
1        2760.000000      33623        76.7               0            0   
2        4710.000000       3511        25.6               1            0   
3        2699.958333      33667        73.2               1            0   
4        4066.000000       4740        39.5               0            1   

   pub.rec  purpose_all_other  purpose_credit_card  \
0        0                  0                    0   
1 

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 20 columns):
credit.policy                 9578 non-null int64
int.rate                      9578 non-null float64
installment                   9578 non-null float64
log.annual.inc                9578 non-null float64
dti                           9578 non-null float64
fico                          9578 non-null int64
days.with.cr.line             9578 non-null float64
revol.bal                     9578 non-null int64
revol.util                    9578 non-null float64
inq.last.6mths                9578 non-null int64
delinq.2yrs                   9578 non-null int64
pub.rec                       9578 non-null int64
not.fully.paid                9578 non-null int64
purpose_all_other             9578 non-null uint8
purpose_credit_card           9578 non-null uint8
purpose_debt_consolidation    9578 non-null uint8
purpose_educational           9578 non-null uint8
purpose_home_improvement      9

In [10]:
import statsmodels.formula.api as sm

In [11]:
    ## Though significance level = 0.05, however, if eliminating the variable is reducing the adjusted r2, then we are retaining 
    ## the variable
    def backwardElimination(x, SL):
            numVars = len(x[0])
            temp = np.zeros((9578,20)).astype(int) ## Only change this line. In this case, we have 9578 rows and 20 columns
            for i in range(0, numVars):
                regressor_OLS = sm.OLS(y, x).fit()
                maxVar = max(regressor_OLS.pvalues).astype(float)
                adjR_before = regressor_OLS.rsquared_adj.astype(float)
                if maxVar > SL:
                    for j in range(0, numVars - i):
                        if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                            temp[:,j] = x[:, j]
                            x = np.delete(x, j, 1)
                            tmp_regressor = sm.OLS(y, x).fit()
                            adjR_after = tmp_regressor.rsquared_adj.astype(float)
                            if (adjR_before >= adjR_after):
                                x_rollback = np.hstack((x, temp[:,[0,j]]))
                                x_rollback = np.delete(x_rollback, j, 1)
                                print (regressor_OLS.summary())
                                return x_rollback
                            else:
                                continue
            regressor_OLS.summary()
            return x

In [12]:
X=np.append(arr=np.ones((9578,1)).astype(int),values=X, axis=1)
SL = 0.05
X = backwardElimination(X, SL)

                            OLS Regression Results                            
Dep. Variable:         not.fully.paid   R-squared:                       0.062
Model:                            OLS   Adj. R-squared:                  0.060
Method:                 Least Squares   F-statistic:                     39.45
Date:                Sun, 13 Jan 2019   Prob (F-statistic):          7.85e-120
Time:                        20:31:15   Log-Likelihood:                -3674.6
No. Observations:                9578   AIC:                             7383.
Df Residuals:                    9561   BIC:                             7505.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6727      0.057     11.833      0.0

In [13]:
X[0]

array([1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 8.29100000e+02,
       1.13504065e+01, 7.37000000e+02, 5.21000000e+01, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 5.63900000e+03])

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [16]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()

classifier.fit(X_train,y_train)
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

[[2019  405]
 [ 351   99]]
             precision    recall  f1-score   support

          0       0.85      0.83      0.84      2424
          1       0.20      0.22      0.21       450

avg / total       0.75      0.74      0.74      2874



In [17]:
from sklearn.model_selection import cross_val_score 

accuracies_logistic= cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) 
accuracies_logistic_mean=accuracies_logistic.mean()*100
print("Mean Accuracy:Decision Tree=",accuracies_logistic_mean)

accuracies_logistic_std=accuracies_logistic.std()*100
print("Standard Deviation - Accuracy:Decision Tree=",accuracies_logistic_std)

Mean Accuracy:Decision Tree= 73.97096014763314
Standard Deviation - Accuracy:Decision Tree= 1.5602779496850616


In [21]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=25) ## Hyperparameter
classifier.fit(X_train,y_train)
predictions = classifier.predict(X_test)


from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

from sklearn.model_selection import cross_val_score 

accuracies_logistic= cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) 
accuracies_logistic_mean=accuracies_logistic.mean()*100
print("Mean Accuracy:Random Forest=",accuracies_logistic_mean)

accuracies_logistic_std=accuracies_logistic.std()*100
print("Standard Deviation:Random Forest=",accuracies_logistic_std)

[[2377   47]
 [ 426   24]]
             precision    recall  f1-score   support

          0       0.85      0.98      0.91      2424
          1       0.34      0.05      0.09       450

avg / total       0.77      0.84      0.78      2874

Mean Accuracy:Random Forest= 83.12940506907658
Standard Deviation:Random Forest= 0.8725942679161864


In [18]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

from sklearn.model_selection import cross_val_score 

accuracies_logistic= cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) 
accuracies_logistic_mean=accuracies_logistic.mean()*100
print("Mean Accuracy:Random Forest=",accuracies_logistic_mean)

accuracies_logistic_std=accuracies_logistic.std()*100
print("Standard Deviation:Random Forest=",accuracies_logistic_std)

[[2076  348]
 [ 318  132]]
             precision    recall  f1-score   support

          0       0.87      0.86      0.86      2424
          1       0.28      0.29      0.28       450

avg / total       0.77      0.77      0.77      2874

Mean Accuracy:Random Forest= 77.56561978006496
Standard Deviation:Random Forest= 1.3496414121822171
