# Lending Club

Lending Club is a peer-to-peer online lending platform. It is the world’s largest marketplace connecting borrowers and investors, where consumers and small business owners lower the cost of their credit and enjoy a better experience than traditional bank lending, and investors earn attractive risk-adjusted returns.Essentially, borrowers apply for loans and are assigned an interest rate by LendingClub. Individual investors are able to choose loans to fund or invest in, raising capital for a loan in a similar way to a crowd-sourcing campaign. As an investor, your returns vary based on the loans you choose (both the interest and default rates). Therefore, if you can better predict which borrowers will pay back their loans, you can expect better investment returns.

In this assignment, you will be analyzing data from LendingClub (<a href = "https://www.lendingclub.com/">www.lendingclub.com</a>). Using the lending data from 2007-2010, you need to create models that predict whether or not borrowers paid back their loan in full. The final model should minimize the number of borrowers who actually did not pay back their load in full but predicted as they did (this is our model selection criteria).


You need to create a Random Forest model and a Support Vector model using the same training/testing data. For both models, you need to optimize the parameters using a Grid Search. 
- For random forest, test the following number of trees in the forest: 10, 50, 100, 200, 300, 500, 800
- For svm, test the following:
    - C values: 0.1,1,10
    - gamma values: "auto","scale",
    - kernel: "poly",'linear','rbf'
    
Do not drop any of the features and make sure to scale them using StandardScaler (otherwise GridSearch for SVM will take a very very long time)

At the very bottom of your notebook, please explain how your models have performed and which model performed the best given the criteria.

Here are what the columns in the data represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* int.rate: The interest rate of the loan, as a proportion. Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
* not.fully.paid: 1 if the borrower did not pay back their loan in full, 0 if they paid back their loan in full.




# Import Libraries


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [3]:
from sklearn.svm import SVC

# Get the Data



In [4]:
data = pd.read_csv("lending_club.csv")

# Exploratory Data Analysis

In [5]:
data.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.167,674.53,10.887437,19.87,667,3904.958333,17176,73.4,1,0,0,1
1,1,all_other,0.074,82.31,9.21034,1.2,807,3899.958333,82,2.3,0,0,0,1
2,1,all_other,0.1218,166.5,10.915088,22.45,702,1800.0,16957,67.0,3,0,0,1
3,1,debt_consolidation,0.1287,420.42,10.545341,10.39,707,3119.958333,12343,67.8,0,0,0,1
4,1,debt_consolidation,0.1114,82.01,11.156251,18.09,712,8130.0,14482,84.2,0,0,0,1


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   credit.policy      9578 non-null   int64  
 1   purpose            9578 non-null   object 
 2   int.rate           9578 non-null   float64
 3   installment        9578 non-null   float64
 4   log.annual.inc     9578 non-null   float64
 5   dti                9578 non-null   float64
 6   fico               9578 non-null   int64  
 7   days.with.cr.line  9578 non-null   float64
 8   revol.bal          9578 non-null   int64  
 9   revol.util         9578 non-null   float64
 10  inq.last.6mths     9578 non-null   int64  
 11  delinq.2yrs        9578 non-null   int64  
 12  pub.rec            9578 non-null   int64  
 13  not.fully.paid     9578 non-null   int64  
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB


In [7]:
data.describe()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
count,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0
mean,0.750783,0.125528,322.757966,10.928091,12.805928,708.647004,4531.012359,17909.38,47.812683,1.779808,0.164022,0.063688,0.248486
std,0.432582,0.02705,210.603178,0.626615,6.926247,37.836367,2507.372905,38553.29,29.13193,2.447285,0.545643,0.265119,0.432158
min,0.0,0.06,15.67,7.547502,0.0,612.0,178.958333,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.1099,164.5625,10.545447,7.3825,682.0,2791.291667,3140.5,23.7,0.0,0.0,0.0,0.0
50%,1.0,0.1253,270.41,10.915088,12.9,702.0,4110.0,8593.0,47.8,1.0,0.0,0.0,0.0
75%,1.0,0.1426,444.56,11.289832,18.18,732.0,5707.78125,18493.25,72.1,3.0,0.0,0.0,0.0
max,1.0,0.2164,940.14,14.180154,29.96,827.0,17639.95833,1207359.0,119.0,33.0,13.0,5.0,1.0


In [8]:
data.isna().sum()

credit.policy        0
purpose              0
int.rate             0
installment          0
log.annual.inc       0
dti                  0
fico                 0
days.with.cr.line    0
revol.bal            0
revol.util           0
inq.last.6mths       0
delinq.2yrs          0
pub.rec              0
not.fully.paid       0
dtype: int64

# Data Cleaning

In [9]:
data = pd.get_dummies(data, "purpose", drop_first = True)

In [10]:
data.head()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
0,1,0.167,674.53,10.887437,19.87,667,3904.958333,17176,73.4,1,0,0,1,0,1,0,0,0,0
1,1,0.074,82.31,9.21034,1.2,807,3899.958333,82,2.3,0,0,0,1,0,0,0,0,0,0
2,1,0.1218,166.5,10.915088,22.45,702,1800.0,16957,67.0,3,0,0,1,0,0,0,0,0,0
3,1,0.1287,420.42,10.545341,10.39,707,3119.958333,12343,67.8,0,0,0,1,0,1,0,0,0,0
4,1,0.1114,82.01,11.156251,18.09,712,8130.0,14482,84.2,0,0,0,1,0,1,0,0,0,0


In [11]:
X = data.drop("not.fully.paid", axis =1)
y = data["not.fully.paid"]

# Train Test Split


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [13]:
scaler =  StandardScaler()

In [14]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training 1st  model


In [15]:
rf_param_grid = {'n_estimators': [10, 50, 100, 200, 300, 500, 800]} 

In [16]:
rf_grid = GridSearchCV(RandomForestClassifier(),rf_param_grid,scoring='recall',verbose=3)

In [17]:
rf_grid.fit(X_train, y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits
[CV 1/5] END ...................n_estimators=10;, score=0.473 total time=   0.0s
[CV 2/5] END ...................n_estimators=10;, score=0.490 total time=   0.0s
[CV 3/5] END ...................n_estimators=10;, score=0.532 total time=   0.0s
[CV 4/5] END ...................n_estimators=10;, score=0.447 total time=   0.0s
[CV 5/5] END ...................n_estimators=10;, score=0.455 total time=   0.0s
[CV 1/5] END ...................n_estimators=50;, score=0.499 total time=   0.5s
[CV 2/5] END ...................n_estimators=50;, score=0.503 total time=   0.5s
[CV 3/5] END ...................n_estimators=50;, score=0.542 total time=   0.5s
[CV 4/5] END ...................n_estimators=50;, score=0.491 total time=   0.5s
[CV 5/5] END ...................n_estimators=50;, score=0.488 total time=   0.8s
[CV 1/5] END ..................n_estimators=100;, score=0.506 total time=   2.4s
[CV 2/5] END ..................n_estimators=100;,

In [18]:
rf_model = RandomForestClassifier()

# Predictions and Evaluation of 1st model


In [19]:
rf_predictions = rf_grid.predict(X_test)

In [20]:
rf_grid.best_params_

{'n_estimators': 800}

In [21]:
rf_grid_predictions = rf_grid.predict(X_test)

In [22]:
from sklearn.metrics import classification_report, confusion_matrix

In [23]:
print(confusion_matrix(rf_grid_predictions, y_test)) # 1416  201  28  271

[[1458  175]
 [  24  259]]


In [24]:
print(classification_report(rf_grid_predictions, y_test))

              precision    recall  f1-score   support

           0       0.98      0.89      0.94      1633
           1       0.60      0.92      0.72       283

    accuracy                           0.90      1916
   macro avg       0.79      0.90      0.83      1916
weighted avg       0.93      0.90      0.90      1916



# Training 2nd model

In [25]:
svc_param_grid = { "C": [0.1,1,10],
"gamma": ["auto","scale"],
"kernel": ["poly",'linear','rbf'] }

In [26]:
svc_grid = GridSearchCV(SVC(),svc_param_grid, scoring= 'recall',  verbose=3)

In [27]:
svc_grid.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5] END ....C=0.1, gamma=auto, kernel=poly;, score=0.059 total time=   1.9s
[CV 2/5] END ....C=0.1, gamma=auto, kernel=poly;, score=0.082 total time=   2.0s
[CV 3/5] END ....C=0.1, gamma=auto, kernel=poly;, score=0.080 total time=   1.8s
[CV 4/5] END ....C=0.1, gamma=auto, kernel=poly;, score=0.059 total time=   2.0s
[CV 5/5] END ....C=0.1, gamma=auto, kernel=poly;, score=0.069 total time=   2.0s
[CV 1/5] END ..C=0.1, gamma=auto, kernel=linear;, score=0.000 total time=   1.3s
[CV 2/5] END ..C=0.1, gamma=auto, kernel=linear;, score=0.000 total time=   1.2s
[CV 3/5] END ..C=0.1, gamma=auto, kernel=linear;, score=0.000 total time=   1.3s
[CV 4/5] END ..C=0.1, gamma=auto, kernel=linear;, score=0.000 total time=   1.2s
[CV 5/5] END ..C=0.1, gamma=auto, kernel=linear;, score=0.000 total time=   1.3s
[CV 1/5] END .....C=0.1, gamma=auto, kernel=rbf;, score=0.000 total time=   2.8s
[CV 2/5] END .....C=0.1, gamma=auto, kernel=rbf;

# Predictions and Evaluation of 2nd model

In [28]:
svc_grid.best_params_

{'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}

In [29]:
svc_grid_predictions = svc_grid.predict(X_test)

In [30]:
print(confusion_matrix(svc_grid_predictions, y_test))

[[1400  316]
 [  82  118]]


In [31]:
print(classification_report(svc_grid_predictions, y_test))

              precision    recall  f1-score   support

           0       0.94      0.82      0.88      1716
           1       0.27      0.59      0.37       200

    accuracy                           0.79      1916
   macro avg       0.61      0.70      0.62      1916
weighted avg       0.87      0.79      0.82      1916

