# Assignment2

Lending Club is a peer-to-peer online lending platform. It is the world’s largest marketplace connecting borrowers and investors, where consumers and small business owners lower the cost of their credit and enjoy a better experience than traditional bank lending, and investors earn attractive risk-adjusted returns.Essentially, borrowers apply for loans and are assigned an interest rate by LendingClub. Individual investors are able to choose loans to fund or invest in, raising capital for a loan in a similar way to a crowd-sourcing campaign. As an investor, your returns vary based on the loans you choose (both the interest and default rates). Therefore, if you can better predict which borrowers will pay back their loans, you can expect better investment returns.

In this assignment, you will be analyzing data from LendingClub (<a href = "https://www.lendingclub.com/">www.lendingclub.com</a>). Using the lending data from 2007-2010, you need to create models that predict whether or not borrowers paid back their loan in full. The final model should minimize the number of borrowers who actually did not pay back their load in full but predicted as they did (this is our model selection criteria).


You need to create a Random Forest model and a Support Vector model using the same training/testing data. For both models, you need to optimize the parameters using a Grid Search. 
- For random forest, test the following number of trees in the forest: 10, 50, 100, 200, 300, 500, 800
- For svm, test the following:
    - C values: 0.1,1,10
    - gamma values: "auto","scale",
    - kernel: "poly",'linear','rbf'
    
Do not drop any of the features and make sure to scale them using StandardScaler (otherwise GridSearch for SVM will take a very very long time)

At the very bottom of your notebook, please explain how your models have performed and which model performed the best given the criteria.

Here are what the columns in the data represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* int.rate: The interest rate of the loan, as a proportion. Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
* not.fully.paid: 1 if the borrower did not pay back their loan in full, 0 if they paid back their loan in full.




# Import Libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Get the Data



In [None]:
loan = pd.read_csv('loan_data.csv')

In [None]:
loan.info()

# Exploratory Data Analysis

In [None]:
loan.head()

In [None]:
loan.describe()

In [None]:
loan["credit.policy"].plot.hist()

In [None]:
loan["not.fully.paid"].plot.hist()

In [None]:
viz = loan[['fico', 'installment', 'int.rate','revol.bal']]
viz.hist(bins = 40)
plt.show()

# Data Cleaning

In [None]:
sns.heatmap(loan.isnull(),yticklabels=False,cbar=False,cmap="viridis")

there are no missing data in the file.

In [None]:
purpose_type = pd.get_dummies(loan["purpose"], drop_first=False)

In [None]:
purpose_type.head()

In [None]:
loan = pd.concat([loan,purpose_type],axis=1)
loan.head()

In [None]:
loan.drop(['purpose'],axis=1,inplace=True)

In [None]:
loan.head()

# Train Test Split


In [None]:
X = loan.drop('not.fully.paid',axis=1)
y = loan['not.fully.paid']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled=sc.transform(X_test)

# Training 1st  model


In [47]:
from sklearn.ensemble import RandomForestClassifier

In [48]:
rfc = RandomForestClassifier(criterion="entropy",n_estimators=500,random_state=42)

In [51]:
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [55]:
y_pred = rfc.predict(X_test_scaled)

In [53]:
from sklearn.metrics import classification_report,confusion_matrix

In [56]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[797 638]
 [272 209]]
              precision    recall  f1-score   support

           0       0.75      0.56      0.64      1435
           1       0.25      0.43      0.31       481

   micro avg       0.53      0.53      0.53      1916
   macro avg       0.50      0.49      0.48      1916
weighted avg       0.62      0.53      0.56      1916



# Predictions and Evaluation of 1st model


In [58]:
from sklearn.model_selection import GridSearchCV

In [59]:
param_grid = {'n_estimators': [10, 50, 100, 200, 300, 500, 800]}
rfr = RandomForestClassifier(random_state = 42)

In [60]:
grid = GridSearchCV(estimator = rfr, param_grid = param_grid, cv = 3, n_jobs = 1, verbose = 0, return_train_score=True)

In [61]:
grid.fit(X_train,y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=1,
       param_grid={'n_estimators': [10, 50, 100, 200, 300, 500, 800]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [62]:
grid.best_params_

{'n_estimators': 800}

In [63]:
grid.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [64]:
grid_predictions = grid.predict(X_test)

In [65]:
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

[[1407   28]
 [ 221  260]]
              precision    recall  f1-score   support

           0       0.86      0.98      0.92      1435
           1       0.90      0.54      0.68       481

   micro avg       0.87      0.87      0.87      1916
   macro avg       0.88      0.76      0.80      1916
weighted avg       0.87      0.87      0.86      1916



In [None]:
y_test.describe()

# Training 2nd model

In [66]:
from sklearn.svm import SVC

In [67]:
model = SVC(gamma="auto")
# The default value of gamma has been updated to "scale" in scikit-learn since I recorded the class lecture. 

In [68]:
model.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [69]:
predictions = model.predict(X_test)

In [70]:
from sklearn.metrics import classification_report,confusion_matrix

In [71]:
print(confusion_matrix(y_test,predictions))

[[1435    0]
 [ 225  256]]


In [72]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.86      1.00      0.93      1435
           1       1.00      0.53      0.69       481

   micro avg       0.88      0.88      0.88      1916
   macro avg       0.93      0.77      0.81      1916
weighted avg       0.90      0.88      0.87      1916



# Predictions and Evaluation of 2nd model

In [73]:
param_grid = {'C': [0.1,1,10], 'gamma': ["auto","scale"],'kernel': ['poly','linear','rbf']} 

In [74]:
grid = GridSearchCV(SVC(),param_grid,verbose=3)

In [None]:
#take awhile!
grid.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 18 candidates, totalling 54 fits
[CV] C=0.1, gamma=auto, kernel=poly ..................................


In [None]:
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_predictions = grid.predict(X_test)

In [None]:
print(confusion_matrix(y_test,grid_predictions))

In [None]:
print(classification_report(y_test,grid_predictions))

# Conclusion

The accuracy from 1 time random forest to trained random forest increases. The best model is with 800 trees in the forest. The result shows we lowered FN value for (272-221)/272 = 18.8%. That means we minimize the number of borrowers who actually did not pay back their load in full but predicted as they did. Comparing to the reducing of FP number from 638 to 28, I think the random forest model optimized more FP than FN. I can see why FN is more important, since FP doesn't cost a lot for the lending platform, but FN will cost the platform a lot, because if the model classify people who owe money to the platform into people who already paid, it will cost a lot. And FP is less harmful.

I am a little worried about the random forest model, since there is possibility of overfitting to this data. However, I checked online that 86% accuracy is okay.

I cannot see the result of th eSVM model since my laptop stopped working after I run the cell. Sorry about this.
