___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# LendingClub Project 
*the project is based on pieriandata DataScience course.*


In the second part of LendingClub project, we will take a look at generalzed-stacking models. If you haven't seen the first one, you can access it by clicking on the [LendingClub: Part1 - Basic Analysis and Modeling](https://github.com/PatrykRadon/LendingClub-Project/blob/master/LendingClub:%20Part1%20-%20Basic%20Analysis%20and%20Modeling.ipynb) link.



In this notebook we will try to lean even further towards predicting a non-paid loans (as the more finacially-critical aspect), and at the same time we will try to improve on paid ones. Still, due to a limited computational capacities, this is only a presentation of how far we could potentilly take this.

Without further ado - lets do some imports and get to work!

In [268]:
import pandas as pd
import numpy as np

In [62]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

In [63]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [136]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [65]:
from sklearn.ensemble import StackingClassifier

In [48]:
from sklearn.metrics import classification_report, confusion_matrix

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [66]:
lcdf = pd.read_csv('dummy_loan.csv')

In [68]:
X = lcdf.drop('not.fully.paid',axis=1)
y = lcdf['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
X_train_lay1, X_train_lay2, y_train_lay1, y_train_lay2 = train_test_split(X_train, y_train, test_size=0.50)

## Let's stack some classifiers!

In short, let's have some classifiers focus on predicting non-paid loans, and some generally doing the best they can to get as accurate as possible!

In [265]:
randf_biased_csf = RandomForestClassifier(n_estimators=4000, max_depth=7, class_weight={0: 1, 1: 2})
randf_csf = RandomForestClassifier(n_estimators=4000, max_depth=4)

xgb_csf = XGBClassifier(
 learning_rate =0.01,
 n_estimators=4000,
 max_depth=3,
 min_child_weight=1,
 gamma=0.18,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27,
 reg_alpha= 1)

nb_csf = GaussianNB()

knn1_csf = KNeighborsClassifier(n_neighbors=2)
knn2_csf = KNeighborsClassifier(n_neighbors=4)
knn3_csf = KNeighborsClassifier(n_neighbors=8)
knn4_csf = KNeighborsClassifier(n_neighbors=16)
knn5_csf = KNeighborsClassifier(n_neighbors=32)
knn6_csf = KNeighborsClassifier(n_neighbors=64)

lreg_csf = make_pipeline(StandardScaler(),LogisticRegression())
lreg_biased_csf = make_pipeline(StandardScaler(),LogisticRegression(class_weight={0: 1, 1: 2}))

svc_csf = make_pipeline(StandardScaler(),SVC(kernel='rbf', probability=True))
svc_biased_csf = make_pipeline(StandardScaler(),SVC(kernel='rbf', probability=True, class_weight={0: 1, 1: 2}))


**First of all, lets see how our model performs without focusing too much on the non-paid loans.**
    
  For this purpose, we will need an unbiased first-layer estimator. In our case, we will use XGBoost.

In [263]:
xgb_l1_csf = XGBClassifier(
 learning_rate =0.01,
 n_estimators=3000,
 max_depth=4,
 min_child_weight=1,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1)

In [231]:
estimators1 = [('xgb',xgb_csf),
             ('randf', randf_csf),
             ('nb',nb_csf),
             ('knn1',knn1_csf),
             ('knn2',knn2_csf),
             ('knn3',knn3_csf),
             ('knn4',knn4_csf),
             ('knn5',knn5_csf),
             ('knn6',knn6_csf),
             ('svc',svc_csf),
             ('lreg',lreg_csf)]

In [232]:
stacked_csf1 = StackingClassifier(
...     estimators=estimators1, final_estimator=xgb_l1_csf
... )

In [233]:
stacked_model1 = stacked_csf1.fit(X_train, y_train)

In [234]:
y_pred =stacked_model1.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.84      0.99      0.91      2411
           1       0.39      0.04      0.07       463

    accuracy                           0.84      2874
   macro avg       0.62      0.51      0.49      2874
weighted avg       0.77      0.84      0.77      2874



In [177]:
for csf, label in zip([xgb_csf, randf_csf, nb_csf, knn1_csf, knn2_csf, knn3_csf, knn4_csf, knn5_csf, knn6_csf,svc_csf,lreg_csf], ['XGBoost', 'Random Forest', 'naive Bayes', 'Knn2', 'Knn4', 'Knn8','Knn16', 'Knn32', 'Knn64','SVC','Logistic Reggression']):
...     scores = cross_val_score(csf, X_test, y_test, scoring='accuracy', cv=5)
...     print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.83 (+/- 0.01) [XGBoost]
Accuracy: 0.84 (+/- 0.00) [Random Forest]
Accuracy: 0.83 (+/- 0.01) [naive Bayes]
Accuracy: 0.82 (+/- 0.01) [Knn2]
Accuracy: 0.83 (+/- 0.00) [Knn4]
Accuracy: 0.83 (+/- 0.00) [Knn8]
Accuracy: 0.84 (+/- 0.00) [Knn16]
Accuracy: 0.84 (+/- 0.00) [Knn32]
Accuracy: 0.84 (+/- 0.00) [Knn64]
Accuracy: 0.84 (+/- 0.00) [SVC]
Accuracy: 0.84 (+/- 0.00) [Logistic Reggression]


## First observations

Well.. it is not worse, thats for sure. It performs roughly equally to our best tries from part 1. 

**But** that is not why we are here, what we want is to use the robustness and precision of XGBoost to compensate for estimators focused on catching the non-paid loans, while we try to bias towards them!

So, lets get straight to it!

In [266]:
estimators2 = [('xgb',xgb_csf),
             ('randf', randf_biased_csf),
             ('nb',nb_csf),
             ('knn1',knn1_csf),
             ('knn2',knn2_csf),
             ('knn3',knn3_csf),
             ('knn4',knn4_csf),
             ('knn5',knn5_csf),
             ('knn6',knn6_csf),
             ('svc',svc_biased_csf)]

In [267]:
stacked_csf2 = StackingClassifier(
...     estimators=estimators2, final_estimator=lreg_biased_csf
... )

In [260]:
stacked_model2 = stacked_csf2.fit(X_train, y_train)

In [261]:
y_pred =stacked_model2.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.86      0.96      0.91      2411
           1       0.44      0.16      0.24       463

    accuracy                           0.83      2874
   macro avg       0.65      0.56      0.57      2874
weighted avg       0.79      0.83      0.80      2874



For comparison, here are the results of Random Forest from part 1:

                   precision    recall  f1-score   support

               0       0.86      0.94      0.90      2431
               1       0.33      0.15      0.20       443

        accuracy                           0.82      2874
       macro avg       0.59      0.55      0.55      2874
    weighted avg       0.78      0.82      0.79      2874

## Success!

We have improved in every way possible! As shown, combining robust, flexible XGBoost with some biased estimators can provide us with the benefits of both! As a result we got more accurate, precise estimator that still does the task of focusing on more finacially-critical aspect