H2O GBM tuning guide:  
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb

## Preparation

Use dataset provided in the eLearning

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 1500)

import warnings
warnings.filterwarnings('ignore')

#Extend cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
#Run once
#!pip install category_encoders
from category_encoders.target_encoder import TargetEncoder

### Load data

In [3]:
X_train = pd.read_csv('C:/Users/monta/OneDrive/Desktop/BUAN_AML/Module 5/SBA_loans_train.csv')
X_test  = pd.read_csv('C:/Users/monta/OneDrive/Desktop/BUAN_AML/Module 5/SBA_loans_test.csv')

In [4]:
X_train.head(n=3)

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,Defaulted
0,Huntsville,AL,35811,"BUSINESS LOAN CENTER, LLC",FL,621310,73,1,2.0,2,1,0,1,N,N,25000.0,0.0,25000.0,21250.0,1
1,SCOTTSDALE,AZ,85254,WELLS FARGO BANK NATL ASSOC,CA,0,84,3,2.0,0,0,0,0,N,N,52000.0,0.0,52000.0,46800.0,1
2,BANGOR,ME,4401,BANGOR SAVINGS BANK,ME,323110,84,9,1.0,0,0,1,1,0,Y,150000.0,0.0,150000.0,127500.0,0


In [5]:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (337186, 20)
Test shape: (112396, 20)


In [6]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 337186 entries, 0 to 337185
Data columns (total 20 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   City               337177 non-null  object 
 1   State              337180 non-null  object 
 2   Zip                337186 non-null  int64  
 3   Bank               336587 non-null  object 
 4   BankState          336583 non-null  object 
 5   NAICS              337186 non-null  int64  
 6   Term               337186 non-null  int64  
 7   NoEmp              337186 non-null  int64  
 8   NewExist           337140 non-null  float64
 9   CreateJob          337186 non-null  int64  
 10  RetainedJob        337186 non-null  int64  
 11  FranchiseCode      337186 non-null  int64  
 12  UrbanRural         337186 non-null  int64  
 13  RevLineCr          335483 non-null  object 
 14  LowDoc             336198 non-null  object 
 15  DisbursementGross  337186 non-null  float64
 16  Ba

# Prepare Dataset

Replace missing values for all columns for both X_train and X_test.
Replace Na's with zero for numerical variables and with "Missing" for categorical

Encode Categorical variables using target encoder.

In [8]:
X_tr.head(n=3)

Unnamed: 0,City_te,State_te,Bank_te,BankState_te,RevLineCr_te,LowDoc_te,Zip_te,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,DisbursementGross,BalanceGross,GrAppv,SBA_Appv
0,0.380952,0.167744,0.308181,0.158105,0.146342,0.186457,0.347826,621310,73,1,2.0,2,1,0,1,25000.0,0.0,25000.0,21250.0
1,0.191919,0.200634,0.138341,0.221678,0.146342,0.186457,0.228916,0,84,3,2.0,0,0,0,0,52000.0,0.0,52000.0,46800.0
2,0.125984,0.096586,0.0625,0.076696,0.149252,0.09074,0.088889,323110,84,9,1.0,0,0,1,1,150000.0,0.0,150000.0,127500.0


In [9]:
X_tst.head(n=3)

Unnamed: 0,City_te,State_te,Bank_te,BankState_te,RevLineCr_te,LowDoc_te,Zip_te,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,DisbursementGross,BalanceGross,GrAppv,SBA_Appv
0,0.083335,0.224163,0.073593,0.222704,0.146342,0.09074,0.090913,0,84,1,2.0,0,0,1,0,42000.0,0.0,42000.0,33600.0
1,0.128531,0.149706,0.138341,0.175423,0.146342,0.186457,0.170732,0,84,7,1.0,0,0,1,0,15000.0,0.0,15000.0,13500.0
2,0.170213,0.224163,0.0,0.222704,0.146342,0.186457,0.229167,0,240,19,1.0,15,0,1,0,497000.0,0.0,497000.0,497000.0


In [10]:
print("Train shape:", X_tr.shape)
print("Test shape:", X_tst.shape)

Train shape: (337186, 19)
Test shape: (112396, 19)


## Datasets for all questions

For all questions, use X_tr and X_tst (after categorical variables encoding).

## Question 1 - 2 points

Train sklearn `GradientBoostingClassifier` with default parameters and `random_state=0`.
Display:
- AUC on Testing data
- Accuracy on Testing data
- Number of trees for the trained classifier

In [11]:
from  sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

gbc = GradientBoostingClassifier(random_state=0)

In [12]:
gbc1 = gbc.fit(X_tr,Y_tr)

### Accuracy

In [13]:
print("Accuracy on training set: {:.3f}".format(gbc.score(X_tr, Y_tr)))
print()
print("Accuracy on testing set: {:.3f}".format(gbc.score(X_tst, Y_tst)))
print()

print(classification_report(Y_tst, gbc.predict(X_tst)))

Accuracy on training set: 0.935

Accuracy on testing set: 0.919

              precision    recall  f1-score   support

           0       0.93      0.97      0.95     92689
           1       0.84      0.67      0.74     19707

    accuracy                           0.92    112396
   macro avg       0.89      0.82      0.85    112396
weighted avg       0.92      0.92      0.92    112396



### AUC score

In [14]:
from sklearn import metrics

yscore_tr = gbc1.predict_proba(X_tr)[::,1]
yscore = gbc1.predict_proba(X_tst)[::,1]

auc_tr = metrics.roc_auc_score(Y_tr,yscore_tr)
auc_tst = metrics.roc_auc_score(Y_tst, yscore)

print("AUC score is:", auc_tr)
print()
print("AUC score is:", auc_tst)


AUC score is: 0.9696718962677164

AUC score is: 0.9505761225251513


### Number of Trees

In [15]:
print('No. of classes: ', gbc1.n_classes_)
print('Classes: ', gbc1.classes_)
print('No. of features: ', gbc1.n_features_in_)
print('No. of Estimators/Trees: ', len(gbc1.estimators_))

# No. of classes:  2
# Classes:  [0 1]
# No. of features:  19
# No. of Estimators:  100

No. of classes:  2
Classes:  [0 1]
No. of features:  19
No. of Estimators/Trees:  100


## Question 2 - 3 points

Train sklearn `GradientBoostingClassifier` with following parameters:
```
n_estimators=1000, 
learning_rate=0.1,
subsample=0.8,
max_features=0.8,
n_iter_no_change=5,
max_depth=3, 
random_state=0
```

Display:
- AUC on Testing data
- Accuracy on Testing data
- Number of trees for the trained classifier

In [16]:
gbclf2 = GradientBoostingClassifier(n_estimators=1000, learning_rate=0.1,subsample=0.8,
                                    max_features=0.8,n_iter_no_change=5,
                                    max_depth=3, random_state=0)

### AUC

In [17]:
gbclf2 = gbclf2.fit(X_tr,Y_tr)

yscore_tr2 = gbclf2.predict_proba(X_tr)[::,1]
yscore_tst2 = gbc1.predict_proba(X_tst)[::,1]

auc_tr2 = metrics.roc_auc_score(Y_tr, yscore_tr2)
auc_tst2 = metrics.roc_auc_score(Y_tst, yscore_tst2)
print("AUC score for train set is:", auc_tr2)
print()
print("AUC score for test set is:", auc_tst2)

AUC score for train set is: 0.9805974692862043

AUC score for test set is: 0.9505761225251513


### Accuracy

In [18]:
print("Accuracy on training set: {:.3f}".format(gbclf2.score(X_tr, Y_tr)))
print()
print("Accuracy on testing set: {:.3f}".format(gbclf2.score(X_tst, Y_tst)))
print()

print(classification_report(Y_tst, gbclf2.predict(X_tst)))

Accuracy on training set: 0.950

Accuracy on testing set: 0.933

              precision    recall  f1-score   support

           0       0.95      0.97      0.96     92689
           1       0.86      0.74      0.79     19707

    accuracy                           0.93    112396
   macro avg       0.90      0.86      0.88    112396
weighted avg       0.93      0.93      0.93    112396



### Number of trees

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
dt_cl = DecisionTreeClassifier(random_state=0)
bag_cl = BaggingClassifier(dt_cl,oob_score=True)
bag_cl.fit(X_tr, Y_tr)

BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=0),
                  oob_score=True)

In [20]:
print('No. of classes: ', gbclf2.n_classes_)
print('Classes: ', gbclf2.classes_)
print('No. of features: ', gbclf2.n_features_in_)
print('No. of Estimators/Trees: ', len(gbclf2.estimators_))

No. of classes:  2
Classes:  [0 1]
No. of features:  19
No. of Estimators/Trees:  437


## Question 3 - 10 points

Use Grid search to train at least 16 `GradientBoostingClassifier`. Tune following parameters, set `random_state=0`:
```
n_estimators, 
learning_rate,
subsample,
max_features,
n_iter_no_change,
max_depth, 
```
Your grid search should be performed using CV=3.
To speed up training process, set `n_jobs` to number of available cores on your machine. For example, if you have 8 CPU/Core machine, set `n_jobs=4` or `n_jobs=6`

For the best model (based on AUC) display:
- Model parameters
- AUC on Testing data
- Accuracy on Testing data
- Number of trees for the trained classifier

**Important**: It will take long time to train models, you will be training at least 16x3=48 models. Test that your Grid search works with small subset of the data first.

Once you have trained the Grid object, save it to the disk so that you can retrieve  it without going through re-train step. You can find example on how to save model in the Project 1 template.

For the best model, display:
- AUC on Testing data
- Accuracy on Testing data
- Number of trees for the trained classifier

**Optional** questions (if you plan to practice data science, you should make every effort to answer them): 
- Why do you think number of trees for the best model is less than `n_estimators`?
- Think about all parameters you are tuning using grid search.
- How are those parameters help to reduce overfit? 

In [21]:
from sklearn.model_selection import GridSearchCV
from  sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import make_scorer
import pickle
import cloudpickle
#creating Scoring parameter: 
scoring = {'accuracy': make_scorer(accuracy_score),
           'precision': make_scorer(precision_score),'recall':make_scorer(recall_score)}

# GLM hyperparameters
# gbc_param  = {
#     "n_estimators":[500,1000,1500],
#     "learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1],
#     "subsample":[1,.8],
#     "max_features":["log2","sqrt"],
#     "n_iter_no_change":[5,None],
#     "max_depth":[3,5]}

# # Train and validate a cartesian grid of GLMs
# gbc_grid  = GridSearchCV(cv=3,n_jobs=6,estimator=GradientBoostingClassifier(random_state = 0),param_grid=gbc_param)

# gbc_grid.fit(X_tr,Y_tr)
with open(r"gbc_grid", "rb") as gbc_grid_file:
    gbc_grid = cloudpickle.load(gbc_grid_file)
    
print('n_estimators:', gbc_grid.best_estimator_.get_params()['n_estimators'])
print('Best learning_rate:', gbc_grid.best_estimator_.get_params()['learning_rate'])
print('Best subsample:', gbc_grid.best_estimator_.get_params()['subsample'])
print('Best max_features:', gbc_grid.best_estimator_.get_params()['max_features'])
print('Best n_iter_no_change:', gbc_grid.best_estimator_.get_params()['n_iter_no_change'])
print('Best max_depth:', gbc_grid.best_estimator_.get_params()['max_depth'])

# n_estimators: 1500
# Best learning_rate: 0.1
# Best subsample: 1
# Best max_features: log2
# Best n_iter_no_change: None
# Best max_depth: 5

n_estimators: 1500
Best learning_rate: 0.1
Best subsample: 1
Best max_features: log2
Best n_iter_no_change: None
Best max_depth: 5


In [22]:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
# GLM hyperparameters

gbc  = GradientBoostingClassifier(n_estimators=1500,learning_rate = 0.1,subsample= 1,max_features= 'log2',
                                  n_iter_no_change= None,max_depth=5)

gbc.fit(X_tr, Y_tr)
gbc.score(X_tst, Y_tst)

0.9403804405850742

### Saving Models

In [23]:
import pickle
import cloudpickle

with open('gbc', 'wb') as gbc_file:
    cloudpickle.dump(gbc, gbc_file)

In [24]:
import pickle
import cloudpickle

with open('gbc_grid', 'wb') as gbc_grid_file:
    cloudpickle.dump(gbc_grid, gbc_grid_file)

### Classification Report

In [25]:

GBC_tst_pred = gbc.predict(X_tst)

print(classification_report(Y_tst, gbc.predict(X_tst)))

              precision    recall  f1-score   support

           0       0.95      0.98      0.96     92689
           1       0.87      0.77      0.82     19707

    accuracy                           0.94    112396
   macro avg       0.91      0.87      0.89    112396
weighted avg       0.94      0.94      0.94    112396



### AUC 

In [26]:

yscore_tr3 = gbc.predict_proba(X_tr)[::,1]
yscore_tst3 = gbc.predict_proba(X_tst)[::,1]

auc_tr2 = metrics.roc_auc_score(Y_tr, yscore_tr3)
auc_tst2 = metrics.roc_auc_score(Y_tst, yscore_tst3)

### Accuracy

In [27]:
print("Accuracy on training set: {:.3f}".format(gbc.score(X_tr, Y_tr)))
print()
print("Accuracy on testing set: {:.3f}".format(gbc.score(X_tst, Y_tst)))
print()

Accuracy on training set: 0.969

Accuracy on testing set: 0.940



### Number of trees

In [28]:
print('No. of classes: ', gbc.n_classes_)
print('Classes: ', gbc.classes_)
print('No. of features: ', gbc.n_features_in_)
print('No. of Estimators/Trees: ', len(gbc.estimators_))

No. of classes:  2
Classes:  [0 1]
No. of features:  19
No. of Estimators/Trees:  1500


## Question 4 - 5 points

Train Stacked Ensemble model by utilizing `from sklearn.ensemble import StackingClassifier`.

Good guide on how to train Stacked ensemble model can be found here: https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/


Stacked Ensemble is a technique to build Meta Learner (model). The Meta learner uses out-of-fold predictions of level-0 models (at least 2) to train level-1 model(meta learner). Meta learner is trained on the out-of-fold predictions done by the level-0 model in order to avoid overfitting.  

For example, if level-0 model was trained using cv=3 it means there are 3 level-0 sub-models each trained on the 2/3 of the data. Therefore the hold-out 1/3 part of the training data was not used by the model for training, and predictions on the out of fold parts of the dataset can be used by Meta Learner. 

Choose two models for the level-0 models:
- Best model from the Question 3
- Worst model from the question 3

You would only need model parameters, since sklearn `StackingClassifier` will retrain the the models.

Train `StackingClassifier` and produce:
- AUC on Testing data
- Accuracy on Testing data


*Hint*: to find best/worst model parameters for your `grid_search` object you can use below code. You would need to change `grid_search` to the variable name that holds your `GridSearchCV` object. 
```
import numpy as np

best_model_idx = np.argmax(grid_search.cv_results_['mean_test_score'])
worst_model_idx = np.argmin(grid_search.cv_results_['mean_test_score']) 
print("Index of worst model:",worst_model_idx)
print("Index of best model:",best_model_idx)

print("Best model params:")
print(grid_search.cv_results_['params'][best_model_idx])
print("")
print("Worst model params:")
print(grid_search.cv_results_['params'][worst_model_idx])
```

In [29]:
from sklearn.datasets import make_classification
import numpy as np

best_model_idx = np.argmax(gbc_grid.cv_results_['mean_test_score'])
worst_model_idx = np.argmin(gbc_grid.cv_results_['mean_test_score']) 
print("Index of worst model:",worst_model_idx)
print("Index of best model:",best_model_idx)

print("Best model params:")
print(gbc_grid.cv_results_['params'][best_model_idx])
print("")
print("Worst model params:")
print(gbc_grid.cv_results_['params'][worst_model_idx])

Index of worst model: 3
Index of best model: 226
Best model params:
{'learning_rate': 0.1, 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 1500, 'n_iter_no_change': None, 'subsample': 1}

Worst model params:
{'learning_rate': 0.01, 'max_depth': 3, 'max_features': 'log2', 'n_estimators': 500, 'n_iter_no_change': None, 'subsample': 0.8}


## Results

### Index of worst model: 3
### Index of best model: 226
### Best model params:
### {'learning_rate': 0.1, 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 1500, 'n_iter_no_change': None, 'subsample': 1}

### Worst model params:
### {'learning_rate': 0.01, 'max_depth': 3, 'max_features': 'log2', 'n_estimators': 500, 'n_iter_no_change': None, 'subsample': 0.8}

 ## Both Best and Worst Models

In [31]:
from sklearn.ensemble import GradientBoostingClassifier

bad_gbc = GradientBoostingClassifier(n_estimators=500,learning_rate = 0.01,subsample= .8,max_features= 'log2',
                                  n_iter_no_change= None,max_depth=3)

good_gbc = GradientBoostingClassifier(n_estimators=1500,learning_rate = 0.1,subsample= 1,max_features= 'log2',
                                  n_iter_no_change= None,max_depth=5)

In [32]:
bad_gbc.fit(X_tr, Y_tr)
print(bad_gbc.score(X_tst, Y_tst))


good_gbc.fit(X_tr, Y_tr)
print(good_gbc.score(X_tst, Y_tst))

print('The Bad GBC Classification Report is:',classification_report(Y_tst, bad_gbc.predict(X_tst)))
print()
print('The Good GBC Classification Report is:',classification_report(Y_tst, good_gbc.predict(X_tst)))


0.8957881063383039
0.9405138972917185
The Bad GBC Classification Report is:               precision    recall  f1-score   support

           0       0.91      0.97      0.94     92689
           1       0.80      0.54      0.65     19707

    accuracy                           0.90    112396
   macro avg       0.85      0.76      0.79    112396
weighted avg       0.89      0.90      0.89    112396


The Good GBC Classification Report is:               precision    recall  f1-score   support

           0       0.95      0.98      0.96     92689
           1       0.87      0.78      0.82     19707

    accuracy                           0.94    112396
   macro avg       0.91      0.88      0.89    112396
weighted avg       0.94      0.94      0.94    112396



## Stack model

In [33]:
from sklearn.ensemble import StackingClassifier
estimators = [ ('bad_gbc', bad_gbc), ('good_gbc', good_gbc)]

stack = StackingClassifier(estimators=estimators, final_estimator=GradientBoostingClassifier())



In [34]:
stack

StackingClassifier(estimators=[('bad_gbc',
                                GradientBoostingClassifier(learning_rate=0.01,
                                                           max_features='log2',
                                                           n_estimators=500,
                                                           subsample=0.8)),
                               ('good_gbc',
                                GradientBoostingClassifier(max_depth=5,
                                                           max_features='log2',
                                                           n_estimators=1500,
                                                           subsample=1))],
                   final_estimator=GradientBoostingClassifier())

In [35]:
stack.fit(X_tr, Y_tr)

pred = stack.predict(X_tst)

In [36]:
pred

array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

In [37]:
print('The Stack Classification Report is:',classification_report(Y_tst, pred))


The Stack Classification Report is:               precision    recall  f1-score   support

           0       0.96      0.97      0.96     92689
           1       0.86      0.79      0.82     19707

    accuracy                           0.94    112396
   macro avg       0.91      0.88      0.89    112396
weighted avg       0.94      0.94      0.94    112396



In [38]:
yscore_stack = stack.predict_proba(X_tst)[::,1]

auc_tr3 = metrics.roc_auc_score(Y_tst, yscore_stack)
print("AUC score for train set is:", auc_tr2)

AUC score for train set is: 0.9913303008325828
