# Project 2 Homework

For this project you're going to apply hyperparameter optimization to both a regression and a classification problem. It looks like a lot to do below, but it's mostly a matter of modifying code from the presentation. 

## Objective

For each of the models in problems 1 and 2 below, apply the following 4 tuning methods from the presentation: GridSearchCV, RandomSearchCV, BayesianOptimization, and TPOT.
* **For TPOT**: In Problem 1 do only hyperparameter optimization. In Problem 2 do **both** hyperparameter optimization and also run TPOT and let it choose the model. See the presentation for examples of both.

### What to submit

For each problem you need to include the following:

1. A pandas table that reports:
    * The best parameters for each tuning method
    * The optimized score from the test data
    * The number of model fits used in the optimization
2. A brief discussion about which hyperparameter optimization approach worked best

### Notes:
* **For problem 1**: your pandas table should include the best parameters for each of the 4 tuning methods above.
* **For problem 2**: your pandas table should include the best parameters for each of the 5 tuning methods (the 4 methods above and the TPOT model search).
* **For GridSearchCV**: you should include at least 2 or 3 values for each hyperparameter and one of those values should be the default.
* **For BayesianOptimization**: you'll have to use `int()` or `bool()` to cast the float values of the hyperparameters inside your `cv_score()` function.
* **For TPOT**: you should use a finer grid than for GridSearchCV, but not more than 10 to 20 possible values for each hyperparameter.  You could lower the number of possible values to keep the search space smaller.
    * If your code is too slow you can reduce the number of cross-validation folds to 3 and if your dataset is really large you can randomly choose a smaller subset of the rows.
* Use section headers to label your work.  Your summary / discussion should be more than simply "XYZ is the best model", but it also shouldn't be more than a few paragraphs and a table.


### Regarding data

* You can use either the specified dataset or you can choose your own.  
    * If you use your own data it should have at least 500 rows and 10 features.  
    * If your data has categorical features you'll need "one hot" encode it (convert categorical features into multiple binary features).  <a href="https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/">Here is a nice tutorial</a>.  For categories with only two values you can remove one of the two hot encoded columns.
* If you do want to use your own data, we suggest first getting things working with the suggested datasets.  Finding, cleaning, and preparing data can take a lot of time.

# Problem 1 - Optimize Random Forest Regression

### Find optimized hyperparameters for a random forest regression model. 

You may use either the diabetes data used in the presentation or a dataset that you choose.  **You do not need to include the TPOT general search for this problem** (use TPOT to optimize RandomForestRegressor, but don't run TOPT to choose a model). Here are ranges for a subset of the hyperparameters:

Hyperparameter |Type | Default Value | Typical Range
---- | ---- | ---- | ----
n_estimators | discrete / integer | 100 | 10 to 150
max_features | continuous / float | 1.0 | 0.05 to 1.0
min_samples_split | discrete / integer | 2 | 2 to 20
min_samples_leaf | discrete / integer | 1 | 1 to 20
bootstrap | discrete / boolean | True | True, False


You can add other hyperparameters to the optimization if you wish.
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">Documentation for sklearn RandomForestRegressor</a>

<font color = "blue"> *** 15 points: </font>

In [1]:
#REGRESSION PROBLEM

#Imports
import numpy as np
import pandas as pd

#Loading and formatting the data
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = np.array(diabetes.data)
y = np.array(diabetes.target)

#Setting up results df
results = pd.DataFrame(None, index=["GridSearch", "RandomSearch", "BayesianOptimization", "tpot"], columns=["bootstrap","max_features","min_samples_leaf","min_samples_split","n_estimators",
                                      "Best R-Squared","Best MSE","Best Root MSE","Number of fits"])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=394) #my results will differ from lecture

#creating the Random Forest Regression model
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(random_state=0)
rf_model.fit(X_train,y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [3]:
#*******************GRID SEARCH********************

from sklearn.model_selection import GridSearchCV

# define the grid of hyperparams
params = {
    "n_estimators": [25, 50, 75, 100, 125],
    "max_features": [0.1, 0.5, 1],
    "min_samples_split": [2, 5, 10, 15],
    "min_samples_leaf": [1, 5, 10, 15],
    "bootstrap": [True, False]
}

# setup the grid search
grid_search = GridSearchCV(rf_model,
                           param_grid=params,
                           cv=5,
                           verbose=1,
                           n_jobs=1,
                           return_train_score=True)

#we fit to training data first
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 480 candidates, totalling 2400 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 2400 out of 2400 | elapsed:  4.0min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=10, n_jobs=None,
                                             oob_score=False, random_state=0,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=1,
             param_grid={'bootstrap': [True, False],
             

In [4]:
#Save the best hyperparams
results.at["GridSearch","bootstrap":"n_estimators"] = grid_search.best_params_
#save the number of model fits into the dataframe manually
results.at["GridSearch","Number of fits"] = 5*3*4*4*2*5

In [5]:
#function definition
def my_regression_results(model):
    score_test = model.score(X_test,y_test)

    y_pred = model.predict(X_test)

    from sklearn.metrics import mean_squared_error
    mse = mean_squared_error(y_test,y_pred)
    rmse = np.sqrt(mse)
    
    return [round(score_test,4), round(mse,4), round(rmse,4)]

In [6]:
#apply my_regression_results to test data
stats = my_regression_results(grid_search)
results.at["GridSearch", "Best R-Squared"] = stats[0]
results.at["GridSearch", "Best MSE"] = stats[1]
results.at["GridSearch", "Best Root MSE"] = stats[2]

In [7]:
#*************RANDOM SEARCH*****************

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

params = {
    "n_estimators": randint(10, 150),
    "max_features": uniform(0.05, 0.95),
    "min_samples_split": randint(2, 20),
    "min_samples_leaf": randint(1, 20),
    "bootstrap": [True, False]
}

random_search = RandomizedSearchCV(
    rf_model,
    param_distributions=params,
    random_state=394, # like setting the seed
    n_iter=25, #just checking 25 randomly selected sets of hyperparameters
    cv=5,
    verbose=1,
    n_jobs=1,
    return_train_score=True)

random_search.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 125 out of 125 | elapsed:   14.4s finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=10, n_jobs=None,
                                                   oob_score=False,
                                                   random_state=0...


In [8]:
#Save the best hyperparams
results.at["RandomSearch","bootstrap":"n_estimators"] = random_search.best_params_
#save the number of model fits into the dataframe manually
results.at["RandomSearch","Number of fits"] = 25*5

In [9]:
#apply my_regression_results to test data
stats = my_regression_results(random_search)
results.at["RandomSearch", "Best R-Squared"] = stats[0]
results.at["RandomSearch", "Best MSE"] = stats[1]
results.at["RandomSearch", "Best Root MSE"] = stats[2]

In [10]:
#*************BAYESIAN OPTIMIZATION*****************

np.random.seed(394) 
from GPyOpt.methods import BayesianOptimization
from sklearn.model_selection import cross_val_score, KFold

hp_bounds = [{
    'name': 'n_estimators',
    'type': 'discrete',
    'domain': (10, 150)
}, {
    'name': 'min_samples_split',
    'type': 'discrete',
    'domain': (2, 20)
}, {
    'name': 'min_samples_leaf',
    'type': 'discrete',
    'domain': (1, 20)
}, {
    'name': 'max_features',
    'type': 'continuous',
    'domain': (0.05, 1)
}, {
    'name': 'bootstrap',
    'type': 'discrete',
    'domain': (True, False)
}]

def cv_score(hyp_parameters):
    hyp_parameters = hyp_parameters[0] # This just gets us inside the hp_bounds list
    rf_model = RandomForestRegressor(n_estimators=int(hyp_parameters[0]),
                                 min_samples_split=int(hyp_parameters[1]),
                                 min_samples_leaf=int(hyp_parameters[2]),
                                 max_features=hyp_parameters[3],
                                 bootstrap=bool(hyp_parameters[4]))
    scores = cross_val_score(rf_model,
                             X=X_train,
                             y=y_train,
                             cv=KFold(n_splits=5))
    return np.array(scores.mean())  # return average of 5-fold scores


optimizer = BayesianOptimization(f=cv_score, #Here is our cv_score function
                                 domain=hp_bounds, #The argument to our cv_score function
                                 model_type='GP',
                                 acquisition_type='EI',
                                 acquisition_jitter=0.05,
                                 exact_feval=True,
                                 maximize=True,
                                 verbosity=True)

optimizer.run_optimization(max_iter=20,verbosity=True)

num acquisition: 1, time elapsed: 2.16s
num acquisition: 2, time elapsed: 3.79s
num acquisition: 3, time elapsed: 5.45s
num acquisition: 4, time elapsed: 7.70s
num acquisition: 5, time elapsed: 9.22s
num acquisition: 6, time elapsed: 11.49s
num acquisition: 7, time elapsed: 14.04s
num acquisition: 8, time elapsed: 17.52s
num acquisition: 9, time elapsed: 19.63s
num acquisition: 10, time elapsed: 21.49s
num acquisition: 11, time elapsed: 23.56s
num acquisition: 12, time elapsed: 25.72s
num acquisition: 13, time elapsed: 28.07s
num acquisition: 14, time elapsed: 31.39s
num acquisition: 15, time elapsed: 34.45s
num acquisition: 16, time elapsed: 37.43s
num acquisition: 17, time elapsed: 39.97s
num acquisition: 18, time elapsed: 43.63s
num acquisition: 19, time elapsed: 46.27s
num acquisition: 20, time elapsed: 48.72s


In [11]:
#extracting the optimized hyperparams
best_hyp_set = {}
for i in range(len(hp_bounds)):
    if hp_bounds[i]['type'] == 'continuous':
        best_hyp_set[hp_bounds[i]['name']] = optimizer.x_opt[i]
    else:
        best_hyp_set[hp_bounds[i]['name']] = int(optimizer.x_opt[i])

#Save the best hyperparams
results.at["BayesianOptimization","bootstrap":"n_estimators"] = best_hyp_set
#save the number of model fits into the dataframe manually
results.at["BayesianOptimization","Number of fits"] = 25*5

In [12]:
bayopt_search = RandomForestRegressor(**best_hyp_set)
bayopt_search.fit(X_train,y_train)

RandomForestRegressor(bootstrap=1, criterion='mse', max_depth=None,
                      max_features=0.3403767613646175, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=150,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [13]:
#apply my_regression_results to test data
stats = my_regression_results(bayopt_search)
results.at["BayesianOptimization", "Best R-Squared"] = stats[0]
results.at["BayesianOptimization", "Best MSE"] = stats[1]
results.at["BayesianOptimization", "Best Root MSE"] = stats[2]

In [14]:
####################################################
#********************  TPOT  **********************
####################################################

from tpot import TPOTRegressor

tpot_config = {
    'sklearn.ensemble.RandomForestRegressor': {
        "n_estimators": [10, 25, 50, 75, 100, 125, 150],
        "max_features": [0.05, 0.1, 0.3, 0.5, 0.7, 1.0],
        "min_samples_split": [2, 5, 7, 10, 12, 15, 17, 20],
        "min_samples_leaf": [1, 3, 5, 7, 10, 12, 15, 17, 20],
        "bootstrap": [True, False]
    }
}

tpot = TPOTRegressor(generations=5,
                     population_size=20,
                     verbosity=2,
                     config_dict=tpot_config,
                     cv=5,
                     scoring='r2',
                     random_state=394)
tpot.fit(X_train, y_train)
tpot.export('tpot_RandomForestRegressor.py') # export the model

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.4486717031231568
Generation 2 - Current best internal CV score: 0.4520778668841704
Generation 3 - Current best internal CV score: 0.45739544435638324
Generation 4 - Current best internal CV score: 0.45739544435638324
Generation 5 - Current best internal CV score: 0.45739544435638324

Best pipeline: RandomForestRegressor(RandomForestRegressor(RandomForestRegressor(input_matrix, bootstrap=True, max_features=1.0, min_samples_leaf=12, min_samples_split=12, n_estimators=10), bootstrap=True, max_features=0.1, min_samples_leaf=1, min_samples_split=2, n_estimators=75), bootstrap=True, max_features=1.0, min_samples_leaf=7, min_samples_split=10, n_estimators=125)


In [0]:
#Best model results from TPOT (per Piazza post @512, it is not practical to save to my 'results' dataframe)

#from "tpot_RandomForestRegressor.py"

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator


tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=394)


exported_pipeline = make_pipeline(
    StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=1.0, min_samples_leaf=12, min_samples_split=12, n_estimators=10)),
    StackingEstimator(estimator=RandomForestRegressor(bootstrap=True, max_features=0.1, min_samples_leaf=1, min_samples_split=2, n_estimators=75)),
    RandomForestRegressor(bootstrap=True, max_features=1.0, min_samples_leaf=7, min_samples_split=10, n_estimators=125)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)


In [18]:
stats = my_regression_results(tpot)

results.at["tpot", "Best R-Squared"] = stats[0]
results.at["tpot", "Best MSE"] = stats[1]
results.at["tpot", "Best Root MSE"] = stats[2]
results.at["tpot","Number of fits"] = 6*20*5

In [19]:
results

Unnamed: 0,bootstrap,max_features,min_samples_leaf,min_samples_split,n_estimators,Best R-Squared,Best MSE,Best Root MSE,Number of fits
GridSearch,True,0.5,1.0,5.0,100.0,0.398,3818.58,61.7947,2400
RandomSearch,True,0.399503,1.0,7.0,133.0,0.3996,3808.72,61.7148,125
BayesianOptimization,1,0.340377,1.0,2.0,150.0,0.4148,3711.98,60.9261,125
tpot,,,,,,0.3467,4144.2,64.3755,600


### Summary:
<font color = "blue"> *** 5 points: </font>

It appears that the Random Forest model does not fit the diabetes dataset as well as the XGBoost model from the Project handout. Even after optimizing the hyperparameters for four different methods, the resulting statistics from each fit consistently showed that the Random Forest model does not fit the data as well as the XGBoost model.

# Problem 2 - Optimize XGBoost Classifier

### Find optimized hyperparameters for an xgboost classifier model. 

This problem contains 5 parts.


### Notes:

#### About the data
The first cell below loads a subset of the loans default data from DS705 and your job is to predict whether a loan defaults or not.  The `status_bad` column is the target column and a 1 indicates a loan that defaulted.  We have selected a subset of the original data that includes 2000 each of good and bad loans.  The data has already been cleaned and encoded.  You're welcome to look into a different dataset, but start by getting this working and then add your own data.

#### This is classification, not regression
The score for each model will be accuracy and not MSE.  Your summary table should include accuracy, sensitivity, and precision for each optimized model applied to the test data.  (<a href="https://classeval.wordpress.com/introduction/basic-evaluation-measures/">Here is a nice overview of metrics for binary classification data</a>) that includes definitions of accuracy and such.

For the models you'll mostly just need to change 'regressor' to 'classifier', e.g. `XGBClassifier` instead of `XGBRegressor`.


Hyperparameter | Type | Default Value | Typical Range
---- | ---- | ---- | ----
n_estimators | discrete / integer | 100 | 50 to 150
max_depth | discrete / integer | 3| 1 to 10
min_child_weight | discrete / integer | 1 | 1 to 20
learning_rate | continuous / float | 0.1 | 0.001 to 1
sub_sample | continuous / float | 1 | 0.05 to 1
reg_lambda | continuous / float | 1 | 0 to 5
reg_alpha  | continuous / float | 0 | 0 to 5

## Part 1: Loading the data

In [1]:
#CLASSIFICATION PROBLEM

# Do not change this cell for loading and preparing the data
import pandas as pd
import numpy as np

X = pd.read_csv('./data/loans_subset.csv')

# split into predictors and target
# convert to numpy arrays for xgboost, OK for other models too
y = np.array(X['status_Bad']) # 1 for bad loan, 0 for good loan
X = np.array(X.drop(columns = ['status_Bad']))

# split into test and training data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) #we select 10% for the test dataset

## Part 2

Write a function called `my_classifier_results` modeled after `my_regression_results` that applies a model to the test data and prints out the accuracy, sensitivity, precision, and the confusion matrix.  There is no need to make a plot.

<font color = "blue"> *** 5 points - (don't delete this cell) </font>

In [2]:
def my_classifier_results(model):
    accuracy = model.score(X_test,y_test)
    
    
    from sklearn.metrics import confusion_matrix
    import pandas as pd

    y_pred = model.predict(X_test)
    
    #must put true before predictions in confusion matrix function
    cmtx = pd.DataFrame(
        confusion_matrix(y_test, y_pred, labels=[1,0]), 
        index=['actual:bad', 'actual:good'], 
        columns=['pred:bad','pred:good']
    )
    display(cmtx)
    
    sensitivity = cmtx['pred:bad']['actual:bad']/(cmtx['pred:bad']['actual:bad']+cmtx['pred:good']['actual:bad'])
    
    precision = cmtx['pred:bad']['actual:bad']/(cmtx['pred:bad']['actual:bad']+cmtx['pred:bad']['actual:good'])
    
    return {"Accuracy":accuracy, "Sensitivity":sensitivity, "Precision":precision} 
    

## Part 3

Start by training some baseline models using default values of the hyperparameters.  We've included logistic regression in a cell below to get you started.  Use `LogisticRegression`, `RandomForestClassifier`, and `GaussianNB` (Gaussian Naive Bayes) from `sklearn`.  Also use `XGBClassifier` from `xgboost` where you may need to include `objective="binary:logistic"` as an option. The default scoring method for all of the `sklearn` classifiers is accuracy. Apply `my_classifier_results` to the test data for each model.

<font color = "blue"> *** 10 points - (don't delete this cell) </font>

<font color="red"> -2 points, should </font>

In [3]:
# We've included this code to get you started
from sklearn.linear_model import LogisticRegression

# we do need to go higher than the default iterations for the solver to get convergence
# and the explicity declaration of the solver avoids a warning message, otherwise
# the parameters are defaults.
logreg_model = LogisticRegression(solver='lbfgs',max_iter=1000)

logreg_model.fit(X_train, y_train)

# Use score method to get accuracy of model
score = logreg_model.score(X_test, y_test) # this is accuracy
print(score)

# obtaining the confusion matrix and making it look nice

from sklearn.metrics import confusion_matrix
import pandas as pd

y_pred = logreg_model.predict(X_test)

# must put true before predictions in confusion matrix function
cmtx = pd.DataFrame(
    confusion_matrix(y_test, y_pred, labels=[1,0]), 
    index=['actual:bad', 'actual:good'], 
    columns=['pred:bad','pred:good']
)
display(cmtx)

0.5475


Unnamed: 0,pred:bad,pred:good
actual:bad,126,71
actual:good,110,93


In [4]:
#Setting up final_results df
final_results = pd.DataFrame(None, index=["Baseline_LR","Baseline_RF", "Baseline_GaussianNB","Baseline_XGBClassifier",
                                          "GridSearch", "RandomSearch", "BayesianOptimization", "TPOT_parameters", "TPOT_model_selection"], 
                                 columns=["learning_rate", "max_depth", "n_estimators", "subsample", "min_child_weight", "reg_lambda","reg_alpha",
                                         "Accuracy","Sensitivity","Precision","Number of fits"])

In [5]:
#*********LOGISTIC REGRESSION BASELINE (using my_classifier_results)*************

stats_dict = my_classifier_results(logreg_model)

#Save the best hyperparams
final_results.at["Baseline_LR",:] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["Baseline_LR","Number of fits"] = 1

Unnamed: 0,pred:bad,pred:good
actual:bad,126,71
actual:good,110,93


In [6]:
#*********RANDOM FOREST BASELINE*************
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=0)
rf_model.fit(X_train,y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [7]:
stats_dict = my_classifier_results(rf_model)

#Save the best hyperparams
final_results.at["Baseline_RF",:] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["Baseline_RF","Number of fits"] = 1

Unnamed: 0,pred:bad,pred:good
actual:bad,100,97
actual:good,63,140


In [8]:
#*********GAUSSIAN NAIVE BAYES BASELINE*************
from sklearn.naive_bayes import GaussianNB

gauss_model = GaussianNB()
gauss_model.fit(X_train,y_train)

stats_dict = my_classifier_results(gauss_model)

#Save the best hyperparams
final_results.at["Baseline_GaussianNB",:] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["Baseline_GaussianNB","Number of fits"] = 1

Unnamed: 0,pred:bad,pred:good
actual:bad,160,37
actual:good,139,64


In [9]:
#*********XGBClassifier BASELINE*************
import xgboost as xgb

xgbr_model = xgb.XGBClassifier(objective ="binary:logistic")
xgbr_model.fit(X_train,y_train)

stats_dict = my_classifier_results(xgbr_model)

#Save the best hyperparams
final_results.at["Baseline_XGBClassifier",:] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["Baseline_XGBClassifier","Number of fits"] = 1

Unnamed: 0,pred:bad,pred:good
actual:bad,132,65
actual:good,70,133


## Part 4

Now use the four hyperparameter optimization techniques on `XGBClassifier` and TPOT general model optimization.  Apply `my_classifer_results` to the test data in each case.
* Feel free to use 3 folds instead of 5 for cross validation to speed things up. 
* Choose a very small number of iterations, population size, etc. until you're sure things are working correctly, then turn up the numbers.  General TPOT optimization will take a while (fair warning: it took about 30 minutes on my Macbook Pro with generations = 10, population_size=40, and cv=5)  
* The hyperparameters to consider for are the same as they were in the presentation , but here they are again for convenience:

<font color = "blue"> *** 10 points - (don't delete this cell) </font>

In [10]:
#*******************GRID SEARCH********************

from sklearn.model_selection import GridSearchCV

# define the grid
params = {
    "learning_rate": [0.01, 0.1],
    "max_depth": [2, 4, 6],
    "n_estimators": [75, 125],
    "subsample": [0.25, 0.75],
    "min_child_weight": [7, 13],
    "reg_lambda": [2, 4],
    "reg_alpha": [2, 4]
}

# setup the grid search
grid_search = GridSearchCV(xgbr_model,
                           param_grid=params,
                           cv=3,
                           verbose=1,
                           n_jobs=-1,
                           return_train_score=True)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 192 candidates, totalling 576 fits


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=-1)]: Done 576 out of 576 | elapsed:  6.1min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='warn', n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.1], 'max_depth': [2, 4, 6],
                         'min_child_weight': [7, 13], 'n_estimators': [75, 125],
         

In [11]:
#Save the best hyperparams
final_results.at["GridSearch","learning_rate":"reg_alpha"] = grid_search.best_params_

stats_dict = my_classifier_results(grid_search)

#Save the stats from the test data
final_results.at["GridSearch","Accuracy":"Precision"] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["GridSearch","Number of fits"] = 2*3*2*2*2*2*2*3

Unnamed: 0,pred:bad,pred:good
actual:bad,134,63
actual:good,71,132


In [12]:
#*************RANDOM SEARCH*****************
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

params = {
    "learning_rate": [0.01, 0.1, 0.25, 0.75, 1.0],
    "max_depth": randint(1, 10),
    "n_estimators": randint(50, 100),
    "subsample": uniform(0.05, 0.95),  # so uniform on [.05,.05+.95] = [.05,1.]
    "min_child_weight": randint(1, 20),
    "reg_alpha": uniform(0, 5),
    "reg_lambda": uniform(0, 5)
}

random_search = RandomizedSearchCV(
    xgbr_model,
    param_distributions=params,
    random_state=394, # like setting the seed
    n_iter=25, #just checking 25 randomly selected sets of hyperparameters
    cv=5,
    verbose=1,
    n_jobs=1,
    return_train_score=True)

random_search.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 125 out of 125 | elapsed:  1.5min finished


RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=None,
                                           objective='binary:logistic',
                                           random_state=0, reg_alpha=0...
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f3f8b09bac8>,
                                        'reg_alpha': <scipy.stats._distn_infrastructure.rv_fro

In [13]:
#Save the best hyperparams
final_results.at["RandomSearch","learning_rate":"reg_alpha"] = random_search.best_params_

stats_dict = my_classifier_results(random_search)

#Save the stats from the test data
final_results.at["RandomSearch","Accuracy":"Precision"] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["RandomSearch","Number of fits"] = 5*25

Unnamed: 0,pred:bad,pred:good
actual:bad,131,66
actual:good,77,126


In [14]:
#*************BAYESIAN OPTIMIZATION*****************

np.random.seed(394)  # seed courtesy of Tommy Tutone
from GPyOpt.methods import BayesianOptimization
from sklearn.model_selection import cross_val_score, KFold

hp_bounds = [{
    'name': 'learning_rate',
    'type': 'continuous',
    'domain': (0.001, 1.0)
}, {
    'name': 'max_depth',
    'type': 'discrete',
    'domain': (1, 10)
}, {
    'name': 'n_estimators',
    'type': 'discrete',
    'domain': (50, 100)
}, {
    'name': 'subsample',
    'type': 'continuous',
    'domain': (0.05, 1.0)
}, {
    'name': 'min_child_weight',
    'type': 'discrete',
    'domain': (1, 20)
}, {
    'name': 'reg_alpha',
    'type': 'continuous',
    'domain': (0, 5)
}, {
    'name': 'reg_lambda',
    'type': 'continuous',
    'domain': (0, 5)
}]


# Optimization objective
def cv_score(hyp_parameters):
    hyp_parameters = hyp_parameters[0] # I THINK this just gets us inside the hp_bounds list
    xgb_model = xgb.XGBClassifier(objective='binary:logistic',
                                 learning_rate=hyp_parameters[0],
                                 max_depth=int(hyp_parameters[1]),
                                 n_estimators=int(hyp_parameters[2]),
                                 subsample=hyp_parameters[3],
                                 min_child_weight=int(hyp_parameters[4]),
                                 reg_alpha=hyp_parameters[5],
                                 reg_lambda=hyp_parameters[6])
    scores = cross_val_score(xgb_model,
                             X=X_train,
                             y=y_train,
                             cv=KFold(n_splits=5))
    return np.array(scores.mean())  # return average of 5-fold scores


optimizer = BayesianOptimization(f=cv_score, #Here is our cv_score function
                                 domain=hp_bounds, #The argument to our cv_score function
                                 model_type='GP',
                                 acquisition_type='EI',
                                 acquisition_jitter=0.05,
                                 exact_feval=True,
                                 maximize=True,
                                 verbosity=True)

optimizer.run_optimization(max_iter=20,verbosity=True)

num acquisition: 1, time elapsed: 2.27s
num acquisition: 2, time elapsed: 4.00s
num acquisition: 3, time elapsed: 6.04s
num acquisition: 4, time elapsed: 7.74s
num acquisition: 5, time elapsed: 9.06s
num acquisition: 6, time elapsed: 10.80s
num acquisition: 7, time elapsed: 12.63s
num acquisition: 8, time elapsed: 14.71s
num acquisition: 9, time elapsed: 16.60s
num acquisition: 10, time elapsed: 18.42s
num acquisition: 11, time elapsed: 19.95s
num acquisition: 12, time elapsed: 22.17s
num acquisition: 13, time elapsed: 25.05s
num acquisition: 14, time elapsed: 27.08s
num acquisition: 15, time elapsed: 29.32s
num acquisition: 16, time elapsed: 31.54s
num acquisition: 17, time elapsed: 33.76s
num acquisition: 18, time elapsed: 36.26s
num acquisition: 19, time elapsed: 38.85s
num acquisition: 20, time elapsed: 42.12s


In [15]:
best_hyp_set = {}
for i in range(len(hp_bounds)):
    if hp_bounds[i]['type'] == 'continuous':
        best_hyp_set[hp_bounds[i]['name']] = optimizer.x_opt[i]
    else:
        best_hyp_set[hp_bounds[i]['name']] = int(optimizer.x_opt[i])

bayopt_search = xgb.XGBClassifier(objective='binary:logistic',**best_hyp_set)
bayopt_search.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=1.0, max_delta_step=0, max_depth=1,
              min_child_weight=20, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=3.733804277871873, reg_lambda=5.0, scale_pos_weight=1,
              seed=None, silent=None, subsample=1.0, verbosity=1)

In [16]:
#Save the best hyperparams
final_results.at["BayesianOptimization","learning_rate":"reg_alpha"] = best_hyp_set

stats_dict = my_classifier_results(bayopt_search)

#Save the stats from the test data
final_results.at["BayesianOptimization","Accuracy":"Precision"] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["BayesianOptimization","Number of fits"] = 5*25

Unnamed: 0,pred:bad,pred:good
actual:bad,130,67
actual:good,65,138


In [17]:
#************* TPOT - hyperparameter optimization only *****************
    
from tpot import TPOTClassifier

tpot_config = {
    'xgboost.XGBClassifier': {
        'n_estimators': [50, 100, 125],
        'max_depth': range(1, 11),
        'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.],
        'subsample': np.arange(0.05, 1.01, 0.05),
        'min_child_weight': range(1, 21),
        'reg_alpha': range(1, 6),
        'reg_lambda': range(1, 6),
        'nthread': [1],
        'objective': ['binary:logistic']
    }
}

tpot = TPOTClassifier(generations=5,
                     population_size=20,
                     verbosity=2,
                     config_dict=tpot_config,
                     cv=3,
                     random_state=394)
tpot.fit(X_train, y_train)
tpot.export('tpot_XGBClassifier.py') # export the model

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.6588888888888889
Generation 2 - Current best internal CV score: 0.6588888888888889
Generation 3 - Current best internal CV score: 0.6588888888888889
Generation 4 - Current best internal CV score: 0.6588888888888889
Generation 5 - Current best internal CV score: 0.6588888888888889

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.001, max_depth=3, min_child_weight=11, n_estimators=125, nthread=1, objective=binary:logistic, reg_alpha=3, reg_lambda=1, subsample=0.3)


In [0]:
# 'tpot_XGBClassifier.py' file contents

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=394)

# Average CV score on the training set was:0.6588888888888889
exported_pipeline = XGBClassifier(learning_rate=0.001, max_depth=3, min_child_weight=11, n_estimators=125, nthread=1, objective="binary:logistic", reg_alpha=3, reg_lambda=1, subsample=0.3)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

In [18]:
stats_dict = my_classifier_results(tpot)

#Save the stats from the test data
final_results.at["TPOT_parameters","Accuracy":"Precision"] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["TPOT_parameters","Number of fits"] = 4*20*5 

Unnamed: 0,pred:bad,pred:good
actual:bad,137,60
actual:good,85,118


In [19]:
#************* TPOT - including model selection *****************

tpot = TPOTClassifier(generations=5,
                     population_size=20,
                     verbosity=2,
                     cv=3,
                     random_state=394)

tpot.fit(X_train, y_train)
tpot.export('tpot_model_selection.py') # export the model

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.6602777777777779
Generation 2 - Current best internal CV score: 0.6644444444444445
Generation 3 - Current best internal CV score: 0.6644444444444445
Generation 4 - Current best internal CV score: 0.6644444444444445
Generation 5 - Current best internal CV score: 0.6644444444444445

Best pipeline: BernoulliNB(GradientBoostingClassifier(VarianceThreshold(input_matrix, threshold=0.01), learning_rate=0.1, max_depth=2, max_features=0.35000000000000003, min_samples_leaf=5, min_samples_split=14, n_estimators=100, subsample=0.8500000000000001), alpha=10.0, fit_prior=False)


In [0]:
# 'tpot_model_selection.py' file contents

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=394)

# Average CV score on the training set was:0.6644444444444445
exported_pipeline = make_pipeline(
    VarianceThreshold(threshold=0.01),
    StackingEstimator(estimator=GradientBoostingClassifier(learning_rate=0.1, max_depth=2, max_features=0.35000000000000003, min_samples_leaf=5, min_samples_split=14, n_estimators=100, subsample=0.8500000000000001)),
    BernoulliNB(alpha=10.0, fit_prior=False)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

In [20]:
stats_dict = my_classifier_results(tpot)

#Save the stats from the test data
final_results.at["TPOT_model_selection","Accuracy":"Precision"] = stats_dict
#save the number of model fits into the dataframe manually
final_results.at["TPOT_model_selection","Number of fits"] = 4*20*5
final_results

Unnamed: 0,pred:bad,pred:good
actual:bad,132,65
actual:good,73,130


Unnamed: 0,learning_rate,max_depth,n_estimators,subsample,min_child_weight,reg_lambda,reg_alpha,Accuracy,Sensitivity,Precision,Number of fits
Baseline_LR,,,,,,,,0.5475,0.639594,0.533898,1
Baseline_RF,,,,,,,,0.6,0.507614,0.613497,1
Baseline_GaussianNB,,,,,,,,0.56,0.812183,0.535117,1
Baseline_XGBClassifier,,,,,,,,0.6625,0.670051,0.653465,1
GridSearch,0.1,2.0,75.0,0.75,7.0,4.0,2.0,0.665,0.680203,0.653659,576
RandomSearch,0.1,1.0,86.0,0.779526,13.0,2.57853,0.844422,0.6425,0.664975,0.629808,125
BayesianOptimization,1.0,1.0,100.0,1.0,20.0,5.0,3.7338,0.67,0.659898,0.666667,125
TPOT_parameters,,,,,,,,0.6375,0.695431,0.617117,400
TPOT_model_selection,,,,,,,,0.655,0.670051,0.643902,400


## Part 5 - Summary

* In addition to your summary table, answer:
    * The bank isn't as concerned about misclassifying some truly good loans as they are interested in correctly predicting truly bad loans.  Which model should they use?  Why?

<font color = "blue"> *** 5 points - (don't delete this cell) </font>

If the bank cares more about correctly labeling the true bad loads, they will want to use the model with the highest sensitivity. 

Out of all the models tested here, the baseline Gaussian Naive Bayes model with the default hyperparameters did the best.