# <font color='blue'>Parameter Optimization with Randomized Search</font>

Every machine learning model has parameters that allow the customization of the model. These parameters are also called hyperparameters.

Functions in programming represent machine learning algorithms, and each function has the customization parameters, precisely what we call hyperparameters.

It is also common for people to report to the model's coefficients (found at the end of training) as parameters.

Part of our job as Data Scientists is to find the best combination of hyperparameters for each model.

In Ensemble Methods, we have the hyperparameters of the base estimator and the hyperparameters of the ensemble model:

* Base estimator:

estim_base = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=5, p=2, weights='uniform')

* Ensemble Model:

BaggingClassifier(base_estimator=estim_base,
                  bootstrap=True, bootstrap_features=False, max_features=0.5,
                  max_samples=0.5, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

## Extremely Randomized Forest

Standard model, with manually chosen hyperparameters.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# Loading dataset
data = pd.read_excel('data/credit.xls', skiprows = 1)

In [3]:
print(data)

          ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  \
0          1      20000    2          2         1   24      2      2     -1   
1          2     120000    2          2         2   26     -1      2      0   
2          3      90000    2          2         2   34      0      0      0   
3          4      50000    2          2         1   37      0      0      0   
4          5      50000    1          2         1   57     -1      0     -1   
...      ...        ...  ...        ...       ...  ...    ...    ...    ...   
29995  29996     220000    1          3         1   39      0      0      0   
29996  29997     150000    1          3         2   43     -1     -1     -1   
29997  29998      30000    1          2         2   37      4      3      2   
29998  29999      80000    1          3         1   41      1     -1      0   
29999  30000      50000    1          2         1   46      0      0      0   

       PAY_4  ...  BILL_AMT4  BILL_AMT5  BILL_AMT6 

In [4]:
# setting target
target = 'default payment next month'
y = np.asarray(data[target])

In [5]:
# setting predictors
features = data.columns.drop(['ID', target])
X = np.asarray(data[features])

In [6]:
# Splitting of training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 99)

In [7]:
# Creating classifier
clf = ExtraTreesClassifier(n_estimators = 500, random_state = 99)

In [8]:
# Training model
clf.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=500,
                     n_jobs=None, oob_score=False, random_state=99, verbose=0,
                     warm_start=False)

In [9]:
# Score
scores = cross_val_score(clf, X_train, y_train, cv = 3, scoring = 'accuracy', n_jobs = -1)

In [10]:
# Printing
print ("ExtraTreesClassifier -> Training Accuracy: Average = %0.3f Standard Deviation = %0.3f" % (np.mean(scores), np.std(scores)))

ExtraTreesClassifier -> Acurácia em Treino: Média = 0.812 Desvio Padrão = 0.002


In [11]:
# Predictions
y_pred = clf.predict(X_test)

In [12]:
# Confusion Matrix
confusionMatrix = confusion_matrix(y_test, y_pred)
print (confusionMatrix)

[[6532  446]
 [1273  749]]


In [13]:
# Accuracy
print("Acurácia em Teste:", accuracy_score(y_test, y_pred))

Acurácia em Teste: 0.809


## Hyperparameter Optimization with Randomized Search

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Randomized Search generates samples of algorithm parameters from a uniform random distribution to a fixed number of interactions. A model is built and tested for each combination of parameters.

In [14]:
# Import
from sklearn.model_selection import RandomizedSearchCV

In [15]:
# Setting of parameters
param_dist = {"max_depth": [1, 3, 7, 8, 12, None],
              "max_features": [8, 9, 10, 11, 16, 22],
              "min_samples_split": [8, 10, 11, 14, 16, 19],
              "min_samples_leaf": [1, 2, 3, 4, 5, 6, 7],
              "bootstrap": [True, False]}

# For the classifier created with ExtraTrees, we tested different combinations of parameters
rsearch = RandomizedSearchCV(clf, param_distributions = param_dist, n_iter = 25, return_train_score = True)  

# Applying the result to the training data set and getting the score
rsearch.fit(X_train, y_train)

# Results 
rsearch.cv_results_

# Printing the best estimator
bestclf = rsearch.best_estimator_
print (bestclf)

# Applying the best estimator to carry out the predictions
y_pred = bestclf.predict(X_test)

# Confusion Matrix
confusionMatrix = confusion_matrix(y_test, y_pred)
print(confusionMatrix)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='gini',
                     max_depth=7, max_features=22, max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=6, min_samples_split=11,
                     min_weight_fraction_leaf=0.0, n_estimators=500,
                     n_jobs=None, oob_score=False, random_state=99, verbose=0,
                     warm_start=False)
[[6652  326]
 [1290  732]]
0.8204444444444444


In [16]:
# Getting the grid with all parameter combinations
rsearch.cv_results_

{'mean_fit_time': array([ 1.73276591,  5.6754566 ,  1.44881407,  4.86209957,  1.86060635,
         3.19942625,  2.48792402,  1.19794106,  1.12878776,  5.68522199,
         7.05678161,  1.71638417,  4.21872195,  1.65388274,  2.76597857,
         7.6410505 ,  3.42891932,  4.48763871,  8.6336147 ,  1.6180977 ,
        17.13844609,  1.2311066 ,  1.72368709,  1.74332809,  5.33069785]),
 'std_fit_time': array([0.02920974, 0.13275819, 0.02895639, 0.12291092, 0.01744542,
        0.11123478, 0.03371552, 0.05790591, 0.05005262, 0.09638464,
        0.11958016, 0.00921741, 0.01845165, 0.0097986 , 0.01785686,
        0.05675859, 0.05520176, 0.0278813 , 0.11361021, 0.01553819,
        0.48353004, 0.03080309, 0.06647351, 0.10591987, 0.20548049]),
 'mean_score_time': array([0.18082841, 0.50313139, 0.13964534, 0.26077239, 0.17606695,
        0.22832497, 0.23130226, 0.15233994, 0.1467642 , 0.35081633,
        0.57697352, 0.16485286, 0.23387305, 0.16475828, 0.16421469,
        0.50292643, 0.33973002, 0.4

## Grid Search x Randomized Search para Estimação dos Hiperparâmetros

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Grid Search has a combination of all algorithm parameters, creating a grid. 

In [17]:
import numpy as np
from time import time
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits

# Gets the dataset
digits = load_digits()
X, y = digits.data, digits.target

# Building the classifier
clf = RandomForestClassifier(n_estimators = 20)

In [18]:
# Randomized Search

# Values of the parameters that will be tested
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Running Randomized Search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, 
                                   param_distributions = param_dist, 
                                   n_iter = n_iter_search,
                                   return_train_score=True)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV ran in %.2f seconds for %d candidates for model parameters."
      % ((time() - start), n_iter_search))

# Prints the combinations of the parameters and their respective accuracy means
random_search.cv_results_

RandomizedSearchCV executou em 2.51 segundos para 20 candidatos a parâmetros do modelo.


{'mean_fit_time': array([0.03226606, 0.03018363, 0.03576922, 0.05826879, 0.03517334,
        0.02979279, 0.0251437 , 0.03163997, 0.02345339, 0.05700334,
        0.02769613, 0.0297219 , 0.03892167, 0.03199426, 0.03271087,
        0.02482438, 0.02065023, 0.04467638, 0.01982141, 0.03476063]),
 'std_fit_time': array([0.00149683, 0.00163327, 0.00141961, 0.01284446, 0.00469383,
        0.00046   , 0.00019163, 0.00076863, 0.00064321, 0.0014836 ,
        0.00019763, 0.00128362, 0.00027435, 0.00036117, 0.00084605,
        0.00031218, 0.00030193, 0.00050351, 0.0004298 , 0.00018396]),
 'mean_score_time': array([0.00354131, 0.00276637, 0.00340072, 0.00423392, 0.00338666,
        0.00343529, 0.00259709, 0.00276828, 0.0026087 , 0.00313926,
        0.0026269 , 0.00281437, 0.00305533, 0.00255203, 0.00259693,
        0.00257484, 0.00300376, 0.00376336, 0.00277996, 0.00300956]),
 'std_score_time': array([1.51585767e-04, 9.51408635e-05, 1.75850723e-04, 7.09208097e-04,
        9.74909659e-04, 4.16790516e-

In [19]:
# Grid Search

# Using a complete grid of all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Running Grid Search
grid_search = GridSearchCV(clf, param_grid = param_grid, return_train_score = True)
start = time()
grid_search.fit(X, y)

print(("GridSearchCV ran at %.2f seconds for all candidate combinations for model parameters."
      % (time() - start))
grid_search.cv_results_

GridSearchCV executou em 8.87 segundos para todas as combinações de candidatos a parâmetros do modelo.


{'mean_fit_time': array([0.0268321 , 0.02451921, 0.02426394, 0.02615468, 0.02548774,
        0.02512439, 0.03135093, 0.03105704, 0.03083913, 0.03292267,
        0.02737737, 0.02524376, 0.03647367, 0.03344862, 0.03104472,
        0.0517416 , 0.04824162, 0.04341102, 0.02462808, 0.02531171,
        0.02411946, 0.02793813, 0.02762294, 0.02735162, 0.03413828,
        0.03360184, 0.0346667 , 0.03637369, 0.02931825, 0.02685134,
        0.0406034 , 0.03614839, 0.03242389, 0.06104136, 0.05499665,
        0.04817398, 0.01921701, 0.01959022, 0.0189321 , 0.0221076 ,
        0.02269069, 0.02142596, 0.03113921, 0.02938668, 0.02921391,
        0.03211141, 0.0239497 , 0.02177993, 0.03666615, 0.0322543 ,
        0.02796698, 0.06074222, 0.05637495, 0.05275265, 0.01992265,
        0.01939774, 0.01959833, 0.02299802, 0.02293682, 0.02232981,
        0.0332485 , 0.03301231, 0.03299697, 0.03643783, 0.02768588,
        0.02238965, 0.04519471, 0.03850102, 0.03242294, 0.07235845,
        0.06804371, 0.05737551]