## Hyperparameter optimization

hyper-parameters, i.e. the parameters that control the training/fitting process of the model.

The model has many right answers. So, how would you find the best parameters? 
- A method would be to evaluate all the combinations and see which one improves the metric. Let’s see how this is done.

In [1]:
best_accuracy = 0
best_parameters = {"a": 0, "b": 0, "c": 0}

In [None]:
# loop over all values for a, b & c
for a in range(1, 11):
 for b in range(1, 11):
    for c in range(1, 11):
        # inititalize model with current parameters
        model = MODEL(a, b, c)
        # fit the model
        model.fit(training_data)
        # make predictions
        preds = model.predict(validation_data)
        # calculate accuracy
        accuracy = metrics.accuracy_score(targets, preds)
        # save params if current accuracy
        # is greater than best accuracy
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_parameters["a"] = a
            best_parameters["b"] = b
            best_parameters["c"] = c


A search over the grid to find the best combination of parameters is known as `grid search.`

In [1]:
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection

In [2]:
# read the training data
df = pd.read_csv(r"C:\Users\lenovo\Desktop\Disha Github\Machine Learning Approaches\Machine-Learning\datasets\train.csv")

In [3]:
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [4]:
# here we have training features
X = df.drop("price_range", axis=1).values
 # and the targets
y = df.price_range.values

In [5]:
 # define the model here
 # i am using random forest with n_jobs=-1
 # n_jobs=-1 => use all cores
classifier = ensemble.RandomForestClassifier(n_jobs=-1)

In [6]:
# define a grid of parameters
 # this can be a dictionary or a list of
 # dictionaries
param_grid = {
 "n_estimators": [100, 200, 250, 300, 400, 500],
 "max_depth": [1, 2, 5, 7, 11, 15],
 "criterion": ["gini", "entropy"]
 }

In [7]:
# initialize grid search
 # estimator is the model that we have defined
 # param_grid is the grid of parameters
 # we use accuracy as our metric. you can define your own
 # higher value of verbose implies a lot of details are printed
 # cv=5 means that we are using 5 fold cv (not stratified)
model = model_selection.GridSearchCV(
estimator=classifier, 
param_grid=param_grid, 
scoring="accuracy",
verbose=10, 
n_jobs=1,
cv=5
)

In [8]:
# fit the model and extract best score
model.fit(X, y)
print(f"Best score: {model.best_score_}")

Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV 1/5; 1/72] START criterion=gini, max_depth=1, n_estimators=100..............
[CV 1/5; 1/72] END criterion=gini, max_depth=1, n_estimators=100;, score=0.583 total time=   6.6s
[CV 2/5; 1/72] START criterion=gini, max_depth=1, n_estimators=100..............
[CV 2/5; 1/72] END criterion=gini, max_depth=1, n_estimators=100;, score=0.532 total time=   0.1s
[CV 3/5; 1/72] START criterion=gini, max_depth=1, n_estimators=100..............
[CV 3/5; 1/72] END criterion=gini, max_depth=1, n_estimators=100;, score=0.662 total time=   0.2s
[CV 4/5; 1/72] START criterion=gini, max_depth=1, n_estimators=100..............
[CV 4/5; 1/72] END criterion=gini, max_depth=1, n_estimators=100;, score=0.613 total time=   0.2s
[CV 5/5; 1/72] START criterion=gini, max_depth=1, n_estimators=100..............
[CV 5/5; 1/72] END criterion=gini, max_depth=1, n_estimators=100;, score=0.547 total time=   0.1s
[CV 1/5; 2/72] START criterion=gini, max_de

In [9]:
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f"\t{param_name}: {best_parameters[param_name]}")

Best parameters set:
	criterion: entropy
	max_depth: 15
	n_estimators: 300


### Random search

- we randomly select a combination of parameters and calculate the cross-validation score.
- The time consumed here is less than grid search because we do not evaluate over all different combinations of parameters.

In [10]:
# define a grid of parameters
 # this can be a dictionary or a list of
 # dictionaries
param_grid_random = {
"n_estimators": np.arange(100, 1500, 100),
"max_depth": np.arange(1, 31),
"criterion": ["gini", "entropy"]
}

In [11]:

 # initialize random search
 # estimator is the model that we have defined
 # param_distributions is the grid/distribution of parameters
 # we use accuracy as our metric. you can define your own
 # higher value of verbose implies a lot of details are printed
 # cv=5 means that we are using 5 fold cv (not stratified)
 # n_iter is the number of iterations we want
 # if param_distributions has all the values as list,
 # random search will be done by sampling without replacement
 # if any of the parameters come from a distribution,
 # random search uses sampling with replacement
model_random = model_selection.RandomizedSearchCV(
 estimator=classifier, 
 param_distributions=param_grid_random,
 n_iter=20,
 scoring="accuracy",
 verbose=10, 
 n_jobs=1,
 cv=5
 )

In [12]:
# fit the model and extract best score
model_random.fit(X, y)


Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV 1/5; 1/20] START criterion=gini, max_depth=10, n_estimators=700.............
[CV 1/5; 1/20] END criterion=gini, max_depth=10, n_estimators=700;, score=0.875 total time=   7.6s
[CV 2/5; 1/20] START criterion=gini, max_depth=10, n_estimators=700.............
[CV 2/5; 1/20] END criterion=gini, max_depth=10, n_estimators=700;, score=0.887 total time=   1.5s
[CV 3/5; 1/20] START criterion=gini, max_depth=10, n_estimators=700.............
[CV 3/5; 1/20] END criterion=gini, max_depth=10, n_estimators=700;, score=0.895 total time=   1.0s
[CV 4/5; 1/20] START criterion=gini, max_depth=10, n_estimators=700.............
[CV 4/5; 1/20] END criterion=gini, max_depth=10, n_estimators=700;, score=0.870 total time=   1.2s
[CV 5/5; 1/20] START criterion=gini, max_depth=10, n_estimators=700.............
[CV 5/5; 1/20] END criterion=gini, max_depth=10, n_estimators=700;, score=0.860 total time=   1.1s
[CV 1/5; 2/20] START criterion=gini, m

In [15]:
print(f"Best score: {model_random.best_score_}")
print("Best parameters set:")
best_parameters = model_random.best_estimator_.get_params()
for param_name in sorted(param_grid_random.keys()):
 print(f"\t{param_name}: {best_parameters[param_name]}")

Best score: 0.8869999999999999
Best parameters set:
	criterion: entropy
	max_depth: 30
	n_estimators: 1400


- When we go into advanced hyperparameter optimization techniques, we can take a look at minimization of functions using different kinds of minimization algorithms.

- This can be achieved by using many minimization functions such as 
    - `downhill simplex algorithm`, `Nelder-Mead optimization`,
    - using a `Bayesian technique` with Gaussian process for finding optimal parameters.
    - By using a `genetic algorithm`.

##### Gaussian process

- These kinds of algorithms need a function they can optimize. Most of the time, it’s about the minimization of 
this function, like we minimize loss.

- Using Bayesian optimization with gaussian process can be accomplished by using gp_minimize
function from scikit-optimize (skopt) library

In [2]:
! pip install scikit-optimize

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-optimize
  Using cached scikit_optimize-0.9.0-py2.py3-none-any.whl (100 kB)
Collecting pyaml>=16.9
  Downloading pyaml-23.5.9-py3-none-any.whl (17 kB)
Collecting PyYAML
  Downloading PyYAML-6.0-cp39-cp39-win_amd64.whl (151 kB)
     ------------------------------------ 151.6/151.6 KB 411.1 kB/s eta 0:00:00
Installing collected packages: PyYAML, pyaml, scikit-optimize
Successfully installed PyYAML-6.0 pyaml-23.5.9 scikit-optimize-0.9.0


You should consider upgrading via the 'C:\Program Files\Python39\python.exe -m pip install --upgrade pip' command.


In [6]:
import numpy as np
import pandas as pd
from functools import partial
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection

from skopt import gp_minimize
from skopt import space

In [7]:
def optimize(params, param_names, x, y):
    """
    The main optimization function. 
    This function takes all the arguments from the search space
    and training features and targets. It then initializes
    the models by setting the chosen parameters and runs 
    cross-validation and returns a negative accuracy score
    :param params: list of params from gp_minimize
    :param param_names: list of param names. order is important!
    :param x: training data
    :param y: labels/targets
    :return: negative accuracy after 5 folds
    """

    # convert params to dictionary
    params = dict(zip(param_names, params))

    # initialize model with current parameters
    model = ensemble.RandomForestClassifier(**params)

    # initialize stratified k-fold
    kf = model_selection.StratifiedKFold(n_splits=5)

    # initialize accuracy list
    accuracies = []

    # loop over all folds
    for idx in kf.split(X=x, y=y):
        train_idx, test_idx = idx[0], idx[1]
        xtrain = x[train_idx]
        ytrain = y[train_idx]
        xtest = x[test_idx]
        ytest = y[test_idx]

        # fit model for current fold
        model.fit(xtrain, ytrain)

        #create predictions
        preds = model.predict(xtest)

        # calculate and append accuracy
        fold_accuracy = metrics.accuracy_score(
        ytest,
        preds
        )
        accuracies.append(fold_accuracy)

    # return negative accuracy
    return -1 * np.mean(accuracies)


In [8]:
# read the training data
data_df = pd.read_csv(r"C:\Users\lenovo\Desktop\Disha Github\Machine Learning Approaches\Machine-Learning\datasets\train.csv")

In [9]:
# features are all columns without price_range
 # note that there is no id column in this dataset
 # here we have training features
X = data_df.drop("price_range", axis=1).values
 # and the targets
y = data_df.price_range.values

In [19]:
# define a parameter space
param_space = [
 # max_depth is an integer between 3 and 10
 space.Integer(3, 15, name="max_depth"),
 # n_estimators is an integer between 50 and 1500
 space.Integer(100, 1500, name="n_estimators"),
 # criterion is a category. here we define list of categories
 space.Categorical(["gini", "entropy"], name="criterion"),
 # you can also have Real numbered space and define a 
 # distribution you want to pick it from
 space.Real(0.01, 1, prior="uniform", name="max_features")
 ]

In [23]:
param_space

[Integer(low=3, high=15, prior='uniform', transform='normalize'),
 Integer(low=100, high=1500, prior='uniform', transform='normalize'),
 Categorical(categories=('gini', 'entropy'), prior=None),
 Real(low=0.01, high=1, prior='uniform', transform='normalize')]

In [11]:
# make a list of param names
 # this has to be same order as the search space
 # inside the main function
param_names = [
 "max_depth",
 "n_estimators",
 "criterion",
 "max_features"
 ]

In [12]:
# by using functools partial, i am creating a 
 # new function which has same parameters as the 
 # optimize function except for the fact that
 # only one param, i.e. the "params" parameter is
 # required. this is how gp_minimize expects the 
 # optimization function to be. you can get rid of this
 # by reading data inside the optimize function or by
 # defining the optimize function here.
optimization_function = partial(
 optimize,
 param_names=param_names,
 x=X,
 y=y
 )


In [24]:
# now we call gp_minimize from scikit-optimize
 # gp_minimize uses bayesian optimization for 
 # minimization of the optimization function.
 # we need a space of parameters, the function itself,
 # the number of calls/iterations we want to have

result = gp_minimize(   
 optimization_function,
 dimensions=param_space,
 n_calls=15,
 n_random_starts=10,
 verbose=10
 )


Iteration No: 1 started. Evaluating function at random point.


AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

In [None]:
# create best params dict and print it
best_params = dict(
 zip(
 param_names,
 result.x
 )
 )
 print(best_params)

- Another useful library for hyperparameter optimization is `hyperopt`. 
- `hyperopt` uses Tree-structured Parzen Estimator (TPE) to find the most optimal parameters.

In [26]:
! pip install hyperopt

Defaulting to user installation because normal site-packages is not writeable
Collecting hyperopt
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
     ---------------------------------------- 1.6/1.6 MB 8.4 MB/s eta 0:00:00
Collecting networkx>=2.2
  Downloading networkx-3.1-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 26.4 MB/s eta 0:00:00
Collecting py4j
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
     ---------------------------------------- 200.5/200.5 KB ? eta 0:00:00
Collecting future
  Downloading future-0.18.3.tar.gz (840 kB)
     ------------------------------------- 840.9/840.9 KB 51.9 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting tqdm
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
     ---------------------------------------- 77.1/77.1 KB 4.2 MB/s eta 0:00:00
Collecting cloudpickle
  Downloading cloudpickle-2.2.1-py3-none-any.

You should consider upgrading via the 'C:\Program Files\Python39\python.exe -m pip install --upgrade pip' command.


In [28]:
import numpy as np
import pandas as pd
from functools import partial
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll.base import scope


In [42]:
def optimize_hypo(params, x, y):
   """
   The main optimization function. 
   This function takes all the arguments from the search space
   and training features and targets. It then initializes
   the models by setting the chosen parameters and runs 
   cross-validation and returns a negative accuracy score
   :param params: dict of params from hyperopt
   :param x: training data
   :param y: labels/targets
   :return: negative accuracy after 5 folds
   """
   # initialize model with current parameters
   model = ensemble.RandomForestClassifier(**params)
   # initialize stratified k-fold
   kf = model_selection.StratifiedKFold(n_splits=5)

   # initialize accuracy list
   accuracies_new = []

   # loop over all folds
   for idx in kf.split(X=x, y=y):
      train_idx, test_idx = idx[0], idx[1]
      xtrain = x[train_idx]
      ytrain = y[train_idx]
      xtest = x[test_idx]
      ytest = y[test_idx]

      # fit model for current fold
      model.fit(xtrain, ytrain)

      #create predictions
      preds = model.predict(xtest)

      # calculate and append accuracy
      fold_accuracy = metrics.accuracy_score(
      ytest,
      preds
      )
      accuracies_new.append(fold_accuracy)

   # return negative accuracy

   neg_accuracies = -1 * np.mean(accuracies_new)
   return neg_accuracies

In [31]:
# read the training data
data_df = pd.read_csv(r"C:\Users\lenovo\Desktop\Disha Github\Machine Learning Approaches\Machine-Learning\datasets\train.csv")

 # features are all columns without price_range
 # note that there is no id column in this dataset
 # here we have training features
X = data_df.drop("price_range", axis=1).values

 # and the targets
y = data_df.price_range.values




In [44]:
 # define a parameter space
 # now we use hyperopt 
param_space_hypo = {
 # quniform gives round(uniform(low, high) / q) * q
 # we want int values for depth and estimators

 "max_depth": scope.int(hp.quniform("max_depth", 1, 15, 1)),
 "n_estimators": scope.int(
 hp.quniform("n_estimators", 100, 1500, 1)
 ),
 # choice chooses from a list of values
 "criterion": hp.choice("criterion", ["gini", "entropy"]),
 # uniform chooses a value between two values
 "max_features": hp.uniform("max_features", 0, 1)
 }

In [45]:
# partial function
optimization_function_hypo = partial(
 optimize_hypo,
 x=X,
 y=y
 )

In [34]:
# initialize trials to keep logging information
trials = Trials()

In [46]:
# run hyperopt
hopt = fmin(
 fn=optimization_function_hypo,
 space=param_space_hypo,
 algo=tpe.suggest,
 max_evals=15,
 trials=trials
 )

print(hopt)

100%|██████████| 15/15 [04:07<00:00, 16.48s/trial, best loss: -0.9085000000000001]
{'criterion': 1, 'max_depth': 14.0, 'max_features': 0.9164623170029373, 'n_estimators': 514.0}


The ways of tuning hyperparameters described above are the most common, and these will work with almost all models: linear regression, logistic regression, tree-based methods, gradient boosting models such as xgboost, 
lightgbm, and even neural networks

When you create large models or introduce a lot of features, you also make it susceptible to overfitting the training data. To avoid 
overfitting, you need to introduce noise in training data features or penalize the cost function. This penalization is called regularization and helps with generalizing the model. 

- In linear models, the most common types of regularizations are L1 and L2. L1 is also known as Lasso regression and L2 as Ridge regression. 

- When it comes to neural networks, we use dropouts, the addition of augmentations, noise, etc. to regularize our models. Using hyper-parameter optimization, you can also find the correct penalty to use.

| Model | Optimize | Range of Values |
|---|---|---|
| Linear Regression | - fit_intercept <br><br> - normalize | - True/False <br><br>- True/False |
| Ridge | - alpha <br><br> - fit_intercept <br><br> - normalize |-  0.01, 0.1, 1.0, 10, 100  <br><br>- True/False <br><br>- True/False |
| k-neighbors | - n_neighbors <br><br> - p | - 2,4,8,16.... <br><br>- 2,3... |
| svm | - C<br><br>- gamma<br><br>- class_weight |- 0.001,0.01..10..100..1000  <br><br>- ‘auto’, RandomSearch* <br><br>- ‘balanced’ , None |
| Logistic Regression | - Penalty <br><br> - C | - 11 or l2 <br><br>-  0.001, 0.01…..10...100 |
| Lasso | - alpha <br><br> - Normalize | -  0.1, 1.0, 10 <br><br>- True/False |
| Random Forest | - n_estimators<br><br>- max_depth<br><br>- min_samples_split<br><br>- min_samples_leaf<br><br>- max features |- 120, 300, 500, 800, 1200<br><br>- 5, 8, 15, 25, 30, None<br><br>- 1, 2, 5, 10, 15, 100<br><br>- 1, 2, 5, 10<br><br>- log2, sqrt, None |
| XGBoost | - eta<br><br>- gamma<br><br>- max_depth<br><br>- min_child_weight<br><br>- subsample<br><br>- colsample_bytree<br><br>- lambda<br><br>- alpha |- 0.01,0.015, 0.025, 0.05, 0.1<br><br>- 0.05-0.1,0.3,0.5,0.7,0.9,1.0<br><br>- 3, 5, 7, 9, 12, 15, 17, 25<br><br>- 1, 3, 5, 7<br><br>- 0.6, 0.7, 0.8, 0.9, 1.0<br><br>- 0.6, 0.7, 0.8, 0.9, 1.0<br><br>- 0.01-0.1, 1.0 , RandomSearch*<br><br>- 0, 0.1, 0.5, 1.0 RandomSearch*|

