# Scikit-Optimize Definitions

**Definitions**

* Uniform distribution: This means that every value in a certain range has an equal chance of being selected. For example, if you're choosing a learning rate between 0.01 and 0.1, any value in that range is equally likely to be picked.

* Log-uniform distribution: This means the values are spread out more on a logarithmic scale. Smaller values are more likely to be chosen than larger ones. For example, for a learning rate, values like 0.001 might be chosen more often than 0.1 because it focuses more on the smaller numbers.

Log-uniform is useful when you expect smaller values to be more effective but still want to explore larger ones.

**Example**

In Artificial Neural Networks (ANNs), we commonly use log-uniform distributions for tuning the learning rate. Here's why:

Learning rates often span multiple orders of magnitude (e.g., 0.0001 to 0.1), and using a log-uniform distribution ensures that smaller values, which are typically more effective for fine-tuning, are explored more thoroughly.
Since small changes in the learning rate can have a big impact on training, log-uniform allows us to sample more from smaller values, which often leads to better results.
In contrast, a uniform distribution would treat all values equally, making it less likely to find those smaller, more effective learning rates. This is why log-uniform is preferred for learning rate in ANN tuning.

# Basic Syntax

## Scikit-optimize search space - distributions


In [None]:
from skopt.space import Real, Integer, Categorical  # Import classes to define hyperparameter spaces
from skopt.utils import use_named_args

# Step 1: Define an integer hyperparameter space
# Integer(10, 120) defines the range of values from 10 to 120 with uniform probability
param = Integer(10, 120, prior="uniform", name="example")
space = Integer(10, 120, prior="uniform", name="example")

# Step 2: Define a continuous real-valued hyperparameter space
# Real(0.00001, 0.1) defines the range of values between 0.00001 and 0.1 with uniform probability
space = Real(0.00001, 0.1, prior="uniform", name="example")

# Log-uniform is useful when smaller values are preferred as it samples based on a logarithmic scale
space = Real(0.00001, 0.1, prior="log-uniform", name="example")

# Step 3: Define a categorical hyperparameter space
# Categorical(['A', 'B', 'C']) defines discrete choices, and one will be selected with equal probability
space = Categorical(['A', 'B', 'C'], name="example")

# Step 4: Define a complete hyperparameter grid for optimization
# param_grid is a list of different hyperparameters with their respective ranges/types
param_grid = [
    Integer(10, 120, name="n_estimators"),  # Number of trees in the model
    Integer(1, 5, name="max_depth"),  # Maximum depth of each tree
    Real(0.0001, 0.1, prior='log-uniform', name='learning_rate'),  # Learning rate with log-uniform distribution
    Real(0.001, 0.999, prior='log-uniform', name="min_samples_split"),  # Minimum samples split
    Categorical(['log_loss', 'exponential'], name="loss"),  # Loss function
]

## Objective Function

In [None]:
# Step 1: Design a function to maximize accuracy using GBM with cross-validation
# The decorator `use_named_args` allows the objective function to receive hyperparameters as keyword arguments
@use_named_args(param_grid)
def objective(**params):

    # Step 2: Update the GBM model with new parameters
    # Set the parameters of the Gradient Boosting Model to the current set of hyperparameters
    gbm.set_params(**params)

    # Step 3: Cross-validation to evaluate the model
    # Perform 3-fold cross-validation on the training data to compute the accuracy for the current parameters
    value = np.mean(
        cross_val_score(
            gbm,
            X_train,        # Training features
            y_train,        # Training labels
            cv=3,           # 3-fold cross-validation
            n_jobs=-4,      # Parallel processing with 4 jobs
            scoring='accuracy')  # Scoring based on accuracy
    )

    # Step 4: Negate the value since the optimizer minimizes the objective
    # Return the negative of the accuracy as the objective function minimizes the value
    return -value

# Randomized Search


In [None]:
# Importing necessary functions and libraries for optimization
from skopt import dummy_minimize  # for the randomized search

# for the analysis of results after the search
from skopt.plots import (
    plot_convergence,  # plots the convergence of the search
    plot_evaluations,  # plots the evaluations of the objective function
)

# Importing the hyperparameter space definition utilities
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args  # for mapping hyperparameters to functions

# Step 1: Perform a random search over the hyperparameter space
search = dummy_minimize(
    objective,  # the objective function to minimize
    param_grid,  # the hyperparameter space
    n_calls=50,  # number of evaluations of the objective function
    random_state=0,  # ensures reproducibility
)

## Plotting

The evaluation plot shows how different choices of hyperparameters affect the model's performance. Here’s what it highlights:

* Axes: Each axis represents a hyperparameter (like n_estimators or max_depth).

* Points: Each point shows a specific combination of hyperparameters and how well that combination performed. Points that are darker or bigger usually mean better performance.

* Trends: You can see which hyperparameter values tend to work better based on where the good-performing points are clustered.

* Outliers: It helps you spot any unusual results where the performance is much worse or better than the others.

In [None]:
# Step 2: Plot the convergence of the search
plot_convergence(search)  # Shows how the objective function value changes over time

# Step 3: Plot evaluations for each dimension of the hyperparameter space
dim_names = ['n_estimators', 'max_depth', 'min_samples_split', 'learning_rate', 'loss']  # hyperparameter names
plot_evaluations(result=search, plot_dims=dim_names)  # Evaluates each hyperparameter's impact

plt.show()  # Display the plots

# Bayesian Optimization

## Bayesian Optimization with Gaussian Process




In [None]:
from skopt import gp_minimize # Bayesian Opt with GP

# for the analysis
from skopt.plots import (
    plot_convergence,
    plot_evaluations,
    plot_objective,
)

from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args

# gp_minimize performs by default GP Optimization
# using a Marten Kernel

gp_ = gp_minimize(
    objective, # the objective function to minimize
    param_grid, # the hyperparameter space
    n_initial_points=10, # the number of points to evaluate f(x) to start of
    acq_func='EI', # the acquisition function
    n_calls=30, # the number of subsequent evaluations of f(x)
    random_state=0,
)

### Plotting

In [None]:
# Step 1: Plot the convergence of the Bayesian optimization
# Shows how the objective function value improves over iterations
plot_convergence(gp_)

# Step 2: Define the hyperparameter names for the objective and evaluation plots
# Hyperparameter names
dim_names = ['n_estimators', 'max_depth', 'min_samples_split', 'learning_rate', 'loss']

# Step 3: Plot the objective for each hyperparameter
# Shows the effect of each hyperparameter on the model's performance
plot_objective(result=gp_, plot_dims=dim_names)
plt.show()

# Step 4: Plot the evaluations for each dimension of the hyperparameter space
# Visualizes which hyperparameter values were evaluated
plot_evaluations(result=gp_, plot_dims=dim_names)
plt.show()

## Bayesian Optimization with Random Forests (SMAC)


In [None]:
# Step 1: Import the necessary modules for Bayesian optimization and analysis
# skopt contains optimization functions and plotting tools
from skopt import forest_minimize  # Bayesian Optimization with RF as surrogate
from skopt.plots import plot_convergence, plot_evaluations, plot_objective  # Plotting tools for analysis
from skopt.space import Real, Integer, Categorical  # Define the hyperparameter search space
from skopt.utils import use_named_args  # Decorator to pass hyperparameters as named arguments

# Step 2: Perform Bayesian optimization using Random Forest as the surrogate model
# forest_minimize optimizes the objective function by exploring the hyperparameter space
fm_ = forest_minimize(
    objective,  # The objective function to minimize
    param_grid,  # The hyperparameter space
    base_estimator='RF',  # Use Random Forest as the surrogate model
    n_initial_points=10,  # Number of points to evaluate the objective function to start with
    acq_func='EI',  # Expected Improvement acquisition function
    n_calls=30,  # Number of iterations for evaluating the objective function
    random_state=0,  # Ensure reproducibility
    n_jobs=4,  # Use 4 cores for parallel computation
)

## Bayesian Optimization with GBM as surrogate


In [None]:
# Step 1: Import the necessary modules for Bayesian optimization and analysis
# skopt contains optimization functions and plotting tools
from skopt import gbrt_minimize  # Bayesian Optimization with GBM as surrogate
from skopt.plots import plot_convergence, plot_evaluations, plot_objective  # Plotting tools for analysis
from skopt.space import Real, Integer, Categorical  # Define the hyperparameter search space
from skopt.utils import use_named_args  # Decorator to pass hyperparameters as named arguments

# Step 2: Perform Bayesian optimization using Gradient Boosted Machines as the surrogate
# gbrt_minimize optimizes the objective function by exploring the hyperparameter space
gbm_ = gbrt_minimize(
    objective,  # The objective function to minimize
    param_grid,  # The hyperparameter space
    n_initial_points=10,  # Number of points to evaluate the objective function to start with
    acq_func='EI',  # Expected Improvement acquisition function
    n_calls=30,  # Number of iterations for evaluating the objective function
    random_state=0,  # Ensure reproducibility
    n_jobs=4,  # Use 4 cores for parallel computation
)

**Note**
* **RF** is better for noisy, irregular problems, where you need a more exploration-driven search.
* **GBM** is ideal when faster convergence and higher precision are desired, though it may require more computational resources.

## Bayesian Optimization with XGBoost

In [None]:
# Step 1: Import XGBoost and necessary libraries for optimization
import xgboost as xgb  # Importing the XGBoost library for model building
from skopt import gp_minimize  # Importing the Gaussian Process optimization function

# Step 2: Import plotting functions for analysis
# These functions will help visualize the optimization process
from skopt.plots import (
    plot_convergence,  # For plotting the convergence of the optimization
    plot_evaluations,  # For visualizing the evaluation of hyperparameters
    plot_objective,    # For plotting the objective function values
)
from skopt.space import Real, Integer, Categorical  # For defining hyperparameter space
from skopt.utils import use_named_args  # For using named arguments in the objective function

# Step 3: Set up the Gaussian Process Optimization
# gp_minimize performs Bayesian Optimization using a Marten Kernel by default
gp_ = gp_minimize(
    objective,  # The objective function to minimize
    param_grid,  # The hyperparameter space to explore
    n_initial_points=10,  # Number of initial points to evaluate f(x)
    acq_func='EI',  # The acquisition function used for optimization
    n_calls=40,  # Number of subsequent evaluations of f(x)
    random_state=0,  # Ensures reproducibility of results
)

## Parallelization Optimization with Gaussian Process


In [None]:
# Step 1: Import necessary modules for optimization and parallelization
# Optimizer is used for the optimization process, Parallel and delayed for parallel computations
from skopt import Optimizer  # For the optimization process
from joblib import Parallel, delayed  # For parallel computation

# Step 2: Initialize the Optimizer for Bayesian Optimization
# Optimizer uses Gaussian Processes (GP) as the surrogate model
optimizer = Optimizer(
    dimensions=param_grid,  # The hyperparameter space
    base_estimator="GP",  # The surrogate model (Gaussian Process)
    n_initial_points=10,  # Number of initial points to evaluate the objective function
    acq_func='EI',  # Expected Improvement acquisition function
    random_state=0,  # Ensure reproducibility
    n_jobs=4,  # Number of cores for parallel computation
)

# Step 3: Perform optimization in parallel over 10 iterations
# Each iteration evaluates 4 points in parallel and updates the optimizer
for i in range(10):
    x = optimizer.ask(n_points=4)  # Generate 4 points to evaluate the objective function
    y = Parallel(n_jobs=4)(delayed(objective)(v) for v in x)  # Evaluate the objective function in parallel
    optimizer.tell(x, y)  # Update the optimizer with the evaluated points

**Note**

Bayesian optimization itself is not inherently a parallel algorithm, as it traditionally works sequentially. Each new set of hyperparameters is chosen based on the information from all the previous evaluations. However, it can be adapted for parallelization using certain strategies.

**Important**

When you run the optimization in parallel, the algorithm picks multiple sets of hyperparameters at the same time (e.g., 4 sets) and evaluates them simultaneously. The trade-off here is that while this parallel evaluation speeds up the process, the algorithm doesn't get the chance to learn from each individual evaluation before choosing the next set of hyperparameters.



In [None]:
# Step 1: Import BayesSearchCV for hyperparameter optimization
# BayesSearchCV is a wrapper for using Bayesian optimization in hyperparameter tuning
from skopt import BayesSearchCV

# Step 2: Set up the BayesSearchCV for hyperparameter search
# This initializes the search with the specified estimator and parameter grid
search = BayesSearchCV(
    estimator=gbm,  # The model to optimize
    search_spaces=param_grid,  # The hyperparameter space to explore
    scoring='neg_mean_squared_error',  # Metric for evaluation (negated for minimization)
    cv=3,  # Number of cross-validation folds
    n_iter=50,  # Number of iterations for the search
    random_state=10,  # Ensures reproducibility of results
    n_jobs=4,  # Number of CPU cores for parallel processing
    refit=True  # Refits the model using the best found hyperparameters
)

# Step 3: Fit the BayesSearchCV to the training data
# This will start the hyperparameter optimization process
search.fit(X_train, y_train)

## Bayesian Optimisation with different Kernels


In [None]:
# Step 1: Import the squared exponential kernel and necessary libraries
# Importing Radial Basis Function (RBF) for Gaussian Process Regression
from sklearn.gaussian_process.kernels import RBF
from skopt import gp_minimize  # For performing Bayesian Optimization
from skopt.plots import plot_convergence  # For plotting the convergence of the optimization process
from skopt.space import Real, Integer, Categorical  # For defining the hyperparameter space
from skopt.utils import use_named_args  # For using named arguments in the objective function
from skopt.learning import GaussianProcessRegressor  # Importing the Gaussian Process Regressor

# Step 2: Define the kernel for Gaussian Process Regression
# This kernel is a Radial Basis Function (RBF) with specified length scale and bounds
kernel = 1.0 * RBF(length_scale=1.0, length_scale_bounds=(1e-1, 10.0))

# Display the kernel
kernel  # This line shows the defined kernel for confirmation

# Step 3: Initialize the Gaussian Process Regressor
# This sets up the regressor with the defined kernel and additional parameters
gpr = GaussianProcessRegressor(
    kernel=kernel,  # The kernel defined above
    normalize_y=True,  # Normalize the output for better stability
    noise="gaussian",  # Assumes Gaussian noise in the observations
    n_restarts_optimizer=2  # Number of restarts for the optimizer
)

# Step 4: Perform Bayesian Optimization using gp_minimize
# This function optimizes the objective function using the Gaussian Process Regressor as the surrogate model
gp_ = gp_minimize(
    objective,  # The objective function to minimize
    dimensions=param_grid,  # The hyperparameter space to explore
    base_estimator=gpr,  # The Gaussian Process Regressor as the surrogate model
    n_initial_points=5,  # Number of initial points to evaluate
    acq_optimizer="sampling",  # Method for acquisition function optimization
    random_state=42  # Ensures reproducibility of results
)

**Description**

You define the GaussianProcessRegressor whenever you want to use a different kernel. It’s a common approach to ensure that your model leverages the specific properties of the kernel you are interested in.
The GaussianProcessRegressor serves as the interface that controls how the kernel interacts with your data, helping to refine the predictions based on the underlying assumptions of that kernel.

**Side node**

Use GaussianProcessRegressor when your task is to predict continuous values (regression). For example, predicting house prices or temperature based on input features.

Use GaussianProcessClassifier when your task is to categorize data into distinct classes (classification). For example, determining if an email is spam or classifying images into different categories.