## Hyperparameter Tuning by Iris Dataset

- The Iris dataset is a widely used dataset in the field of machine learning and data analysis. It is named after the Iris flower plant and was introduced by the British statistician and biologist Ronald Fisher in 1936. The dataset is frequently used as a beginner's dataset to learn and practice various classification algorithms.

- The Iris dataset consists of measurements of four features of three different species of Iris flowers: Setosa, Versicolor, and Virginica

- **The four features are:**
1) Sepal length (in centimeters)
2) Sepal width (in centimeters)
3) Petal length (in centimeters)
4) Petal width (in centimeters)

- In the Iris dataset, the target variable is the species of the Iris flowers. It represents the class or category to which each sample belongs. The target variable is a categorical variable with three possible values: Setosa, Versicolor, and Virginica

- For each of the three species, there are 50 samples, resulting in a total of 150 samples in the dataset

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials

# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line. 
import mlflow

In [9]:
# Load the iris dataset from scikit-learn
iris = load_iris()
X = iris.data
y = iris.target

In [10]:
print(X)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [11]:
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Part 1. Single-machine Hyperopt workflow

Here are the steps in a Hyperopt workflow:  
1. Define a function to minimize.  
2. Define a search space over hyperparameters.  
3. Select a search algorithm.  
4. Run the tuning algorithm with Hyperopt `fmin()`.

For more information, see the [Hyperopt documentation](https://github.com/hyperopt/hyperopt/wiki/FMin).

#### 1- Define a function to minimize.
We use a support vector machine classifier. The objective is to find the best value for the regularization parameter C.

In [12]:
def objective(C):
    # Create a support vector classifier model
    clf = SVC(C=C)
    
    # Use the cross-validation accuracy to compare the models' performance
    accuracy = cross_val_score(clf, X, y).mean()
    
    # Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
    return {'loss': -accuracy, 'status': STATUS_OK}

#### 2- Define the search space over hyperparameters

In [13]:
search_space = hp.lognormal('C', 0, 1.0)

#### 2- Select a search algorithm
The two main choices are:

`hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on past results.

`hyperopt.rand.suggest`: Random search, a non-adaptive approach that samples over the search space.

In [14]:
algo = tpe.suggest

#### 3- Run the tuning algorithm with Gyperopt fmin ()

In [23]:
argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=200
  )

100%|█████████████████████████████████████████████| 200/200 [00:04<00:00, 43.55trial/s, best loss: -0.9866666666666667]


In [24]:
print("Best value found: ", argmin)

Best value found:  {'C': 4.513431083458916}



## Part 2. Distributed tuning using Apache Spark and MLflow

- **Distributed tuning** A technique for tuning the hyperparameters of ML models on large datasets. It works by distributing the tuning process across multiple machines, which can significantly speed up the tuning process

- Apache Spark is a distributed computing framework that can be used to run distributed tuning jobs.
- MLflow is an open-source platform for managing the end-to-end ML lifecycle, including hyperparameter tuning.

To distribute tuning, add one more argument to `fmin()`: Argument `Trials` & class `SparkTrials`

`SparkTrials` takes 2 optional arguments:  
* `parallelism`: Number of models to fit and evaluate concurrently. The default is the number of available Spark task slots.
* `timeout`: Maximum time (in seconds) that `fmin()` can run. The default is no maximum time limit.

In [17]:
"""
from hyperopt import SparkTrials

spark_trials = SparkTrials()
with mlflow.start_run():
  argmin = fmin(
    fn=objective,
    space=search_space,
    algo=algo,
    max_evals=16,
    trials=spark_trials
  )
"""

Exception: SparkTrials cannot import pyspark classes.  Make sure that PySpark is available in your environment.  E.g., try running 'import pyspark'