# Import packages and load data

In [0]:
from sklearn.datasets import load_iris #Classification dataset
from sklearn.model_selection import cross_val_score #Estimate the performance of the model using cross-validation
from sklearn.svm import SVC #Support Vector Classifier (SVC) - For classification tasks

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials

#we can skip this import if we are using Databricks Runtime for machine learning.
import mlflow 

**Iris dataset:**

The Iris dataset is a collection of measurements for 150 iris flowers that is used in machine learning to classify the flowers into species. It's a classic dataset that's often used to introduce beginners to machine learning. 

What's in the Iris dataset?

150 samples of iris flowers 

3 species of iris flowers: Iris setosa, Iris versicolor, and Iris virginica (Target variable)

4 features for each flower: sepal length, sepal width, petal length, and petal width 

In [0]:
iris = load_iris()
X = iris.data
y=iris.target

In [0]:
print(X)

In [0]:
print(y)

# Part 1: Single - machine Hyperopt workflow

Steps in Hyperopt workflow:

1. Define a function to minimize

2. Define a search space over hyperparameter

3. Select search alogorithm

4. Run the tuning algorithm with hyperopt fmin()

## Define a function to minimize

Here we use Support Vector Classifier. Objective of this step is find the best value for regularization parameter C

In [0]:
def Objective(C):
    #Create SVC model
    clf = SVC(C=C) #C is the penalty parameter of the error term; SVC is class ; clf is instance of SVC
    #Use Cross Validation to estimate the performance of the model
    accuracy = cross_val_score(clf, X, y).mean()
    #Hyperopt tries to minimize the loss, so we multiply the accuracy by -1
    return {'loss': -accuracy, 'status': STATUS_OK}

## Define a search space over hyperparameter

hp.lognormal(label, mu, sigma)

Returns a value drawn according to exp(normal(mu, sigma)) so that the logarithm of the return value is normally distributed. When optimizing, this variable is constrained to be positive.

In [0]:
search_space = hp.lognormal('C', 0, 1.0)

## Select a search algorithm

**hyperopt. tpe. suggest :** Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on past results.

**hyperopt. rand. suggest :** Random search, a non-adaptive approach that samples over the search space.

In [0]:
algo = tpe.suggest

## Run the tuning algorithm with hyperopt fmin()



In [0]:
argmin = fmin(fn=Objective, 
              space=search_space, 
              algo=algo, 
              max_evals=15, #Max models to try
              )

In [0]:
print("Best value found:", argmin)

# Part 2: Distributed tuning using Apache Spark and ML Flow

### To distribute tuning, add one more argument to fmin(): a Trials class called SparkTrials.

SparkTrials takes 2 optional arguments:

**parallelism:** Number of models to fit and evaluate concurrently. The default is the number of available Spark task slots.

**timeout:** Maximum time (in seconds) that fmin() can run. The default is no maximum time limit.

This example uses the very simple objective function defined in Cmd 7. In this case, the function runs quickly and the overhead of starting the Spark jobs dominates the calculation time, so the calculations for the distributed case take more time. For typical real-world problems, the objective function is more complex, and using SparkTrails to distribute the calculations will be faster than single-machine tuning.

Automated MLflow tracking is enabled by default. To use it, call mlflow.start_run() before calling fmin() as shown in the example.

In [0]:
from hyperopt import SparkTrials

In [0]:
help(SparkTrials)

In [0]:
spark_trials = SparkTrials()

In [0]:
with mlflow.start_run(): #this line is to enable ML tracking
    argmin = fmin(fn=Objective,
                  space=search_space,
                  algo=algo,
                  max_evals=15,
                  trials=spark_trials
    )

In [0]:
print("Best value found:", argmin)