## Model Selection with Hyperopt & MLflow

In [1]:
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials

import mlflow

### California housing dataset

- The California housing dataset is a widely used dataset in machine learning and is available in the scikit-learn library
- It contains information about housing prices in various districts of California. The dataset is often used for regression tasks to predict the median house value in a given district based on several features.

- **The California housing dataset provides the following information for each district:**

1) MedInc: Median income of households in the district.
2) HouseAge: Median age of houses in the district.
3) AveRooms: Average number of rooms per house.
4) AveBedrms: Average number of bedrooms per house.
5) Population: Total population in the district.
6) AveOccup: Average number of occupants per house.
7) Latitude: Latitude of the district's location.
8) Longitude: Longitude of the district's location.
9) MedHouseVal: Median value of houses in the district (the target variable).

- The goal of using this dataset is typically to build a regression model that can predict the median house value based on the given features.

In [2]:
X, y = fetch_california_housing(return_X_y=True)

### Feature engineering 
##### Scale the features

Its the process of transformation numerical features in a dataset to a common scale. Its a crucial stept in data pre-procesing, help to bring the features to similar range of magnitude.

In [3]:
from sklearn.preprocessing import StandardScaler

X.mean(axis=0)
scalar = StandardScaler()
X = scalar.fit_transform(X)
X.mean(axis=0)

array([ 6.60969987e-17,  5.50808322e-18,  6.60969987e-17, -1.06030602e-16,
       -1.10161664e-17,  3.44255201e-18, -1.07958431e-15, -8.52651283e-15])

#### Convert the numeric target column to discrete values

Refers to the process of transforming a target variable that originally contains continuous numeric data into discrete categories. This process is typically done through techniques such as binning or quantization.

**Binning**: This involves grouping a range of continuous values into bins or categories. For instance, ages in a dataset could be converted from numerical values like 22, 34, 45, etc., into categorical bins such as '20-29', '30-39', '40-49'.  
**Quantization**: This method assigns each continuous value to a discrete class that represents a range or a specific condition. For example, converting a continuous income variable into 'low', 'medium', and 'high' income categories based on predefined income thresholds.  

In [4]:
y_discrete = np.where(y < np.median(y), 0, 1)

In [5]:
print(y_discrete)

[1 1 1 ... 0 0 0]



### Hyperopt workflow
#### Define the function to minimize

In [6]:
def objective(params):
    classifier_type = params['type']
    del params['type']
    if classifier_type == 'svm':
        clf = SVC(**params)
    elif classifier_type == 'rf':
        clf = RandomForestClassifier(**params)
    elif classifier_type == 'logreg':
        clf = LogisticRegression(**params)
    else:
        return 0
    accuracy = cross_val_score(clf, X, y_discrete).mean()
    
    # Because fmin() tries to minimize the objective, this function must return the negative accuracy. 
    return {'loss': -accuracy, 'status': STATUS_OK}

#### Define the search space over hyperparameters
**search space** refers to the set of all possible configurations, solutions, or parameter sets that a model or algorithm can explore to find the optimal solution.

#### Select the search algorithm

The two main choices are:
* `hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach that iteratively and adaptively selects new hyperparameter settings to explore based on previous results
* `hyperopt.rand.suggest`: Random search, a non-adaptive approach that samples over the search space

In [14]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

search_space = hp.choice('classifier_type', [
    {
        'type': 'svm',
        'C': hp.lognormal('SVM_C', 0, 1.0),
        'kernel': hp.choice('kernel', ['linear', 'rbf'])
    },
    {
        'type': 'rf',
        'max_depth': hp.choice('max_depth', [int(x) for x in np.arange(2, 6, 1)]),  # Ensure max_depth is an integer
        'criterion': hp.choice('criterion', ['gini', 'entropy'])
    },
    {
        'type': 'logreg',
        'C': hp.lognormal('LR_C', 0, 1.0),
        'solver': hp.choice('solver', ['liblinear', 'lbfgs'])
    },
])


# Set your algorithm to use for hyperparameter optimization
algo = tpe.suggest

# Initialize Trials object
trials = Trials()

# Start the MLflow run and hyperparameter optimization
import mlflow

with mlflow.start_run():
    best_results = fmin(
        fn=objective,
        space=search_space,
        algo=algo,
        max_evals=32,
        trials=trials  # Use the initialized Trials object here
    )


100%|███████████████████████████████████████████████| 32/32 [09:36<00:00, 18.02s/trial, best loss: -0.8375968992248062]
