## Hyperparameter Optimization with Maggy

*Note: currently this notebook needs to be run with a PySpark kernel to work properly!*

In this notebook, we'll use the [Maggy](https://maggy.ai/master/) library from Hopsworks to run experiments with hyperparameter tuning. In particular we will:

- Load a training dataset from the feature store.
- Train models on the dataset using different hyperparameters.

![tutorial-flow](images/maggy_hp.png)

We will train our model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch.

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
31,application_1653473648291_0132,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

### Load Training Data

First, we'll need to fetch the training dataset that we created in the previous notebook. Since we're running this notebook in a PySpark Kernel we'll get Spark Dataframes, which we'll need to convert back to Pandas Dataframes.

In [3]:
feature_view = fs.get_feature_view("churn_feature_view", 1)

td_version = 1 
_, td_df_random = feature_view.get_training_dataset_splits({'train': 70, 'validation': 30}, version = td_version)

X_train = td_df_random["train"].toPandas()
X_val = td_df_random["validation"].toPandas()

X_train.head()

   gender  seniorcitizen  partner  ...  monthlycharges  totalcharges  churn
0       0              0        0  ...        0.069652      0.002907      0
1       0              0        0  ...        0.063184      0.002833      1
2       0              0        0  ...        0.061194      0.002810      0
3       0              0        0  ...        0.066169      0.002867      0
4       0              0        0  ...        0.068159      0.002890      1

[5 rows x 20 columns]

Next, we'll one-hot encode the categorical features.

We will train a model to predict `churn` given the rest of the features.

In [4]:
target = feature_view.label[0]

y_train = X_train.pop(target)
y_val = X_val.pop(target)

Let's check the distribution of our target label.

In [5]:
y_train.value_counts(normalize=True)

0    0.730319
1    0.269681
Name: churn, dtype: float64

We can see that the distribution is unbalanced.

### Hyperparameter Optimization

In the following example, we'll use a random forest ensemble model and do a hyperparameter search over the number of trees in the ensemble (`n_estimators`). Since our dataset is unbalanced we will evaluate each hyperparameter configuration using the *F1-score* rather than *accuracy*.

First, we define a training function that will return an evaluation score given a hyperparameter configuration.

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

def training_function(n_estimators):
    clf = RandomForestClassifier(class_weight="balanced", n_estimators=n_estimators)
    clf.fit(X_train, y_train)
    preds = clf.predict(X_val)
    score = f1_score(y_val, preds)
    return score

Note that this code assumes that the `X_train`, `y_train` etc variables already exist in the namespace.

Let's test the code to see that it works.

In [7]:
score = training_function(1)
print(f"Score: {score}")

Score: 0.4953271028037383

Now let's see if we can find a value for `n_estimators` that gives us a better score.

To do this we'll define a search space, which represents the set of possible values we want to consider for our hyperparameters. We'll also need to define datatypes for the hyperparameters.

In [8]:
from maggy import Searchspace

sp = Searchspace(n_estimators=('INTEGER', [1, 100]))

Hyperparameter added: n_estimators

Next we'll define a configuration for our hyperparameter search. Some important parameters are:
- `num_trials`: Number of models to train. You should set this based on how much time you are willing to spend. We'll just do five trials here to showcase the functionality.
- `optimizer`: Strategy used to determine the next parameter value to try. We will just use grid search, but you can read about alternatives [here](https://maggy.ai/master/hpo/strategies/).
- `direction`: Should be set to `max` if the output of `train_fn` should be maximized, otherwise `min`.

Now we can run the `lagom` method, which tries to find the best value. Lagom is a Swedish word that means "just right". The function is "lagom" in the way it uses your resources.

In [10]:
from maggy import experiment
result = experiment.lagom(train_fn=training_function, 
                          searchspace=sp,
                          optimizer='randomsearch', 
                          direction='max',
                          num_trials=2,
                          name='churn_lr')
result

HBox(children=(FloatProgress(value=0.0, description='Maggy experiment', max=2.0, style=ProgressStyle(descripti…

Started Maggy Experiment: churn_lr, application_1653473648291_0132, run 1

------ RandomSearch Results ------ direction(max) 
BEST combination {"n_estimators": 89} -- metric 0.5510835913312694
WORST combination {"n_estimators": 30} -- metric 0.5309548793284364
AVERAGE metric -- 0.5410192353298529
EARLY STOPPED Trials -- 0
Total job time 0 hours, 0 minutes, 25 seconds

Finished Experiment
{'best_id': 'faa7f5aad3e410ec', 'best_val': 0.5510835913312694, 'best_hp': {'n_estimators': 89}, 'worst_id': 'a71a06c8d311c21f', 'worst_val': 0.5309548793284364, 'worst_hp': {'n_estimators': 30}, 'avg': 0.5410192353298529, 'metric_list': [0.5510835913312694, 0.5309548793284364], 'num_trials': 2, 'early_stopped': 0}


The function returns a dict with results from our experiment. Of special interest is of course the `best_config` dict, which contains the best hyperparameters found. Let's save this dict.

In [13]:
import pickle

with open("best_params.pickle", "wb") as f:
    pickle.dump(result["best_hp"], f)

You can also upload this file to your cluster using the *hopsworks* library. To do this you would run the following code:

In [14]:
import hopsworks

hopsworks_conn = hopsworks.connection()
project = hopsworks_conn.get_project()
dataset_api = project.get_dataset_api()

uploaded_file_path = dataset_api.upload("best_params.pickle", "Resources")
print(uploaded_file_path)

Connected. Call `.close()` to terminate connection gracefully.
Resources/best_params.pickle
Uploading: 100.000%|##########| 32/32 elapsed<00:00 remaining<00:00

### Next Steps

In the next notebook, we'll look at how to register a model to the [Hopsworks Model Registry](https://docs.hopsworks.ai/machine-learning-api/latest), which enables us to version control our models and easily create APIs for them.