## Hyperparameter Optimization with Maggy

### **Note: currently this notebook needs to be run with a PySpark kernel to work properly!*

In this notebook, we'll use the [Maggy](https://maggy.ai/master/) library from Hopsworks to run experiments with hyperparameter tuning. In particular we will:

- Load a training dataset from the feature store.
- Train models on the dataset using different hyperparameters.

![tutorial-flow](images/maggy_hp.png)

We will train our model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch.

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
0,application_1653266850468_0007,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

In [2]:
feature_view = fs.get_feature_view("transactions_view", 1)

As we described in the previus notebook feature view contains information about associating a label feature

In [3]:
feature_view.label

['fraud_label']

### Load Training Data

First, we'll need to fetch the training dataset that we created in the previous notebook. Since we're running this notebook in a PySpark Kernel we'll get Spark Dataframes, which we'll need to convert back to Pandas Dataframes.


In [4]:
_, td_df = feature_view.get_training_dataset_splits({'train': 80, 'validation': 20}, start_time=None, end_time=None, version = 1)



In [5]:
td_df["train"].show()

+-----------+--------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|fraud_label|category|              amount|  age_at_transaction|days_until_card_expires|           loc_delta|   trans_volume_mstd|   trans_volume_mavg|          trans_freq|      loc_delta_mavg|
+-----------+--------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|          0|       5|                 0.0|0.010857729704653134|      0.850452102272883| 0.02495461846742725|0.003957852755597837|0.003957852755597837|0.003957852755597837|3.537142033300619E-5|
|          0|       5|                 0.0| 0.04737893138702051|     0.9437215841540204|0.035718227936116044|5.155452750525806E-4|5.155452750525806E-4|5.155452750525806E-4| 0.09750806388585448|
|          0|       5|        

In [6]:
td_df["train"].printSchema()

root
 |-- fraud_label: integer (nullable = true)
 |-- category: integer (nullable = true)
 |-- amount: double (nullable = true)
 |-- age_at_transaction: double (nullable = true)
 |-- days_until_card_expires: double (nullable = true)
 |-- loc_delta: double (nullable = true)
 |-- trans_volume_mstd: double (nullable = true)
 |-- trans_volume_mavg: double (nullable = true)
 |-- trans_freq: double (nullable = true)
 |-- loc_delta_mavg: double (nullable = true)

In [7]:
X_train = td_df["train"].toPandas()
X_val = td_df['validation'].toPandas()

X_train.head()

   fraud_label  category  amount  ...  trans_volume_mavg  trans_freq  loc_delta_mavg
0            0         5     0.0  ...           0.003958    0.003958        0.000035
1            0         5     0.0  ...           0.000516    0.000516        0.097508
2            0         5     0.0  ...           0.002621    0.002621        0.074447
3            0         5     0.0  ...           0.159000    0.159000        0.000128
4            0         5     0.0  ...           0.074303    0.074303        0.000086

[5 rows x 10 columns]

We will train a model to predict `fraud_label` given the rest of the features.

In [8]:
target = feature_view.label[0]

y_train = X_train.pop(target)
y_val = X_val.pop(target)

Let's check the distribution of our target label.

In [9]:
y_train.value_counts(normalize=True)

0    0.998444
1    0.001556
Name: fraud_label, dtype: float64

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

### Hyperparameter Optimization

In the following example, we'll use a simple logistic regression model and do a hyperparameter search over class weights. Since our dataset is unbalanced we will evaluate each hyperparameter configuration using the *F1-score* rather than *accuracy*.

First, we define a training function that will return an evaluation score given a hyperparameter configuration.

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

def training_function(pos_class_weight):
    clf = LogisticRegression(class_weight={0: 1.0 - pos_class_weight, 1: pos_class_weight}, solver='liblinear')
    clf.fit(X_train, y_train)
    preds = clf.predict(X_val)
    score = f1_score(y_val, preds)
    return {'metric': score} 

Note that this code assumes that the `X_train`, `y_train` etc variables already exist in the namespace.

Let's test the code to see that it works.

In [11]:
score = training_function(0.5)

Now let's see if we can find a value for `class_weight` that gives us a better score.

To do this we'll define a search space, which represents the set of possible values we want to consider for our hyperparameters. We'll also need to define datatypes for the hyperparameters.

In [12]:
from maggy import Searchspace

sp = Searchspace(pos_class_weight=('DOUBLE', [0.1, 0.9]))

Hyperparameter added: pos_class_weight

Next we'll define a configuration for our hyperparameter search. Some important parameters are:
- `num_trials`: Number of models to train. You should set this based on how much time you are willing to spend. We'll just do five trials here to showcase the functionality.
- `optimizer`: Strategy used to determine the next parameter value to try. We will just use grid search, but you can read about alternatives [here](https://maggy.ai/master/hpo/strategies/).
- `direction`: Should be set to `max` if the output of `train_fn` should be maximized, otherwise `min`.

Now we can run the `lagom` method, which tries to find the best value. Lagom is a Swedish word that means "just right". The function is "lagom" in the way it uses your resources.

In [13]:
from maggy import experiment
result = experiment.lagom(train_fn=training_function, 
                          searchspace=sp,
                          optimizer='randomsearch', 
                          direction='max',
                          num_trials=2,
                          name='fraud_lr')

HBox(children=(FloatProgress(value=0.0, description='Maggy experiment', max=2.0, style=ProgressStyle(descripti…

Started Maggy Experiment: fraud_lr, application_1653266850468_0007, run 1

------ RandomSearch Results ------ direction(max) 
BEST combination {"pos_class_weight": 0.772225376982996} -- metric 0.0
WORST combination {"pos_class_weight": 0.772225376982996} -- metric 0.0
AVERAGE metric -- 0.0
EARLY STOPPED Trials -- 0
Total job time 0 hours, 0 minutes, 40 seconds

Finished Experiment


The function returns a dict with results from our experiment. Of special interest is of course the `best_config` dict, which contains the best hyperparameters found. Let's save this dict.

In [14]:
result

{'best_id': 'f6959958e73eeda3', 'best_val': 0.0, 'best_hp': {'pos_class_weight': 0.772225376982996}, 'worst_id': 'f6959958e73eeda3', 'worst_val': 0.0, 'worst_hp': {'pos_class_weight': 0.772225376982996}, 'avg': 0.0, 'metric_list': [0.0, 0.0], 'num_trials': 2, 'early_stopped': 0}

In [15]:
import pickle

with open("best_params.pickle", "wb") as f:
    pickle.dump(result["best_hp"], f)

You can also upload this file to your cluster using the *hopsworks* library. To do this you would run the following code:

In [16]:
import hopsworks

hopsworks_conn = hopsworks.connection()
project = hopsworks_conn.get_project()
dataset_api = project.get_dataset_api()

uploaded_file_path = dataset_api.upload("best_params.pickle", "Resources")
print(uploaded_file_path)

Connected. Call `.close()` to terminate connection gracefully.
Resources/best_params.pickle
Uploading: 100.000%|##########| 43/43 elapsed<00:00 remaining<00:00

### Next Steps

In the next notebook, we'll look at how to register a model to the [Hopsworks Model Registry](https://docs.hopsworks.ai/machine-learning-api/latest), which enables us to version control our models and easily create APIs for them.