## Hyperparameter Optimization with Maggy

### **Note: currently this notebook needs to be run with a PySpark kernel to work properly!*

In this notebook, we'll use the [Maggy](https://maggy.ai/master/) library from Hopsworks to run experiments with hyperparameter tuning. In particular we will:

- Load a training dataset from the feature store.
- Train models on the dataset using different hyperparameters.

![tutorial-flow](images/maggy_hp.png)

We will train our model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch.

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
0,application_1653087438552_0010,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

In [2]:
feature_view = fs.get_feature_view("transactions_view", 1)

## TODO: (Davit): explain about label here or in the previous 

In [3]:
feature_view.label

['fraud_label']

### Load Training Data

First, we'll need to fetch the training dataset that we created in the previous notebook. Since we're running this notebook in a PySpark Kernel we'll get Spark Dataframes, which we'll need to convert back to Pandas Dataframes.

In [20]:
_, train_df = feature_view.get_training_dataset_splits({'train': 80}, start_time=None, end_time=None, version = 1)
_, val_df = feature_view.get_training_dataset_splits({'validation': 20}, start_time=None, end_time=None, version = 1)



In [22]:
train_df.show()

+-----------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|        _c0|     _c1|                 _c2|                 _c3|                 _c4|                 _c5|                 _c6|                 _c7|                 _c8|                 _c9|
+-----------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|fraud_label|category|              amount|  age_at_transaction|days_until_card_e...|           loc_delta|   trans_volume_mstd|   trans_volume_mavg|          trans_freq|      loc_delta_mavg|
|          0|       5|1.067794613710282...|0.006711374929193003|  0.7216365405497585|0.028996255497714032|8.709211442635827E-4|8.709211442635827E-4|8.709211442635827E-4| 8.31250223954274E-5|
|          0|       5|1.101163195388729E-5|  

In [None]:
X_train = train_df.toPandas()
X_val = val_df.toPandas()

X_train.head()

We will train a model to predict `fraud_label` given the rest of the features.

In [11]:
target = feature_view.label

y_train = X_train.pop(target)
y_val = X_val.pop(target)

An error was encountered:
"None of [Index(['fraud_label'], dtype='object')] are in the [columns]"
Traceback (most recent call last):
  File "/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/core/frame.py", line 5226, in pop
    return super().pop(item=item)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/core/generic.py", line 870, in pop
    result = self[item]
  File "/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/core/frame.py", line 3464, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
  File "/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1314, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/core/indexing.py", line 1374, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['fraud_lab

Let's check the distribution of our target label.

In [12]:
y_train.value_counts(normalize=True)

An error was encountered:
name 'y_train' is not defined
Traceback (most recent call last):
NameError: name 'y_train' is not defined



Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

### Hyperparameter Optimization

In the following example, we'll use a simple logistic regression model and do a hyperparameter search over class weights. Since our dataset is unbalanced we will evaluate each hyperparameter configuration using the *F1-score* rather than *accuracy*.

First, we define a training function that will return an evaluation score given a hyperparameter configuration.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

def training_function(pos_class_weight):
    clf = LogisticRegression(class_weight={0: 1.0 - pos_class_weight, 1: pos_class_weight}, solver='liblinear')
    clf.fit(X_train, y_train)
    preds = clf.predict(X_val)
    score = f1_score(y_val, preds)
    return {'metric': score} 

Note that this code assumes that the `X_train`, `y_train` etc variables already exist in the namespace.

Let's test the code to see that it works.

In [14]:
score = training_function(0.5)
print(f"Score: {score}")

An error was encountered:
name 'y_train' is not defined
Traceback (most recent call last):
  File "<stdin>", line 6, in training_function
NameError: name 'y_train' is not defined



Now let's see if we can find a value for `class_weight` that gives us a better score.

To do this we'll define a search space, which represents the set of possible values we want to consider for our hyperparameters. We'll also need to define datatypes for the hyperparameters.

In [15]:
from maggy import Searchspace

sp = Searchspace(pos_class_weight=('DOUBLE', [0.1, 0.9]))

Hyperparameter added: pos_class_weight

Next we'll define a configuration for our hyperparameter search. Some important parameters are:
- `num_trials`: Number of models to train. You should set this based on how much time you are willing to spend. We'll just do five trials here to showcase the functionality.
- `optimizer`: Strategy used to determine the next parameter value to try. We will just use grid search, but you can read about alternatives [here](https://maggy.ai/master/hpo/strategies/).
- `direction`: Should be set to `max` if the output of `train_fn` should be maximized, otherwise `min`.

Now we can run the `lagom` method, which tries to find the best value. Lagom is a Swedish word that means "just right". The function is "lagom" in the way it uses your resources.

In [16]:
from maggy import experiment
result = experiment.lagom(train_fn=training_function, 
                          searchspace=sp,
                          optimizer='randomsearch', 
                          direction='max',
                          num_trials=2,
                          name='fraud_lr')

HBox(children=(FloatProgress(value=0.0, description='Maggy experiment', max=2.0, style=ProgressStyle(descripti…

An error was encountered:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 1628) (hopsworks0.logicalclocks.com executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/srv/hops/spark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/srv/hops/spark/python/lib/pyspark.zip/pyspark/worker.py", line 594, in process
    out_iter = func(split_index, iterator)
  File "/srv/hops/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
  File "/srv/hops/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
  File "/srv/hops/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2916, in pipeline_func
  [Previous line repeated 1 more time]
  File "/srv/hops/spark/python/lib/pyspark.zip/pysp




The function returns a dict with results from our experiment. Of special interest is of course the `best_config` dict, which contains the best hyperparameters found. Let's save this dict.

In [17]:
result

An error was encountered:
name 'result' is not defined
Traceback (most recent call last):
NameError: name 'result' is not defined



In [18]:
import pickle

with open("best_params.pickle", "wb") as f:
    pickle.dump(result["best_hp"], f)

An error was encountered:
name 'result' is not defined
Traceback (most recent call last):
NameError: name 'result' is not defined



You can also upload this file to your cluster using the *hopsworks* library. To do this you would run the following code:

In [19]:
import hopsworks

hopsworks_conn = hopsworks.connection()
project = hopsworks_conn.get_project()
dataset_api = project.get_dataset_api()

uploaded_file_path = dataset_api.upload("best_params.pickle", "Resources")
print(uploaded_file_path)

Connected. Call `.close()` to terminate connection gracefully.
Resources/best_params.pickle
Uploading: 0.000%|          | 0/0 elapsed<00:00 remaining<?

### Next Steps

In the next notebook, we'll look at how to register a model to the [Hopsworks Model Registry](https://docs.hopsworks.ai/machine-learning-api/latest), which enables us to version control our models and easily create APIs for them.