# Hyperparameter optimization with Maggy

*Note: this notebook needs to be run with a PySpark kernel to work properly!*

We'll use the [Maggy](https://maggy.ai/master/) library from Hopsworks to run experiments with hyperparameter tuning. 

The way it works is that we wrap training code in a function that we feed to an experiment object that executes the code. We can use code similar to the previous notebook, `4_model_training_and_registration.ipynb`.

The function will accept input parameters corresponding to model parameters that we want to tune. In this tutorial, a simple optimization we might want to do is to find a suitable value for the class weight on the positive class. With another model, for instance random forest or gradient boosting, you might want to tune the number of trees in the ensemble, the maximum depth of the trees, and so on.

Start by reading the training and validation data in the same way as for the previous notebook. We give the code without comments for brevity.

In [1]:
import hsfs
import pandas as pd

conn = hsfs.connection()
fs = conn.get_feature_store()
td = fs.get_training_dataset("transactions_dataset_splitted", version=2)
train_df = td.read('train')
val_df = td.read('validation')

if not type(train_df) == pd.core.frame.DataFrame: 
    train_df = train_df.toPandas()
    val_df = val_df.toPandas()
    
target = 'fraud_label'
features = list(set(train_df.columns) - set([target]))

X_train, y_train = train_df[features], train_df[target]
X_val, y_val = val_df[features], val_df[target]

Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
10,application_1649849762999_0153,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

Now wrap the training code into a function that returns some metric we want to optimize. 

This code assumes that the `X_train`, `y_train` etc variables already exist in the namespace. You could also require them to be inputs to the `fraud_logreg_train` function but would then need to modify how you call the function in the last cell for example using Python's `partial` functionality.

In [8]:
def fraud_logreg_train(class_weight):
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import precision_score
    clf = LogisticRegression(class_weight={0: 1, 1: class_weight}, solver='liblinear')
    clf.fit(X_train, y_train)
    preds = clf.predict(X_val)
    pos_prec = precision_score(y_true=y_val, y_pred=preds, pos_label=1)
    return pos_prec

Just check that it works.

In [12]:
fraud_logreg_train(500)

0.004691789368604942

Now we have to define the search space, meaning the interval to be searched for the best value. We are trying to optimize the positive-class precision, and since the positive class is very rare, the weight on it should probably be large. Let's try between 10 and 10000.

In [14]:
from maggy import Searchspace

sp = Searchspace(class_weight=('DOUBLE', [10, 10000]))

Hyperparameter added: class_weight

Now we can run the `lagom` method, which tries to find the best value. Lagom is a Swedish word that means "just right". We give it the training wrapper function and the search space we just defined. Since precision is a metric we want to be as high as possible, we use `direction=max`. 

`num_trials` is simply how many models will be trained; you should set this based on how much time you are willing to spend. We'll just do five trials here to showcase the functionality.

The `optimizer` is the strategy used to determine the next parameter value to try. We will just ise random search, which often works well in practice. You can read about alternatives [here](https://maggy.ai/master/hpo/strategies/).

In [15]:
from maggy import experiment

result = experiment.lagom(train_fn=fraud_logreg_train,
                            searchspace=sp,
                            optimizer='randomsearch',
                            direction='max',
                            num_trials=5,
                            name='fraud_lr'
                           )

HBox(children=(FloatProgress(value=0.0, description='Maggy experiment', max=5.0, style=ProgressStyle(descripti…

Started Maggy Experiment: fraud_lr, application_1649849762999_0153, run 1

------ RandomSearch Results ------ direction(max) 
BEST combination {"class_weight": 1228.9784393829098} -- metric 0.00253848905950968
WORST combination {"class_weight": 8672.772219669952} -- metric 0.002301660046847948
AVERAGE metric -- 0.002397147451510917
EARLY STOPPED Trials -- 0
Total job time 0 hours, 0 minutes, 30 seconds

Finished Experiment


In [16]:
result

{'best_id': 'febae4089cd78087', 'best_val': 0.00253848905950968, 'best_hp': {'class_weight': 1228.9784393829098}, 'worst_id': '800cb7116ffe4670', 'worst_val': 0.002301660046847948, 'worst_hp': {'class_weight': 8672.772219669952}, 'avg': 0.002397147451510917, 'metric_list': [0.00253848905950968, 0.0023123936093849343, 0.002331002331002331, 0.002301660046847948, 0.002502192210809694], 'num_trials': 5, 'early_stopped': 0}