# Hyperparameter Optimization on PACE

This notebook is a tutorial on using AmpOpt to tune an amptorch model hyperparameters on the PACE cluster.

Before starting this notebook, please make sure that you've followed all the steps in [SETUP.md](../docs/SETUP.md).

Tip: open this notebook on a GPU-enabled PACE Jupyter job by running this command from the project root:

```
./gpu-notebook.sh
```

In [None]:
import ampopt
from ampopt.utils import format_params
from ampopt.study import get_study

# 1. Create MySQL Port

In order to run hyperparameter tuning jobs, we need a MySQL port.

update ".env" if ssh is required include the last five argument in env

```
MYSQL_USERNAME=
MYSQL_PASSWORD=
HPOPT_DB=
MYSQL_HOSTNAME=

MYSQL_PORT=
SSH_HOST=
SSH_USER=
SSH_PASS=
SSH_PORT=
```

## 2. Preprocessing

AmpOpt requires data to be preprocessed using the preferred fingerprinting scheme and preprocessing pipeline, and saved in LMDB format, before hyperparameter optimization. This saves a lot of work being wasted performing the featurization for every optimization trial.

With AmpOpt, preprocessing and saving to LMDB is as easy as:

In [None]:
ampopt.preprocess("../data/oc20_50k_alex.extxyz", "../data/oc20_300_test.traj")

The data should be readable by either `ase.io.Trajectory` or `ase.io.read`. 

If you have several files, the first will be used to fit the transformers (e.g. for feature scaling). This prevents data leakage.

## 3. Running an Individual Training Job

Before we launch into running hyperparameter tuning jobs, let's train an individual model and evaluate it to get a (poor) baseline.

In [None]:
ampopt.eval_score(
    epochs=10,
    train_fname="../data/oc20_50k_alex.lmdb",
    valid_fname="../data/oc20_300_test.traj",
    dropout_rate=0.,
    lr=1e-3,
    gamma=1.,
    num_nodes=5,
    num_layers=5,
    port=port,
)

The performance of this model is poor, but that's to be expected: we only trained it for 10 epochs. We'll improve this score in the next section.

## 4. Running Tuning Jobs on PACE

Let's first run a single tuning job to try and find the optimal number of layers and number of nodes per layer when training for just 10 epochs.

We only need to supply a single dataset; amptorch will split 10% of the data off as a validation set.

The `study` argument can be anything, though we should be careful not to name this study the same as a previous study. It's how we'll later retrieve the study to perform analysis.

For `params`, we can pass any of the following hyperparameters:

- Learnable Parameters:
    - `num_layers`, the number of layers of the neural network
    - `num_nodes`, the number of nodes per layer
    - `dropout_rate`, the rate of [dropout](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/) during training
    - `lr`, the learning rate for gradient descent
    - `gamma`, the decay parameter for the learning rate.
- Non-Learnable Parameters:
    - `step_size`, the number of epochs after which the learning rate decreases by `gamma`
    - `batch_size`, the size of minibatches for gradient descent

Any learnable parameter not fixed in the `params` argument will be learned during hyperparameter optimization. Any non-learnable parameter will be given a default value.

The learnable and non-learnable parameters, as well as default values in the amptorch config, are specified in `src/ampopt/train.py`. Feel free to tweak this.

In [None]:
ampopt.run_pace_tuning_job(
    study="tutorial1",
    trials=10,
    epochs=10,
    data="../data/oc20_50k_alex.lmdb",
    params=format_params(
        dropout_rate=0.0,
        gamma=1.0,
    ),
)

We can check that our job was successfully submitted:

In [None]:
ampopt.view_jobs()

The three jobs are as follows:

- The first job, `mysql`, is running MySQL
- The second job, `pace-jupyter-not`, is running the Jupyter notebook instance
- The third job, `tune-amptorch-hy`, is the tuning job we just triggered.

Once the job is finished, it will disappear from `ampopt.view_jobs()`. It will generate 2 log files, one for the stdout and one for the stderr. It's worth checking the log files to verify that the job completed successfully.

We can load the study as follows:

In [None]:
tutorial1 = get_study("tutorial1")

Let's take a quick look at the trials we ran:

In [None]:
tutorial1.trials_dataframe()

## 5. Parallel Tuning Jobs

Of course, for optimizing over a large hyperparameter search space, we will want to parallelize our jobs. Doing this with AmpOpt and PACE is easy: simply run `ampopt.run_pace_tuning_job()` several times. For example:

In [None]:
for _ in range(5):
    ampopt.run_pace_tuning_job(
        study="tutorial2",
        trials=20,
        epochs=100,
        data="../data/oc20_50k_alex.lmdb",
    )

## Reports and Summaries

To get a summary of all studies currently in the database, run

In [None]:
ampopt.view_studies()

For a particular study, you can load it into memory and use `optuna.visualization.matplotlib` to easily visualise the study.

AmpOpt provides a single function for generating several interesting plots:

In [None]:
ampopt.generate_report("tutorial1")

You can then view the generated plots in the `reports` folder of the project root.

Finally, perhaps you have run some experiments that aren't useful, and you'd like to clean up the list of studies. Run:

In [None]:
ampopt.delete_studies("tutorial1", "tutorial2")

In [None]:
ampopt.tune(
    jobs=5,
    study="50K-alex-local",
    trials=5,
    epochs=100,
    data="../data/oc20_50k_alex.lmdb",
)

In [None]:
ampopt.delete_studies("50K-alex-local")

Setting e.g. `jobs=2` in ampopt.tune would run 2 processes,
but on PACE it's more efficient to run several jobs instead: