# Hyperparameter Optimization on PACE

This notebook is a tutorial on using AmpOpt to tune an amptorch model hyperparameters on the PACE cluster.

Before starting this notebook, please make sure that you've followed all the steps in [SETUP.md](../docs/SETUP.md).

Tip: open this notebook on a GPU-enabled PACE Jupyter job by running this command from the project root:

```
./gpu-notebook.sh
```

In [27]:
import ampopt
from ampopt.utils import format_params
from ampopt.study import get_study

## 1. Starting MySQL

In order to run hyperparameter tuning jobs on PACE, we need a separate job running MySQL in the background.

In [5]:
ampopt.ensure_mysql_running()

Starting mysql job
115629.sched-pace-ice.pace.gatech.edu
Waiting for mysql job 115629 to start...
Waiting for mysql job 115629 to start...
mysql running, job ID: 115629


This function checks if a MySQL job is already running, and if not it starts one.

## 2. Preprocessing

AmpOpt requires data to be preprocessed using the preferred fingerprinting scheme and preprocessing pipeline, and saved in LMDB format, before hyperparameter optimization. This saves a lot of work being wasted performing the featurization for every optimization trial.

With AmpOpt, preprocessing and saving to LMDB is as easy as:

In [4]:
ampopt.preprocess("../data/oc20_50k_alex.extxyz", "../data/oc20_300_test.traj")

Creating LMDBs from files /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.extxyz, /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_300_test.traj
/storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.lmdb already exists, aborting


The data should be readable by either `ase.io.Trajectory` or `ase.io.read`. 

If you have several files, the first will be used to fit the transformers (e.g. for feature scaling). This prevents data leakage.

## 3. Running an Individual Training Job

Before we launch into running hyperparameter tuning jobs, let's train an individual model and evaluate it to get a (poor) baseline.

In [5]:
ampopt.eval_score(
    epochs=10,
    train_fname="../data/oc20_50k_alex.lmdb",
    valid_fname="../data/oc20_300_test.traj",
    dropout_rate=0.,
    lr=1e-3,
    gamma=1.,
    num_nodes=5,
    num_layers=5,
)

Loading validation data labels...


loading from /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.lmdb:   1%|          | 481/50000 [00:00<00:10, 4808.01 images/s]

Results saved to ./checkpoints/2022-04-19-15-50-11-16322358-7c5c-4945-b592-e4a998f2c7d6


loading from /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.lmdb: 100%|██████████| 50000/50000 [00:11<00:00, 4467.90 images/s]


Loading dataset: 50000 images
Use Xavier initialization
Loading model: 291 parameters
Loading skorch trainer
  epoch    train_energy_mae    train_loss    cp      lr     dur
-------  ------------------  ------------  ----  ------  ------
      1            [36m288.7328[0m        [32m0.6193[0m     +  0.0010  1.8943
      2            [36m151.0952[0m        [32m0.3248[0m     +  0.0010  1.6754
      3            [36m147.0156[0m        [32m0.3158[0m     +  0.0010  1.6832
      4            [36m144.9110[0m        [32m0.3114[0m     +  0.0010  1.6841
      5            [36m142.7045[0m        [32m0.3065[0m     +  0.0010  1.6727
      6            [36m141.7071[0m        [32m0.3045[0m     +  0.0010  1.6730
      7            [36m139.3646[0m        [32m0.2996[0m     +  0.0010  1.6710
      8            [36m137.6498[0m        [32m0.2959[0m     +  0.0010  1.6715
      9            [36m135.7527[0m        [32m0.2917[0m     +  0.0010  1.6783
     10            [36m1

HBox(children=(FloatProgress(value=0.0, description='converting ASE atoms collection to Data objects', max=300…




HBox(children=(FloatProgress(value=0.0, description='Scaling Feature data (normalize)', max=300.0, style=Progr…




HBox(children=(FloatProgress(value=0.0, description='Predicting', max=300.0, style=ProgressStyle(description_w…




128.52952354382325

The performance of this model is poor, but that's to be expected: we only trained it for 10 epochs. We'll improve this score in the next section.

## 4. Running Tuning Jobs on PACE

Let's first run a single tuning job to try and find the optimal number of layers and number of nodes per layer when training for just 10 epochs.

We only need to supply a single dataset; amptorch will split 10% of the data off as a validation set.

The `study` argument can be anything, though we should be careful not to name this study the same as a previous study. It's how we'll later retrieve the study to perform analysis.

For `params`, we can pass any of the following hyperparameters:

- Learnable Parameters:
    - `num_layers`, the number of layers of the neural network
    - `num_nodes`, the number of nodes per layer
    - `dropout_rate`, the rate of [dropout](https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/) during training
    - `lr`, the learning rate for gradient descent
    - `gamma`, the decay parameter for the learning rate.
- Non-Learnable Parameters:
    - `step_size`, the number of epochs after which the learning rate decreases by `gamma`
    - `batch_size`, the size of minibatches for gradient descent

Any learnable parameter not fixed in the `params` argument will be learned during hyperparameter optimization. Any non-learnable parameter will be given a default value.

The learnable and non-learnable parameters, as well as default values in the amptorch config, are specified in `src/ampopt/train.py`. Feel free to tweak this.

In [15]:
ampopt.run_pace_tuning_job(
    study="tutorial1",
    trials=10,
    epochs=10,
    data="../data/oc20_50k_alex.lmdb",
    params=format_params(
        dropout_rate=0.0,
        gamma=1.0,
    ),
)

mysql running, job ID: 115629
115650.sched-pace-ice.pace.gatech.edu


We can check that our job was successfully submitted:

In [16]:
ampopt.view_jobs()

    id   username    queue             name sessid nds tsk memory     time status  elapsed             node
115629 amckenzie9 pace-ice            mysql 139625   1   1     -- 08:00:00      R 06:01:16 atl1-1-02-009-31
115643 amckenzie9 pace-ice pace-jupyter-not 186065   1   1     -- 03:00:00      R 00:38:16 atl1-1-02-009-31
115650 amckenzie9 pace-ice tune-amptorch-hy     --   1   1    2gb 02:00:00      Q       --               --


The three jobs are as follows:

- The first job, `mysql`, is running MySQL
- The second job, `pace-jupyter-not`, is running the Jupyter notebook instance
- The third job, `tune-amptorch-hy`, is the tuning job we just triggered.

Once the job is finished, it will disappear from `ampopt.view_jobs()`. It will generate 2 log files, one for the stdout and one for the stderr. It's worth checking the log files to verify that the job completed successfully.

We can load the study as follows:

In [10]:
tutorial1 = get_study("tutorial1")

Let's take a quick look at the trials we ran:

In [11]:
tutorial1.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_lr,params_num_layers,params_num_nodes,state
0,0,120.043,2022-04-19 16:04:11,2022-04-19 16:05:15,0 days 00:01:04,6.4e-05,17,26,COMPLETE
1,1,119.026,2022-04-19 16:05:15,2022-04-19 16:05:45,0 days 00:00:30,0.000516,9,17,COMPLETE
2,2,136.51,2022-04-19 16:05:45,2022-04-19 16:06:20,0 days 00:00:35,1.4e-05,17,17,COMPLETE
3,3,109.859,2022-04-19 16:06:20,2022-04-19 16:06:49,0 days 00:00:29,0.005522,8,11,COMPLETE
4,4,107.0,2022-04-19 16:06:49,2022-04-19 16:07:22,0 days 00:00:33,0.005742,14,27,COMPLETE
5,5,113.928,2022-04-19 16:07:22,2022-04-19 16:07:51,0 days 00:00:29,0.000646,7,23,COMPLETE
6,6,,2022-04-19 16:07:51,NaT,NaT,7e-05,14,17,RUNNING


## 5. Parallel Tuning Jobs

Of course, for optimizing over a large hyperparameter search space, we will want to parallelize our jobs. Doing this with AmpOpt and PACE is easy: simply run `ampopt.run_pace_tuning_job()` several times. For example:

In [12]:
for _ in range(5):
    ampopt.run_pace_tuning_job(
        study="tutorial2",
        trials=20,
        epochs=100,
        data="../data/oc20_50k_alex.lmdb",
    )

mysql running, job ID: 115629
115645.sched-pace-ice.pace.gatech.edu
mysql running, job ID: 115629
115646.sched-pace-ice.pace.gatech.edu
mysql running, job ID: 115629
115647.sched-pace-ice.pace.gatech.edu
mysql running, job ID: 115629
115648.sched-pace-ice.pace.gatech.edu
mysql running, job ID: 115629
115649.sched-pace-ice.pace.gatech.edu


## Reports and Summaries

To get a summary of all studies currently in the database, run

In [26]:
ampopt.view_studies()

Study 50K-alex-with-lr-and-gamma:
  Params:
    - gamma
    - lr
    - num_layers
    - num_nodes
  Best score: 68.9657
  Num trials: 150
Study cmaes-oc20-3k:
  Params:
    - lr
    - num_layers
    - num_nodes
  Best score: 91.3529
  Num trials: 60
Study random-oc20-3k:
  Params:
    - lr
    - num_layers
    - num_nodes
  Best score: 90.7533
  Num trials: 60
Study tpe-oc20-3k:
  Params:
    - lr
    - num_layers
    - num_nodes
  Best score: 91.9822
  Num trials: 60
Study tutorial1:
  Params:
    - lr
    - num_layers
    - num_nodes
  Best score: 106.676
  Num trials: 4


For a particular study, you can load it into memory and use `optuna.visualization.matplotlib` to easily visualise the study.

AmpOpt provides a single function for generating several interesting plots:

In [25]:
ampopt.generate_report("tutorial1")

[33m[W 2022-04-19 16:15:17,077][0m Param num_nodes unique value length is less than 2.[0m


Best params: {'lr': 0.00112022, 'num_layers': 17, 'num_nodes': 21} with MAE 111.756
Report saved to /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/report/tutorial1


You can then view the generated plots in the `reports` folder of the project root.

Finally, perhaps you have run some experiments that aren't useful, and you'd like to clean up the list of studies. Run:

In [14]:
ampopt.delete_studies("tutorial1", "tutorial2")

Deleted study tutorial1.
Deleted study tutorial2.


In [None]:
ampopt.tune(
    jobs=5,
    study="50K-alex-local",
    trials=5,
    epochs=100,
    data="../data/oc20_50k_alex.lmdb",
)

[32m[I 2022-04-19 03:26:19,472][0m A new study created in RDB with name: 50K-alex[0m


Running hyperparam tuning with:
 - study_name: 50K-alex
 - dataset: ../data/oc20_50k_alex.lmdb
 - n_trials: 2
 - sampler: CmaEs
 - pruner: Median
 - num epochs: 10


loading from /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.lmdb:   0%|          | 0/50000 [00:00<?, ? images/s]

 - params:
   - dropout_rate: 0.0
   - gamma: 1.0
   - lr: 0.001
Results saved to ./checkpoints/2022-04-19-03-26-19-13e6986f-ff7c-4686-a28b-43276c760144


loading from /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.lmdb: 100%|██████████| 50000/50000 [00:11<00:00, 4465.65 images/s]


Loading dataset: 50000 images
Use Xavier initialization
Loading model: 6085 parameters
Loading skorch trainer
  epoch    train_energy_mae    train_loss    val_energy_mae    valid_loss    cp      lr     dur
-------  ------------------  ------------  ----------------  ------------  ----  ------  ------
      1            [36m225.0493[0m        [32m0.4833[0m          [35m134.0000[0m        [31m0.2880[0m     +  0.0010  2.4851
      2            [36m137.3736[0m        [32m0.2951[0m          142.5463        0.3062        0.0010  2.2722
      3            [36m134.1583[0m        [32m0.2883[0m          140.4345        0.3021        0.0010  2.2691
      4            [36m132.0226[0m        [32m0.2837[0m          [35m125.1809[0m        [31m0.2691[0m     +  0.0010  2.2657
      5            [36m127.5917[0m        [32m0.2742[0m          [35m120.0584[0m        [31m0.2583[0m     +  0.0010  2.2666
      6            128.2016        0.2755          123.3189        0.2651 

[32m[I 2022-04-19 03:26:57,810][0m Trial 0 finished with value: 113.89310749326707 and parameters: {'num_layers': 17, 'num_nodes': 18}. Best is trial 0 with value: 113.893.[0m
loading from /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.lmdb:   1%|          | 454/50000 [00:00<00:10, 4534.73 images/s]

Results saved to ./checkpoints/2022-04-19-03-26-57-dcaeb72b-5c50-4caa-8823-3269af3ab316


loading from /storage/home/hpaceice1/amckenzie9/bdqm-hyperparam-tuning/data/oc20_50k_alex.lmdb: 100%|██████████| 50000/50000 [00:10<00:00, 4585.89 images/s]


Loading dataset: 50000 images
Use Xavier initialization
Loading model: 8001 parameters
Loading skorch trainer
  epoch    train_energy_mae    train_loss    val_energy_mae    valid_loss    cp      lr     dur
-------  ------------------  ------------  ----------------  ------------  ----  ------  ------
      1            [36m190.3676[0m        [32m0.4089[0m          [35m139.4427[0m        [31m0.2996[0m     +  0.0010  2.0187
      2            [36m134.3880[0m        [32m0.2887[0m          [35m130.9475[0m        [31m0.2816[0m     +  0.0010  2.0094
      3            [36m132.1665[0m        [32m0.2840[0m          [35m127.2361[0m        [31m0.2735[0m     +  0.0010  2.0090
      4            [36m129.9690[0m        [32m0.2793[0m          [35m126.0410[0m        [31m0.2711[0m     +  0.0010  2.0163
      5            [36m129.0102[0m        [32m0.2772[0m          [35m121.9872[0m        [31m0.2623[0m     +  0.0010  2.0125
      6            [36m125.5399[0m  

[32m[I 2022-04-19 03:27:31,342][0m Trial 1 finished with value: 111.71088543808824 and parameters: {'num_layers': 12, 'num_nodes': 25}. Best is trial 1 with value: 111.711.[0m


In [None]:
ampopt.delete_studies("50K-alex-local")

Setting e.g. `jobs=2` in ampopt.tune would run 2 processes,
but on PACE it's more efficient to run several jobs instead: