Tune preference comparison example hyperparameters #771

timokau · 2023-08-21T13:12:27Z

Description

This PR changes the hyperparameters of the preference comparisons example to values that result in much more reliable training. The main point of discussion is how we should handle the examples in the notebooks. I have left them unchanged for now.

See the commit message (included here for convenience) for details.

Commit Message

The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point.

I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs.

Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space:

import syne_tune.config_space as cs
config_space = {
    "reward_epochs": cs.randint(1, 20),
    "ppo_clip_range": cs.uniform(0.0, 0.3),
    "ppo_ent_coef": cs.uniform(0.0, 0.01),
    "ppo_gae_lambda": cs.uniform(0.9, 0.99),
    "ppo_n_epochs": cs.randint(5, 25),
    "discount_factor": cs.uniform(0.9, 1.0),
    "use_sde": cs.choice(["true", "false"]),
    "sde_sample_freq": cs.randint(1, 5),
    "ppo_lr": cs.loguniform(1e-4, 5e-3),
    "exploration_frac": cs.uniform(0, 0.1),
    "num_iterations": cs.randint(5, 100),
    "initial_comparison_frac": cs.uniform(0.05, 0.25),
    "initial_epoch_multiplier": cs.randint(1, 4),
    "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]),
    "total_timesteps": 50_000,
    "total_comparisons": 200,
    "max_evals": 100,
}

and the configuration I selected in the end is this one

{
    "reward_epochs": 10,
    "ppo_clip_range": 0.1,
    "ppo_ent_coef": 0.01,
    "ppo_gae_lambda": 0.90,
    "ppo_n_epochs": 15,
    "discount_factor": 0.97,
    "use_sde": "false",
    "sde_sample_freq": 1,
    "ppo_lr": 2e-3,
    "exploration_frac": 0.05,
    "num_iterations": 60,
    "initial_comparison_frac": 0.10,
    "initial_epoch_multiplier": 4,
    "query_schedule": "hyperbolic",
}

Here are the (rounded) evaluation results of the 100 runs of the configuration:

[ -155,  -100,  -132,  -150,  -164,  -110,  -195,  -194,  -168,
  -148,  -177,  -113,  -176,  -205,  -106,  -169,  -123,  -104,
  -151,  -169,  -157,  -184,  -130,  -151,  -108,  -111,  -202,
  -142,  -198,  -138,  -178,  -104,  -174,  -149,  -113,  -107,
  -122,  -198,  -428,  -221,  -217,  -141,  -192,  -158,  -139,
  -219,  -230,  -209,  -141,  -173,  -118,  -176,  -108,  -290,
  -810,  -182,  -159,  -178,  -247,  -205,  -165,  -672,  -250,
  -138,  -166,  -282,  -133,  -147,  -111,  -145,  -148,  -116,
  -436,  -140,  -190,  -137,  -194,  -177,  -193, -1043,  -243,
  -183,  -156,  -183,  -184,  -186,  -141,  -144,  -194,  -112,
  -178,  -146,  -140,  -130,  -143,  -618,  -402,  -236,  -171,
  -163]

Mean (before rounding): 196.49
Fraction of runs <-800: 2/100
Fraction of runs >-200: 79/100

This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first.

I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them.

[1] https://github.com/awslabs/syne-tune

Testing

I trained the agent 100 times with the updated configuration and reported the results above.

The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point. I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs. Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space: ```py import syne_tune.config_space as cs config_space = { "reward_epochs": cs.randint(1, 20), "ppo_clip_range": cs.uniform(0.0, 0.3), "ppo_ent_coef": cs.uniform(0.0, 0.01), "ppo_gae_lambda": cs.uniform(0.9, 0.99), "ppo_n_epochs": cs.randint(5, 25), "discount_factor": cs.uniform(0.9, 1.0), "use_sde": cs.choice(["true", "false"]), "sde_sample_freq": cs.randint(1, 5), "ppo_lr": cs.loguniform(1e-4, 5e-3), "exploration_frac": cs.uniform(0, 0.1), "num_iterations": cs.randint(5, 100), "initial_comparison_frac": cs.uniform(0.05, 0.25), "initial_epoch_multiplier": cs.randint(1, 4), "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]), "total_timesteps": 50_000, "total_comparisons": 200, "max_evals": 100, } ``` and the configuration I selected in the end is this one ```py { "reward_epochs": 10, "ppo_clip_range": 0.1, "ppo_ent_coef": 0.01, "ppo_gae_lambda": 0.90, "ppo_n_epochs": 15, "discount_factor": 0.97, "use_sde": "false", "sde_sample_freq": 1, "ppo_lr": 2e-3, "exploration_frac": 0.05, "num_iterations": 60, "initial_comparison_frac": 0.10, "initial_epoch_multiplier": 4, "query_schedule": "hyperbolic", } ``` Here are the (rounded) evaluation results of the 100 runs of the configuration: ``` [ -155, -100, -132, -150, -164, -110, -195, -194, -168, -148, -177, -113, -176, -205, -106, -169, -123, -104, -151, -169, -157, -184, -130, -151, -108, -111, -202, -142, -198, -138, -178, -104, -174, -149, -113, -107, -122, -198, -428, -221, -217, -141, -192, -158, -139, -219, -230, -209, -141, -173, -118, -176, -108, -290, -810, -182, -159, -178, -247, -205, -165, -672, -250, -138, -166, -282, -133, -147, -111, -145, -148, -116, -436, -140, -190, -137, -194, -177, -193, -1043, -243, -183, -156, -183, -184, -186, -141, -144, -194, -112, -178, -146, -140, -130, -143, -618, -402, -236, -171, -163] ``` Mean (before rounding): 196.49 Fraction of runs <-800: 2/100 Fraction of runs >-200: 79/100 This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first. I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them. [1] https://github.com/awslabs/syne-tune

michalzajac-ml

Thank you so much for contributing to imitation, @timokau !
The PR looks nice overall, I left a couple of small comments.
Additionally, could you please modify the notebook tutorial (docs/tutorials/5_train_preference_comparisons.ipynb) to match these settings as well? We want to reach reasonable performance in tutorials as well.

michalzajac-ml · 2023-09-06T08:42:15Z

docs/algorithms/preference_comparisons.rst

+    # initial_epoch_multiplier, query_schedule) used in this example have been
+    # approximately fine-tuned to reach a reasonable initial experience. It's
+    # worth noting that we did not optimize all parameters those we did optimize
+    # may not be optimal.


For brevity, I'd suggest skipping this comment in the .rst doc and consider putting it inside notebook instead.

michalzajac-ml · 2023-09-06T08:42:46Z

docs/algorithms/preference_comparisons.rst

        rng=rng,
    )

+    querent = preference_comparisons.PreferenceQuerent()


What is this line, should it be removed?

michalzajac-ml · 2023-09-06T08:45:12Z

docs/algorithms/preference_comparisons.rst

    pref_comparisons = preference_comparisons.PreferenceComparisons(
        trajectory_generator,
        reward_net,
-        num_iterations=5,
+        num_iterations=60,


Could you make it back to 5 and write a comment like: "Set to 60 for better performance"? The reason is we want this example to run as fast as possible (as .rst docs are included in automated tests).
In the notebook, we can have just 60.

michalzajac-ml · 2023-09-06T12:30:09Z

docs/algorithms/preference_comparisons.rst

    )
-    pref_comparisons.train(total_timesteps=5_000, total_comparisons=200)
+    pref_comparisons.train(total_timesteps=50_000, total_comparisons=200)

    reward, _ = evaluate_policy(agent.policy, venv, 10)
    print("Reward:", reward)


Can we report here mean +/ std instead?

AdamGleave · 2023-09-07T02:34:15Z

Thanks for the contribution @timokau. Do let us know if you need any pointers/clarification on how to port this over to the notebook. The hyperparameters for example & notebook can be the same (apart from total number of training timesteps, which as @zajaczajac mentioned can be higher in the notebook than in examples). It's fine if the notebook includes additional code (like visualizing results) as it's intended to be more fully-featured than the examples in the *.rst.

…#782) * Tune preference comparison example hyperparameters The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point. I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs. Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space: ```py import syne_tune.config_space as cs config_space = { "reward_epochs": cs.randint(1, 20), "ppo_clip_range": cs.uniform(0.0, 0.3), "ppo_ent_coef": cs.uniform(0.0, 0.01), "ppo_gae_lambda": cs.uniform(0.9, 0.99), "ppo_n_epochs": cs.randint(5, 25), "discount_factor": cs.uniform(0.9, 1.0), "use_sde": cs.choice(["true", "false"]), "sde_sample_freq": cs.randint(1, 5), "ppo_lr": cs.loguniform(1e-4, 5e-3), "exploration_frac": cs.uniform(0, 0.1), "num_iterations": cs.randint(5, 100), "initial_comparison_frac": cs.uniform(0.05, 0.25), "initial_epoch_multiplier": cs.randint(1, 4), "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]), "total_timesteps": 50_000, "total_comparisons": 200, "max_evals": 100, } ``` and the configuration I selected in the end is this one ```py { "reward_epochs": 10, "ppo_clip_range": 0.1, "ppo_ent_coef": 0.01, "ppo_gae_lambda": 0.90, "ppo_n_epochs": 15, "discount_factor": 0.97, "use_sde": "false", "sde_sample_freq": 1, "ppo_lr": 2e-3, "exploration_frac": 0.05, "num_iterations": 60, "initial_comparison_frac": 0.10, "initial_epoch_multiplier": 4, "query_schedule": "hyperbolic", } ``` Here are the (rounded) evaluation results of the 100 runs of the configuration: ``` [ -155, -100, -132, -150, -164, -110, -195, -194, -168, -148, -177, -113, -176, -205, -106, -169, -123, -104, -151, -169, -157, -184, -130, -151, -108, -111, -202, -142, -198, -138, -178, -104, -174, -149, -113, -107, -122, -198, -428, -221, -217, -141, -192, -158, -139, -219, -230, -209, -141, -173, -118, -176, -108, -290, -810, -182, -159, -178, -247, -205, -165, -672, -250, -138, -166, -282, -133, -147, -111, -145, -148, -116, -436, -140, -190, -137, -194, -177, -193, -1043, -243, -183, -156, -183, -184, -186, -141, -144, -194, -112, -178, -146, -140, -130, -143, -618, -402, -236, -171, -163] ``` Mean (before rounding): 196.49 Fraction of runs <-800: 2/100 Fraction of runs >-200: 79/100 This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first. I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them. [1] https://github.com/awslabs/syne-tune * Add changes to notebook * Change number notation in cell. * clear outputs from notebook * remove empty code cell * fix variable name in preference_comparison * Run black * remove whitespace --------- Co-authored-by: Timo Kaufmann <timokau@zoho.com>

AdamGleave · 2023-09-12T20:16:37Z

This has now been merged in #782 that incorporates this change and adds it to the notebook. Thanks @timokau for the contribution!

timokau · 2023-09-13T07:30:44Z

Thanks a lot @AdamGleave, @lukasberglund and @zajaczajac! Glad to see the changes polished up and merged. Its a bit busy right now and would have taken me a couple more days to get back to this.

michalzajac-ml reviewed Sep 6, 2023

View reviewed changes

michalzajac-ml requested a review from AdamGleave September 6, 2023 08:51

michalzajac-ml reviewed Sep 6, 2023

View reviewed changes

AdamGleave mentioned this pull request Sep 9, 2023

Ensure all tutorials work as expected #763

Closed

8 tasks

lukasberglund mentioned this pull request Sep 11, 2023

Complete PR #771 (Tune preference comparison example hyperparameters) #782

Merged

AdamGleave closed this Sep 12, 2023

timokau deleted the tune-pc-exapmle-hparams branch September 13, 2023 07:30

timokau mentioned this pull request Nov 3, 2023

Use optimized hyperparamters for our imitation example timokau/prefq#28

Closed

patrickab added a commit to patrickab/prefq that referenced this pull request Nov 7, 2023

Apply optimized hyperparameters from HumanCompatibleAI/imitation#771

cd218ba

patrickab added a commit to patrickab/prefq that referenced this pull request Nov 9, 2023

Apply optimized hyperparameters from HumanCompatibleAI/imitation#771

c6d7379

patrickab mentioned this pull request Nov 10, 2023

First Fully Funtional Pendulum Example timokau/prefq#33

Merged

patrickab mentioned this pull request Nov 23, 2023

Milestone: Train a simple task from human feedback timokau/prefq#25

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune preference comparison example hyperparameters #771

Tune preference comparison example hyperparameters #771

timokau commented Aug 21, 2023

michalzajac-ml left a comment •

edited

michalzajac-ml Sep 6, 2023

michalzajac-ml Sep 6, 2023

michalzajac-ml Sep 6, 2023

michalzajac-ml Sep 6, 2023

AdamGleave commented Sep 7, 2023

AdamGleave commented Sep 12, 2023

timokau commented Sep 13, 2023

Tune preference comparison example hyperparameters #771

Tune preference comparison example hyperparameters #771

Conversation

timokau commented Aug 21, 2023

Description

Commit Message

Testing

michalzajac-ml left a comment • edited

Choose a reason for hiding this comment

michalzajac-ml Sep 6, 2023

Choose a reason for hiding this comment

michalzajac-ml Sep 6, 2023

Choose a reason for hiding this comment

michalzajac-ml Sep 6, 2023

Choose a reason for hiding this comment

michalzajac-ml Sep 6, 2023

Choose a reason for hiding this comment

AdamGleave commented Sep 7, 2023

AdamGleave commented Sep 12, 2023

timokau commented Sep 13, 2023

michalzajac-ml left a comment •

edited