Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune preference comparison example hyperparameters #771

Closed

Conversation

timokau
Copy link
Contributor

@timokau timokau commented Aug 21, 2023

Description

This PR changes the hyperparameters of the preference comparisons example to values that result in much more reliable training. The main point of discussion is how we should handle the examples in the notebooks. I have left them unchanged for now.

See the commit message (included here for convenience) for details.

Commit Message

The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point.

I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs.

Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space:

import syne_tune.config_space as cs
config_space = {
    "reward_epochs": cs.randint(1, 20),
    "ppo_clip_range": cs.uniform(0.0, 0.3),
    "ppo_ent_coef": cs.uniform(0.0, 0.01),
    "ppo_gae_lambda": cs.uniform(0.9, 0.99),
    "ppo_n_epochs": cs.randint(5, 25),
    "discount_factor": cs.uniform(0.9, 1.0),
    "use_sde": cs.choice(["true", "false"]),
    "sde_sample_freq": cs.randint(1, 5),
    "ppo_lr": cs.loguniform(1e-4, 5e-3),
    "exploration_frac": cs.uniform(0, 0.1),
    "num_iterations": cs.randint(5, 100),
    "initial_comparison_frac": cs.uniform(0.05, 0.25),
    "initial_epoch_multiplier": cs.randint(1, 4),
    "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]),
    "total_timesteps": 50_000,
    "total_comparisons": 200,
    "max_evals": 100,
}

and the configuration I selected in the end is this one

{
    "reward_epochs": 10,
    "ppo_clip_range": 0.1,
    "ppo_ent_coef": 0.01,
    "ppo_gae_lambda": 0.90,
    "ppo_n_epochs": 15,
    "discount_factor": 0.97,
    "use_sde": "false",
    "sde_sample_freq": 1,
    "ppo_lr": 2e-3,
    "exploration_frac": 0.05,
    "num_iterations": 60,
    "initial_comparison_frac": 0.10,
    "initial_epoch_multiplier": 4,
    "query_schedule": "hyperbolic",
}

Here are the (rounded) evaluation results of the 100 runs of the configuration:

[ -155,  -100,  -132,  -150,  -164,  -110,  -195,  -194,  -168,
  -148,  -177,  -113,  -176,  -205,  -106,  -169,  -123,  -104,
  -151,  -169,  -157,  -184,  -130,  -151,  -108,  -111,  -202,
  -142,  -198,  -138,  -178,  -104,  -174,  -149,  -113,  -107,
  -122,  -198,  -428,  -221,  -217,  -141,  -192,  -158,  -139,
  -219,  -230,  -209,  -141,  -173,  -118,  -176,  -108,  -290,
  -810,  -182,  -159,  -178,  -247,  -205,  -165,  -672,  -250,
  -138,  -166,  -282,  -133,  -147,  -111,  -145,  -148,  -116,
  -436,  -140,  -190,  -137,  -194,  -177,  -193, -1043,  -243,
  -183,  -156,  -183,  -184,  -186,  -141,  -144,  -194,  -112,
  -178,  -146,  -140,  -130,  -143,  -618,  -402,  -236,  -171,
  -163]

Mean (before rounding): 196.49
Fraction of runs <-800: 2/100
Fraction of runs >-200: 79/100

This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first.

I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them.

[1] https://github.com/awslabs/syne-tune

Testing

I trained the agent 100 times with the updated configuration and reported the results above.

The preference comparison example previously did not show significant
learning. It usually ended with a reward < -1000, which can be
considered "failed" in the Pendulum environment. This commit updates the
parameters to avoid this. It could be argued that hyperparameter
optimization for the examples is bad, since it gives a skewed impression
of the library. I think as long as we acknowledge that the parameters
were optimized this is okay though, and it is much nicer if we have a
working example as a starting point.

I have tuned the hyperparameters with a mix of syne_tune [1] and manual
tuning. Since the training can have very high variance, I repeated each
training run multiple (up to 100) times and used multi-fidelity
optimization (PASHA and ASHA) to find a good configuration. I set the
objective to the 90% upper-confidence-bound of the mean final-evaluation
reward over all the training runs.

Unfortunately the optimization process was a bit messy since I was just
getting started with syne_tune, so it is difficult to provide a full
script to cleanly reproduce the results. I used something akin to this
configuration space:

```py
import syne_tune.config_space as cs
config_space = {
    "reward_epochs": cs.randint(1, 20),
    "ppo_clip_range": cs.uniform(0.0, 0.3),
    "ppo_ent_coef": cs.uniform(0.0, 0.01),
    "ppo_gae_lambda": cs.uniform(0.9, 0.99),
    "ppo_n_epochs": cs.randint(5, 25),
    "discount_factor": cs.uniform(0.9, 1.0),
    "use_sde": cs.choice(["true", "false"]),
    "sde_sample_freq": cs.randint(1, 5),
    "ppo_lr": cs.loguniform(1e-4, 5e-3),
    "exploration_frac": cs.uniform(0, 0.1),
    "num_iterations": cs.randint(5, 100),
    "initial_comparison_frac": cs.uniform(0.05, 0.25),
    "initial_epoch_multiplier": cs.randint(1, 4),
    "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]),
    "total_timesteps": 50_000,
    "total_comparisons": 200,
    "max_evals": 100,
}
```

and the configuration I selected in the end is this one

```py
{
    "reward_epochs": 10,
    "ppo_clip_range": 0.1,
    "ppo_ent_coef": 0.01,
    "ppo_gae_lambda": 0.90,
    "ppo_n_epochs": 15,
    "discount_factor": 0.97,
    "use_sde": "false",
    "sde_sample_freq": 1,
    "ppo_lr": 2e-3,
    "exploration_frac": 0.05,
    "num_iterations": 60,
    "initial_comparison_frac": 0.10,
    "initial_epoch_multiplier": 4,
    "query_schedule": "hyperbolic",
}
```

Here are the (rounded) evaluation results of the 100 runs of the
configuration:

```
[ -155,  -100,  -132,  -150,  -164,  -110,  -195,  -194,  -168,
  -148,  -177,  -113,  -176,  -205,  -106,  -169,  -123,  -104,
  -151,  -169,  -157,  -184,  -130,  -151,  -108,  -111,  -202,
  -142,  -198,  -138,  -178,  -104,  -174,  -149,  -113,  -107,
  -122,  -198,  -428,  -221,  -217,  -141,  -192,  -158,  -139,
  -219,  -230,  -209,  -141,  -173,  -118,  -176,  -108,  -290,
  -810,  -182,  -159,  -178,  -247,  -205,  -165,  -672,  -250,
  -138,  -166,  -282,  -133,  -147,  -111,  -145,  -148,  -116,
  -436,  -140,  -190,  -137,  -194,  -177,  -193, -1043,  -243,
  -183,  -156,  -183,  -184,  -186,  -141,  -144,  -194,  -112,
  -178,  -146,  -140,  -130,  -143,  -618,  -402,  -236,  -171,
  -163]
```

Mean (before rounding): 196.49
Fraction of runs <-800:  2/100
Fraction of runs >-200: 79/100

This is far from perfect. I didn't include all parameters in the
optimization. The 50,000 steps and 200 queries are likely overkill.
Still, it significantly improves the example that users see first.

I only changed the example on the main documentation page, not the
notebooks. Those are already out of sync with the main example, so I am
not sure how best to proceed with them.

[1] https://github.com/awslabs/syne-tune
Copy link
Contributor

@michalzajac-ml michalzajac-ml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for contributing to imitation, @timokau !
The PR looks nice overall, I left a couple of small comments.
Additionally, could you please modify the notebook tutorial (docs/tutorials/5_train_preference_comparisons.ipynb) to match these settings as well? We want to reach reasonable performance in tutorials as well.

# initial_epoch_multiplier, query_schedule) used in this example have been
# approximately fine-tuned to reach a reasonable initial experience. It's
# worth noting that we did not optimize all parameters those we did optimize
# may not be optimal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For brevity, I'd suggest skipping this comment in the .rst doc and consider putting it inside notebook instead.

rng=rng,
)

querent = preference_comparisons.PreferenceQuerent()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line, should it be removed?

pref_comparisons = preference_comparisons.PreferenceComparisons(
trajectory_generator,
reward_net,
num_iterations=5,
num_iterations=60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make it back to 5 and write a comment like: "Set to 60 for better performance"? The reason is we want this example to run as fast as possible (as .rst docs are included in automated tests).
In the notebook, we can have just 60.

)
pref_comparisons.train(total_timesteps=5_000, total_comparisons=200)
pref_comparisons.train(total_timesteps=50_000, total_comparisons=200)

reward, _ = evaluate_policy(agent.policy, venv, 10)
print("Reward:", reward)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we report here mean +/ std instead?

@AdamGleave
Copy link
Member

Thanks for the contribution @timokau. Do let us know if you need any pointers/clarification on how to port this over to the notebook. The hyperparameters for example & notebook can be the same (apart from total number of training timesteps, which as @zajaczajac mentioned can be higher in the notebook than in examples). It's fine if the notebook includes additional code (like visualizing results) as it's intended to be more fully-featured than the examples in the *.rst.

AdamGleave pushed a commit that referenced this pull request Sep 12, 2023
…#782)

* Tune preference comparison example hyperparameters

The preference comparison example previously did not show significant
learning. It usually ended with a reward < -1000, which can be
considered "failed" in the Pendulum environment. This commit updates the
parameters to avoid this. It could be argued that hyperparameter
optimization for the examples is bad, since it gives a skewed impression
of the library. I think as long as we acknowledge that the parameters
were optimized this is okay though, and it is much nicer if we have a
working example as a starting point.

I have tuned the hyperparameters with a mix of syne_tune [1] and manual
tuning. Since the training can have very high variance, I repeated each
training run multiple (up to 100) times and used multi-fidelity
optimization (PASHA and ASHA) to find a good configuration. I set the
objective to the 90% upper-confidence-bound of the mean final-evaluation
reward over all the training runs.

Unfortunately the optimization process was a bit messy since I was just
getting started with syne_tune, so it is difficult to provide a full
script to cleanly reproduce the results. I used something akin to this
configuration space:

```py
import syne_tune.config_space as cs
config_space = {
    "reward_epochs": cs.randint(1, 20),
    "ppo_clip_range": cs.uniform(0.0, 0.3),
    "ppo_ent_coef": cs.uniform(0.0, 0.01),
    "ppo_gae_lambda": cs.uniform(0.9, 0.99),
    "ppo_n_epochs": cs.randint(5, 25),
    "discount_factor": cs.uniform(0.9, 1.0),
    "use_sde": cs.choice(["true", "false"]),
    "sde_sample_freq": cs.randint(1, 5),
    "ppo_lr": cs.loguniform(1e-4, 5e-3),
    "exploration_frac": cs.uniform(0, 0.1),
    "num_iterations": cs.randint(5, 100),
    "initial_comparison_frac": cs.uniform(0.05, 0.25),
    "initial_epoch_multiplier": cs.randint(1, 4),
    "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]),
    "total_timesteps": 50_000,
    "total_comparisons": 200,
    "max_evals": 100,
}
```

and the configuration I selected in the end is this one

```py
{
    "reward_epochs": 10,
    "ppo_clip_range": 0.1,
    "ppo_ent_coef": 0.01,
    "ppo_gae_lambda": 0.90,
    "ppo_n_epochs": 15,
    "discount_factor": 0.97,
    "use_sde": "false",
    "sde_sample_freq": 1,
    "ppo_lr": 2e-3,
    "exploration_frac": 0.05,
    "num_iterations": 60,
    "initial_comparison_frac": 0.10,
    "initial_epoch_multiplier": 4,
    "query_schedule": "hyperbolic",
}
```

Here are the (rounded) evaluation results of the 100 runs of the
configuration:

```
[ -155,  -100,  -132,  -150,  -164,  -110,  -195,  -194,  -168,
  -148,  -177,  -113,  -176,  -205,  -106,  -169,  -123,  -104,
  -151,  -169,  -157,  -184,  -130,  -151,  -108,  -111,  -202,
  -142,  -198,  -138,  -178,  -104,  -174,  -149,  -113,  -107,
  -122,  -198,  -428,  -221,  -217,  -141,  -192,  -158,  -139,
  -219,  -230,  -209,  -141,  -173,  -118,  -176,  -108,  -290,
  -810,  -182,  -159,  -178,  -247,  -205,  -165,  -672,  -250,
  -138,  -166,  -282,  -133,  -147,  -111,  -145,  -148,  -116,
  -436,  -140,  -190,  -137,  -194,  -177,  -193, -1043,  -243,
  -183,  -156,  -183,  -184,  -186,  -141,  -144,  -194,  -112,
  -178,  -146,  -140,  -130,  -143,  -618,  -402,  -236,  -171,
  -163]
```

Mean (before rounding): 196.49
Fraction of runs <-800:  2/100
Fraction of runs >-200: 79/100

This is far from perfect. I didn't include all parameters in the
optimization. The 50,000 steps and 200 queries are likely overkill.
Still, it significantly improves the example that users see first.

I only changed the example on the main documentation page, not the
notebooks. Those are already out of sync with the main example, so I am
not sure how best to proceed with them.

[1] https://github.com/awslabs/syne-tune

* Add changes to notebook

* Change number notation in cell.

* clear outputs from notebook

* remove empty code cell

* fix variable name in preference_comparison

* Run black

* remove whitespace

---------

Co-authored-by: Timo Kaufmann <timokau@zoho.com>
@AdamGleave
Copy link
Member

This has now been merged in #782 that incorporates this change and adds it to the notebook. Thanks @timokau for the contribution!

@AdamGleave AdamGleave closed this Sep 12, 2023
@timokau
Copy link
Contributor Author

timokau commented Sep 13, 2023

Thanks a lot @AdamGleave, @lukasberglund and @zajaczajac! Glad to see the changes polished up and merged. Its a bit busy right now and would have taken me a couple more days to get back to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants