New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tune preference comparison example hyperparameters #771
Conversation
The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point. I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs. Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space: ```py import syne_tune.config_space as cs config_space = { "reward_epochs": cs.randint(1, 20), "ppo_clip_range": cs.uniform(0.0, 0.3), "ppo_ent_coef": cs.uniform(0.0, 0.01), "ppo_gae_lambda": cs.uniform(0.9, 0.99), "ppo_n_epochs": cs.randint(5, 25), "discount_factor": cs.uniform(0.9, 1.0), "use_sde": cs.choice(["true", "false"]), "sde_sample_freq": cs.randint(1, 5), "ppo_lr": cs.loguniform(1e-4, 5e-3), "exploration_frac": cs.uniform(0, 0.1), "num_iterations": cs.randint(5, 100), "initial_comparison_frac": cs.uniform(0.05, 0.25), "initial_epoch_multiplier": cs.randint(1, 4), "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]), "total_timesteps": 50_000, "total_comparisons": 200, "max_evals": 100, } ``` and the configuration I selected in the end is this one ```py { "reward_epochs": 10, "ppo_clip_range": 0.1, "ppo_ent_coef": 0.01, "ppo_gae_lambda": 0.90, "ppo_n_epochs": 15, "discount_factor": 0.97, "use_sde": "false", "sde_sample_freq": 1, "ppo_lr": 2e-3, "exploration_frac": 0.05, "num_iterations": 60, "initial_comparison_frac": 0.10, "initial_epoch_multiplier": 4, "query_schedule": "hyperbolic", } ``` Here are the (rounded) evaluation results of the 100 runs of the configuration: ``` [ -155, -100, -132, -150, -164, -110, -195, -194, -168, -148, -177, -113, -176, -205, -106, -169, -123, -104, -151, -169, -157, -184, -130, -151, -108, -111, -202, -142, -198, -138, -178, -104, -174, -149, -113, -107, -122, -198, -428, -221, -217, -141, -192, -158, -139, -219, -230, -209, -141, -173, -118, -176, -108, -290, -810, -182, -159, -178, -247, -205, -165, -672, -250, -138, -166, -282, -133, -147, -111, -145, -148, -116, -436, -140, -190, -137, -194, -177, -193, -1043, -243, -183, -156, -183, -184, -186, -141, -144, -194, -112, -178, -146, -140, -130, -143, -618, -402, -236, -171, -163] ``` Mean (before rounding): 196.49 Fraction of runs <-800: 2/100 Fraction of runs >-200: 79/100 This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first. I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them. [1] https://github.com/awslabs/syne-tune
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for contributing to imitation, @timokau !
The PR looks nice overall, I left a couple of small comments.
Additionally, could you please modify the notebook tutorial (docs/tutorials/5_train_preference_comparisons.ipynb
) to match these settings as well? We want to reach reasonable performance in tutorials as well.
# initial_epoch_multiplier, query_schedule) used in this example have been | ||
# approximately fine-tuned to reach a reasonable initial experience. It's | ||
# worth noting that we did not optimize all parameters those we did optimize | ||
# may not be optimal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For brevity, I'd suggest skipping this comment in the .rst doc and consider putting it inside notebook instead.
rng=rng, | ||
) | ||
|
||
querent = preference_comparisons.PreferenceQuerent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this line, should it be removed?
pref_comparisons = preference_comparisons.PreferenceComparisons( | ||
trajectory_generator, | ||
reward_net, | ||
num_iterations=5, | ||
num_iterations=60, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make it back to 5 and write a comment like: "Set to 60 for better performance"? The reason is we want this example to run as fast as possible (as .rst docs are included in automated tests).
In the notebook, we can have just 60.
) | ||
pref_comparisons.train(total_timesteps=5_000, total_comparisons=200) | ||
pref_comparisons.train(total_timesteps=50_000, total_comparisons=200) | ||
|
||
reward, _ = evaluate_policy(agent.policy, venv, 10) | ||
print("Reward:", reward) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we report here mean +/ std
instead?
Thanks for the contribution @timokau. Do let us know if you need any pointers/clarification on how to port this over to the notebook. The hyperparameters for example & notebook can be the same (apart from total number of training timesteps, which as @zajaczajac mentioned can be higher in the notebook than in examples). It's fine if the notebook includes additional code (like visualizing results) as it's intended to be more fully-featured than the examples in the |
…#782) * Tune preference comparison example hyperparameters The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point. I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs. Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space: ```py import syne_tune.config_space as cs config_space = { "reward_epochs": cs.randint(1, 20), "ppo_clip_range": cs.uniform(0.0, 0.3), "ppo_ent_coef": cs.uniform(0.0, 0.01), "ppo_gae_lambda": cs.uniform(0.9, 0.99), "ppo_n_epochs": cs.randint(5, 25), "discount_factor": cs.uniform(0.9, 1.0), "use_sde": cs.choice(["true", "false"]), "sde_sample_freq": cs.randint(1, 5), "ppo_lr": cs.loguniform(1e-4, 5e-3), "exploration_frac": cs.uniform(0, 0.1), "num_iterations": cs.randint(5, 100), "initial_comparison_frac": cs.uniform(0.05, 0.25), "initial_epoch_multiplier": cs.randint(1, 4), "query_schedule": cs.choice(["constant", "hyperbolic", "inverse_quadratic"]), "total_timesteps": 50_000, "total_comparisons": 200, "max_evals": 100, } ``` and the configuration I selected in the end is this one ```py { "reward_epochs": 10, "ppo_clip_range": 0.1, "ppo_ent_coef": 0.01, "ppo_gae_lambda": 0.90, "ppo_n_epochs": 15, "discount_factor": 0.97, "use_sde": "false", "sde_sample_freq": 1, "ppo_lr": 2e-3, "exploration_frac": 0.05, "num_iterations": 60, "initial_comparison_frac": 0.10, "initial_epoch_multiplier": 4, "query_schedule": "hyperbolic", } ``` Here are the (rounded) evaluation results of the 100 runs of the configuration: ``` [ -155, -100, -132, -150, -164, -110, -195, -194, -168, -148, -177, -113, -176, -205, -106, -169, -123, -104, -151, -169, -157, -184, -130, -151, -108, -111, -202, -142, -198, -138, -178, -104, -174, -149, -113, -107, -122, -198, -428, -221, -217, -141, -192, -158, -139, -219, -230, -209, -141, -173, -118, -176, -108, -290, -810, -182, -159, -178, -247, -205, -165, -672, -250, -138, -166, -282, -133, -147, -111, -145, -148, -116, -436, -140, -190, -137, -194, -177, -193, -1043, -243, -183, -156, -183, -184, -186, -141, -144, -194, -112, -178, -146, -140, -130, -143, -618, -402, -236, -171, -163] ``` Mean (before rounding): 196.49 Fraction of runs <-800: 2/100 Fraction of runs >-200: 79/100 This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first. I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them. [1] https://github.com/awslabs/syne-tune * Add changes to notebook * Change number notation in cell. * clear outputs from notebook * remove empty code cell * fix variable name in preference_comparison * Run black * remove whitespace --------- Co-authored-by: Timo Kaufmann <timokau@zoho.com>
Thanks a lot @AdamGleave, @lukasberglund and @zajaczajac! Glad to see the changes polished up and merged. Its a bit busy right now and would have taken me a couple more days to get back to this. |
Description
This PR changes the hyperparameters of the preference comparisons example to values that result in much more reliable training. The main point of discussion is how we should handle the examples in the notebooks. I have left them unchanged for now.
See the commit message (included here for convenience) for details.
Commit Message
The preference comparison example previously did not show significant learning. It usually ended with a reward < -1000, which can be considered "failed" in the Pendulum environment. This commit updates the parameters to avoid this. It could be argued that hyperparameter optimization for the examples is bad, since it gives a skewed impression of the library. I think as long as we acknowledge that the parameters were optimized this is okay though, and it is much nicer if we have a working example as a starting point.
I have tuned the hyperparameters with a mix of syne_tune [1] and manual tuning. Since the training can have very high variance, I repeated each training run multiple (up to 100) times and used multi-fidelity optimization (PASHA and ASHA) to find a good configuration. I set the objective to the 90% upper-confidence-bound of the mean final-evaluation reward over all the training runs.
Unfortunately the optimization process was a bit messy since I was just getting started with syne_tune, so it is difficult to provide a full script to cleanly reproduce the results. I used something akin to this configuration space:
and the configuration I selected in the end is this one
Here are the (rounded) evaluation results of the 100 runs of the configuration:
Mean (before rounding): 196.49
Fraction of runs <-800: 2/100
Fraction of runs >-200: 79/100
This is far from perfect. I didn't include all parameters in the optimization. The 50,000 steps and 200 queries are likely overkill. Still, it significantly improves the example that users see first.
I only changed the example on the main documentation page, not the notebooks. Those are already out of sync with the main example, so I am not sure how best to proceed with them.
[1] https://github.com/awslabs/syne-tune
Testing
I trained the agent 100 times with the updated configuration and reported the results above.