Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding reward ensembles and conservative reward functions #460

Merged
merged 72 commits into from
Jul 28, 2022

Conversation

levmckinney
Copy link
Collaborator

@levmckinney levmckinney commented Jul 5, 2022

This pull request creates a base class for reward functions that keep track of their epistemic uncertainty, provides an ensemble based implementation for it and a conservative reward wrapper.

TODO:

  • Implement seeding correctly for ensemble.
  • Add ensemble reward function to make_reward.
  • Add conservative reward wrapper.
  • Add option for training with reward ensemble to train_preference_comparison.py.
  • Write tests for ensemble creation.
  • Write tests for evaluating the ensemble
  • Write tests for saving and loading ensemble
  • Write tests for conservative reward wrapper
  • Write reward ensemble trainer in train_preference.py
  • Write tests for reward ensemble
  • Add logging for reward standard deviation
  • Add option to make reward conservative to train_rl.py.

@codecov
Copy link

codecov bot commented Jul 5, 2022

Codecov Report

Merging #460 (64c4220) into master (83dfc39) will increase coverage by 0.14%.
The diff coverage is 99.76%.

@@            Coverage Diff             @@
##           master     #460      +/-   ##
==========================================
+ Coverage   96.67%   96.82%   +0.14%     
==========================================
  Files          80       82       +2     
  Lines        6775     7127     +352     
==========================================
+ Hits         6550     6901     +351     
- Misses        225      226       +1     
Impacted Files Coverage Δ
src/imitation/scripts/common/reward.py 98.64% <97.05%> (-1.36%) ⬇️
src/imitation/algorithms/preference_comparisons.py 98.95% <100.00%> (+0.18%) ⬆️
src/imitation/rewards/reward_nets.py 97.07% <100.00%> (+1.45%) ⬆️
src/imitation/rewards/serialize.py 100.00% <100.00%> (ø)
...ion/scripts/config/train_preference_comparisons.py 84.72% <100.00%> (+0.21%) ⬆️
src/imitation/scripts/config/train_rl.py 77.63% <100.00%> (+0.29%) ⬆️
.../imitation/scripts/train_preference_comparisons.py 98.21% <100.00%> (+0.03%) ⬆️
src/imitation/scripts/train_rl.py 100.00% <100.00%> (ø)
src/imitation/testing/reward_nets.py 100.00% <100.00%> (ø)
tests/algorithms/test_preference_comparisons.py 100.00% <100.00%> (ø)
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

src/imitation/rewards/reward_nets.py Outdated Show resolved Hide resolved
src/imitation/rewards/reward_nets.py Outdated Show resolved Hide resolved
src/imitation/rewards/reward_nets.py Outdated Show resolved Hide resolved
src/imitation/rewards/reward_nets.py Outdated Show resolved Hide resolved
src/imitation/rewards/reward_nets.py Outdated Show resolved Hide resolved
src/imitation/rewards/reward_nets.py Outdated Show resolved Hide resolved
src/imitation/rewards/reward_nets.py Outdated Show resolved Hide resolved
@AdamGleave
Copy link
Member

Finished initial review now. Seems like a reasonable design. Left some more detailed comments in line. Please request re-review when addressed.

@levmckinney
Copy link
Collaborator Author

levmckinney commented Jul 9, 2022

@yawen-d It should be possible to train the reward ensemble now. If you have time can you test that this most basic version works? You can do this by just setting the reward class used by the reward ingredient to be a reward ensemble.

I have not quite figured out how to get the conservative wrapper applied during retraining with train_irl.py. It's an issue with reward deserialization. Currently, it returns a RewardFn so I'll have to think more carefully about how to introduce the reward wrapper. Once that is completed the entire pipeline for training ensembles of reward functions and making them conservative should be done.

@yawen-d
Copy link
Contributor

yawen-d commented Jul 10, 2022

Thanks for the implementations!

@yawen-d It should be possible to train the reward ensemble now. If you have time can you test that this most basic version works? You can do this by just setting the reward class used by the reward ingredient to be a reward ensemble.

Sure. I just started some light benchmarking environments on the current version.

In addition, one another feature to consider is to track the reward variance over time.

@levmckinney levmckinney marked this pull request as ready for review July 12, 2022 14:27
@AdamGleave AdamGleave changed the base branch from master to windows-ci-improvements July 26, 2022 23:42
Copy link
Member

@AdamGleave AdamGleave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Before merging, please:

  • Review my changes. I pushed some small changes to restructure one piece of the code, and fix some typos.
  • Wait for windows-ci-improvements to get merged (if not already reviewed, you can review it to accelerate that). We should then let this PR be rebased and merge into master.

Base automatically changed from windows-ci-improvements to master July 28, 2022 06:27
@AdamGleave
Copy link
Member

I think this is good to merge once conflicts resolved.

@levmckinney
Copy link
Collaborator Author

Seems like rerunning the tests fixed things. see #502

@levmckinney levmckinney merged commit c1a992a into master Jul 28, 2022
@levmckinney levmckinney deleted the reward_ensemble branch July 28, 2022 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants