-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding reward ensembles and conservative reward functions #460
Conversation
Codecov Report
@@ Coverage Diff @@
## master #460 +/- ##
==========================================
+ Coverage 96.67% 96.82% +0.14%
==========================================
Files 80 82 +2
Lines 6775 7127 +352
==========================================
+ Hits 6550 6901 +351
- Misses 225 226 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. |
Finished initial review now. Seems like a reasonable design. Left some more detailed comments in line. Please request re-review when addressed. |
Co-authored-by: Adam Gleave <adam@gleave.me>
… EnsembleRewardNet initalizer
…ng before this ...
9b6b987
to
42378f8
Compare
@yawen-d It should be possible to train the reward ensemble now. If you have time can you test that this most basic version works? You can do this by just setting the reward class used by the reward ingredient to be a reward ensemble. I have not quite figured out how to get the conservative wrapper applied during retraining with |
Thanks for the implementations!
Sure. I just started some light benchmarking environments on the current version. In addition, one another feature to consider is to track the reward variance over time. |
a63c0f0
to
af895cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Before merging, please:
- Review my changes. I pushed some small changes to restructure one piece of the code, and fix some typos.
- Wait for
windows-ci-improvements
to get merged (if not already reviewed, you can review it to accelerate that). We should then let this PR be rebased and merge intomaster
.
I think this is good to merge once conflicts resolved. |
Seems like rerunning the tests fixed things. see #502 |
This pull request creates a base class for reward functions that keep track of their epistemic uncertainty, provides an ensemble based implementation for it and a conservative reward wrapper.
TODO:
make_reward
.train_preference_comparison.py
.train_preference.py