New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train a reward model based on RankGen #78
Comments
I'd be happy to take a look at this. If I understand correctly, you'd like me to
A few further questions:
|
I am not sure if the method described in the rankgen paper can be directly applied here (I only skimmed the paper)? If you know more, let me know... My first thoughts were to take the rankgen embedding and project it to a scalar reward and then finetune with cross-entropy loss (e.g.
That will be the topic of a separate issue.
The idea is to use an equal train/val split of the OpenAI summarization dataset for both reward models ... the instructor based #77 one and the other one based on rankgen. The simplest metric that we could compare is accuracy (how often the models match the preference ranking provided by humans, e.g.
That would be a good first thing you could find out and report back. :-) |
As another reference could also take a look at the RM code by @copycat https://github.com/theblackcat102/copycat/blob/master/train_critic.py ... |
Okay, yeah I also just skimmed, but it seemed like rankgen operates by embedding the prefix and suffix into vectors of the same size and then finds their dot product, which is, of course, already a scalar and (in theory) should be alright to use as the reward. My intuition is that just doing a linear probe from the prefix or combining the prefix and suffix into a single string and projecting down from that embedding wouldn't take advantage of the pretrained weights as much, but maybe I can experiment with both approaches.
👍
Sounds good, I was unsure whether we were going to do the full RLHF process as well, so was wondering which LM we were supposed to optimize. It makes sense to compare the reward models independently first though.
Will do! |
ok, I see .. in our case we indeed have a well defined prefix (e.g. the user's instruction / conversation so far). For other cases it is porbably possible to split the text somewhere to compute the 'coherence' of those two segments (as they do in the paper for beam-search). The interesting part for us would then be the training of the model from the ranking data that we generate. We will be able to get multiple results for a given prefix from our db .. with combined ranking scores which allows to generate. |
You can also take a look at trlx library for reward function implementation. |
Hey few Qs,
Thanks I'll take a look :) |
Added draft PR, mainly working out of this notebook for now: |
As per discussion with @theblackcat102 I decided to build this on top of their trainer. The code for this is in the new PR. I am also training the model on W&B here |
Add a reward-head (linear projection of the model's embedding to a scalar value) to martiansideofthemoon/rankgen & RankGen paper) and train it on human-feedback (good-bad example pairs) of the openai/summarize-from-feedback dataset (see the Learning to summarize from human feedback paper for details about the objective).
model/reward/rankgen
folderBackground:
We want to implement the RLHF stack for Open-Assistant in parallel to our data collection effort. As a temporary fill-in we use existing RLHF datasets like OpenAI's learning to summarize for model development. Instructor was proposed as a promising base-model candidate for a reward model.
You could use bits of reward model training code that I trained a couple of weeks ago which contains data loading code for the summarize-from-feedback data as inspirations. If you like, you can of course use of a framework like pytorch_lightning.
The text was updated successfully, but these errors were encountered: