Train a reward model based on RankGen #78

andreaskoepf · 2022-12-27T08:06:27Z

Add a reward-head (linear projection of the model's embedding to a scalar value) to martiansideofthemoon/rankgen & RankGen paper) and train it on human-feedback (good-bad example pairs) of the openai/summarize-from-feedback dataset (see the Learning to summarize from human feedback paper for details about the objective).

place your training code in a new model/reward/rankgen folder
please use wandb for experiment tracking, measure at least loss + accuracy (based on score of good example > bad example)
try to avoid modifying the original model, if possible aggregate the existing model (i.e. add the existing model as a member of the new model class)
compare with results from Train a reward model based on Instructor #77

Background:
We want to implement the RLHF stack for Open-Assistant in parallel to our data collection effort. As a temporary fill-in we use existing RLHF datasets like OpenAI's learning to summarize for model development. Instructor was proposed as a promising base-model candidate for a reward model.

You could use bits of reward model training code that I trained a couple of weeks ago which contains data loading code for the summarize-from-feedback data as inspirations. If you like, you can of course use of a framework like pytorch_lightning.

The text was updated successfully, but these errors were encountered:

bth5032 · 2022-12-27T09:12:22Z

I'd be happy to take a look at this.

If I understand correctly, you'd like me to

Start with rankgen pretrained model
Use the dot product method between posts explained in rankgen paper to score the summaries from the SFF dataset
Treat the dot product score as rewards for the RM loss from SFF

A few further questions:

Do we want to then implement the policy optimization step as well?
What is the baseline generative model we are using, e.g. to do comparisons with Train a reward model based on Instructor #77
What kind of compute resources do you expect this to require?

andreaskoepf · 2022-12-27T11:14:27Z

2. Use the dot product method between posts explained in rankgen paper to score the summaries from the SFF dataset

I am not sure if the method described in the rankgen paper can be directly applied here (I only skimmed the paper)? If you know more, let me know... My first thoughts were to take the rankgen embedding and project it to a scalar reward and then finetune with cross-entropy loss (e.g. loss = -torch.mean(torch.log(torch.sigmoid(pos - neg)))).

A few further questions:
1. Do we want to then implement the policy optimization step as well?

That will be the topic of a separate issue.

2. What is the baseline generative model we are using, e.g. to do comparisons with

The idea is to use an equal train/val split of the OpenAI summarization dataset for both reward models ... the instructor based #77 one and the other one based on rankgen. The simplest metric that we could compare is accuracy (how often the models match the preference ranking provided by humans, e.g. rm(human_preferred_example) > rm(other_element_of_examples_pair) ).

3. What kind of compute resources do you expect this to require?

That would be a good first thing you could find out and report back. :-)

andreaskoepf · 2022-12-27T11:32:08Z

As another reference could also take a look at the RM code by @copycat https://github.com/theblackcat102/copycat/blob/master/train_critic.py ...

bth5032 · 2022-12-27T16:54:51Z

I am not sure if the method described in the rankgen paper can be directly applied here (I only skimmed the paper)? If you know more, let me know... My first thoughts were to take the rankgen embedding and project it to a scalar reward and then finetune with cross-entropy loss (e.g. loss = -torch.mean(torch.log(torch.sigmoid(pos - neg)))).

Okay, yeah I also just skimmed, but it seemed like rankgen operates by embedding the prefix and suffix into vectors of the same size and then finds their dot product, which is, of course, already a scalar and (in theory) should be alright to use as the reward.

My intuition is that just doing a linear probe from the prefix or combining the prefix and suffix into a single string and projecting down from that embedding wouldn't take advantage of the pretrained weights as much, but maybe I can experiment with both approaches.

That will be the topic of a separate issue.

👍

The idea is to use an equal train/val split of the OpenAI summarization dataset for both reward models ... the instructor based #77 one and the other one based on rankgen. The simplest metric that we could compare is accuracy (how often the models match the preference ranking provided by humans, e.g. rm(human_preferred_example) > rm(other_element_of_examples_pair) ).

Sounds good, I was unsure whether we were going to do the full RLHF process as well, so was wondering which LM we were supposed to optimize. It makes sense to compare the reward models independently first though.

That would be a good first thing you could find out and report back. :-)

Will do!

andreaskoepf · 2022-12-28T20:48:42Z

My intuition is that just doing a linear probe from the prefix or combining the prefix and suffix into a single string and projecting down from that embedding wouldn't take advantage of the pretrained weights as much, but maybe I can experiment with both approaches.

ok, I see .. in our case we indeed have a well defined prefix (e.g. the user's instruction / conversation so far). For other cases it is porbably possible to split the text somewhere to compute the 'coherence' of those two segments (as they do in the paper for beam-search). The interesting part for us would then be the training of the model from the ranking data that we generate. We will be able to get multiple results for a given prefix from our db .. with combined ranking scores which allows to generate.

dhruv2601 · 2022-12-30T10:33:28Z

You can also take a look at trlx library for reward function implementation.

bth5032 · 2023-01-01T02:45:53Z

Hey few Qs,

any suggestions for how to compare with the other tasks? I'm thinking we might want to make comparing our pretrained models it's own issue since all these rankers are being built in parallel by different people... Otherwise, perhaps we should just use the loss or some other figures of merit, e.g. consider this a binary classification and use your standard accuracy, f1 score, ROC, etc.., but I think we should at least have a fixed list of the summarize-from-feedback examples, right?
I heard there was a discord, can I get an invite?
I'll make a PR tonight with a sketch of what I'm doing. So far have just been working with the webgpt dataset but think I have basic training infrastructure setup for the model. It seems like my PC can't handle the xl t5 model, but I can def train a t5-base model and maybe t5-large. Will we have training infra somewhere for this project?

@dhruv2601

You can also take a look at trlx library for reward function implementation.

Thanks I'll take a look :)

bth5032 · 2023-01-01T09:11:09Z

Added draft PR, mainly working out of this notebook for now:

bth5032 · 2023-01-03T06:46:18Z

As per discussion with @theblackcat102 I decided to build this on top of their trainer. The code for this is in the new PR. I am also training the model on W&B here

andreaskoepf added the ml label Dec 27, 2022

andreaskoepf added this to the Minimum Viable Prototype milestone Dec 27, 2022

andreaskoepf mentioned this issue Dec 27, 2022

Train a reward model based on Instructor #77

Closed

andreaskoepf assigned bth5032 Dec 27, 2022

bth5032 mentioned this issue Jan 1, 2023

[WIP] Train Rankgen ranking model for RLHF #232

Closed

bth5032 mentioned this issue Jan 3, 2023

Bth5032/78 blackcat trainer #313

Merged

yk linked a pull request Jan 3, 2023 that will close this issue

Bth5032/78 blackcat trainer #313

Merged

theblackcat102 closed this as completed in #313 Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train a reward model based on RankGen #78

Train a reward model based on RankGen #78

andreaskoepf commented Dec 27, 2022 •

edited

bth5032 commented Dec 27, 2022

andreaskoepf commented Dec 27, 2022 •

edited

andreaskoepf commented Dec 27, 2022

bth5032 commented Dec 27, 2022

andreaskoepf commented Dec 28, 2022

dhruv2601 commented Dec 30, 2022

bth5032 commented Jan 1, 2023 •

edited

bth5032 commented Jan 1, 2023

bth5032 commented Jan 3, 2023

Train a reward model based on RankGen #78

Train a reward model based on RankGen #78

Comments

andreaskoepf commented Dec 27, 2022 • edited

bth5032 commented Dec 27, 2022

andreaskoepf commented Dec 27, 2022 • edited

andreaskoepf commented Dec 27, 2022

bth5032 commented Dec 27, 2022

andreaskoepf commented Dec 28, 2022

dhruv2601 commented Dec 30, 2022

bth5032 commented Jan 1, 2023 • edited

bth5032 commented Jan 1, 2023

bth5032 commented Jan 3, 2023

andreaskoepf commented Dec 27, 2022 •

edited

andreaskoepf commented Dec 27, 2022 •

edited

bth5032 commented Jan 1, 2023 •

edited