Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoring-algorithm for crowd-sourced data collection #37

Closed
andreaskoepf opened this issue Dec 19, 2022 · 10 comments
Closed

Scoring-algorithm for crowd-sourced data collection #37

andreaskoepf opened this issue Dec 19, 2022 · 10 comments
Assignees

Comments

@andreaskoepf
Copy link
Collaborator

Interactions with our human-feedback front-ends (discord bot + website) should be incentivized with a user score that is shown on a leader board (gamification) + a news feed that notifies (on discord and website) about recent user activity (potentially anonymized if desired by the user).

It is not sufficient for us to collect a large number of user-interactions. We want to especially incentivize submitting high-quality instruct-fulfillment data points and human written agent-responses (demonstrations). We plan to estimate quality of user provided data in a similar way to how feedback on language-model outputs is generated - by using human feedback (e.g. grading/ranking input provided by other users).

We want to use a mix of distributed community and centralized administrator moderation. Centralized admin moderation (disabling users and deleting their data) is expensive and not easily scalable.

Challenging user behavior may include:

  • random ranking and grading
  • giving intentionally wrong feedback
  • politically motivated fake news (e.g. flooding the dataset)
  • duplicate data entry
  • DoS attacks
    ...
@andreaskoepf andreaskoepf moved this to 📫 Triage in Open-Assistant Dec 19, 2022
@yk yk moved this from 📫 Triage to 🛠 Todo in Open-Assistant Dec 19, 2022
@yk
Copy link
Collaborator

yk commented Dec 19, 2022

Maybe to add: this goes beyond just the scoring algorithm. The main question is how can we design the system to incentivize as much useful and high-quality feedback as possible.

@MattAlexMiracle
Copy link
Collaborator

I have a first draft of the scoring and question selection system as a pull request:
The overall goal I'm targetting is:

  1. "No trolls": we do not want to reward or amplify malicious characters interacting with the system
  2. "No echo chamber": we do not want to greedily reward the consensus opinion as that would just produce an echo chamber of uninteresting and therefore lower quality prompts. It also discourages people from interacting with their system using their honest opinion, since that gives less points than if they ranked based on the presumed "mean opinion".

In this first draft I built a system to rate users in the three domains "voting", "prompting" and "ranking", each of which with their own individual point storages (for a total score we can weight and merge those).

Further, we also tally how many "good questions" a user submitted (with "good" meaning "above average"): This can be used to prevent spam on the leaderboards (and in the system) as we can simply filter out people below a certain quality threshold (e.g. "if you have contributed at least 20 times, and your score is above 80%, we show your contributions on the leaderboard and may include them in our dataset").

To do this we need to (at some point) retire a question from being shown to users, since only then can we use the consensus mechanisms to reward users (since otherwise the consensus might still change).

I envision a workflow like this:

  1. use the maximum information gain metric to sample from the questions
    $$x\sim Categorical(softmax(infogain(question))$$ to create a pseudo-thompson sampling scheme over all questions
  2. Once a question reaches either a threshhold number of votes, or falls below a minimum information gain (i.e. continuing to vote isn't expected to do a lot), we can retire the question into a "passive pool" of finished questions
  3. When we move a question from the "active pool" of questions into the "passive pool" of finished questions, we compute the consensus, and reward the people that took part in answering the question

@MattAlexMiracle
Copy link
Collaborator

LGTM, thank you
if I understand correctly, a "vote" is a best-one-of-N judgement. Is there also a variant if each user provides a complete ranking of all N choices? I'm thinking if we already make the user read all of the options, they might as well rank all of them.

@yk moving your question over from the closed issue.
at the moment I consider three scoring systems:

  1. best-of-N (how good was the vote)
  2. prompt value (i.e. the dual to best-of-N: how good was the question)
  3. ranking (i.e. given K options to rank producing a ranking how well does a users rank line up with the consensus rank)

the latter seems to be what you are getting at. This is currently done by obtaining N rankings (which might be partial), and merging them using the ranked pairs algorithm, giving a consensus rank τ'. I then compute the Kendall-tau correlation and shift it such that instead of being in [-1,1] it is in [0,1], which then gives you your ranking points.
I consider a ranking "good" if it has a correlation of at least 0.5

@TMRolle
Copy link

TMRolle commented Jan 1, 2023

Just a thought, could this be greatly simplified by using an existing text dataset to source the responses, and work backwards to manually or semi-automatically human generate the prompts to match?

I.e. take the existing text of a wikipedia page for the Artic Monkeys, and then just have a human back-write a prompt like "Write a wikipedia-style page about the English rock band "The Arctic Monkeys"".

It seems like an approach like this would significantly simplify the process while still maintaining a high level of control over dataset content balance and response quality. Large swaths of the training data could likely be generated from semi-structured/rich metadata content fairly easily as well, such as by using the Wikidata and Wikipedia datasets along with a few templates variations of "Write a wikipedia-style page about the musician [musician name]. Be sure to include details about [items found in structured table of contents]."

You probably would want to limit the amount of template based generated data, but it could be useful for bootstrapping an initial training set, and could potentially approach human generated prompt quality by using a much less sophisticated LLM to generate paraphrasings of the template prompts.

The use of fixed high quality response data would also lend itself to a natural gamification approach for human generated prompts, where users could propose prompts for a given response in a Jeapordy-style game, and proposed prompts could be ranked by other users and points awarded through a simple reddit-style upvote/downvote points system.

@yk
Copy link
Collaborator

yk commented Jan 1, 2023

@TMRolle fully agree, and we're heavily thinking about (and some people already working on) prompt-based data. The problem is that it quickly leads to overfitting. The impressive thing about something like ChatGPT is how versatile it is, i.e. how many different tasks it can do, even beyond what its creators had intended. If we go the template way, then the training dataset will consist of lots of samples from relatively few tasks and therefore the danger is that the model becomes too task-specific, so you're definitely correct in pointing out that the amount of data generated like this has to be carefully balanced, and in my estimation the optimal balance of it might be quite low. Which is a bummer, given how easy it would be to produce.

@TMRolle
Copy link

TMRolle commented Jan 2, 2023

Makes sense on the structured data+template approach - though I think it would still likely be a good approach to even fully human generated instruction prompts, given it's much faster and easier for humans to look at a desired output and come up with a corresponding input prompt than it is to start from scratch and write both the prompt and response, and since that could presumably be done with any arbitrary excerpts from an existing language model training dataset.

@andreaskoepf
Copy link
Collaborator Author

Several ideas have been discussed during the last days. I will summarize a simple system that I propose to implement.

  • There are two related scoring systems that have different requirements:
    a) gamification reward for the user (leaderbaord) to incentivize high quality feedback & peer-review/moderation
    b) ranked completions of a given prompt for the RM dataset or to evaluate/compare a trained model

  • For gamification we want timely feedback and a monotonously increasing score. Negative rewards should only be assigned to users when we know with high certainty that they deliberately misbehaved, e.g. submitted inappropriate prompts or spam. Ratings and rankings by (random) other users provide a strong basis for judging user input.

  • Rankings exported from the DB are less time critical but should be as accurate as possible.

Steps to collect and rank replies for a give prompt (generation of initial prompts has to be disussed separately):

  1. MessageTree growinng phase: A new message tree is created with an initial prompt. We assume this prompt has already been screened and passed baisc quality tests (not described here). The system selects sets a goal for the number of total descendants to collect in the tree. The system hands out text-reply tasks to users with randomly sampled nodes of the tree as conversation-thread (traced to root). To create wider and less deep trees nodes could be sampled inverse-proportional to depth. The colleection phase creates a random conversation tree and ends when the desired number of total messages in the tree has been reached.

  2. Quality feedback/Rating phase: During the tree growing phase as soon as new messages have been sent to the backend they can be handed out to random users (different from the author) for quality-feedback. For this feedback a user generated completion is shown with the conversation context and the user is asked to check a couple of quality features, maybe also to assign multiple attributes to it. Quality feedback is used to auto-delete/exclude messages from the tree which also influences when the tree is ready to be declared as mature and ready for the ranking phase.

  3. Once the desired number of messages passing the qa-feedback step have been collected the random tree can go into ranking mode. No new messages are accepted and the messages of all nodes with two or more direct replies can be handed out to users for ranking.

During ranking we we can assign reward to the orignal text-authors depending on individual ranking results, e.g. 3 points for the 1st place, 2 points for the 2nd, 1 points for the 3rd an 0 for others (or eponentially decaying, for example1st: 9, 2nd:3, 3rd:1).

(to be continued)

@MattAlexMiracle
Copy link
Collaborator

I have a system for task-selection over at #383.
This method gives a self-adjusting distribution over whether to do "rating", "prompting", "answering" or "ranking".

There's already a system for selecting which question to select for voting, this one now also adds in a system for which prompt to produce answers to.
Last thing will be which thing to rate next.

@hemangjoshi37a
Copy link
Contributor

To resolve this issue, it looks like you need to design a scoring algorithm for crowd-sourced data collection that incentivizes users to provide high-quality feedback. You want to gamify the process and create a leaderboard to show the scores of users. It also looks like you want to use a mix of distributed community and centralized administrator moderation.

You also want to prevent challenging user behavior such as random ranking and grading, giving intentionally wrong feedback, politically motivated fake news, duplicate data entry, and DoS attacks.

To do this, you can consider the following steps:

  1. Determine the domains that you want to rate users in, such as voting, prompting, and ranking.
  2. Set up individual point storages for each domain.
  3. Tally the number of "good" (above average) questions that a user has submitted.
  4. Use a maximum information gain metric to sample questions and create a pseudo-thompson sampling scheme.
  5. When a question reaches a certain number of votes or falls below a minimum information gain, retire the question into a "passive pool" of finished questions.
  6. Compute the consensus when a question is retired, and reward the users that took part in answering it.
  7. Use the consensus to compute a score for each user.
  8. Display the scores on a leaderboard and use them to filter out users below a certain quality threshold.

You may also want to consider using techniques such as reputation systems, social proof, and loss aversion to incentivize users to provide high-quality feedback.

It is not clear from the prompt whether you need specific code for this issue. If you do, please provide more details and context about what you need help with.

@andreaskoepf
Copy link
Collaborator Author

I close this because we have a simple version running for multiple weeks now.

@github-project-automation github-project-automation bot moved this from 🛠 Todo to ✅ Done in Open-Assistant Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants