-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scoring-algorithm for crowd-sourced data collection #37
Comments
Maybe to add: this goes beyond just the scoring algorithm. The main question is how can we design the system to incentivize as much useful and high-quality feedback as possible. |
I have a first draft of the scoring and question selection system as a pull request:
In this first draft I built a system to rate users in the three domains "voting", "prompting" and "ranking", each of which with their own individual point storages (for a total score we can weight and merge those). Further, we also tally how many "good questions" a user submitted (with "good" meaning "above average"): This can be used to prevent spam on the leaderboards (and in the system) as we can simply filter out people below a certain quality threshold (e.g. "if you have contributed at least 20 times, and your score is above 80%, we show your contributions on the leaderboard and may include them in our dataset"). To do this we need to (at some point) retire a question from being shown to users, since only then can we use the consensus mechanisms to reward users (since otherwise the consensus might still change). I envision a workflow like this:
|
@yk moving your question over from the closed issue.
the latter seems to be what you are getting at. This is currently done by obtaining N rankings (which might be partial), and merging them using the ranked pairs algorithm, giving a consensus rank τ'. I then compute the Kendall-tau correlation and shift it such that instead of being in [-1,1] it is in [0,1], which then gives you your ranking points. |
Just a thought, could this be greatly simplified by using an existing text dataset to source the responses, and work backwards to manually or semi-automatically human generate the prompts to match? I.e. take the existing text of a wikipedia page for the Artic Monkeys, and then just have a human back-write a prompt like "Write a wikipedia-style page about the English rock band "The Arctic Monkeys"". It seems like an approach like this would significantly simplify the process while still maintaining a high level of control over dataset content balance and response quality. Large swaths of the training data could likely be generated from semi-structured/rich metadata content fairly easily as well, such as by using the Wikidata and Wikipedia datasets along with a few templates variations of "Write a wikipedia-style page about the musician [musician name]. Be sure to include details about [items found in structured table of contents]." You probably would want to limit the amount of template based generated data, but it could be useful for bootstrapping an initial training set, and could potentially approach human generated prompt quality by using a much less sophisticated LLM to generate paraphrasings of the template prompts. The use of fixed high quality response data would also lend itself to a natural gamification approach for human generated prompts, where users could propose prompts for a given response in a Jeapordy-style game, and proposed prompts could be ranked by other users and points awarded through a simple reddit-style upvote/downvote points system. |
@TMRolle fully agree, and we're heavily thinking about (and some people already working on) prompt-based data. The problem is that it quickly leads to overfitting. The impressive thing about something like ChatGPT is how versatile it is, i.e. how many different tasks it can do, even beyond what its creators had intended. If we go the template way, then the training dataset will consist of lots of samples from relatively few tasks and therefore the danger is that the model becomes too task-specific, so you're definitely correct in pointing out that the amount of data generated like this has to be carefully balanced, and in my estimation the optimal balance of it might be quite low. Which is a bummer, given how easy it would be to produce. |
Makes sense on the structured data+template approach - though I think it would still likely be a good approach to even fully human generated instruction prompts, given it's much faster and easier for humans to look at a desired output and come up with a corresponding input prompt than it is to start from scratch and write both the prompt and response, and since that could presumably be done with any arbitrary excerpts from an existing language model training dataset. |
Several ideas have been discussed during the last days. I will summarize a simple system that I propose to implement.
Steps to collect and rank replies for a give prompt (generation of initial prompts has to be disussed separately):
During ranking we we can assign reward to the orignal text-authors depending on individual ranking results, e.g. 3 points for the 1st place, 2 points for the 2nd, 1 points for the 3rd an 0 for others (or eponentially decaying, for example1st: 9, 2nd:3, 3rd:1). (to be continued) |
I have a system for task-selection over at #383. There's already a system for selecting which question to select for voting, this one now also adds in a system for which prompt to produce answers to. |
To resolve this issue, it looks like you need to design a scoring algorithm for crowd-sourced data collection that incentivizes users to provide high-quality feedback. You want to gamify the process and create a leaderboard to show the scores of users. It also looks like you want to use a mix of distributed community and centralized administrator moderation. You also want to prevent challenging user behavior such as random ranking and grading, giving intentionally wrong feedback, politically motivated fake news, duplicate data entry, and DoS attacks. To do this, you can consider the following steps:
You may also want to consider using techniques such as reputation systems, social proof, and loss aversion to incentivize users to provide high-quality feedback. It is not clear from the prompt whether you need specific code for this issue. If you do, please provide more details and context about what you need help with. |
I close this because we have a simple version running for multiple weeks now. |
Interactions with our human-feedback front-ends (discord bot + website) should be incentivized with a user score that is shown on a leader board (gamification) + a news feed that notifies (on discord and website) about recent user activity (potentially anonymized if desired by the user).
It is not sufficient for us to collect a large number of user-interactions. We want to especially incentivize submitting high-quality instruct-fulfillment data points and human written agent-responses (demonstrations). We plan to estimate quality of user provided data in a similar way to how feedback on language-model outputs is generated - by using human feedback (e.g. grading/ranking input provided by other users).
We want to use a mix of distributed community and centralized administrator moderation. Centralized admin moderation (disabling users and deleting their data) is expensive and not easily scalable.
Challenging user behavior may include:
...
The text was updated successfully, but these errors were encountered: