-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message tree state machine #555
Conversation
…_LABELING setting)
…tree_state_machine2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks awesome!
one question I had was if by nature of how we grow trees and when we stop, can it be that there are many children that have no siblings? in that case, those would never be ranked at all. Not something we have to fix now, just to keep in mind.
Also, very dangerous stuff around raw sql statements, makes things more interesting ;) let's just keep this in mind and be super weary of any property changes to the models. Eventually, we might want to re-write this to sqlmodel/sqlalchemy
num_required_rankings: int = 3 | ||
"""Number of rankings in which the message participated.""" | ||
|
||
mandatory_labels_initial_prompt: Optional[list[protocol_schema.TextLabel]] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we shouldn't ask "helpful" as mandatory, at least not for initial prompt and prompter reply, because it doesn't really apply to their situation. Maybe we also need to re-think the helpful label, maybe calling it "appropriate" or so, meaning that a prompt is really a prompt, an assistant response is of (helpful) assistance, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking a bit more, I think "appropriate" as I formulate it here would just be the opposite of "spam". the question is, if we know something is "not spam", in what way is that not enough, to the point where we'd need to collect another quality-gate label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I removet helpful
from all and also simplified the
def _calculate_acceptance(self, labels: list[TextLabels]):
# calculate acceptance based on spam label
return np.mean([1 - l.labels[protocol_schema.TextLabel.spam] for l in labels])
task_type = TaskType(np.random.choice(a=len(task_weights), p=task_weights)) | ||
|
||
logger.debug(f"Selected task type: {task_type}") | ||
return TaskType(task_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't task_type
already a TaskType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think added the cast to the np.random.choice and forgot to remove it there.
|
||
num_missing_replies = sum(x.remaining_messages for x in active_tree_sizes) | ||
|
||
task_tpye = self._task_selection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tpye is my favorite typo ;-)
The trees are constructed randomly. Especially the deeper nodes are at risk of being alone. It currently depends much on the following configuration settings:
Parents have a higher chance of being selected: Mainly because they are longer in the tree and participate in more selection processes. New children first have to go through the review-phase before they themselves can become parents.
I see the problem with schema-changes but problems likely surface quickly since most of the queries are run for the first few interactions with the system. Maybe moving the queries into stored-procedures would also be a good option (not sure how the larger queries would look as sql-alchemy code). |
Approve regarding the broken web e2e tests. We'll fix that in a separate PR. |
Message trees are created with an initial prompt. After checking the initial prompt quality via peer-review a tree enters the growing phase in which concurrently prompter/assistant replies and labeling tasks are handed out. When the desired number of messages surpassing the minimum acceptable quality has been collected the tree goes into ranking phase. In ranking phase replies of messages with more than one reply are presented to humans to be ordered from best to worst. When enough ranking results have been collected for all messages the tree enters the ready for scoring phase in which the scoring algorithm combines all user feedback and computes the ranking-scores of all children in the message tree. After the scoring algorithm ran the message tree is completed and enters the ready for export state.
Beside the ready for export additional terminal error-states exist: scoring_failed, aborted_low_grade, halted_by_moderator .. in terminal states no further tasks are handed out to users.