Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message tree state machine #555

Merged
merged 29 commits into from
Jan 11, 2023
Merged

Conversation

andreaskoepf
Copy link
Collaborator

Message trees are created with an initial prompt. After checking the initial prompt quality via peer-review a tree enters the growing phase in which concurrently prompter/assistant replies and labeling tasks are handed out. When the desired number of messages surpassing the minimum acceptable quality has been collected the tree goes into ranking phase. In ranking phase replies of messages with more than one reply are presented to humans to be ordered from best to worst. When enough ranking results have been collected for all messages the tree enters the ready for scoring phase in which the scoring algorithm combines all user feedback and computes the ranking-scores of all children in the message tree. After the scoring algorithm ran the message tree is completed and enters the ready for export state.

Beside the ready for export additional terminal error-states exist: scoring_failed, aborted_low_grade, halted_by_moderator .. in terminal states no further tasks are handed out to users.

@andreaskoepf andreaskoepf marked this pull request as ready for review January 11, 2023 07:08
@andreaskoepf andreaskoepf requested a review from yk as a code owner January 11, 2023 07:08
Copy link
Collaborator

@yk yk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks awesome!

one question I had was if by nature of how we grow trees and when we stop, can it be that there are many children that have no siblings? in that case, those would never be ranked at all. Not something we have to fix now, just to keep in mind.

Also, very dangerous stuff around raw sql statements, makes things more interesting ;) let's just keep this in mind and be super weary of any property changes to the models. Eventually, we might want to re-write this to sqlmodel/sqlalchemy

num_required_rankings: int = 3
"""Number of rankings in which the message participated."""

mandatory_labels_initial_prompt: Optional[list[protocol_schema.TextLabel]] = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we shouldn't ask "helpful" as mandatory, at least not for initial prompt and prompter reply, because it doesn't really apply to their situation. Maybe we also need to re-think the helpful label, maybe calling it "appropriate" or so, meaning that a prompt is really a prompt, an assistant response is of (helpful) assistance, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking a bit more, I think "appropriate" as I formulate it here would just be the opposite of "spam". the question is, if we know something is "not spam", in what way is that not enough, to the point where we'd need to collect another quality-gate label?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I removet helpful from all and also simplified the

def _calculate_acceptance(self, labels: list[TextLabels]):
        # calculate acceptance based on spam label
        return np.mean([1 - l.labels[protocol_schema.TextLabel.spam] for l in labels])

task_type = TaskType(np.random.choice(a=len(task_weights), p=task_weights))

logger.debug(f"Selected task type: {task_type}")
return TaskType(task_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't task_type already a TaskType?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think added the cast to the np.random.choice and forgot to remove it there.


num_missing_replies = sum(x.remaining_messages for x in active_tree_sizes)

task_tpye = self._task_selection(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tpye is my favorite typo ;-)

@andreaskoepf
Copy link
Collaborator Author

one question I had was if by nature of how we grow trees and when we stop, can it be that there are many children that have no siblings? in that case, those would never be ranked at all. Not something we have to fix now, just to keep in mind.

The trees are constructed randomly. Especially the deeper nodes are at risk of being alone. It currently depends much on the following configuration settings:

    max_tree_depth: int = 6
    """Maximum depth of message tree."""

    max_children_count: int = 5
    """Maximum number of reply messages per tree node."""

    goal_tree_size: int = 15
    """Total number of messages to gather per tree"""

Parents have a higher chance of being selected: Mainly because they are longer in the tree and participate in more selection processes. New children first have to go through the review-phase before they themselves can become parents.

Also, very dangerous stuff around raw sql statements, makes things more interesting ;) let's just keep this in mind and be super weary of any property changes to the models. Eventually, we might want to re-write this to sqlmodel/sqlalchemy

I see the problem with schema-changes but problems likely surface quickly since most of the queries are run for the first few interactions with the system. Maybe moving the queries into stored-procedures would also be a good option (not sure how the larger queries would look as sql-alchemy code).

@fozziethebeat
Copy link
Collaborator

Approve regarding the broken web e2e tests. We'll fix that in a separate PR.

@andreaskoepf andreaskoepf merged commit 14fa08e into main Jan 11, 2023
@andreaskoepf andreaskoepf deleted the 531_inital_message_tree_state_machine2 branch January 11, 2023 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants