Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prefix formatting #885

Merged
merged 5 commits into from Jan 23, 2023
Merged

Fix prefix formatting #885

merged 5 commits into from Jan 23, 2023

Conversation

theblackcat102
Copy link
Collaborator

Move question and answer token to formatting.py and added in dataset phase instead of collator. This fix soda and prosocial dialogue which previously added the prefix as follow:

<human><prefix>Your prefix</prefix>prompt question<bot>answer text <eos token>

dataset with prefix (soda, prosocial_dialogue) should now be as follow :

<prefix>Your prefix</prefix><human>prompt question<bot>answer text <eos token>



def format_pair(pair):
return "{} {} {}".format(QA_SPECIAL_TOKENS["Question"], pair[0], QA_SPECIAL_TOKENS["Answer"]), pair[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this into
-+ return "{}{}{}".format(QA_SPECIAL_TOKENS["Question"], pair[0], QA_SPECIAL_TOKENS["Answer"]), pair[1]

Not sure how all the tokenizer handle spaces for the models we have

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@sanagno sanagno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProsocialDialogue and SODA should also have the format_parirs

@theblackcat102
Copy link
Collaborator Author

@sanagno I added the prefix token in the preprocess phase, so there's no reason to add during get_item

@sanagno
Copy link
Collaborator

sanagno commented Jan 23, 2023

Thanks, looks great

@sanagno sanagno merged commit 0cfc6a3 into main Jan 23, 2023
@sanagno sanagno deleted the sft-formatting branch January 23, 2023 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants