Fix prefix formatting #885

theblackcat102 · 2023-01-22T14:00:55Z

Move question and answer token to formatting.py and added in dataset phase instead of collator. This fix soda and prosocial dialogue which previously added the prefix as follow:

<human><prefix>Your prefix</prefix>prompt question<bot>answer text <eos token>

dataset with prefix (soda, prosocial_dialogue) should now be as follow :

<prefix>Your prefix</prefix><human>prompt question<bot>answer text <eos token>

sanagno · 2023-01-22T17:56:21Z

model/supervised_finetuning/custom_datasets/formatting.py

+
+
+def format_pair(pair):
+    return "{} {} {}".format(QA_SPECIAL_TOKENS["Question"], pair[0], QA_SPECIAL_TOKENS["Answer"]), pair[1]


Maybe we can make this into
-+ return "{}{}{}".format(QA_SPECIAL_TOKENS["Question"], pair[0], QA_SPECIAL_TOKENS["Answer"]), pair[1]

Not sure how all the tokenizer handle spaces for the models we have

sanagno

ProsocialDialogue and SODA should also have the format_parirs

theblackcat102 · 2023-01-23T02:46:22Z

@sanagno I added the prefix token in the preprocess phase, so there's no reason to add during get_item

sanagno · 2023-01-23T08:51:41Z

Thanks, looks great

theblackcat102 added 4 commits January 21, 2023 03:31

[feature] move data formatting into dataset, instead of collator

62a203f

[feature] add pythia and limit translation pair

f5b2a34

Merge branch 'main' into sft-formatting

98bf148

[fix] prosocial dialogue format error

736f46f

theblackcat102 requested a review from sanagno as a code owner January 22, 2023 14:00

theblackcat102 added the ml label Jan 22, 2023

sanagno reviewed Jan 22, 2023

View reviewed changes

sanagno approved these changes Jan 22, 2023

View reviewed changes

[fix] remove spaces in format_pair

b8990d9

sanagno merged commit 0cfc6a3 into main Jan 23, 2023

sanagno deleted the sft-formatting branch January 23, 2023 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prefix formatting #885

Fix prefix formatting #885

theblackcat102 commented Jan 22, 2023

sanagno Jan 22, 2023

theblackcat102 Jan 23, 2023

sanagno left a comment

theblackcat102 commented Jan 23, 2023

sanagno commented Jan 23, 2023



		def format_pair(pair):
		return "{} {} {}".format(QA_SPECIAL_TOKENS["Question"], pair[0], QA_SPECIAL_TOKENS["Answer"]), pair[1]

Fix prefix formatting #885

Fix prefix formatting #885

Conversation

theblackcat102 commented Jan 22, 2023

sanagno Jan 22, 2023

Choose a reason for hiding this comment

theblackcat102 Jan 23, 2023

Choose a reason for hiding this comment

sanagno left a comment

Choose a reason for hiding this comment

theblackcat102 commented Jan 23, 2023

sanagno commented Jan 23, 2023