New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/dataset entry #2520
Feature/dataset entry #2520
Conversation
…e/dialogue-data-collator-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that right now this system message format isn't used e.g. for the OA dataset? Just because I didn't spot where it gets hooked into the data pipeline.
But looks great! can be merged
model/model_training/tests/resources/data_collator/tokenizer.json
Outdated
Show resolved
Hide resolved
if mode == Mode.sft: | ||
return qa_list | ||
elif mode == Mode.rm: | ||
raise NotImplementedError("This is currently not implemented.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that RM training will not work after this is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is correct. Currently it is only implemented for Dolly, which does not support rm, so no change for now. But this needs to be implemented soon.
] | ||
if len(relevant_system_infos) > 0: | ||
shuffle(relevant_system_infos) | ||
system_tag_key_values = "\n".join([f"{k}: {v}" for k, v in relevant_system_infos]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the values be quantized here or are they already within a certain set? (I think real values could be difficult to understand for the model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys here are one of context
, lang
, length
, quality
, humor
and creativity
. The values for quality
, humor
and creativity
can be floating point numbers between 0 and 1 (this is checked by the validators and a test for this is added aswell, see here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, so if they can be anything between 0 and 1 (including e.g. 0.1826357216), then I think we should round them (e.g. to a single decimal) to make it easier for the model to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
closes #2708
Add pydantic basemodel class (equivalent to dataclass but with stronger guarantees) to return from dolly dataset. Add the formatting functionality in the dataset entry class.
This PR does quite a bit:
QA_SPECIAL_TOKENS
and theeos_token
. This class should work as a general pattern to store and format single dataset entries. This class should also remove all the formatting errors we had previously with the datasets.DialogueDataCollator
. I trained a minimal tokenizer only on the tokens that are present in the tests to not bloat the code (still a lot of LOC).DialogueDataCollator
and the newly introducedDatasetEntry
class.DatasetEntry
and old rows from the dataset.todos: