Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/dataset entry #2520

Merged
merged 26 commits into from Apr 20, 2023
Merged

Conversation

CloseChoice
Copy link
Collaborator

@CloseChoice CloseChoice commented Apr 14, 2023

closes #2708

Add pydantic basemodel class (equivalent to dataclass but with stronger guarantees) to return from dolly dataset. Add the formatting functionality in the dataset entry class.
This PR does quite a bit:

  • add pydantic dependency
  • introduce a new DatasetEntry class, which provides a method to do the formatting based on the mode and the QA_SPECIAL_TOKENS and the eos_token. This class should work as a general pattern to store and format single dataset entries. This class should also remove all the formatting errors we had previously with the datasets.
  • add tests for DialogueDataCollator. I trained a minimal tokenizer only on the tokens that are present in the tests to not bloat the code (still a lot of LOC).
  • added tests for the interplay of DialogueDataCollator and the newly introduced DatasetEntry class.
  • add small fixes to be backwards compatible in handling the new DatasetEntry and old rows from the dataset.

todos:

  • mask system
  • remove changes in config
  • write tests for formatting of dataset entry

@CloseChoice CloseChoice marked this pull request as ready for review April 19, 2023 20:11
Copy link
Collaborator

@dvruette dvruette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that right now this system message format isn't used e.g. for the OA dataset? Just because I didn't spot where it gets hooked into the data pipeline.

But looks great! can be merged

if mode == Mode.sft:
return qa_list
elif mode == Mode.rm:
raise NotImplementedError("This is currently not implemented.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that RM training will not work after this is merged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that is correct. Currently it is only implemented for Dolly, which does not support rm, so no change for now. But this needs to be implemented soon.

]
if len(relevant_system_infos) > 0:
shuffle(relevant_system_infos)
system_tag_key_values = "\n".join([f"{k}: {v}" for k, v in relevant_system_infos])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the values be quantized here or are they already within a certain set? (I think real values could be difficult to understand for the model)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys here are one of context, lang, length, quality, humor and creativity. The values for quality, humor and creativity can be floating point numbers between 0 and 1 (this is checked by the validators and a test for this is added aswell, see here).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so if they can be anything between 0 and 1 (including e.g. 0.1826357216), then I think we should round them (e.g. to a single decimal) to make it easier for the model to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

model/model_training/custom_datasets/formatting.py Outdated Show resolved Hide resolved
@CloseChoice CloseChoice merged commit 42e0798 into LAION-AI:main Apr 20, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fill <|system|> prompt during supervised fine-tuning with lang, len & text-labels
3 participants