Feature/dataset entry #2520

CloseChoice · 2023-04-14T18:51:52Z

Add pydantic basemodel class (equivalent to dataclass but with stronger guarantees) to return from dolly dataset. Add the formatting functionality in the dataset entry class.
This PR does quite a bit:

add pydantic dependency
introduce a new DatasetEntry class, which provides a method to do the formatting based on the mode and the QA_SPECIAL_TOKENS and the eos_token. This class should work as a general pattern to store and format single dataset entries. This class should also remove all the formatting errors we had previously with the datasets.
add tests for DialogueDataCollator. I trained a minimal tokenizer only on the tokens that are present in the tests to not bloat the code (still a lot of LOC).
added tests for the interplay of DialogueDataCollator and the newly introduced DatasetEntry class.
add small fixes to be backwards compatible in handling the new DatasetEntry and old rows from the dataset.

todos:

mask system
remove changes in config
write tests for formatting of dataset entry

…e/dialogue-data-collator-tests

…db-dolly-15k

…e/dataset-entry

dvruette

Do I understand correctly that right now this system message format isn't used e.g. for the OA dataset? Just because I didn't spot where it gets hooked into the data pipeline.

But looks great! can be merged

model/model_training/tests/resources/data_collator/tokenizer.json

model/model_training/tests/test_dialogue_data_collator.py

dvruette · 2023-04-20T19:12:58Z

model/model_training/custom_datasets/formatting.py

+        if mode == Mode.sft:
+            return qa_list
+        elif mode == Mode.rm:
+            raise NotImplementedError("This is currently not implemented.")


Does this mean that RM training will not work after this is merged?

Yep, that is correct. Currently it is only implemented for Dolly, which does not support rm, so no change for now. But this needs to be implemented soon.

dvruette · 2023-04-20T19:14:22Z

model/model_training/custom_datasets/formatting.py

+        ]
+        if len(relevant_system_infos) > 0:
+            shuffle(relevant_system_infos)
+            system_tag_key_values = "\n".join([f"{k}: {v}" for k, v in relevant_system_infos])


Should the values be quantized here or are they already within a certain set? (I think real values could be difficult to understand for the model)

The keys here are one of context, lang, length, quality, humor and creativity. The values for quality, humor and creativity can be floating point numbers between 0 and 1 (this is checked by the validators and a test for this is added aswell, see here).

yes, so if they can be anything between 0 and 1 (including e.g. 0.1826357216), then I think we should round them (e.g. to a single decimal) to make it easier for the model to understand.

model/model_training/custom_datasets/formatting.py

…e/dataset-entry

CloseChoice added 16 commits April 12, 2023 08:28

add tests

6a8a8d7

add tests

76b2f27

add dialogue data collator unit test

5ca3785

build small tokenizer for test

df77918

update config

642f2a9

Merge branch 'main' of github.com:LAION-AI/Open-Assistant into featur…

f5aaf06

…e/dialogue-data-collator-tests

add databricks dolly dataset

d4a6143

remove todo

b4604c8

update due to PR discussions

e2f952f

return dataclass from dolly dataset

da34c31

Merge branch 'feature/dialogue-data-collator-tests' into dataset/add-…

e2ba3e5

…db-dolly-15k

WIP: add tests for dataset entry

c7340d4

Merge branch 'main' of github.com:LAION-AI/Open-Assistant into featur…

690715c

…e/dataset-entry

update masking of system tags

946f016

remove test configs

3a8bcb7

remove dtype for pythia model

9eab49f

andreaskoepf mentioned this pull request Apr 18, 2023

Fill <|system|> prompt during supervised fine-tuning with lang, len & text-labels #2708

Closed

CloseChoice added 3 commits April 19, 2023 18:18

Merge branch 'main' of github.com:LAION-AI/Open-Assistant into featur…

0109d60

…e/dataset-entry

update dataset entry class and add tests

e426848

update test

23cb3f5

CloseChoice marked this pull request as ready for review April 19, 2023 20:11

CloseChoice requested review from theblackcat102, sanagno, dvruette, andreaskoepf and yk as code owners April 19, 2023 20:11

CloseChoice added 3 commits April 19, 2023 23:01

pdate tests and formatting

9ed6957

update qa datasets

21eb45d

remove unused stuff from dolly

0cdf06b

theblackcat102 approved these changes Apr 20, 2023

View reviewed changes

dvruette approved these changes Apr 20, 2023

View reviewed changes

CloseChoice added 4 commits April 20, 2023 21:31

add test_formatting

ed47fa3

update custom datasets, tests

af83ace

Merge branch 'main' of github.com:LAION-AI/Open-Assistant into featur…

bfc2d5f

…e/dataset-entry

update system tag so that tests pass

6b90a05

CloseChoice merged commit 42e0798 into LAION-AI:main Apr 20, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/dataset entry #2520

Feature/dataset entry #2520

CloseChoice commented Apr 14, 2023 •

edited

dvruette left a comment

dvruette Apr 20, 2023

CloseChoice Apr 20, 2023

dvruette Apr 20, 2023

CloseChoice Apr 20, 2023

dvruette Apr 20, 2023

CloseChoice Apr 20, 2023

Feature/dataset entry #2520

Feature/dataset entry #2520

Conversation

CloseChoice commented Apr 14, 2023 • edited

dvruette left a comment

Choose a reason for hiding this comment

dvruette Apr 20, 2023

Choose a reason for hiding this comment

CloseChoice Apr 20, 2023

Choose a reason for hiding this comment

dvruette Apr 20, 2023

Choose a reason for hiding this comment

CloseChoice Apr 20, 2023

Choose a reason for hiding this comment

dvruette Apr 20, 2023

Choose a reason for hiding this comment

CloseChoice Apr 20, 2023

Choose a reason for hiding this comment

CloseChoice commented Apr 14, 2023 •

edited