Switch to filtered prosocial dataset #3162

TCLProject · 2023-05-14T19:35:52Z

This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections.

In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request.

This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144

olliestanley · 2023-05-14T19:51:19Z

Relevant #3144

Modified ProsocialDialogue training to use the filtered version of the dataset with less irrelevant data and less rejections.

andreaskoepf · 2023-05-15T07:20:22Z

Thanks a lot for working on prosocial, we got some negative comments for SFT-8 (not deployed yet) which used 15% of prosocial-dialog and had an unfiltered version of gpt4all. The discussion of the dataset mixture for SFT-9 is still very much at the beginning.

andreaskoepf · 2023-05-15T07:34:27Z

(@TCLProject if you want to help us determining the OA SFT-9 dataset mix, please contact Ollie or me via DM on discord .. almostEvil___ is coordinating the SFT-9 project.)

0x22almostEvil · 2023-05-15T16:31:09Z

Thanks!

andreaskoepf

Nice, thanks for filtering!

olliestanley · 2023-05-16T09:14:22Z

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

TCLProject · 2023-06-08T20:01:14Z

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

I apologize for the late reply.
As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).

As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.

olliestanley · 2023-06-08T20:20:22Z

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

I apologize for the late reply. As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).

As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.

If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the load_dataset() call. I made a change in a follow-up PR to add this argument so it's no longer an issue

TCLProject · 2023-06-08T20:57:01Z

If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the load_dataset() call. I made a change in a follow-up PR to add this argument so it's no longer an issue

Good to know, Thank you!

TCLProject requested review from theblackcat102, sanagno, dvruette, andreaskoepf and yk as code owners May 14, 2023 19:35

olliestanley added ml data labels May 14, 2023

olliestanley changed the title ~~Switch to filtered dataset~~ Switch to filtered prosocial dataset May 14, 2023

switch to filtered dataset

77ff8af

Modified ProsocialDialogue training to use the filtered version of the dataset with less irrelevant data and less rejections.

andreaskoepf approved these changes May 16, 2023

View reviewed changes

andreaskoepf approved these changes May 20, 2023

View reviewed changes

andreaskoepf merged commit d39976a into LAION-AI:main May 20, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to filtered prosocial dataset #3162

Switch to filtered prosocial dataset #3162

TCLProject commented May 14, 2023

olliestanley commented May 14, 2023

andreaskoepf commented May 15, 2023

andreaskoepf commented May 15, 2023 •

edited

0x22almostEvil commented May 15, 2023

andreaskoepf left a comment

olliestanley commented May 16, 2023

TCLProject commented Jun 8, 2023 •

edited

olliestanley commented Jun 8, 2023

TCLProject commented Jun 8, 2023

Switch to filtered prosocial dataset #3162

Switch to filtered prosocial dataset #3162

Conversation

TCLProject commented May 14, 2023

olliestanley commented May 14, 2023

andreaskoepf commented May 15, 2023

andreaskoepf commented May 15, 2023 • edited

0x22almostEvil commented May 15, 2023

andreaskoepf left a comment

Choose a reason for hiding this comment

olliestanley commented May 16, 2023

TCLProject commented Jun 8, 2023 • edited

olliestanley commented Jun 8, 2023

TCLProject commented Jun 8, 2023

andreaskoepf commented May 15, 2023 •

edited

TCLProject commented Jun 8, 2023 •

edited