Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to filtered prosocial dataset #3162

Merged
merged 1 commit into from May 20, 2023

Conversation

TCLProject
Copy link
Contributor

This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections.

In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request.

This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144

@olliestanley
Copy link
Collaborator

Relevant #3144

@olliestanley olliestanley changed the title Switch to filtered dataset Switch to filtered prosocial dataset May 14, 2023
Modified ProsocialDialogue training to use the filtered version of the dataset with less irrelevant data and less rejections.
@andreaskoepf
Copy link
Collaborator

Thanks a lot for working on prosocial, we got some negative comments for SFT-8 (not deployed yet) which used 15% of prosocial-dialog and had an unfiltered version of gpt4all. The discussion of the dataset mixture for SFT-9 is still very much at the beginning.

@andreaskoepf
Copy link
Collaborator

andreaskoepf commented May 15, 2023

(@TCLProject if you want to help us determining the OA SFT-9 dataset mix, please contact Ollie or me via DM on discord .. almostEvil___ is coordinating the SFT-9 project.)

@0x22almostEvil
Copy link
Contributor

Thanks!

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for filtering!

@olliestanley
Copy link
Collaborator

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

@andreaskoepf andreaskoepf merged commit d39976a into LAION-AI:main May 20, 2023
1 check passed
@TCLProject
Copy link
Contributor Author

TCLProject commented Jun 8, 2023

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

I apologize for the late reply.
As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).

As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.

@olliestanley
Copy link
Collaborator

One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in data/datasets/ in this repo, if it's possible

I apologize for the late reply. As for the code: it was done through a commandline and I've cleared bash history since then, but it was done using a couple of grep commands I could probably get close to replicating (the exact filtered words).

As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening.

If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the load_dataset() call. I made a change in a follow-up PR to add this argument so it's no longer an issue

@TCLProject
Copy link
Contributor Author

If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the load_dataset() call. I made a change in a follow-up PR to add this argument so it's no longer an issue

Good to know, Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants