New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to filtered prosocial dataset #3162
Conversation
Relevant #3144 |
Modified ProsocialDialogue training to use the filtered version of the dataset with less irrelevant data and less rejections.
Thanks a lot for working on prosocial, we got some negative comments for SFT-8 (not deployed yet) which used 15% of prosocial-dialog and had an unfiltered version of gpt4all. The discussion of the dataset mixture for SFT-9 is still very much at the beginning. |
(@TCLProject if you want to help us determining the OA SFT-9 dataset mix, please contact Ollie or me via DM on discord .. almostEvil___ is coordinating the SFT-9 project.) |
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for filtering!
One point of confusion from me - the new filtered dataset is 221MB and 3013 pages of rows on HF viewer. The original is much smaller, only 117 MB and 1203 pages. Could there be a duplication issue? It would also be really good if you could include your code for filtering in a directory in |
I apologize for the late reply. As for the duplicates: This is a point of confusion for me too, believe it or not. I appear to have misunderstood how HF datasets work. The train.json in the original dataset is roughly 85 MB. The train.json (which I thought is what matters) is roughly 40MB in the filtered dataset that I have uploaded. HF seems to have combined it with all the other json files, which is not the intended behavior and I do not understand why it acted that way (and I do apologize). The other json files are there to have a dataset at different points of filtering (e.g. with possibly but no casual, with probably but no word filtering, etc). I did not intend on all of the files somehow being merged together and would appreciate a pointer on why that happened and how I can prevent it from happening. |
If you upload multiple JSONs, by default loading the HF dataset will combine them all unless a specific one is given as an argument to the |
Good to know, Thank you! |
This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections.
In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request.
This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144