-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curate SFT-9 dataset mixes #3144
Comments
I'm here! |
Would it be possible to use the Starcoder dataset? |
Currently we include some RedPajama data with a language modelling objective during SFT to try to prevent catastrophic forgetting of pretraining knowledge. Maybe it would be possible to do something similar with StarCoder data. But I don't think we could train on the whole dataset, that would just be hugely expensive and more in the realms of foundation model pretraining than assistant finetuning. |
Understandable. I'd assume that a large portion of current OA users are coders. So it might be reasonable for the start to have a good understanding for coding. It's pretty ok for SFT-7 already but there will always be room for improvement. |
Yeah, we might collect some good datasets for coding as well. I'm currently doing a work with collecting reasoning, logic and semantics ones, as I noticed there are some problems in this field. |
Is it possible to train a model without any datasets that are legally questionable such as the code_alpaca and gpt4all which are trained on the OpenAI API AFAIK, which doesn't allow training models with its output? A fully open source model like this would be very helpful. |
This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections. In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request. This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144
Since the results of the Guanaco paper I think it is clear SFT-9 should use a much smaller set of finetuning data and focus on high quality. I suggest we try a run dropping synthetic datasets, with maybe exceptions for those synthetic datasets which are clearly high-quality |
While we definitely should use QLoRA (a groundbreaking result for the whole ML community) and only try a super-high quality final fine-tuning run (like OA top-1 threads, i.e. as it was done for Guanaco) I think the total situation is not that simple. We already followed a 2-stage training approach. Guanaco of course goes a step further and trained only on the highest quality OA-data. When we decided to use the full OA set (I implemented a top-k thread filter which was not used) the idea was to create a diverse SFT output as input to the RL stage. Also probably we were a bit afraid of overfitting and with a too small dataset (and we saw that naive Dropout for the larger LLaMA models didn't work as well as for pythia .. Lima showed a better approach). And since QLoRA allows much faster iterations they could try a lot of different configurations in a short amount of time (rapid feedback loop is extremely beneficial if you have the right eval metrics). In the fine-print of his twitter mega-thread Tim Dettmers writes:
What we clearly see is that the style of the model output can be greatly modified already with 1k (Lima) or 10k (QLoRA) examples. Whether additional "pre-training" is beneficial for capabilities or not was IMO not analyzed. We observed that pre-training has clearly an influence (e.g. negative with prosocial and positive with grade-school-math). I see two obvious solutions/approaches for chat vs. plugins:
|
I agree that there is a clear distinction between datasets useful for chat-tuning vs instruction-following, but have a few points here.
This makes sense, but it seems to me that even if we continue the 2-stage approach we can most likely get sufficiently diverse outputs for RL even with a highly filtered OA set.
Yes, my suggestion would be to retain these high-quality instruction-following datasets (e.g. math instructions, poetry instructions, coding instructions, and I believe Dragan is building a plugin-instruction dataset) but remove the synthetic chat datasets. It seems like we do not need Alpaca, Vicuna, WizardLM, prosocial, roleplay, etc datasets which are chat-focused and likely to be lower quality than filtered OA and Dolly data, perhaps with the exception of alpaca_gpt4? There are some datasets I haven't looked at so am less sure about (OIG file, soda, webgpt, recipes, wiki QA datasets)
I personally prefer the idea of having a single model which can do both - it aligns much better with the OA vision of running on consumer hardware. So the system prompt idea seems like a good starting point, imo |
I am for single mode/model approach. And completely removing instruction datasets seams a bit too much. But I am all in for leaving only top quality samples/datasets. |
Iterate on the SFT-8 dataset mixes to create pretraining and final SFT mixes for SFT-9. This requires investigating the quality and usefulness of the datasets. Community input welcome below. See the
sft8_training
branch for the code state corresponding to the below SFT-8 configs.SFT-8 pretraining mix
SFT-8 final SFT mix
Leading on this: @0x22almostEvil
Some initial requests from community include removal or reduction/filtering of
prosocial_dialogue
andexplain_prosocial
datasets from pretraining.The text was updated successfully, but these errors were encountered: