Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curate SFT-9 dataset mixes #3144

Open
olliestanley opened this issue May 13, 2023 · 10 comments
Open

Curate SFT-9 dataset mixes #3144

olliestanley opened this issue May 13, 2023 · 10 comments
Assignees

Comments

@olliestanley
Copy link
Collaborator

olliestanley commented May 13, 2023

Iterate on the SFT-8 dataset mixes to create pretraining and final SFT mixes for SFT-9. This requires investigating the quality and usefulness of the datasets. Community input welcome below. See the sft8_training branch for the code state corresponding to the below SFT-8 configs.

SFT-8 pretraining mix
  datasets:
    - gpteacher_roleplay:
        val_split: 0.05
    - red_pajama:
        fraction: 0.25
        max_val_set: 1000
    - wizardlm_70k:
        val_split: 0.05
        max_val_set: 500
    - joke:
        val_split: 0.05
    - poem_instructions:
        val_split: 0.025
    - oa_stackexchange:
        val_split: 0.05
        fraction: 0.1
        max_val_set: 1000
    - tell_a_joke:
        val_split: 0.05
        max_val_set: 250
    - webgpt:
        val_split: 0.05
        max_val_set: 250
    - gpt4all:
        val_split: 0.01
        max_val_set: 1000
    - alpaca_gpt4:
        val_split: 0.025
        max_val_set: 250
    - code_alpaca:
        val_split: 0.05
        max_val_set: 250
    - vicuna:
        max_val_set: 250
    - oig_file:
        source_url: https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl
        max_count: 10000
        min_length: 250
        val_split: 0.05
        max_val_set: 250
    - minimath:
        val_split: 0.05
    - humaneval_mbpp_codegen_qa:
        val_split: 0.05
    - humaneval_mbpp_testgen_qa:
        val_split: 0.05
    - grade_school_math_instructions:
        val_split: 0.05
    - recipes:
        val_split: 0.05
    - cmu_wiki_qa:
        val_split: 0.05
    - oa_wiki_qa_bart_10000row:
        val_split: 0.05
        max_val_set: 250
    - prosocial_dialogue:
        fraction: 0.1
        max_val_set: 250
    - explain_prosocial:
        fraction: 0.075
        max_val_set: 250
    - soda:
        fraction: 0.25
        max_val_set: 1000
    - oa_leet10k:
        val_split: 0.05
        max_val_set: 250
    - dolly15k:
        val_split: 0.05
        max_val_set: 300
SFT-8 final SFT mix
  datasets:
    - oasst_export:
        lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
        input_file_path: 2023-05-06_OASST_labels.jsonl.gz
        val_split: 0.05
    - vicuna:
        val_split: 0.05
        max_val_set: 800
        fraction: 0.4
    - dolly15k:
        val_split: 0.05
        max_val_set: 300
    - grade_school_math_instructions:
        val_split: 0.05
    - code_alpaca:
        val_split: 0.05
        max_val_set: 250
    - red_pajama:
        fraction: 0.05
        max_val_set: 1000
    - wizardlm_70k:
        val_split: 0.05
        max_val_set: 500
        fraction: 0.4
    - poem_instructions:
        fraction: 0.5
        val_split: 0.025

Leading on this: @0x22almostEvil

Some initial requests from community include removal or reduction/filtering of prosocial_dialogue and explain_prosocial datasets from pretraining.

@0x22almostEvil
Copy link
Contributor

I'm here!

@r7l
Copy link

r7l commented May 14, 2023

Would it be possible to use the Starcoder dataset?

@olliestanley
Copy link
Collaborator Author

Would it be possible to use the Starcoder dataset?

Currently we include some RedPajama data with a language modelling objective during SFT to try to prevent catastrophic forgetting of pretraining knowledge. Maybe it would be possible to do something similar with StarCoder data. But I don't think we could train on the whole dataset, that would just be hugely expensive and more in the realms of foundation model pretraining than assistant finetuning.

@r7l
Copy link

r7l commented May 14, 2023

Understandable. I'd assume that a large portion of current OA users are coders. So it might be reasonable for the start to have a good understanding for coding. It's pretty ok for SFT-7 already but there will always be room for improvement.

@0x22almostEvil
Copy link
Contributor

0x22almostEvil commented May 14, 2023

Yeah, we might collect some good datasets for coding as well.

I'm currently doing a work with collecting reasoning, logic and semantics ones, as I noticed there are some problems in this field.

@marcelklehr
Copy link

Is it possible to train a model without any datasets that are legally questionable such as the code_alpaca and gpt4all which are trained on the OpenAI API AFAIK, which doesn't allow training models with its output? A fully open source model like this would be very helpful.

andreaskoepf pushed a commit that referenced this issue May 20, 2023
This PR modifies ProsocialDialogue to use the filtered version of the
dataset I have created, with less irrelevant data and less rejections.

In this modified dataset I have filtered out the mostly irrelevant lines
where the safety label is "casual" and "possibly/probably needs
caution", which I have found to be mostly pointless, as well as some
lines where the phrasing of the response might hurt the model's
performance by refusing to act on a request.

This is an alternative solution that may work instead of removing the
dataset completely, as mentioned in #3144
@olliestanley
Copy link
Collaborator Author

Since the results of the Guanaco paper I think it is clear SFT-9 should use a much smaller set of finetuning data and focus on high quality. I suggest we try a run dropping synthetic datasets, with maybe exceptions for those synthetic datasets which are clearly high-quality

@andreaskoepf
Copy link
Collaborator

andreaskoepf commented May 25, 2023

While we definitely should use QLoRA (a groundbreaking result for the whole ML community) and only try a super-high quality final fine-tuning run (like OA top-1 threads, i.e. as it was done for Guanaco) I think the total situation is not that simple.

We already followed a 2-stage training approach. Guanaco of course goes a step further and trained only on the highest quality OA-data. When we decided to use the full OA set (I implemented a top-k thread filter which was not used) the idea was to create a diverse SFT output as input to the RL stage. Also probably we were a bit afraid of overfitting and with a too small dataset (and we saw that naive Dropout for the larger LLaMA models didn't work as well as for pythia .. Lima showed a better approach). And since QLoRA allows much faster iterations they could try a lot of different configurations in a short amount of time (rapid feedback loop is extremely beneficial if you have the right eval metrics).

In the fine-print of his twitter mega-thread Tim Dettmers writes:

  • "Our main finding here: (1) instruction tuning datasets are good for instruction following but bad for chatbot performance; (2) you can create a 99.3% of ChatGPT performance level chatbot with QLoRA in just 24 hours of fine-tuning!" (tweet) -> i.e instruction following and "fluffy" chat are two different things
  • "its really bad at math" (tweet)

What we clearly see is that the style of the model output can be greatly modified already with 1k (Lima) or 10k (QLoRA) examples. Whether additional "pre-training" is beneficial for capabilities or not was IMO not analyzed. We observed that pre-training has clearly an influence (e.g. negative with prosocial and positive with grade-school-math).
Also we know that our SFT7e3 model although it most of the time fails to generate rhyming poems is our best model for following instructions and handling plugin-requests. The larger LLaMA models were pre-trained on 1.4T tokens .. the question is of course whether adding further data sets like synthetic instructions improve the desired model behavior or if they have detrimental effects. For pro-social "safety" datasets we concluded that their effect is overall negative and that they should be removed from future run but for others it is less clear and needs further analysis.

I see two obvious solutions/approaches for chat vs. plugins:

  • use something "mode" in system-prompt to specify whether we want instruction following or fluffy-talk mode
  • use multiple specialized models, e.g. one for chat and another one for instruction following

@olliestanley
Copy link
Collaborator Author

I agree that there is a clear distinction between datasets useful for chat-tuning vs instruction-following, but have a few points here.

We already followed a 2-stage training approach. Guanaco of course goes a step further and trained only on the highest quality OA-data. When we decided to use the full OA set (I implemented a top-k thread filter which was not used) the idea was to create a diverse SFT output as input to the RL stage.

This makes sense, but it seems to me that even if we continue the 2-stage approach we can most likely get sufficiently diverse outputs for RL even with a highly filtered OA set.

In the fine-print of his twitter mega-thread Tim Dettmers writes:

  • "Our main finding here: (1) instruction tuning datasets are good for instruction following but bad for chatbot performance; (2) you can create a 99.3% of ChatGPT performance level chatbot with QLoRA in just 24 hours of fine-tuning!" (tweet) -> i.e instruction following and "fluffy" chat are two different things
  • "its really bad at math" (tweet)

Yes, my suggestion would be to retain these high-quality instruction-following datasets (e.g. math instructions, poetry instructions, coding instructions, and I believe Dragan is building a plugin-instruction dataset) but remove the synthetic chat datasets. It seems like we do not need Alpaca, Vicuna, WizardLM, prosocial, roleplay, etc datasets which are chat-focused and likely to be lower quality than filtered OA and Dolly data, perhaps with the exception of alpaca_gpt4?

There are some datasets I haven't looked at so am less sure about (OIG file, soda, webgpt, recipes, wiki QA datasets)

I see two obvious solutions/approaches for chat vs. plugins:

  • use something "mode" in system-prompt to specify whether we want instruction following or fluffy-talk mode
  • use multiple specialized models, e.g. one for chat and another one for instruction following

I personally prefer the idea of having a single model which can do both - it aligns much better with the OA vision of running on consumer hardware. So the system prompt idea seems like a good starting point, imo

@draganjovanovich
Copy link
Collaborator

I am for single mode/model approach. And completely removing instruction datasets seams a bit too much. But I am all in for leaving only top quality samples/datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants