Curate SFT-9 dataset mixes #3144

olliestanley · 2023-05-13T09:57:48Z

Iterate on the SFT-8 dataset mixes to create pretraining and final SFT mixes for SFT-9. This requires investigating the quality and usefulness of the datasets. Community input welcome below. See the sft8_training branch for the code state corresponding to the below SFT-8 configs.

SFT-8 pretraining mix

  datasets:
    - gpteacher_roleplay:
        val_split: 0.05
    - red_pajama:
        fraction: 0.25
        max_val_set: 1000
    - wizardlm_70k:
        val_split: 0.05
        max_val_set: 500
    - joke:
        val_split: 0.05
    - poem_instructions:
        val_split: 0.025
    - oa_stackexchange:
        val_split: 0.05
        fraction: 0.1
        max_val_set: 1000
    - tell_a_joke:
        val_split: 0.05
        max_val_set: 250
    - webgpt:
        val_split: 0.05
        max_val_set: 250
    - gpt4all:
        val_split: 0.01
        max_val_set: 1000
    - alpaca_gpt4:
        val_split: 0.025
        max_val_set: 250
    - code_alpaca:
        val_split: 0.05
        max_val_set: 250
    - vicuna:
        max_val_set: 250
    - oig_file:
        source_url: https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl
        max_count: 10000
        min_length: 250
        val_split: 0.05
        max_val_set: 250
    - minimath:
        val_split: 0.05
    - humaneval_mbpp_codegen_qa:
        val_split: 0.05
    - humaneval_mbpp_testgen_qa:
        val_split: 0.05
    - grade_school_math_instructions:
        val_split: 0.05
    - recipes:
        val_split: 0.05
    - cmu_wiki_qa:
        val_split: 0.05
    - oa_wiki_qa_bart_10000row:
        val_split: 0.05
        max_val_set: 250
    - prosocial_dialogue:
        fraction: 0.1
        max_val_set: 250
    - explain_prosocial:
        fraction: 0.075
        max_val_set: 250
    - soda:
        fraction: 0.25
        max_val_set: 1000
    - oa_leet10k:
        val_split: 0.05
        max_val_set: 250
    - dolly15k:
        val_split: 0.05
        max_val_set: 300

SFT-8 final SFT mix

  datasets:
    - oasst_export:
        lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
        input_file_path: 2023-05-06_OASST_labels.jsonl.gz
        val_split: 0.05
    - vicuna:
        val_split: 0.05
        max_val_set: 800
        fraction: 0.4
    - dolly15k:
        val_split: 0.05
        max_val_set: 300
    - grade_school_math_instructions:
        val_split: 0.05
    - code_alpaca:
        val_split: 0.05
        max_val_set: 250
    - red_pajama:
        fraction: 0.05
        max_val_set: 1000
    - wizardlm_70k:
        val_split: 0.05
        max_val_set: 500
        fraction: 0.4
    - poem_instructions:
        fraction: 0.5
        val_split: 0.025

Leading on this: @0x22almostEvil

Some initial requests from community include removal or reduction/filtering of prosocial_dialogue and explain_prosocial datasets from pretraining.

The text was updated successfully, but these errors were encountered:

0x22almostEvil · 2023-05-13T10:03:08Z

I'm here!

r7l · 2023-05-14T09:35:07Z

Would it be possible to use the Starcoder dataset?

olliestanley · 2023-05-14T09:52:37Z

Would it be possible to use the Starcoder dataset?

Currently we include some RedPajama data with a language modelling objective during SFT to try to prevent catastrophic forgetting of pretraining knowledge. Maybe it would be possible to do something similar with StarCoder data. But I don't think we could train on the whole dataset, that would just be hugely expensive and more in the realms of foundation model pretraining than assistant finetuning.

r7l · 2023-05-14T10:46:06Z

Understandable. I'd assume that a large portion of current OA users are coders. So it might be reasonable for the start to have a good understanding for coding. It's pretty ok for SFT-7 already but there will always be room for improvement.

0x22almostEvil · 2023-05-14T13:32:32Z

Yeah, we might collect some good datasets for coding as well.

I'm currently doing a work with collecting reasoning, logic and semantics ones, as I noticed there are some problems in this field.

marcelklehr · 2023-05-16T10:15:12Z

Is it possible to train a model without any datasets that are legally questionable such as the code_alpaca and gpt4all which are trained on the OpenAI API AFAIK, which doesn't allow training models with its output? A fully open source model like this would be very helpful.

This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections. In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request. This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144

olliestanley · 2023-05-24T20:57:55Z

Since the results of the Guanaco paper I think it is clear SFT-9 should use a much smaller set of finetuning data and focus on high quality. I suggest we try a run dropping synthetic datasets, with maybe exceptions for those synthetic datasets which are clearly high-quality

andreaskoepf · 2023-05-25T06:57:25Z

While we definitely should use QLoRA (a groundbreaking result for the whole ML community) and only try a super-high quality final fine-tuning run (like OA top-1 threads, i.e. as it was done for Guanaco) I think the total situation is not that simple.

We already followed a 2-stage training approach. Guanaco of course goes a step further and trained only on the highest quality OA-data. When we decided to use the full OA set (I implemented a top-k thread filter which was not used) the idea was to create a diverse SFT output as input to the RL stage. Also probably we were a bit afraid of overfitting and with a too small dataset (and we saw that naive Dropout for the larger LLaMA models didn't work as well as for pythia .. Lima showed a better approach). And since QLoRA allows much faster iterations they could try a lot of different configurations in a short amount of time (rapid feedback loop is extremely beneficial if you have the right eval metrics).

In the fine-print of his twitter mega-thread Tim Dettmers writes:

"Our main finding here: (1) instruction tuning datasets are good for instruction following but bad for chatbot performance; (2) you can create a 99.3% of ChatGPT performance level chatbot with QLoRA in just 24 hours of fine-tuning!" (tweet) -> i.e instruction following and "fluffy" chat are two different things
"its really bad at math" (tweet)

What we clearly see is that the style of the model output can be greatly modified already with 1k (Lima) or 10k (QLoRA) examples. Whether additional "pre-training" is beneficial for capabilities or not was IMO not analyzed. We observed that pre-training has clearly an influence (e.g. negative with prosocial and positive with grade-school-math).
Also we know that our SFT7e3 model although it most of the time fails to generate rhyming poems is our best model for following instructions and handling plugin-requests. The larger LLaMA models were pre-trained on 1.4T tokens .. the question is of course whether adding further data sets like synthetic instructions improve the desired model behavior or if they have detrimental effects. For pro-social "safety" datasets we concluded that their effect is overall negative and that they should be removed from future run but for others it is less clear and needs further analysis.

I see two obvious solutions/approaches for chat vs. plugins:

use something "mode" in system-prompt to specify whether we want instruction following or fluffy-talk mode
use multiple specialized models, e.g. one for chat and another one for instruction following

olliestanley · 2023-05-25T08:56:58Z

I agree that there is a clear distinction between datasets useful for chat-tuning vs instruction-following, but have a few points here.

We already followed a 2-stage training approach. Guanaco of course goes a step further and trained only on the highest quality OA-data. When we decided to use the full OA set (I implemented a top-k thread filter which was not used) the idea was to create a diverse SFT output as input to the RL stage.

This makes sense, but it seems to me that even if we continue the 2-stage approach we can most likely get sufficiently diverse outputs for RL even with a highly filtered OA set.

In the fine-print of his twitter mega-thread Tim Dettmers writes:

"Our main finding here: (1) instruction tuning datasets are good for instruction following but bad for chatbot performance; (2) you can create a 99.3% of ChatGPT performance level chatbot with QLoRA in just 24 hours of fine-tuning!" (tweet) -> i.e instruction following and "fluffy" chat are two different things

"its really bad at math" (tweet)

Yes, my suggestion would be to retain these high-quality instruction-following datasets (e.g. math instructions, poetry instructions, coding instructions, and I believe Dragan is building a plugin-instruction dataset) but remove the synthetic chat datasets. It seems like we do not need Alpaca, Vicuna, WizardLM, prosocial, roleplay, etc datasets which are chat-focused and likely to be lower quality than filtered OA and Dolly data, perhaps with the exception of alpaca_gpt4?

There are some datasets I haven't looked at so am less sure about (OIG file, soda, webgpt, recipes, wiki QA datasets)

I see two obvious solutions/approaches for chat vs. plugins:

use something "mode" in system-prompt to specify whether we want instruction following or fluffy-talk mode

use multiple specialized models, e.g. one for chat and another one for instruction following

I personally prefer the idea of having a single model which can do both - it aligns much better with the OA vision of running on consumer hardware. So the system prompt idea seems like a good starting point, imo

draganjovanovich · 2023-05-25T15:13:32Z

I am for single mode/model approach. And completely removing instruction datasets seams a bit too much. But I am all in for leaving only top quality samples/datasets.

olliestanley added research ml data labels May 13, 2023

olliestanley assigned 0x22almostEvil May 13, 2023

TCLProject mentioned this issue May 14, 2023

Switch to filtered prosocial dataset #3162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curate SFT-9 dataset mixes #3144

Curate SFT-9 dataset mixes #3144

olliestanley commented May 13, 2023 •

edited

0x22almostEvil commented May 13, 2023

r7l commented May 14, 2023 •

edited

olliestanley commented May 14, 2023

r7l commented May 14, 2023

0x22almostEvil commented May 14, 2023 •

edited

marcelklehr commented May 16, 2023

olliestanley commented May 24, 2023

andreaskoepf commented May 25, 2023 •

edited

olliestanley commented May 25, 2023

draganjovanovich commented May 25, 2023

Curate SFT-9 dataset mixes #3144

Curate SFT-9 dataset mixes #3144

Comments

olliestanley commented May 13, 2023 • edited

0x22almostEvil commented May 13, 2023

r7l commented May 14, 2023 • edited

olliestanley commented May 14, 2023

r7l commented May 14, 2023

0x22almostEvil commented May 14, 2023 • edited

marcelklehr commented May 16, 2023

olliestanley commented May 24, 2023

andreaskoepf commented May 25, 2023 • edited

olliestanley commented May 25, 2023

draganjovanovich commented May 25, 2023

olliestanley commented May 13, 2023 •

edited

r7l commented May 14, 2023 •

edited

0x22almostEvil commented May 14, 2023 •

edited

andreaskoepf commented May 25, 2023 •

edited