Add data sampling features for SFT and future projects ( RLHF ) as well #910

theblackcat102 · 2023-01-24T04:03:04Z

Currently we are reaching more than 2M pairs of total datasets available in our SFT pipeline. Existing code simple put a hard limit of pairs each datasets if len(pairs) > max_length: break.

Ideally we want to write a sampler which takes in the ratio of each datasets represent in a single epoch. This sampler takes a list of ratio weights from config.yaml as input.

config.yaml would look something like this:

datasets:
   - prompt_dialogue
      - ratio : 0.1
   - webgpt
      - ratio : 0.2

the ratio could be a percentage ratio or number of rows? I am not sure which one is bit straightforward and ease of understanding.

Any idea or proposal is welcomed.

The text was updated successfully, but these errors were encountered:

sanagno · 2023-01-24T07:57:04Z

I can take care of this

hemangjoshi37a · 2023-01-26T19:00:50Z

Great idea @theblackcat102, having a data sampler that takes in the ratio of each dataset in a single epoch would be a great addition to our SFT pipeline and future projects like RLHF.

One approach to implement this could be to create a custom data sampler class that takes in a list of ratio weights from a config file like config.yaml. This class would then use these weights to determine the number of samples to be taken from each dataset in a single epoch.

As you mentioned, the ratio could be specified as a percentage or a number of rows. A percentage ratio might be more straightforward and easier to understand, as it would allow users to specify the ratio of samples taken from each dataset in terms of the proportion of the total dataset they represent.

Alternatively, specifying the ratio as a number of rows would allow users to specify the exact number of samples to be taken from each dataset. This approach would be more flexible and would allow for more fine-grained control over the sampling process.

I suggest that we should go with the first approach of percentage ratio, as it could be more straightforward and easy to understand.

I can work on a pull request to implement this feature and would appreciate any input or suggestions you may have. Let me know if this works for you.

sanagno · 2023-01-27T11:01:56Z

Hey @hemangjoshi37a feel free to start on this. If not I can implement something based on your ideas, I was planning to have something by Saturday.

hemangjoshi37a · 2023-01-27T11:04:10Z

@sanagno I will try to contribute in this direction as time permits. Because I have other projects running in parallel that are in my repositories.

maw501 · 2023-01-31T15:28:40Z

I'm happy to pick this up. I'm new around here (this would be my first contribution) so would happily work on it with input from others.

Let me know if you are okay with this.

sanagno · 2023-01-31T16:01:10Z

Sounds great @maw501, thanks a lot! I will follow this closely

maw501 · 2023-02-01T17:34:16Z

Hi @sanagno, I have quite a few questions (sorry!).

Questions related to this issue ❓

Should the sampling be happening per epoch, or once at the start of training when the data is loaded? If every epoch (e.g. each epoch we take a different 10% of a given dataset) then I'm going to have to have another think as I'm not quite clear at the moment where this would then happen given we use the transformers Trainer class for the training loop and so would need to work at the DataLoader level AFAIU (or we write a loop ourselves over epochs and redefine SFTTrainer every epoch with a new dataset).
What does "we are reaching more than 2M pairs of total datasets" mean in the original issue? Particularly the "2M" bit? (cc. @theblackcat102 since it's their comment).
Regardless of the answer to 1, whilst it seems unlikely (?) a user will know the actual dataset sizes we could mimic the approach in train_test_split (i.e. if the user specifies a fraction parameter as a float between 0.0 and 1.0 we use this, if they specify size as an int we take that many examples, otherwise if they don't specify anything we take all the data). See the yaml example below. Note that the user not knowing the dataset sizes limits how much we can assume: so the above logic operates on a single dataset based on its size.
Where the logic should live: if this happens once before we start training I think the natural place for this is inside get_dataset. The advantage of this is that it minimally touches the rest of the code (the alternative is to have to make changes in quite a few places, which I'm loathe to do initially). The downside to this approach is that we get all the data first, then subset, which might be slow (though I'm tempted to not worry about this until it's actually a problem - indeed, getting all the data is close to the current behaviour).

Other questions 🙏

Is the idea when working on SFT to use a conda env and run things locally with our own GPUs (I don't have one)?
Related to the above, it seems I hit an error for wandb since I'm not part of your supervised-finetuning project - could you please add me if this is required?
From having a look around at the datasets code it feels a tad complex at the moment - I'm tempted to open another issue whose goal would be to clean up the interfaces for datasets and allow easier configuration. This would hopefully make future changes (such as this one) possible by only changing (ideally) one part of the code.

Suggested yaml format:

 datasets:
    - webgpt:
        fraction : 0.1
    - prompt_dialogue:
        size : 2000
    - squad_v2
    - adversarial_qa

Thanks, let me know your thoughts!

sanagno · 2023-02-02T08:54:02Z

Hi @maw501, great points!

The more general we make it the better for us. So ideally we could resample after each epoch. There are a few "hacky way" we can do this, e.g. rearranging datapoints in the dataset via a callback at the end of training.
We have a lot of different data sources at the moment! We want to use different datasets but are still interested in prioritizing some kind of data. E.g. using a huge translation dataset may be useful for finetuning, but these examples shouldn't overwhelm examples from other datasets, e.g. dialogue creation.
That seems great to me. Perhaps we can have something even more general, where we have a fraction and a size, and the minimum of the two is selected.
I think the subset idea is fine for the moment. We could have better logic for some of the bigger datasets, but since we are manually creating the data from some of them during init, this is not going to work "perfectly" for all of the datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data sampling features for SFT and future projects ( RLHF ) as well #910

Add data sampling features for SFT and future projects ( RLHF ) as well #910

theblackcat102 commented Jan 24, 2023

sanagno commented Jan 24, 2023

hemangjoshi37a commented Jan 26, 2023

sanagno commented Jan 27, 2023

hemangjoshi37a commented Jan 27, 2023

maw501 commented Jan 31, 2023

sanagno commented Jan 31, 2023

maw501 commented Feb 1, 2023 •

edited

Loading

sanagno commented Feb 2, 2023

maw501 commented Feb 2, 2023

sanagno commented Feb 2, 2023

maw501 commented Feb 6, 2023

Add data sampling features for SFT and future projects ( RLHF ) as well #910

Add data sampling features for SFT and future projects ( RLHF ) as well #910

Comments

theblackcat102 commented Jan 24, 2023

sanagno commented Jan 24, 2023

hemangjoshi37a commented Jan 26, 2023

sanagno commented Jan 27, 2023

hemangjoshi37a commented Jan 27, 2023

maw501 commented Jan 31, 2023

sanagno commented Jan 31, 2023

maw501 commented Feb 1, 2023 • edited Loading

sanagno commented Feb 2, 2023

maw501 commented Feb 2, 2023

sanagno commented Feb 2, 2023

maw501 commented Feb 6, 2023

maw501 commented Feb 1, 2023 •

edited

Loading