Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data sampling features for SFT and future projects ( RLHF ) as well #910

Closed
theblackcat102 opened this issue Jan 24, 2023 · 11 comments · Fixed by #1368
Closed

Add data sampling features for SFT and future projects ( RLHF ) as well #910

theblackcat102 opened this issue Jan 24, 2023 · 11 comments · Fixed by #1368
Assignees
Labels

Comments

@theblackcat102
Copy link
Collaborator

Currently we are reaching more than 2M pairs of total datasets available in our SFT pipeline. Existing code simple put a hard limit of pairs each datasets if len(pairs) > max_length: break.

Ideally we want to write a sampler which takes in the ratio of each datasets represent in a single epoch. This sampler takes a list of ratio weights from config.yaml as input.

config.yaml would look something like this:

datasets:
   - prompt_dialogue
      - ratio : 0.1
   - webgpt
      - ratio : 0.2

the ratio could be a percentage ratio or number of rows? I am not sure which one is bit straightforward and ease of understanding.

Any idea or proposal is welcomed.

@sanagno sanagno self-assigned this Jan 24, 2023
@sanagno
Copy link
Collaborator

sanagno commented Jan 24, 2023

I can take care of this

@hemangjoshi37a
Copy link
Contributor

Great idea @theblackcat102, having a data sampler that takes in the ratio of each dataset in a single epoch would be a great addition to our SFT pipeline and future projects like RLHF.

One approach to implement this could be to create a custom data sampler class that takes in a list of ratio weights from a config file like config.yaml. This class would then use these weights to determine the number of samples to be taken from each dataset in a single epoch.

As you mentioned, the ratio could be specified as a percentage or a number of rows. A percentage ratio might be more straightforward and easier to understand, as it would allow users to specify the ratio of samples taken from each dataset in terms of the proportion of the total dataset they represent.

Alternatively, specifying the ratio as a number of rows would allow users to specify the exact number of samples to be taken from each dataset. This approach would be more flexible and would allow for more fine-grained control over the sampling process.

I suggest that we should go with the first approach of percentage ratio, as it could be more straightforward and easy to understand.

I can work on a pull request to implement this feature and would appreciate any input or suggestions you may have. Let me know if this works for you.

@sanagno
Copy link
Collaborator

sanagno commented Jan 27, 2023

Hey @hemangjoshi37a feel free to start on this. If not I can implement something based on your ideas, I was planning to have something by Saturday.

@hemangjoshi37a
Copy link
Contributor

@sanagno I will try to contribute in this direction as time permits. Because I have other projects running in parallel that are in my repositories.

@maw501
Copy link
Contributor

maw501 commented Jan 31, 2023

I'm happy to pick this up. I'm new around here (this would be my first contribution) so would happily work on it with input from others.

Let me know if you are okay with this.

@sanagno
Copy link
Collaborator

sanagno commented Jan 31, 2023

Sounds great @maw501, thanks a lot! I will follow this closely

@maw501
Copy link
Contributor

maw501 commented Feb 1, 2023

Hi @sanagno, I have quite a few questions (sorry!).

Questions related to this issue ❓

  1. Should the sampling be happening per epoch, or once at the start of training when the data is loaded? If every epoch (e.g. each epoch we take a different 10% of a given dataset) then I'm going to have to have another think as I'm not quite clear at the moment where this would then happen given we use the transformers Trainer class for the training loop and so would need to work at the DataLoader level AFAIU (or we write a loop ourselves over epochs and redefine SFTTrainer every epoch with a new dataset).
  2. What does "we are reaching more than 2M pairs of total datasets" mean in the original issue? Particularly the "2M" bit? (cc. @theblackcat102 since it's their comment).
  3. Regardless of the answer to 1, whilst it seems unlikely (?) a user will know the actual dataset sizes we could mimic the approach in train_test_split (i.e. if the user specifies a fraction parameter as a float between 0.0 and 1.0 we use this, if they specify size as an int we take that many examples, otherwise if they don't specify anything we take all the data). See the yaml example below. Note that the user not knowing the dataset sizes limits how much we can assume: so the above logic operates on a single dataset based on its size.
  4. Where the logic should live: if this happens once before we start training I think the natural place for this is inside get_dataset. The advantage of this is that it minimally touches the rest of the code (the alternative is to have to make changes in quite a few places, which I'm loathe to do initially). The downside to this approach is that we get all the data first, then subset, which might be slow (though I'm tempted to not worry about this until it's actually a problem - indeed, getting all the data is close to the current behaviour).

Other questions 🙏

  1. Is the idea when working on SFT to use a conda env and run things locally with our own GPUs (I don't have one)?
  2. Related to the above, it seems I hit an error for wandb since I'm not part of your supervised-finetuning project - could you please add me if this is required?
  3. From having a look around at the datasets code it feels a tad complex at the moment - I'm tempted to open another issue whose goal would be to clean up the interfaces for datasets and allow easier configuration. This would hopefully make future changes (such as this one) possible by only changing (ideally) one part of the code.

Suggested yaml format:

 datasets:
    - webgpt:
        fraction : 0.1
    - prompt_dialogue:
        size : 2000
    - squad_v2
    - adversarial_qa

Thanks, let me know your thoughts!

@sanagno
Copy link
Collaborator

sanagno commented Feb 2, 2023

Hi @maw501, great points!

  1. The more general we make it the better for us. So ideally we could resample after each epoch. There are a few "hacky way" we can do this, e.g. rearranging datapoints in the dataset via a callback at the end of training.
  2. We have a lot of different data sources at the moment! We want to use different datasets but are still interested in prioritizing some kind of data. E.g. using a huge translation dataset may be useful for finetuning, but these examples shouldn't overwhelm examples from other datasets, e.g. dialogue creation.
  3. That seems great to me. Perhaps we can have something even more general, where we have a fraction and a size, and the minimum of the two is selected.
  4. I think the subset idea is fine for the moment. We could have better logic for some of the bigger datasets, but since we are manually creating the data from some of them during init, this is not going to work "perfectly" for all of the datasets.

Other questions:

  1. For this issue, no GPUs should be needed, but yes that is the general idea :))
  2. Just send me your wandb account. For now there is the flag --wandb-entity that you can set to your own wandb account.
  3. There are a lot of different data sources so its always going to be a bit messy. But feel free to propose changes as you want!

Thanks again :))

@maw501
Copy link
Contributor

maw501 commented Feb 2, 2023

Thanks for your thoughtful comments @sanagno - greatly appreciated. 🙇

I'll have a think about the best way to implement the sampling per epoch but from a quick look overriding the get_train_dataloader method in the Trainer class looks like a potentially non-hacky way. Note this means we'll be keeping all the data in memory whilst training (and only using a subset per epoch) - we'll have to check if this causes any problems.

It will most likely be early next week now that I have the PR if that's okay (away this weekend 😬 ).

@sanagno
Copy link
Collaborator

sanagno commented Feb 2, 2023

Sounds good!

@maw501
Copy link
Contributor

maw501 commented Feb 6, 2023

Just a quick update on this: the single GPU version is working and was simple enough.

I've got a bit more to do to get it to work in a distributed setting but hope to have something to share in the next day or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants