-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data sampling features for SFT and future projects ( RLHF ) as well #910
Comments
I can take care of this |
Great idea @theblackcat102, having a data sampler that takes in the ratio of each dataset in a single epoch would be a great addition to our SFT pipeline and future projects like RLHF. One approach to implement this could be to create a custom data sampler class that takes in a list of ratio weights from a config file like config.yaml. This class would then use these weights to determine the number of samples to be taken from each dataset in a single epoch. As you mentioned, the ratio could be specified as a percentage or a number of rows. A percentage ratio might be more straightforward and easier to understand, as it would allow users to specify the ratio of samples taken from each dataset in terms of the proportion of the total dataset they represent. Alternatively, specifying the ratio as a number of rows would allow users to specify the exact number of samples to be taken from each dataset. This approach would be more flexible and would allow for more fine-grained control over the sampling process. I suggest that we should go with the first approach of percentage ratio, as it could be more straightforward and easy to understand. I can work on a pull request to implement this feature and would appreciate any input or suggestions you may have. Let me know if this works for you. |
Hey @hemangjoshi37a feel free to start on this. If not I can implement something based on your ideas, I was planning to have something by Saturday. |
@sanagno I will try to contribute in this direction as time permits. Because I have other projects running in parallel that are in my repositories. |
I'm happy to pick this up. I'm new around here (this would be my first contribution) so would happily work on it with input from others. Let me know if you are okay with this. |
Sounds great @maw501, thanks a lot! I will follow this closely |
Hi @sanagno, I have quite a few questions (sorry!). Questions related to this issue ❓
Other questions 🙏
Suggested yaml format: datasets:
- webgpt:
fraction : 0.1
- prompt_dialogue:
size : 2000
- squad_v2
- adversarial_qa Thanks, let me know your thoughts! |
Hi @maw501, great points!
Other questions:
Thanks again :)) |
Thanks for your thoughtful comments @sanagno - greatly appreciated. 🙇 I'll have a think about the best way to implement the sampling per epoch but from a quick look overriding the It will most likely be early next week now that I have the PR if that's okay (away this weekend 😬 ). |
Sounds good! |
Just a quick update on this: the single GPU version is working and was simple enough. I've got a bit more to do to get it to work in a distributed setting but hope to have something to share in the next day or two. |
Currently we are reaching more than 2M pairs of total datasets available in our SFT pipeline. Existing code simple put a hard limit of pairs each datasets
if len(pairs) > max_length: break
.Ideally we want to write a sampler which takes in the ratio of each datasets represent in a single epoch. This sampler takes a list of ratio weights from config.yaml as input.
config.yaml would look something like this:
the ratio could be a percentage ratio or number of rows? I am not sure which one is bit straightforward and ease of understanding.
Any idea or proposal is welcomed.
The text was updated successfully, but these errors were encountered: