Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data sampling to the training pipeline #208

Closed
jackapbutler opened this issue May 3, 2023 · 0 comments
Closed

Add data sampling to the training pipeline #208

jackapbutler opened this issue May 3, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request priority: medium work package: model training Relates to the model training pipeline

Comments

@jackapbutler
Copy link
Collaborator

jackapbutler commented May 3, 2023

For training on a mixture of natural language and chemistry datasets we want to be able to load more than 1 Hugging Face tokenised dataset into the training pipeline specifying some sampling metrics which define the composition of the final training dataset.

i.e.

data:
  paths: /fsx/proj-chemnlp/data/EleutherAI/pythia-1b/marianna13/chemrxiv .... <another>
  ratios: 25:75 # final dataset has a composition of 25% dataset A and 50% dataset B

See this issue for further conversation of possible implementations.

@jackapbutler jackapbutler added enhancement New feature or request work package: model training Relates to the model training pipeline priority: medium labels May 3, 2023
@jackapbutler jackapbutler changed the title Add minimal data sampling to the pipeline Add data sampling to the training pipeline May 3, 2023
@bethanyconnolly bethanyconnolly self-assigned this May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: medium work package: model training Relates to the model training pipeline
Projects
None yet
Development

No branches or pull requests

2 participants