Add data sampling to the training pipeline #208

jackapbutler · 2023-05-03T14:25:45Z

For training on a mixture of natural language and chemistry datasets we want to be able to load more than 1 Hugging Face tokenised dataset into the training pipeline specifying some sampling metrics which define the composition of the final training dataset.

i.e.

data:
  paths: /fsx/proj-chemnlp/data/EleutherAI/pythia-1b/marianna13/chemrxiv .... <another>
  ratios: 25:75 # final dataset has a composition of 25% dataset A and 50% dataset B

See this issue for further conversation of possible implementations.

The text was updated successfully, but these errors were encountered:

jackapbutler added enhancement New feature or request work package: model training Relates to the model training pipeline priority: medium labels May 3, 2023

jackapbutler changed the title ~~Add minimal data sampling to the pipeline~~ Add data sampling to the training pipeline May 3, 2023

bethanyconnolly self-assigned this May 9, 2023

bethanyconnolly closed this as completed May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data sampling to the training pipeline #208

Add data sampling to the training pipeline #208

jackapbutler commented May 3, 2023 •

edited

Loading

Add data sampling to the training pipeline #208

Add data sampling to the training pipeline #208

Comments

jackapbutler commented May 3, 2023 • edited Loading

jackapbutler commented May 3, 2023 •

edited

Loading