Add data sampling to the training pipeline #208
Labels
enhancement
New feature or request
priority: medium
work package: model training
Relates to the model training pipeline
For training on a mixture of natural language and chemistry datasets we want to be able to load more than 1 Hugging Face tokenised dataset into the training pipeline specifying some sampling metrics which define the composition of the final training dataset.
i.e.
See this issue for further conversation of possible implementations.
The text was updated successfully, but these errors were encountered: