-
Notifications
You must be signed in to change notification settings - Fork 86
HF-preprocessed provider #506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pefontana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one @dsocolobsky !
Just one thing, maybe we can check
in the PreprocessedDataProvider we check the sequence length
shared/data-provider/src/preprocessed.rs:99
let input_ids = list_to_vec(
&row,
inputs_column,
Some(num_tokens_per_sequence),
)?;
Maybe we can do the same here, but only if doesnt involve a lot of code
| // For example if row_indices = [5, 6, 7, 10, 11] then the loop does: | ||
| // - [5, 6, 7] in first iteration since they are consecutive | ||
| // - [10, 11] in second iteration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how this works with the shuffled indices. There’s a small chance we could end up with consecutive numbers, which would mean making one request per index, so we’d likely get rate-limited sooner rather than later. Maybe we could request the data using ordered indices and then shuffle the results afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a look at this but at a first glance it seems like we can't avoid fetching 1 by 1 in the case the indices are shuffled. It's very unlikely that if we have e.g. 4 shuffled indices in the range [0,40000] some of them will be consecutive.
| if self.num_rows == 0 { | ||
| return vec![]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're already checking this in the get_samples function before calling this, can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth it to leave the check just in case at some point we re-use it somewhere. If we ever call this before checking before if num_rows=0 it will access out of bounds and panic. I prefer not to have this hidden condition.
|
I think in the |
|
it would be useful for me to learn about this if you could add information about this new data provider under the Provider Configuration subheading in |
Adds a new
HuggingFacePreprocessedDataProviderData Provider which fetches pre-tokenized datasets from HuggingFace via their streaming interface (so, on-demand).I added a new test config
hf-preprocessed-config.tomlusingemozilla/Hermes-3-Preprocessed-Llama3, training seemed to work alright but I'm not that familiar with these datasets/api so requires further testing.