Skip to content

Conversation

@dsocolobsky
Copy link
Contributor

@dsocolobsky dsocolobsky commented Jan 16, 2026

Adds a new HuggingFacePreprocessedDataProvider Data Provider which fetches pre-tokenized datasets from HuggingFace via their streaming interface (so, on-demand).

I added a new test config hf-preprocessed-config.toml using emozilla/Hermes-3-Preprocessed-Llama3, training seemed to work alright but I'm not that familiar with these datasets/api so requires further testing.

@dsocolobsky dsocolobsky marked this pull request as ready for review January 20, 2026 14:34
@dsocolobsky dsocolobsky changed the title Draft: HF-preprocessed provider HF-preprocessed provider Jan 20, 2026
Copy link
Contributor

@pefontana pefontana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one @dsocolobsky !
Just one thing, maybe we can check
in the PreprocessedDataProvider we check the sequence length
shared/data-provider/src/preprocessed.rs:99

                                let input_ids = list_to_vec(
                                    &row,
                                    inputs_column,
                                    Some(num_tokens_per_sequence),
                                )?;

Maybe we can do the same here, but only if doesnt involve a lot of code

Comment on lines +226 to +228
// For example if row_indices = [5, 6, 7, 10, 11] then the loop does:
// - [5, 6, 7] in first iteration since they are consecutive
// - [10, 11] in second iteration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this works with the shuffled indices. There’s a small chance we could end up with consecutive numbers, which would mean making one request per index, so we’d likely get rate-limited sooner rather than later. Maybe we could request the data using ordered indices and then shuffle the results afterward.

Copy link
Contributor Author

@dsocolobsky dsocolobsky Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a look at this but at a first glance it seems like we can't avoid fetching 1 by 1 in the case the indices are shuffled. It's very unlikely that if we have e.g. 4 shuffled indices in the range [0,40000] some of them will be consecutive.

Comment on lines +202 to +204
if self.num_rows == 0 {
return vec![];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're already checking this in the get_samples function before calling this, can this be removed?

Copy link
Contributor Author

@dsocolobsky dsocolobsky Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth it to leave the check just in case at some point we re-use it somewhere. If we ever call this before checking before if num_rows=0 it will access out of bounds and panic. I prefer not to have this hidden condition.

@IAvecilla
Copy link
Contributor

I think in the train.rs example we use the Preprocessed variant but it expects the data to be available locally. I think is not a priority and can be done in another PR but I think at some point we want to add this variant to be able to run the train example without needing the data locally beforehand #26

@philrhc
Copy link

philrhc commented Jan 22, 2026

it would be useful for me to learn about this if you could add information about this new data provider under the Provider Configuration subheading in psyche-book/src/explain/data-provider.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants