HF-preprocessed provider #506

dsocolobsky · 2026-01-16T19:42:52Z

Adds a new HuggingFacePreprocessedDataProvider Data Provider which fetches pre-tokenized datasets from HuggingFace via their streaming interface (so, on-demand).

I added a new test config hf-preprocessed-config.toml using emozilla/Hermes-3-Preprocessed-Llama3, training seemed to work alright but I'm not that familiar with these datasets/api so requires further testing.

pefontana

Nice one @dsocolobsky !
Just one thing, maybe we can check
in the PreprocessedDataProvider we check the sequence length
shared/data-provider/src/preprocessed.rs:99

                                let input_ids = list_to_vec(
                                    &row,
                                    inputs_column,
                                    Some(num_tokens_per_sequence),
                                )?;

Maybe we can do the same here, but only if doesnt involve a lot of code

IAvecilla · 2026-01-21T18:09:43Z

shared/data-provider/src/hf_preprocessed.rs

+        // For example if row_indices = [5, 6, 7, 10, 11] then the loop does:
+        // - [5, 6, 7] in first iteration since they are consecutive
+        // - [10, 11] in second iteration


I wonder how this works with the shuffled indices. There’s a small chance we could end up with consecutive numbers, which would mean making one request per index, so we’d likely get rate-limited sooner rather than later. Maybe we could request the data using ordered indices and then shuffle the results afterward.

I've taken a look at this but at a first glance it seems like we can't avoid fetching 1 by 1 in the case the indices are shuffled. It's very unlikely that if we have e.g. 4 shuffled indices in the range [0,40000] some of them will be consecutive.

IAvecilla · 2026-01-21T18:15:33Z

shared/data-provider/src/hf_preprocessed.rs

+        if self.num_rows == 0 {
+            return vec![];
+        }


I think we're already checking this in the get_samples function before calling this, can this be removed?

I think it's worth it to leave the check just in case at some point we re-use it somewhere. If we ever call this before checking before if num_rows=0 it will access out of bounds and panic. I prefer not to have this hidden condition.

IAvecilla · 2026-01-21T18:22:51Z

I think in the train.rs example we use the Preprocessed variant but it expects the data to be available locally. I think is not a priority and can be done in another PR but I think at some point we want to add this variant to be able to run the train example without needing the data locally beforehand #26

philrhc · 2026-01-22T07:12:38Z

it would be useful for me to learn about this if you could add information about this new data provider under the Provider Configuration subheading in psyche-book/src/explain/data-provider.md

dsocolobsky added 8 commits January 15, 2026 09:19

WIP hf decentralized provider

310288b

Merge branch 'main' into dy/hf-preprocessed-provider

9e2e07a

change hf-preprocessed-config.toml

3c35844

simplify code

2b72fd9

simplify code further

18c2eb9

Merge branch 'main' into dy/hf-preprocessed-provider

febea24

simplify code further

8e3682b

Merge branch 'main' into dy/hf-preprocessed-provider

c45c6d9

dsocolobsky marked this pull request as ready for review January 20, 2026 14:34

dsocolobsky changed the title ~~Draft: HF-preprocessed provider~~ HF-preprocessed provider Jan 20, 2026

pefontana reviewed Jan 20, 2026

View reviewed changes

dsocolobsky and others added 2 commits January 21, 2026 06:34

hf_preprocessed: validate sequence length when parsing

ae10685

Merge branch 'main' into dy/hf-preprocessed-provider

2ad0f81

IAvecilla reviewed Jan 21, 2026

View reviewed changes

Merge branch 'main' into dy/hf-preprocessed-provider

f3b4f1e

philrhc mentioned this pull request Jan 22, 2026

Adds HTTP to train.rs example #511

Open

IAvecilla assigned dsocolobsky Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF-preprocessed provider #506

HF-preprocessed provider #506

Uh oh!

dsocolobsky commented Jan 16, 2026 •

edited

Loading

Uh oh!

pefontana left a comment

Uh oh!

IAvecilla Jan 21, 2026

Uh oh!

dsocolobsky Jan 23, 2026 •

edited

Loading

Uh oh!

IAvecilla Jan 21, 2026

Uh oh!

dsocolobsky Jan 23, 2026 •

edited

Loading

Uh oh!

IAvecilla commented Jan 21, 2026

Uh oh!

philrhc commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HF-preprocessed provider #506

Are you sure you want to change the base?

HF-preprocessed provider #506

Uh oh!

Conversation

dsocolobsky commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pefontana left a comment

Choose a reason for hiding this comment

Uh oh!

IAvecilla Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

dsocolobsky Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IAvecilla Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

dsocolobsky Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IAvecilla commented Jan 21, 2026

Uh oh!

philrhc commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dsocolobsky commented Jan 16, 2026 •

edited

Loading

dsocolobsky Jan 23, 2026 •

edited

Loading

dsocolobsky Jan 23, 2026 •

edited

Loading