Add length column to tokenize_and_chunk output#252
Open
jammastergirish wants to merge 1 commit intomainfrom
Open
Add length column to tokenize_and_chunk output#252jammastergirish wants to merge 1 commit intomainfrom
jammastergirish wants to merge 1 commit intomainfrom
Conversation
When data.chunk_length > 0, dataset preparation routes through
tokenize_and_chunk, which previously returned only input_ids and
doc_ids. Every downstream consumer expects a length column:
- bergson/build.py:87,100 (build worker)
- bergson/score/score.py:290,313 (score worker)
- bergson/hessians/hessian_approximations.py:159
The non-chunked path adds length implicitly via tokenizer
return_length=True; the chunked path skipped it, so any chunked
build/score/hessian run failed at first batch with:
ValueError: Column 'length' doesn't exist.
Every chunk has length chunk_size by construction, so the fix
is one line in chunk_batch's return dict.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
data.chunk_length > 0, dataset preparation routes throughtokenize_and_chunk(bergson/data.py:700). That function previously returned onlyinput_idsanddoc_ids. Every downstream consumer expects alengthcolumn too:bergson/build.py:87,100— build workerbergson/score/score.py:290,313— score workerbergson/hessians/hessian_approximations.py:159The non-chunked path adds
lengthimplicitly becausetokenize(data.py:633) passesreturn_length=Trueto the tokenizer. The chunked path skipped that entirely, so anybuild/score/hessianrun withchunk_length > 0crashed at the first batch:Fix
Every chunk produced by
tokenize_and_chunkhas lengthchunk_sizeby construction, so this is a one-line addition tochunk_batch's return dict:How this surfaced
Hit while running the new build → score pipeline example (on the upcoming PR in
add_build_score_pipeline_example_yaml), which useschunk_length: 1024. Both the build and score steps would have crashed on the same missing column. We also saw this on earlier work.