Skip to content

Add length column to tokenize_and_chunk output#252

Open
jammastergirish wants to merge 1 commit intomainfrom
fix-length-column-in-chunked-tokenize
Open

Add length column to tokenize_and_chunk output#252
jammastergirish wants to merge 1 commit intomainfrom
fix-length-column-in-chunked-tokenize

Conversation

@jammastergirish
Copy link
Copy Markdown
Collaborator

@jammastergirish jammastergirish commented Apr 29, 2026

Summary

When data.chunk_length > 0, dataset preparation routes through tokenize_and_chunk (bergson/data.py:700). That function previously returned only input_ids and doc_ids. Every downstream consumer expects a length column too:

  • bergson/build.py:87,100 — build worker
  • bergson/score/score.py:290,313 — score worker
  • bergson/hessians/hessian_approximations.py:159

The non-chunked path adds length implicitly because tokenize (data.py:633) passes return_length=True to the tokenizer. The chunked path skipped that entirely, so any build / score / hessian run with chunk_length > 0 crashed at the first batch:

ValueError: Column 'length' doesn't exist.

Fix

Every chunk produced by tokenize_and_chunk has length chunk_size by construction, so this is a one-line addition to chunk_batch's return dict:

return {
    "input_ids": token_chunks,
    "doc_ids": doc_chunks,
    "length": [chunk_size] * n_chunks,
}

How this surfaced

Hit while running the new build → score pipeline example (on the upcoming PR in add_build_score_pipeline_example_yaml), which uses chunk_length: 1024. Both the build and score steps would have crashed on the same missing column. We also saw this on earlier work.

When data.chunk_length > 0, dataset preparation routes through
tokenize_and_chunk, which previously returned only input_ids and
doc_ids. Every downstream consumer expects a length column:

  - bergson/build.py:87,100             (build worker)
  - bergson/score/score.py:290,313      (score worker)
  - bergson/hessians/hessian_approximations.py:159

The non-chunked path adds length implicitly via tokenizer
return_length=True; the chunked path skipped it, so any chunked
build/score/hessian run failed at first batch with:

    ValueError: Column 'length' doesn't exist.

Every chunk has length chunk_size by construction, so the fix
is one line in chunk_batch's return dict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant