Add length column to tokenize_and_chunk output by jammastergirish · Pull Request #252 · EleutherAI/bergson

jammastergirish · 2026-04-29T16:23:57Z

Summary

When data.chunk_length > 0, dataset preparation routes through tokenize_and_chunk (bergson/data.py:700). That function previously returned only input_ids and doc_ids. Every downstream consumer expects a length column too:

bergson/build.py:87,100 — build worker
bergson/score/score.py:290,313 — score worker
bergson/hessians/hessian_approximations.py:159

The non-chunked path adds length implicitly because tokenize (data.py:633) passes return_length=True to the tokenizer. The chunked path skipped that entirely, so any build / score / hessian run with chunk_length > 0 crashed at the first batch:

ValueError: Column 'length' doesn't exist.

Fix

Every chunk produced by tokenize_and_chunk has length chunk_size by construction, so this is a one-line addition to chunk_batch's return dict:

return {
    "input_ids": token_chunks,
    "doc_ids": doc_chunks,
    "length": [chunk_size] * n_chunks,
}

How this surfaced

Hit while running the new build → score pipeline example (on the upcoming PR in add_build_score_pipeline_example_yaml), which uses chunk_length: 1024. Both the build and score steps would have crashed on the same missing column. We also saw this on earlier work.

When data.chunk_length > 0, dataset preparation routes through tokenize_and_chunk, which previously returned only input_ids and doc_ids. Every downstream consumer expects a length column: - bergson/build.py:87,100 (build worker) - bergson/score/score.py:290,313 (score worker) - bergson/hessians/hessian_approximations.py:159 The non-chunked path adds length implicitly via tokenizer return_length=True; the chunked path skipped it, so any chunked build/score/hessian run failed at first batch with: ValueError: Column 'length' doesn't exist. Every chunk has length chunk_size by construction, so the fix is one line in chunk_batch's return dict. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jammastergirish requested a review from luciaquirke April 29, 2026 16:24

This was referenced Apr 29, 2026

Add build → score pipeline example #253

Open

ELE-11: Add YAML-mediated search pipeline to Bergson #246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add length column to tokenize_and_chunk output#252

Add length column to tokenize_and_chunk output#252
jammastergirish wants to merge 1 commit intomainfrom
fix-length-column-in-chunked-tokenize

jammastergirish commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jammastergirish commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

How this surfaced

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jammastergirish commented Apr 29, 2026 •

edited

Loading