Skip to content

feat: add chunking module, remove chonkie#84

Merged
stephantul merged 9 commits into
mainfrom
new-chunkle
May 11, 2026
Merged

feat: add chunking module, remove chonkie#84
stephantul merged 9 commits into
mainfrom
new-chunkle

Conversation

@stephantul
Copy link
Copy Markdown
Contributor

@stephantul stephantul commented May 10, 2026

This PR removes chonkie from semble. One of Chonkie's direct dependencies, tokie, was causing installation failures for users on python 3.14. As uvx defaults to the latest stable release of python, users without a pre-existing python environment would pull 3.14 when running, e.g.,

uvx --from "semble[mcp]" semble

See #81 and #80 for examples of this issue. Upon closer inspection, we don't use any features from either chonkie or tokie in our code, except the code chunker. As the code chunker is mostly a refinement algorithm on top of tree-sitter, we opted to reimplement this algorithm.

The actual PR implements a recursive algorithm on top of either:

  1. chunks produced by tree-sitter (for languages recognized by us supported by by tree-sitter-language-pack).
  2. lines produces by simple splitting (for documentation and languages we don't support).

The algorithm works as follows. Given a hierarchically organized set of chunks, all of which have a start and end span, and a desired chunk size, we create chunks by grouping adjacent chunks if they're below the chunk size. If a chunk is larger than the desired chunk size, we instead split it into its children, and then perform the same algorithm. After this, we then perform a refinement step where we merge adjacent chunks.

For 2), above, we only perform the second step, as lines are not organized hierarchically.

The PR removes 11 dependencies, and maintains 100% test coverage. Performance on our internal benchmark is maintained.

@stephantul stephantul requested a review from Pringled May 10, 2026 15:59
@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/semble/chunking/__init__.py 100.00% <100.00%> (ø)
src/semble/chunking/chunking.py 100.00% <100.00%> (ø)
src/semble/chunking/core.py 100.00% <100.00%> (ø)
src/semble/index/create.py 100.00% <100.00%> (ø)

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small comments but looks great

Comment thread src/semble/chunking/core.py Outdated
Comment thread src/semble/chunking/core.py Outdated
Comment thread src/semble/chunking/chunking.py Outdated
@stephantul stephantul merged commit b399168 into main May 11, 2026
30 of 31 checks passed
@stephantul stephantul deleted the new-chunkle branch May 11, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants