feat(text-chunking): sentence-/token-aware `chunk_text(...)` with rich metadata & smart merging #35

deniswaker · 2025-09-26T11:58:30Z

Replaces the fixed-window chunker with a configurable chunk_text(...) that is sentence-aware, token-aware (via tiktoken when available), and emits detailed metadata for each chunk. Handles very long sentences and oversized chunks gracefully and reduces tiny fragments through configurable merging.

What Changed

Sentence-aware chunking
- Joins sentences up to a character limit to produce coherent chunks.
Token-aware splitting
- Uses tiktoken if present; otherwise falls back to a lightweight token estimator.
- Supports a token_target to guide chunk sizes.
Chunk metadata output
- Returns a list of dicts with: id, order, offset, length, text, estimated_tokens.
Long sentence handling
- Splits very long single sentences into windows that respect limits.
Secondary splitting for large chunks
- Further subdivides oversized chunks based on token_target to keep sizes consistent.
Tiny-chunk merging
- Merges sub-threshold fragments into the previous chunk using min_chunk_length to avoid many tiny pieces.

Motivation

Improve downstream retrieval and summarization quality by keeping sentences together.
Provide deterministic sizing for LLM contexts while retaining readability.
Enable diagnostics and auditing through rich per-chunk metadata.

Implementation Notes

New chunk_text(...) API encapsulates sentence detection, token estimation, windowing, and merging.
Prefers tiktoken when installed; gracefully degrades without it.
Offsets and lengths are computed against the original input to enable traceability.

Configuration & Defaults

max_chars_per_chunk: soft cap used during sentence joining.
token_target: desired token size when tokenization is available.
min_chunk_length: lower bound for merging tiny fragments.
Sensible defaults provided; all parameters are overrideable.

Backward Compatibility

Supersedes the previous fixed-window approach; call sites should switch to chunk_text(...).
Output now returns metadata dicts instead of plain strings (intentional API improvement).

Risks & Mitigations

Environment variance: token counts differ when tiktoken is absent → mitigated by fallback estimator.
Chunk size drift: extreme inputs may still overrun targets → mitigated by secondary splitting.

Summary by CodeRabbit

New Features
- Improved chunking with sentence-aware and token-aware sizing.
- Optional token target to refine large chunks.
- Returns structured chunk metadata (id, order, offsets, length, estimated tokens) instead of plain text.
- Automatically merges very small chunks for more consistent results.
Bug Fixes
- Stronger input validation and clearer error messages for invalid inputs.
Refactor
- Streamlined chunking flow for more reliable, consistent output.
- Expanded API parameters to configure sentence-based splitting and minimum chunk length.

…e-aware splitting, and add metadata generation

coderabbitai · 2025-09-26T11:58:36Z

Walkthrough

Replaced fixed-window chunking with token- and sentence-aware logic. Introduced optional token estimation, sentence splitting/joining, metadata construction per chunk, merging of small chunks, and revised error handling. The public API of chunk_text changed to accept new parameters and now returns a list of metadata dicts.

Changes

Cohort / File(s)	Summary
Chunking service overhaul `services/chunking_service.py`	Reworked chunking flow to sentence/token-aware; added token estimation (tiktoken or fallback), sentence split/join helpers, metadata builder, tiny-chunk merge, and error handling. Updated `chunk_text` signature to include `by_sentence`, `min_chunk_length`, `token_target`; return type is now list of metadata dicts.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Client
    participant CS as ChunkingService
    participant TE as TokenEstimator (tiktoken/fallback)

    Client->>CS: chunk_text(text, max_length, overlap, by_sentence, min_chunk_length, token_target)
    rect rgba(200,220,255,0.25)
    note over CS: Validate inputs
    CS-->>Client: DocumentChunkError (on invalid inputs)
    end

    alt by_sentence == true
        CS->>CS: Split text by sentence boundaries
        CS->>CS: Join sentences into initial chunks (max_length/overlap)
    else
        CS->>CS: Fixed-window chunking (max_length/overlap)
    end

    opt token_target provided
        CS->>TE: Estimate tokens per chunk
        TE-->>CS: token counts
        CS->>CS: Refine/split oversized chunks to meet token_target
    end

    CS->>CS: Build metadata per chunk (id, order, offset, length, text, estimated_tokens)
    CS->>CS: Merge tiny chunks (< min_chunk_length) into previous
    CS->>CS: Reassign final order
    CS-->>Client: List[Dict]: chunk metadata

    rect rgba(255,230,200,0.25)
    note over CS: Any processing failure -> DocumentChunkError
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibble sentences, hop through tokens bright,
Stitching tiny thoughts into chunks just right.
With whiskered care, I merge the small and shy,
Count tokens, split giants, let bad inputs fly.
A metadata warren—ordered, neat, concise—
My carrots? Clean chunks, precisely sliced. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly indicates the feature scope and main improvements in the `chunk_text` API by highlighting sentence- and token-awareness along with rich metadata and smart merging, which directly match the core changes in the PR. It correctly references the primary functionality added without extraneous detail or ambiguity. The phrasing is concise, readable, and clearly conveys the main enhancement to anyone scanning the history. This aligns well with the PR’s objectives and detailed changeset.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/chunking-sentence-token-meta

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77ca247 and 9219e90.

📒 Files selected for processing (1)

services/chunking_service.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

services/chunking_service.py (1)

services/exceptions.py (1)

DocumentChunkError (17-19)

🪛 Ruff (0.13.1)

services/chunking_service.py

15-15: Do not catch blind exception: Exception

(BLE001)

27-28: try-except-pass detected, consider logging the exception

(S110)

27-27: Do not catch blind exception: Exception

(BLE001)

157-157: Consider moving this statement to an else block

(TRY300)

coderabbitai · 2025-09-26T12:06:10Z

services/chunking_service.py

+        refined: List[Tuple[int, str]] = []  # list of (offset, text)
+        offset = 0
+        for rc in raw_chunks:
+            off = text.find(rc, offset)
+            if off == -1:
+                off = offset
+            # if token_target set and estimated tokens exceed twice the target, split
+            if token_target is not None:
+                est = _estimate_tokens_from_text(rc)
+                if est > token_target * 2:
+                    # split by character windows approximating tokens
+                    approx_chars = max(100, token_target * 4)
+                    for i in range(0, len(rc), approx_chars):
+                        refined.append((off + i, rc[i:i+approx_chars]))
+                    offset = off + len(rc)
+                    continue
+            refined.append((off, rc))
+            offset = off + len(rc)
+


⚠️ Potential issue | 🔴 Critical

Fix metadata offsets when chunks overlap or we normalize whitespace.

Recovering offsets with text.find(...) breaks as soon as we use overlap or mutate whitespace. Example: with by_sentence=False, max_length=10, overlap=3, chunk #2 really starts at index 7, but because offset has already advanced to 10 the search fails, we fall back to off = offset = 10, and the returned metadata points at the wrong span. _join_sentences also collapses whitespace with ' '.join, so the chunk text no longer appears verbatim in the source; find fails again and we reuse the previous offset, making both offset and length diverge from the original document—violating the contract advertised in this PR.

Please carry the true start index together with each chunk instead of rediscovering it. Have both chunk builders return (start_idx, chunk_text) tuples (preserving the exact slice from text) and consume them directly here:

- refined: List[Tuple[int, str]] = [] # list of (offset, text) - offset = 0 - for rc in raw_chunks: - off = text.find(rc, offset) - if off == -1: - off = offset + refined: List[Tuple[int, str]] = [] + for off, rc in raw_chunks: if token_target is not None: est = _estimate_tokens_from_text(rc) if est > token_target * 2: approx_chars = max(100, token_target * 4) for i in range(0, len(rc), approx_chars): refined.append((off + i, rc[i:i+approx_chars])) continue refined.append((off, rc)) - offset = off + len(rc)

…and update _join_sentences / the fixed-window branch so they emit those (start_idx, chunk) tuples using untouched slices from the original string. Without this change any consumer that trusts offset/length will read the wrong portion of the document.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In services/chunking_service.py around lines 115-133, stop using text.find() to recover offsets (which breaks when chunks overlap or whitespace is normalized); instead have the sentence- and fixed-window chunk builders return tuples (start_idx, chunk_text) using untouched slices from the original text, then consume those tuples here (use the provided start_idx as off rather than computing it), and when token_target triggers further sub-splits compute sub-slice offsets relative to the provided start_idx (e.g., off + i) so all refined entries keep true start indices and lengths.

Enhance chunking service: implement token estimation, improve sentenc…

9219e90

…e-aware splitting, and add metadata generation

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

fehranbit self-requested a review September 30, 2025 05:44

fehranbit approved these changes Sep 30, 2025

View reviewed changes

fehranbit merged commit f8ad998 into main Sep 30, 2025
1 check passed

coderabbitai bot mentioned this pull request Oct 2, 2025

Enhance chunking service: add docstrings for token estimation #36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(text-chunking): sentence-/token-aware `chunk_text(...)` with rich metadata & smart merging #35

feat(text-chunking): sentence-/token-aware `chunk_text(...)` with rich metadata & smart merging #35

Uh oh!

deniswaker commented Sep 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 26, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

feat(text-chunking): sentence-/token-aware chunk_text(...) with rich metadata & smart merging #35

feat(text-chunking): sentence-/token-aware chunk_text(...) with rich metadata & smart merging #35

Uh oh!

Conversation

deniswaker commented Sep 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Changed

Motivation

Implementation Notes

Configuration & Defaults

Backward Compatibility

Risks & Mitigations

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

feat(text-chunking): sentence-/token-aware `chunk_text(...)` with rich metadata & smart merging #35

feat(text-chunking): sentence-/token-aware `chunk_text(...)` with rich metadata & smart merging #35

deniswaker commented Sep 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 26, 2025 •

edited

Loading