Skip to content

Conversation

deniswaker
Copy link
Collaborator

@deniswaker deniswaker commented Sep 26, 2025

Replaces the fixed-window chunker with a configurable chunk_text(...) that is sentence-aware, token-aware (via tiktoken when available), and emits detailed metadata for each chunk. Handles very long sentences and oversized chunks gracefully and reduces tiny fragments through configurable merging.

What Changed

  • Sentence-aware chunking

    • Joins sentences up to a character limit to produce coherent chunks.
  • Token-aware splitting

    • Uses tiktoken if present; otherwise falls back to a lightweight token estimator.
    • Supports a token_target to guide chunk sizes.
  • Chunk metadata output

    • Returns a list of dicts with: id, order, offset, length, text, estimated_tokens.
  • Long sentence handling

    • Splits very long single sentences into windows that respect limits.
  • Secondary splitting for large chunks

    • Further subdivides oversized chunks based on token_target to keep sizes consistent.
  • Tiny-chunk merging

    • Merges sub-threshold fragments into the previous chunk using min_chunk_length to avoid many tiny pieces.

Motivation

  • Improve downstream retrieval and summarization quality by keeping sentences together.
  • Provide deterministic sizing for LLM contexts while retaining readability.
  • Enable diagnostics and auditing through rich per-chunk metadata.

Implementation Notes

  • New chunk_text(...) API encapsulates sentence detection, token estimation, windowing, and merging.
  • Prefers tiktoken when installed; gracefully degrades without it.
  • Offsets and lengths are computed against the original input to enable traceability.

Configuration & Defaults

  • max_chars_per_chunk: soft cap used during sentence joining.
  • token_target: desired token size when tokenization is available.
  • min_chunk_length: lower bound for merging tiny fragments.
  • Sensible defaults provided; all parameters are overrideable.

Backward Compatibility

  • Supersedes the previous fixed-window approach; call sites should switch to chunk_text(...).
  • Output now returns metadata dicts instead of plain strings (intentional API improvement).

Risks & Mitigations

  • Environment variance: token counts differ when tiktoken is absent → mitigated by fallback estimator.
  • Chunk size drift: extreme inputs may still overrun targets → mitigated by secondary splitting.

Summary by CodeRabbit

  • New Features

    • Improved chunking with sentence-aware and token-aware sizing.
    • Optional token target to refine large chunks.
    • Returns structured chunk metadata (id, order, offsets, length, estimated tokens) instead of plain text.
    • Automatically merges very small chunks for more consistent results.
  • Bug Fixes

    • Stronger input validation and clearer error messages for invalid inputs.
  • Refactor

    • Streamlined chunking flow for more reliable, consistent output.
    • Expanded API parameters to configure sentence-based splitting and minimum chunk length.

…e-aware splitting, and add metadata generation
Copy link

coderabbitai bot commented Sep 26, 2025

Walkthrough

Replaced fixed-window chunking with token- and sentence-aware logic. Introduced optional token estimation, sentence splitting/joining, metadata construction per chunk, merging of small chunks, and revised error handling. The public API of chunk_text changed to accept new parameters and now returns a list of metadata dicts.

Changes

Cohort / File(s) Summary
Chunking service overhaul
services/chunking_service.py
Reworked chunking flow to sentence/token-aware; added token estimation (tiktoken or fallback), sentence split/join helpers, metadata builder, tiny-chunk merge, and error handling. Updated chunk_text signature to include by_sentence, min_chunk_length, token_target; return type is now list of metadata dicts.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Client
    participant CS as ChunkingService
    participant TE as TokenEstimator (tiktoken/fallback)

    Client->>CS: chunk_text(text, max_length, overlap, by_sentence, min_chunk_length, token_target)
    rect rgba(200,220,255,0.25)
    note over CS: Validate inputs
    CS-->>Client: DocumentChunkError (on invalid inputs)
    end

    alt by_sentence == true
        CS->>CS: Split text by sentence boundaries
        CS->>CS: Join sentences into initial chunks (max_length/overlap)
    else
        CS->>CS: Fixed-window chunking (max_length/overlap)
    end

    opt token_target provided
        CS->>TE: Estimate tokens per chunk
        TE-->>CS: token counts
        CS->>CS: Refine/split oversized chunks to meet token_target
    end

    CS->>CS: Build metadata per chunk (id, order, offset, length, text, estimated_tokens)
    CS->>CS: Merge tiny chunks (< min_chunk_length) into previous
    CS->>CS: Reassign final order
    CS-->>Client: List[Dict]: chunk metadata

    rect rgba(255,230,200,0.25)
    note over CS: Any processing failure -> DocumentChunkError
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibble sentences, hop through tokens bright,
Stitching tiny thoughts into chunks just right.
With whiskered care, I merge the small and shy,
Count tokens, split giants, let bad inputs fly.
A metadata warren—ordered, neat, concise—
My carrots? Clean chunks, precisely sliced. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly indicates the feature scope and main improvements in the chunk_text API by highlighting sentence- and token-awareness along with rich metadata and smart merging, which directly match the core changes in the PR. It correctly references the primary functionality added without extraneous detail or ambiguity. The phrasing is concise, readable, and clearly conveys the main enhancement to anyone scanning the history. This aligns well with the PR’s objectives and detailed changeset.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/chunking-sentence-token-meta

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77ca247 and 9219e90.

📒 Files selected for processing (1)
  • services/chunking_service.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
services/chunking_service.py (1)
services/exceptions.py (1)
  • DocumentChunkError (17-19)
🪛 Ruff (0.13.1)
services/chunking_service.py

15-15: Do not catch blind exception: Exception

(BLE001)


27-28: try-except-pass detected, consider logging the exception

(S110)


27-27: Do not catch blind exception: Exception

(BLE001)


157-157: Consider moving this statement to an else block

(TRY300)

Comment on lines +115 to +133
refined: List[Tuple[int, str]] = [] # list of (offset, text)
offset = 0
for rc in raw_chunks:
off = text.find(rc, offset)
if off == -1:
off = offset
# if token_target set and estimated tokens exceed twice the target, split
if token_target is not None:
est = _estimate_tokens_from_text(rc)
if est > token_target * 2:
# split by character windows approximating tokens
approx_chars = max(100, token_target * 4)
for i in range(0, len(rc), approx_chars):
refined.append((off + i, rc[i:i+approx_chars]))
offset = off + len(rc)
continue
refined.append((off, rc))
offset = off + len(rc)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix metadata offsets when chunks overlap or we normalize whitespace.

Recovering offsets with text.find(...) breaks as soon as we use overlap or mutate whitespace. Example: with by_sentence=False, max_length=10, overlap=3, chunk #2 really starts at index 7, but because offset has already advanced to 10 the search fails, we fall back to off = offset = 10, and the returned metadata points at the wrong span. _join_sentences also collapses whitespace with ' '.join, so the chunk text no longer appears verbatim in the source; find fails again and we reuse the previous offset, making both offset and length diverge from the original document—violating the contract advertised in this PR.

Please carry the true start index together with each chunk instead of rediscovering it. Have both chunk builders return (start_idx, chunk_text) tuples (preserving the exact slice from text) and consume them directly here:

-        refined: List[Tuple[int, str]] = []  # list of (offset, text)
-        offset = 0
-        for rc in raw_chunks:
-            off = text.find(rc, offset)
-            if off == -1:
-                off = offset
+        refined: List[Tuple[int, str]] = []
+        for off, rc in raw_chunks:
             if token_target is not None:
                 est = _estimate_tokens_from_text(rc)
                 if est > token_target * 2:
                     approx_chars = max(100, token_target * 4)
                     for i in range(0, len(rc), approx_chars):
                         refined.append((off + i, rc[i:i+approx_chars]))
                     continue
             refined.append((off, rc))
-            offset = off + len(rc)

…and update _join_sentences / the fixed-window branch so they emit those (start_idx, chunk) tuples using untouched slices from the original string. Without this change any consumer that trusts offset/length will read the wrong portion of the document.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In services/chunking_service.py around lines 115-133, stop using text.find() to
recover offsets (which breaks when chunks overlap or whitespace is normalized);
instead have the sentence- and fixed-window chunk builders return tuples
(start_idx, chunk_text) using untouched slices from the original text, then
consume those tuples here (use the provided start_idx as off rather than
computing it), and when token_target triggers further sub-splits compute
sub-slice offsets relative to the provided start_idx (e.g., off + i) so all
refined entries keep true start indices and lengths.

@fehranbit fehranbit self-requested a review September 30, 2025 05:44
@fehranbit fehranbit merged commit f8ad998 into main Sep 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants