-
Notifications
You must be signed in to change notification settings - Fork 0
feat(text-chunking): sentence-/token-aware chunk_text(...)
with rich metadata & smart merging
#35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e-aware splitting, and add metadata generation
WalkthroughReplaced fixed-window chunking with token- and sentence-aware logic. Introduced optional token estimation, sentence splitting/joining, metadata construction per chunk, merging of small chunks, and revised error handling. The public API of chunk_text changed to accept new parameters and now returns a list of metadata dicts. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant CS as ChunkingService
participant TE as TokenEstimator (tiktoken/fallback)
Client->>CS: chunk_text(text, max_length, overlap, by_sentence, min_chunk_length, token_target)
rect rgba(200,220,255,0.25)
note over CS: Validate inputs
CS-->>Client: DocumentChunkError (on invalid inputs)
end
alt by_sentence == true
CS->>CS: Split text by sentence boundaries
CS->>CS: Join sentences into initial chunks (max_length/overlap)
else
CS->>CS: Fixed-window chunking (max_length/overlap)
end
opt token_target provided
CS->>TE: Estimate tokens per chunk
TE-->>CS: token counts
CS->>CS: Refine/split oversized chunks to meet token_target
end
CS->>CS: Build metadata per chunk (id, order, offset, length, text, estimated_tokens)
CS->>CS: Merge tiny chunks (< min_chunk_length) into previous
CS->>CS: Reassign final order
CS-->>Client: List[Dict]: chunk metadata
rect rgba(255,230,200,0.25)
note over CS: Any processing failure -> DocumentChunkError
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
services/chunking_service.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
services/chunking_service.py (1)
services/exceptions.py (1)
DocumentChunkError
(17-19)
🪛 Ruff (0.13.1)
services/chunking_service.py
15-15: Do not catch blind exception: Exception
(BLE001)
27-28: try
-except
-pass
detected, consider logging the exception
(S110)
27-27: Do not catch blind exception: Exception
(BLE001)
157-157: Consider moving this statement to an else
block
(TRY300)
refined: List[Tuple[int, str]] = [] # list of (offset, text) | ||
offset = 0 | ||
for rc in raw_chunks: | ||
off = text.find(rc, offset) | ||
if off == -1: | ||
off = offset | ||
# if token_target set and estimated tokens exceed twice the target, split | ||
if token_target is not None: | ||
est = _estimate_tokens_from_text(rc) | ||
if est > token_target * 2: | ||
# split by character windows approximating tokens | ||
approx_chars = max(100, token_target * 4) | ||
for i in range(0, len(rc), approx_chars): | ||
refined.append((off + i, rc[i:i+approx_chars])) | ||
offset = off + len(rc) | ||
continue | ||
refined.append((off, rc)) | ||
offset = off + len(rc) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix metadata offsets when chunks overlap or we normalize whitespace.
Recovering offsets with text.find(...)
breaks as soon as we use overlap or mutate whitespace. Example: with by_sentence=False
, max_length=10
, overlap=3
, chunk #2 really starts at index 7, but because offset
has already advanced to 10 the search fails, we fall back to off = offset = 10
, and the returned metadata points at the wrong span. _join_sentences
also collapses whitespace with ' '.join
, so the chunk text no longer appears verbatim in the source; find
fails again and we reuse the previous offset, making both offset
and length
diverge from the original document—violating the contract advertised in this PR.
Please carry the true start index together with each chunk instead of rediscovering it. Have both chunk builders return (start_idx, chunk_text)
tuples (preserving the exact slice from text
) and consume them directly here:
- refined: List[Tuple[int, str]] = [] # list of (offset, text)
- offset = 0
- for rc in raw_chunks:
- off = text.find(rc, offset)
- if off == -1:
- off = offset
+ refined: List[Tuple[int, str]] = []
+ for off, rc in raw_chunks:
if token_target is not None:
est = _estimate_tokens_from_text(rc)
if est > token_target * 2:
approx_chars = max(100, token_target * 4)
for i in range(0, len(rc), approx_chars):
refined.append((off + i, rc[i:i+approx_chars]))
continue
refined.append((off, rc))
- offset = off + len(rc)
…and update _join_sentences
/ the fixed-window branch so they emit those (start_idx, chunk)
tuples using untouched slices from the original string. Without this change any consumer that trusts offset
/length
will read the wrong portion of the document.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In services/chunking_service.py around lines 115-133, stop using text.find() to
recover offsets (which breaks when chunks overlap or whitespace is normalized);
instead have the sentence- and fixed-window chunk builders return tuples
(start_idx, chunk_text) using untouched slices from the original text, then
consume those tuples here (use the provided start_idx as off rather than
computing it), and when token_target triggers further sub-splits compute
sub-slice offsets relative to the provided start_idx (e.g., off + i) so all
refined entries keep true start indices and lengths.
Replaces the fixed-window chunker with a configurable
chunk_text(...)
that is sentence-aware, token-aware (viatiktoken
when available), and emits detailed metadata for each chunk. Handles very long sentences and oversized chunks gracefully and reduces tiny fragments through configurable merging.What Changed
Sentence-aware chunking
Token-aware splitting
tiktoken
if present; otherwise falls back to a lightweight token estimator.token_target
to guide chunk sizes.Chunk metadata output
id
,order
,offset
,length
,text
,estimated_tokens
.Long sentence handling
Secondary splitting for large chunks
token_target
to keep sizes consistent.Tiny-chunk merging
min_chunk_length
to avoid many tiny pieces.Motivation
Implementation Notes
chunk_text(...)
API encapsulates sentence detection, token estimation, windowing, and merging.tiktoken
when installed; gracefully degrades without it.Configuration & Defaults
max_chars_per_chunk
: soft cap used during sentence joining.token_target
: desired token size when tokenization is available.min_chunk_length
: lower bound for merging tiny fragments.Backward Compatibility
chunk_text(...)
.Risks & Mitigations
tiktoken
is absent → mitigated by fallback estimator.Summary by CodeRabbit
New Features
Bug Fixes
Refactor