feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits#1348
Conversation
…able splits Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue MODSetter#1334). Changes: - document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker. - indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default. Fixes MODSetter#1334
|
@guangyang1206 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…able splits
Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue #1334).
Changes:
Fixes #1334
Description
This PR adds a
chunk_text_hybrid()function to the document chunker and wires it into the indexing pipeline so that Markdown tables are never split mid-row during document ingestion.Description
The original
chunk_text()function treats the entire document as a flat character stream, so Markdown tables often get cut in the middle when the table is wider than a single chunk window. This produces garbled, incomplete rows that are useless for RAG retrieval.Changes:
surfsense_backend/app/indexing_pipeline/document_chunker.py_TABLE_BLOCK_REregex to detect Markdown table blocks (header + separator + rows)chunk_text_hybrid(text):chunk_text()chunkersurfsense_backend/app/indexing_pipeline/indexing_pipeline_service.pychunk_text_hybridfor normal (non-code) documentschunk_text(use_code_chunker=True)as beforeScreenshots
Checklist
chunk_text()is unchanged;chunk_text_hybrid()is additiveFixes #1334
High-level PR Summary
This PR introduces a table-aware text chunking mechanism to prevent Markdown tables from being split mid-row during document ingestion. A new
chunk_text_hybrid()function detects Markdown table blocks using regex, emits each complete table as an indivisible chunk, and processes surrounding prose text through the existing chunker. The indexing pipeline now routes non-code documents through this hybrid chunker by default, ensuring table data remains intact for RAG retrieval while preserving the original chunking behavior for code documents.⏱️ Estimated Review Time: 5-15 minutes
💡 Review Order Suggestion
surfsense_backend/app/indexing_pipeline/document_chunker.pysurfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py