Skip to content

feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits#1348

Merged
MODSetter merged 1 commit intoMODSetter:devfrom
guangyang1206:feat/document-chunker-table-aware-hybrid-1334
May 5, 2026
Merged

feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits#1348
MODSetter merged 1 commit intoMODSetter:devfrom
guangyang1206:feat/document-chunker-table-aware-hybrid-1334

Conversation

@guangyang1206
Copy link
Copy Markdown
Contributor

@guangyang1206 guangyang1206 commented May 5, 2026

…able splits

Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue #1334).

Changes:

  • document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker.
  • indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default.

Fixes #1334

Description

This PR adds a chunk_text_hybrid() function to the document chunker and wires it into the indexing pipeline so that Markdown tables are never split mid-row during document ingestion.

Description

The original chunk_text() function treats the entire document as a flat character stream, so Markdown tables often get cut in the middle when the table is wider than a single chunk window. This produces garbled, incomplete rows that are useless for RAG retrieval.

Changes:

surfsense_backend/app/indexing_pipeline/document_chunker.py

  • Added _TABLE_BLOCK_RE regex to detect Markdown table blocks (header + separator + rows)
  • Added chunk_text_hybrid(text):
    1. Finds all Markdown table blocks in the document
    2. Emits each table as a single, indivisible chunk
    3. Passes non-table prose segments through the existing chunk_text() chunker
    4. Returns chunks in document order

surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py

  • Updated the chunking call site to use chunk_text_hybrid for normal (non-code) documents
  • Code documents still use chunk_text(use_code_chunker=True) as before

Screenshots

Before After
Table rows split across chunks Each table is one indivisible chunk

Checklist

  • I have read the contributing guidelines
  • This change does not require a documentation update
  • Backwards-compatible — chunk_text() is unchanged; chunk_text_hybrid() is additive
  • No external dependencies added

Fixes #1334

High-level PR Summary

This PR introduces a table-aware text chunking mechanism to prevent Markdown tables from being split mid-row during document ingestion. A new chunk_text_hybrid() function detects Markdown table blocks using regex, emits each complete table as an indivisible chunk, and processes surrounding prose text through the existing chunker. The indexing pipeline now routes non-code documents through this hybrid chunker by default, ensuring table data remains intact for RAG retrieval while preserving the original chunking behavior for code documents.

⏱️ Estimated Review Time: 5-15 minutes

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/app/indexing_pipeline/document_chunker.py
2 surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py

Need help? Join our Discord

…able splits

Document_chunker currently splits Markdown tables mid-row when the table is
larger than a single chunk window, producing garbled rows that are useless
for RAG retrieval (issue MODSetter#1334).

Changes:
- document_chunker.py: add chunk_text_hybrid() that detects Markdown table
  blocks with a regex, emits each table as an indivisible single chunk, and
  feeds the surrounding prose through the normal chunk_text() chunker.
- indexing_pipeline_service.py: route normal (non-code) documents through
  chunk_text_hybrid instead of chunk_text so tables are protected by default.

Fixes MODSetter#1334
@vercel
Copy link
Copy Markdown

vercel Bot commented May 5, 2026

@guangyang1206 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c8e41ff-3828-4f85-a00e-ad14ab9baa3f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MODSetter MODSetter merged commit 489dd0a into MODSetter:dev May 5, 2026
4 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants