feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits by guangyang1206 · Pull Request #1348 · MODSetter/SurfSense

guangyang1206 · 2026-05-05T11:15:00Z

…able splits

Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue #1334).

Changes:

document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker.
indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default.

Fixes #1334

Description

This PR adds a chunk_text_hybrid() function to the document chunker and wires it into the indexing pipeline so that Markdown tables are never split mid-row during document ingestion.

Description

The original chunk_text() function treats the entire document as a flat character stream, so Markdown tables often get cut in the middle when the table is wider than a single chunk window. This produces garbled, incomplete rows that are useless for RAG retrieval.

Changes:

`surfsense_backend/app/indexing_pipeline/document_chunker.py`

Added _TABLE_BLOCK_RE regex to detect Markdown table blocks (header + separator + rows)
Added chunk_text_hybrid(text):
1. Finds all Markdown table blocks in the document
2. Emits each table as a single, indivisible chunk
3. Passes non-table prose segments through the existing chunk_text() chunker
4. Returns chunks in document order

`surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py`

Updated the chunking call site to use chunk_text_hybrid for normal (non-code) documents
Code documents still use chunk_text(use_code_chunker=True) as before

Screenshots

Before	After
Table rows split across chunks	Each table is one indivisible chunk

Checklist

I have read the contributing guidelines
This change does not require a documentation update
Backwards-compatible — chunk_text() is unchanged; chunk_text_hybrid() is additive
No external dependencies added

Fixes #1334

High-level PR Summary

This PR introduces a table-aware text chunking mechanism to prevent Markdown tables from being split mid-row during document ingestion. A new chunk_text_hybrid() function detects Markdown table blocks using regex, emits each complete table as an indivisible chunk, and processes surrounding prose text through the existing chunker. The indexing pipeline now routes non-code documents through this hybrid chunker by default, ensuring table data remains intact for RAG retrieval while preserving the original chunking behavior for code documents.

⏱️ Estimated Review Time: 5-15 minutes

💡 Review Order Suggestion

Order	File Path
1	`surfsense_backend/app/indexing_pipeline/document_chunker.py`
2	`surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py`

…able splits Document_chunker currently splits Markdown tables mid-row when the table is larger than a single chunk window, producing garbled rows that are useless for RAG retrieval (issue MODSetter#1334). Changes: - document_chunker.py: add chunk_text_hybrid() that detects Markdown table blocks with a regex, emits each table as an indivisible single chunk, and feeds the surrounding prose through the normal chunk_text() chunker. - indexing_pipeline_service.py: route normal (non-code) documents through chunk_text_hybrid instead of chunk_text so tables are protected by default. Fixes MODSetter#1334

vercel · 2026-05-05T11:15:04Z

@guangyang1206 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-05T11:15:07Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8c8e41ff-3828-4f85-a00e-ad14ab9baa3f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

MODSetter merged commit 489dd0a into MODSetter:dev May 5, 2026
4 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits#1348

feat(chunker): add table-aware chunk_text_hybrid to prevent mid-row table splits#1348
MODSetter merged 1 commit intoMODSetter:devfrom
guangyang1206:feat/document-chunker-table-aware-hybrid-1334

guangyang1206 commented May 5, 2026 •

edited by recurseml Bot

Loading

Uh oh!

vercel Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

guangyang1206 commented May 5, 2026 • edited by recurseml Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Description

surfsense_backend/app/indexing_pipeline/document_chunker.py

surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py

Screenshots

Checklist

High-level PR Summary

Uh oh!

vercel Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guangyang1206 commented May 5, 2026 •

edited by recurseml Bot

Loading

`surfsense_backend/app/indexing_pipeline/document_chunker.py`

`surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py`