fix: prevent Document nodes from bypassing chunking pipeline by washi4 · Pull Request #509 · HKUDS/DeepTutor

washi4 · 2026-05-23T10:07:56Z

Summary

Fixes a bug where all documents bypass SentenceSplitter during KB indexing, causing full-document text (1M+ chars) to be sent to the embedding API and triggering HTTP 400 input-length errors.

Root Cause

LlamaIndex's Document class inherits from BaseNode. The previous filtering logic:

text_documents = [doc for doc in documents if not isinstance(doc, BaseNode)]

classified every Document as a pre-embedded node, skipping the chunking pipeline entirely.

Fix

Replace the naive isinstance(doc, BaseNode) check with _has_precomputed_embedding() that:

Always treats Document instances as needing chunking
Only skips nodes that are non-Document BaseNode subclasses and already carry an embedding vector (e.g. ImageNode from multimodal loaders)

Testing

Verified end-to-end: KB creation with a 1M+ char PDF now correctly chunks into 512-token segments and embeds successfully via DashScope text-embedding-v3.

LlamaIndex's Document class inherits from BaseNode, so the previous isinstance(doc, BaseNode) check classified all Documents as pre-embedded nodes, skipping SentenceSplitter entirely. This sent full-document text (1M+ chars) to the embedding API, triggering HTTP 400 input-length errors. Replace the type check with an explicit embedding-presence check that correctly distinguishes pre-embedded nodes (e.g. ImageNode from multimodal loaders) from regular Documents that still require splitting.

pancacake · 2026-05-27T12:56:49Z

Thanks for your contribution!

Security: lock down the TutorBot tool sandbox (shell exec is opt-in, all filesystem/shell access confined to the bot workspace) and isolate per-user resources, closing #518, #517, #516, #515, #514 and #506 (first hardened in #507). Bug fixes: chat input disabled after the first turn (#520), KB embedding failure on long documents (#521 / #509), profile creation under Docker (#512 / #513), Qwen reasoning models failing native tool calling (#527 / #528), the GPT-5 init-wizard token parameter (#508), and oversized session-event truncation (#524). Features: HTTP/SSE API for multi-turn chat with a specific TutorBot (#511), multimodal image fallback for vision-capable providers without a capability entry, safe ZIP knowledge upload, and a /settings/network page with model fetching (community PRs #522 and #523 reimplemented locally). Also bumps __version__ to 1.4.1, adds the v1.4.1 release notes, updates the README Releases section, and ships the Astro + Starlight docs site under site/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pancacake merged commit 5c0af99 into HKUDS:dev May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent Document nodes from bypassing chunking pipeline#509

fix: prevent Document nodes from bypassing chunking pipeline#509
pancacake merged 1 commit into
HKUDS:devfrom
washi4:fix/ingestion-chunking-bypass

washi4 commented May 23, 2026

Uh oh!

pancacake commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

washi4 commented May 23, 2026

Summary

Root Cause

Fix

Testing

Uh oh!

pancacake commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants