Skip to content

fix: prevent Document nodes from bypassing chunking pipeline#509

Merged
pancacake merged 1 commit into
HKUDS:devfrom
washi4:fix/ingestion-chunking-bypass
May 27, 2026
Merged

fix: prevent Document nodes from bypassing chunking pipeline#509
pancacake merged 1 commit into
HKUDS:devfrom
washi4:fix/ingestion-chunking-bypass

Conversation

@washi4
Copy link
Copy Markdown
Contributor

@washi4 washi4 commented May 23, 2026

Summary

Fixes a bug where all documents bypass SentenceSplitter during KB indexing, causing full-document text (1M+ chars) to be sent to the embedding API and triggering HTTP 400 input-length errors.

Root Cause

LlamaIndex's Document class inherits from BaseNode. The previous filtering logic:

text_documents = [doc for doc in documents if not isinstance(doc, BaseNode)]

classified every Document as a pre-embedded node, skipping the chunking pipeline entirely.

Fix

Replace the naive isinstance(doc, BaseNode) check with _has_precomputed_embedding() that:

  1. Always treats Document instances as needing chunking
  2. Only skips nodes that are non-Document BaseNode subclasses and already carry an embedding vector (e.g. ImageNode from multimodal loaders)

Testing

Verified end-to-end: KB creation with a 1M+ char PDF now correctly chunks into 512-token segments and embeds successfully via DashScope text-embedding-v3.

LlamaIndex's Document class inherits from BaseNode, so the previous
isinstance(doc, BaseNode) check classified all Documents as pre-embedded
nodes, skipping SentenceSplitter entirely. This sent full-document text
(1M+ chars) to the embedding API, triggering HTTP 400 input-length errors.

Replace the type check with an explicit embedding-presence check that
correctly distinguishes pre-embedded nodes (e.g. ImageNode from multimodal
loaders) from regular Documents that still require splitting.
@pancacake pancacake merged commit 5c0af99 into HKUDS:dev May 27, 2026
@pancacake
Copy link
Copy Markdown
Collaborator

Thanks for your contribution!

pancacake added a commit that referenced this pull request May 27, 2026
Security: lock down the TutorBot tool sandbox (shell exec is opt-in, all
filesystem/shell access confined to the bot workspace) and isolate per-user
resources, closing #518, #517, #516, #515, #514 and #506 (first hardened in
#507).

Bug fixes: chat input disabled after the first turn (#520), KB embedding
failure on long documents (#521 / #509), profile creation under Docker
(#512 / #513), Qwen reasoning models failing native tool calling (#527 / #528),
the GPT-5 init-wizard token parameter (#508), and oversized session-event
truncation (#524).

Features: HTTP/SSE API for multi-turn chat with a specific TutorBot (#511),
multimodal image fallback for vision-capable providers without a capability
entry, safe ZIP knowledge upload, and a /settings/network page with model
fetching (community PRs #522 and #523 reimplemented locally).

Also bumps __version__ to 1.4.1, adds the v1.4.1 release notes, updates the
README Releases section, and ships the Astro + Starlight docs site under site/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants