Skip to content

fix(rag): guard against None embeddings in LlamaIndex pipeline#347

Merged
pancacake merged 1 commit intoHKUDS:devfrom
kagura-agent:fix/embedding-none-guard
Apr 20, 2026
Merged

fix(rag): guard against None embeddings in LlamaIndex pipeline#347
pancacake merged 1 commit intoHKUDS:devfrom
kagura-agent:fix/embedding-none-guard

Conversation

@kagura-agent
Copy link
Copy Markdown
Contributor

Summary

Fixes #346 — RAG queries crash with TypeError: unsupported operand type(s) for *: 'NoneType' and 'float' when a stored embedding vector is None.

Root Cause

When an embedding provider returns {"embedding": null} for a chunk, two things go wrong:

  1. _extract_embeddings_from_response uses item.get("embedding", []) — but dict.get() only returns the default when the key is absent, not when the value is explicitly None. So None passes through.
  2. CustomEmbedding._get_text_embeddings trusts the result without validation, allowing None vectors to be stored in the index and crash np.dot at query time.

Fix

Two-layer defense:

  1. Adapter layer (openai_compatible.py): Changed item.get("embedding", [])item.get("embedding") or [] so explicit None values are caught.
  2. Pipeline layer (llamaindex.py): Added post-embed validation in _get_text_embeddings — any None vectors are replaced with zero vectors and logged as errors. This prevents silent storage corruption regardless of which adapter is used.

Testing

  • Added unit test for None embedding extraction in test_extract_embeddings.py
  • All 22 extraction tests pass
  • All 45 passing embedding/RAG tests still pass (10 pre-existing failures due to missing async plugin — unrelated)

…#346)

When an embedding provider returns null for a chunk's embedding vector,
the None value gets stored in the vector index and causes a TypeError
in LlamaIndex's similarity computation (np.dot with NoneType).

Two-layer fix:
1. _extract_embeddings_from_response: use 'or []' instead of
   get(key, default) so explicit None values are caught (get() only
   uses the default when the key is absent, not when it's None).
2. CustomEmbedding._get_text_embeddings: validate the batch result
   and replace any None vectors with zero vectors, logging an error
   to surface the upstream issue.

Closes HKUDS#346
@pancacake pancacake merged commit 509d3ec into HKUDS:dev Apr 20, 2026
2 of 4 checks passed
pancacake added a commit that referenced this pull request Apr 20, 2026
- New `assets/releases/ver1-2-1.md` covering #348 (per-stage chat token
  limits), #349 (Regenerate across CLI/WS/Web UI), the regenerate UI
  harmony polish, and bug fixes #347 / #345 / #352.
- README release-notes block updated to surface v1.2.1 above v1.2.0.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants