Skip to content

Empty Embedding Vectors and Size Mismatch in RAG Pipeline #263

@BelovedYaoo

Description

@BelovedYaoo

Description:
During the RAG (Retrieval-Augmented Generation) pipeline execution, multiple documents are being skipped or filtered out due to:

​Empty embedding vectors​ (e.g., Documents 5310–5319).
​Embedding size mismatch​ (e.g., VanillaChestLoot.java has size 0 but expected 2048).

we expect to add targeted repair functions. :-)

​Log Excerpt:
2025-07-08 23:38:47,328 - INFO - api.data_pipeline - data_pipeline.py:804 - Loaded 16050 documents from existing database
2025-07-08 23:38:47,328 - INFO - api.rag - rag.py:421 - Loaded 16050 documents for retrieval
2025-07-08 23:38:47,331 - WARNING - api.rag - rag.py:335 - Document 5310 has empty embedding vector, skipping
2025-07-08 23:38:47,332 - WARNING - api.rag - rag.py:335 - Document 5311 has empty embedding vector, skipping
2025-07-08 23:38:47,332 - WARNING - api.rag - rag.py:335 - Document 5312 has empty embedding vector, skipping
2025-07-08 23:38:47,332 - WARNING - api.rag - rag.py:335 - Document 5313 has empty embedding vector, skipping
2025-07-08 23:38:47,332 - WARNING - api.rag - rag.py:335 - Document 5314 has empty embedding vector, skipping
2025-07-08 23:38:47,332 - WARNING - api.rag - rag.py:335 - Document 5315 has empty embedding vector, skipping
2025-07-08 23:38:47,333 - WARNING - api.rag - rag.py:335 - Document 5316 has empty embedding vector, skipping
2025-07-08 23:38:47,333 - WARNING - api.rag - rag.py:335 - Document 5317 has empty embedding vector, skipping
2025-07-08 23:38:47,333 - WARNING - api.rag - rag.py:335 - Document 5318 has empty embedding vector, skipping
2025-07-08 23:38:47,333 - WARNING - api.rag - rag.py:335 - Document 5319 has empty embedding vector, skipping
2025-07-08 23:38:47,337 - INFO - api.rag - rag.py:350 - Target embedding size: 2048 (found in 16040 documents)
2025-07-08 23:38:47,339 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaChestLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,339 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaChestLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,339 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaChestLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,339 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaEntityLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,339 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaEntityLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,340 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaEntityLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,340 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaEntityLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,340 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaFishingLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,340 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaGiftLoot.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,340 - WARNING - api.rag - rag.py:377 - Filtering out document 'src\main\java\net\minecraft\data\loot\packs\VanillaLootTableProvider.java' due to embedding size mismatch: 0 != 2048
2025-07-08 23:38:47,344 - INFO - api.rag - rag.py:384 - Embedding validation complete: 16040/16050 documents have valid embeddings
2025-07-08 23:38:47,344 - WARNING - api.rag - rag.py:390 - Filtered out 10 documents due to embedding issues
2025-07-08 23:38:47,344 - INFO - api.rag - rag.py:429 - Using 16040 documents with valid embeddings for retrieval

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions