Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files by voarsh2 · Pull Request #46 · Context-Engine-AI/Context-Engine

voarsh2 · 2025-12-06T01:13:37Z

Bugs fixes :

Fixes some bugs in watch_index, namely:

Avoid deriving a root-level "/work-" collection in multi-repo mode
Resolve per-file Qdrant collections via get_collection_for_file for all data ops
Fix on_deleted and move/rename delete paths to use repo-specific collections instead of the watcher’s default_collection

Improvements:

Pseudo / tags passthrough in `hybrid_search`

Ensure index‑time pseudo and tags metadata are carried through in hybrid_search results:
- In run_hybrid_search, we now extract pseudo / tags from the Qdrant payload/metadata and attach them to each returned item.
- This is a schema enrichment only; the core dense + lexical + heuristic scoring pipeline is unchanged.
Why this matters:
- Downstream consumers (e.g. repo_search rerankers, IDE integrations, MCP clients) can now:
  - Display GLM/LLM‑generated summary labels and tags alongside each candidate,
  - Optionally incorporate pseudo / tags into their own secondary scoring or UI logic.
- This makes it possible to build higher‑level rerankers and UIs that understand index‑time labels without having to re‑query Qdrant for metadata.
Scope:
- No behavior change to base ranking in hybrid_search itself; we intentionally keep pseudo / tags as metadata inputs for higher layers, not as mandatory ranking signals.

repo_search `mode` knob (code_first / balanced / docs_first)

Add an explicit mode knob to the repo_search MCP tool:
- Supported values: code_first, balanced (default), docs_first.
- Plumbed through repo_search → run_hybrid_search (in scripts/hybrid_search.py) and exposed via repo_search_compat.
Make run_hybrid_search implementation/doc weighting mode‑aware:
- code_first (default / no mode):
  - Full IMPLEMENTATION_BOOST.
  - Full DOCUMENTATION_PENALTY.
  - Slight TEST_FILE_PENALTY to prefer implementation over tests.
- balanced:
  - Keeps implementation boost.
  - Halves the structural doc penalty to let strong docs compete more fairly when they’re clearly relevant.
- docs_first:
  - Reduces implementation boost.
  - Disables structural documentation penalty so docs can surface before code when they’re equally relevant.
- Documentation penalties are strictly structural (README/docs/.md/etc), never query‑phrase based.
MCP‑side, keep result shaping mode‑aware in repo_search:
- code_first groups core implementation code, other code, then docs when ranking.
- docs_first inverts that: docs first, then implementation/test code.
- A simple post‑processing shim ensures at least N “core code” hits appear in the top‑K for code_first, tunable via:
  - REPO_SEARCH_CODE_FIRST_MIN_CORE
  - REPO_SEARCH_CODE_FIRST_TOP_K
Empirical behavior (manual stress tests):
- Queries with mixed code + docs (e.g. “how does hybrid search work”, “remote upload delta client and watcher”, “context answer tool usage and architecture”):
  - code_first reliably pulls .py implementation files to the top and pushes .md docs down, while still keeping high‑value docs in the tail of the top‑K.
  - docs_first flips the ordering: doc pages (README, MCP_API, CLAUDE.example, GETTING_STARTED, etc.) dominate the top‑K, with implementation files and tests following.
  - The default/balanced mode sits between these two, returning a blended set (a mix of implementation + docs), suitable when the caller doesn’t have a strong preference.
- Purely docs‑oriented queries (e.g. “Kubernetes deployment configuration and documentation”) converge across all three modes, as expected, because there are effectively no strong code competitors.
- Overall, mode behaves as intended:
  - When there is real competition between code and docs, it meaningfully tilts the top‑K toward the selected side.
  - When the corpus is effectively “all code” or “all docs” for a query, it doesn’t introduce noise.
Usage guidance:
- Use mode="code_first" for agents or workflows that need “where is this implemented?” answers, but still want docs as a fallback in the same call.
- Use mode="docs_first" when you primarily want conceptual/usage explanations from documentation and only occasionally need to dive into code.
- For “only code” or “only docs” scenarios, prefer hard filters (path_glob / not_glob) and use mode as a softer preference layer on top.

Perf:

Implements a mechanism to skip re-indexing files based on file size and modification time.

This optimization, enabled by the INDEX_FS_FASTPATH environment variable, significantly speeds up indexing, particularly when dealing with large repositories or frequent re-indexing operations where file contents may not have changed. The logic retrieves file metadata (size and mtime) from the cache and compares it with the current file's metadata. If they match, the file is skipped, avoiding unnecessary re-reading and processing.

The change also updates the cache to store file size and mtime along with the file hash.
When fast-fs is enabled, this commit refreshes the file hash cache with size/mtime information during file skipping. This ensures the cache remains up-to-date even when files are skipped due to unchanged content, and enhances the fast-fs
performance.

Add an INDEX_FS_FASTPATH-gated precheck in index_repo that walks files
using cache.json fs metadata (size/mtime) and, when every file matches,
exits early before model construction and Qdrant client setup, making
true no-change runs much cheaper.
Leave behavior unchanged for any new/changed/uncached files or entries
missing size/mtime metadata (these still fall back to the original
full index path).
Add a TODO above the collection health check to note that these
expensive Qdrant probes should eventually be split into a dedicated
"health-check-only" mode, so "nothing changed" runs can remain fast
while still offering an explicit way to validate collections.

- Add an explicit mode knob to the repo_search MCP tool (code_first, balanced, docs_first) - Plumb mode through repo_search → run_hybrid_search for in-process hybrid search calls - Make hybrid_search implementation/doc weighting mode-aware: - Default/code_first: full IMPLEMENTATION_BOOST and DOCUMENTATION_PENALTY - balanced: keep impl boost, halve structural doc penalty - docs_first: reduce impl boost and disable structural doc penalty - Keep documentation penalties purely structural (README/docs/.md/etc) instead of query-phrase based - Add MCP-side mode-aware reordering in repo_search: - Group core implementation code, other code, and docs differently for code_first vs docs_first - Implement a code_first post-processing shim to ensure at least N core code hits in the top-K - Tunable via REPO_SEARCH_CODE_FIRST_MIN_CORE and REPO_SEARCH_CODE_FIRST_TOP_K - Thread the mode argument through repo_search_compat so clients can select modes via the compat wrapper

Ensures the hybrid search uses the greater value between the originally requested limit and the rerank_top_n value when reranking is enabled. Also enforces the user-requested limit on the final result set.

Refactors the core code classification logic for more accurate identification, re-using hybrid_search's helpers when available. This change avoids duplicating extension and path-based heuristics and allows for better mode-aware reordering of search results.

Ensures that pseudo and tag metadata from index time are carried through in hybrid search results. This allows downstream consumers, such as repo search rerankers, to incorporate index-time GLM/LLM labels into their scoring or display logic. It enriches candidate documents with pseudo/tags information when available, improving reranking and search result context.

feat(hybrid-search): auto-scale search parameters for large codebases Automatically adjust RRF and retrieval parameters based on collection size to maintain search quality at scale (100k-500k+ LOC codebases). Changes: - Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination - Add _adaptive_per_query(): sqrt-based candidate retrieval scaling - Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions - Add _get_collection_stats(): cached collection size lookup (5-min TTL) - Apply scaling to both MCP (run_hybrid_search) and CLI paths - All scaling enabled by default, no configuration required Scaling behavior: - Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD) - RRF k: 60 → up to 180 (3x max, logarithmic) - Per-query: 24 → up to 72 (3x max, sqrt scaling) - Score normalization spreads compressed score ranges Small codebases (<10k points) are unaffected - parameters unchanged. === Large Codebase Scaling Tests === LARGE_COLLECTION_THRESHOLD: 10000 MAX_RRF_K_SCALE: 3.0 SCORE_NORMALIZE_ENABLED: True Base RRF_K: 60 --- RRF K Scaling --- 5000 points -> k=60 10000 points -> k=60 50000 points -> k=101 100000 points -> k=120 250000 points -> k=143 500000 points -> k=161 --- Per-Query Scaling --- 5000 points -> per_query=24 (filtered=24) 10000 points -> per_query=24 (filtered=24) 50000 points -> per_query=53 (filtered=37) 100000 points -> per_query=72 (filtered=53) 250000 points -> per_query=72 (filtered=72) 500000 points -> per_query=72 (filtered=72) --- Score Normalization --- Before (compressed): [0.5, 0.505, 0.51, 0.495] After (spread): [0.4443, 0.5557, 0.6617, 0.3383] Range: 0.4950-0.5100 -> 0.3383-0.6617 ------------------- Collection: codebase Points: 16622 Threshold: 10000 Scaling Active: True RRF K: 60 -> 73 (scale factor: 1.22x) Per-query: 24 -> 30 (no filters) Per-query: 24 -> 24 (with filters)

- Avoid deriving a root-level "/work-<hash>" collection in multi-repo mode - Resolve per-file Qdrant collections via get_collection_for_file for all data ops - Fix on_deleted and move/rename delete paths to use repo-specific collections instead of the watcher’s default_collection

Adds a background worker to backfill missing pseudo/tags and lexical vectors in Qdrant. This allows for a two-phase indexing process where base vectors are written first, followed by a background process to enrich them. This is enabled via the `PSEUDO_BACKFILL_ENABLED` environment variable and configured with interval and batch size.

…there's nothing to patch

Adds an init container to each indexer service deployment that waits for Qdrant to be available before starting the indexer. This ensures that the indexer does not start processing data before Qdrant is ready to accept connections, preventing potential data loss or corruption.

Adds a debug mode for the pseudo backfill process, enabled via the `PSEUDO_BACKFILL_DEBUG` environment variable. When enabled, the backfill process tracks and reports detailed statistics, including scanned points, GLM calls, the number of filled and updated vectors, and the reasons for skipping vectors, providing insights into the backfill's performance and potential bottlenecks.

Implements a mechanism to skip re-indexing files based on file size and modification time. This optimization, enabled by the `INDEX_FS_FASTPATH` environment variable, significantly speeds up indexing, particularly when dealing with large repositories or frequent re-indexing operations where file contents may not have changed. The logic retrieves file metadata (size and mtime) from the cache and compares it with the current file's metadata. If they match, the file is skipped, avoiding unnecessary re-reading and processing. The change also updates the cache to store file size and mtime along with the file hash.

When fast-fs is enabled, this commit refreshes the file hash cache with size/mtime information during file skipping. This ensures the cache remains up-to-date even when files are skipped due to unchanged content, and enhances the fast-fs performance.

- Add an INDEX_FS_FASTPATH-gated precheck in index_repo that walks files using cache.json fs metadata (size/mtime) and, when every file matches, exits early before model construction and Qdrant client setup, making true no-change runs much cheaper. - Leave behavior unchanged for any new/changed/uncached files or entries missing size/mtime metadata (these still fall back to the original full index path). - Add a TODO above the collection health check to note that these expensive Qdrant probes should eventually be split into a dedicated "health-check-only" mode, so "nothing changed" runs can remain fast while still offering an explicit way to validate collections.

voarsh2 · 2025-12-06T04:34:11Z

I am debating the usefulness of "mode" mcp arg for code_first/docs first... might rip it out after experimenting some more.

Refactors commit search to incorporate lexical scoring, allowing for ranking of results by relevance when a query is provided. This change replaces the previous strict "all tokens must appear" filter with a field-aware scoring mechanism, enabling the system to identify and prioritize commits that better match the specified behavior phrase. The results are then sorted by score and trimmed to the requested limit.

Ensures related paths are emitted in the appropriate path space (host or container) based on the PATH_EMIT_MODE environment variable. This change introduces a mapping of container paths to host paths to ensure consistent path representation for human-facing interfaces, while preserving container paths for backend usage.

Implements an optional feature to enhance commit search using vector embeddings. This feature allows for a semantic score to be computed for the query by blending it with the lexical/lineage score which is gated by a configuration setting. This commit also includes lazy loading of the fastembed library and a sanitize helper to allow environments without these dependencies to still function with a pure lexical search.

Passes through index-time metadata Passes through index-time pseudo/tags metadata to downstream consumers. This allows MCP clients, rerankers, and IDEs to optionally incorporate GLM/LLM labels into their scoring or display logic.

Emphasizes the mandatory nature of Qdrant-Indexer tool usage. Reinforces the importance of semantic search with short natural-language queries, discouraging grep/regex syntax.

…st-4 Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files

voarsh and others added 14 commits November 30, 2025 14:23

Improves hybrid search limit handling

eb2bfcf

Ensures the hybrid search uses the greater value between the originally requested limit and the rerank_top_n value when reranking is enabled. Also enforces the user-requested limit on the final result set.

k8 - when patching images, set qdrant image pull policy to always so …

c09f5da

…there's nothing to patch

Merge branch 'test' into feat/repo-search-ranking-code-first-4

d3b0f79

voarsh added 3 commits December 7, 2025 17:08

voarsh2 changed the title ~~Pseudo / Tag backfill via watcher - indexer base index pass, fast-fs index path for unchanged files~~ Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files Dec 7, 2025

voarsh added 2 commits December 7, 2025 18:30

Ref: 58be093

6cac485

Passes through index-time metadata Passes through index-time pseudo/tags metadata to downstream consumers. This allows MCP clients, rerankers, and IDEs to optionally incorporate GLM/LLM labels into their scoring or display logic.

docs(claude example): Clarifies Qdrant-Indexer usage rules.

7c1097b

Emphasizes the mandatory nature of Qdrant-Indexer tool usage. Reinforces the importance of semantic search with short natural-language queries, discouraging grep/regex syntax.

voarsh2 marked this pull request as ready for review December 8, 2025 17:53

m1rl0k merged commit 64c436c into Context-Engine-AI:test Dec 8, 2025
1 check passed

voarsh2 deleted the feat/repo-search-ranking-code-first-4 branch December 10, 2025 04:09

m1rl0k added a commit that referenced this pull request Mar 1, 2026

Merge pull request #46 from voarsh2/feat/repo-search-ranking-code-fir…

7a8ae2a

…st-4 Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files#46

Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files#46
m1rl0k merged 19 commits intoContext-Engine-AI:testfrom
voarsh2:feat/repo-search-ranking-code-first-4

voarsh2 commented Dec 6, 2025 •

edited

Loading

Uh oh!

voarsh2 commented Dec 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

voarsh2 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bugs fixes :

Fixes some bugs in watch_index, namely:

Improvements:

Pseudo / tags passthrough in hybrid_search

repo_search mode knob (code_first / balanced / docs_first)

Perf:

Implements a mechanism to skip re-indexing files based on file size and modification time.

Uh oh!

voarsh2 commented Dec 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

voarsh2 commented Dec 6, 2025 •

edited

Loading

Pseudo / tags passthrough in `hybrid_search`

repo_search `mode` knob (code_first / balanced / docs_first)