Skip to content

Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files#46

Merged
m1rl0k merged 19 commits intoContext-Engine-AI:testfrom
voarsh2:feat/repo-search-ranking-code-first-4
Dec 8, 2025
Merged

Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files#46
m1rl0k merged 19 commits intoContext-Engine-AI:testfrom
voarsh2:feat/repo-search-ranking-code-first-4

Conversation

@voarsh2
Copy link
Contributor

@voarsh2 voarsh2 commented Dec 6, 2025

Bugs fixes :

Fixes some bugs in watch_index, namely:

  • Avoid deriving a root-level "/work-" collection in multi-repo mode
  • Resolve per-file Qdrant collections via get_collection_for_file for all data ops
  • Fix on_deleted and move/rename delete paths to use repo-specific collections instead of the watcher’s default_collection

Improvements:

Pseudo / tags passthrough in hybrid_search

  • Ensure index‑time pseudo and tags metadata are carried through in hybrid_search results:

    • In run_hybrid_search, we now extract pseudo / tags from the Qdrant payload/metadata and attach them to each returned item.
    • This is a schema enrichment only; the core dense + lexical + heuristic scoring pipeline is unchanged.
  • Why this matters:

    • Downstream consumers (e.g. repo_search rerankers, IDE integrations, MCP clients) can now:
      • Display GLM/LLM‑generated summary labels and tags alongside each candidate,
      • Optionally incorporate pseudo / tags into their own secondary scoring or UI logic.
    • This makes it possible to build higher‑level rerankers and UIs that understand index‑time labels without having to re‑query Qdrant for metadata.
  • Scope:

    • No behavior change to base ranking in hybrid_search itself; we intentionally keep pseudo / tags as metadata inputs for higher layers, not as mandatory ranking signals.

repo_search mode knob (code_first / balanced / docs_first)

  • Add an explicit mode knob to the repo_search MCP tool:

    • Supported values: code_first, balanced (default), docs_first.
    • Plumbed through repo_search → run_hybrid_search (in scripts/hybrid_search.py) and exposed via repo_search_compat.
  • Make run_hybrid_search implementation/doc weighting mode‑aware:

    • code_first (default / no mode):
      • Full IMPLEMENTATION_BOOST.
      • Full DOCUMENTATION_PENALTY.
      • Slight TEST_FILE_PENALTY to prefer implementation over tests.
    • balanced:
      • Keeps implementation boost.
      • Halves the structural doc penalty to let strong docs compete more fairly when they’re clearly relevant.
    • docs_first:
      • Reduces implementation boost.
      • Disables structural documentation penalty so docs can surface before code when they’re equally relevant.
    • Documentation penalties are strictly structural (README/docs/.md/etc), never query‑phrase based.
  • MCP‑side, keep result shaping mode‑aware in repo_search:

    • code_first groups core implementation code, other code, then docs when ranking.
    • docs_first inverts that: docs first, then implementation/test code.
    • A simple post‑processing shim ensures at least N “core code” hits appear in the top‑K for code_first, tunable via:
      • REPO_SEARCH_CODE_FIRST_MIN_CORE
      • REPO_SEARCH_CODE_FIRST_TOP_K
  • Empirical behavior (manual stress tests):

    • Queries with mixed code + docs (e.g. “how does hybrid search work”, “remote upload delta client and watcher”, “context answer tool usage and architecture”):
      • code_first reliably pulls .py implementation files to the top and pushes .md docs down, while still keeping high‑value docs in the tail of the top‑K.
      • docs_first flips the ordering: doc pages (README, MCP_API, CLAUDE.example, GETTING_STARTED, etc.) dominate the top‑K, with implementation files and tests following.
      • The default/balanced mode sits between these two, returning a blended set (a mix of implementation + docs), suitable when the caller doesn’t have a strong preference.
    • Purely docs‑oriented queries (e.g. “Kubernetes deployment configuration and documentation”) converge across all three modes, as expected, because there are effectively no strong code competitors.
    • Overall, mode behaves as intended:
      • When there is real competition between code and docs, it meaningfully tilts the top‑K toward the selected side.
      • When the corpus is effectively “all code” or “all docs” for a query, it doesn’t introduce noise.
  • Usage guidance:

    • Use mode="code_first" for agents or workflows that need “where is this implemented?” answers, but still want docs as a fallback in the same call.
    • Use mode="docs_first" when you primarily want conceptual/usage explanations from documentation and only occasionally need to dive into code.
    • For “only code” or “only docs” scenarios, prefer hard filters (path_glob / not_glob) and use mode as a softer preference layer on top.

Perf:

Implements a mechanism to skip re-indexing files based on file size and modification time.

This optimization, enabled by the INDEX_FS_FASTPATH environment variable, significantly speeds up indexing, particularly when dealing with large repositories or frequent re-indexing operations where file contents may not have changed. The logic retrieves file metadata (size and mtime) from the cache and compares it with the current file's metadata. If they match, the file is skipped, avoiding unnecessary re-reading and processing.

The change also updates the cache to store file size and mtime along with the file hash.
When fast-fs is enabled, this commit refreshes the file hash cache with size/mtime information during file skipping. This ensures the cache remains up-to-date even when files are skipped due to unchanged content, and enhances the fast-fs
performance.

  • Add an INDEX_FS_FASTPATH-gated precheck in index_repo that walks files
    using cache.json fs metadata (size/mtime) and, when every file matches,
    exits early before model construction and Qdrant client setup, making
    true no-change runs much cheaper.
  • Leave behavior unchanged for any new/changed/uncached files or entries
    missing size/mtime metadata (these still fall back to the original
    full index path).
  • Add a TODO above the collection health check to note that these
    expensive Qdrant probes should eventually be split into a dedicated
    "health-check-only" mode, so "nothing changed" runs can remain fast
    while still offering an explicit way to validate collections.

voarsh and others added 14 commits November 30, 2025 14:23
- Add an explicit mode knob to the repo_search MCP tool (code_first, balanced, docs_first)
- Plumb mode through repo_search → run_hybrid_search for in-process hybrid search calls
- Make hybrid_search implementation/doc weighting mode-aware:
  - Default/code_first: full IMPLEMENTATION_BOOST and DOCUMENTATION_PENALTY
  - balanced: keep impl boost, halve structural doc penalty
  - docs_first: reduce impl boost and disable structural doc penalty
- Keep documentation penalties purely structural (README/docs/.md/etc) instead of query-phrase based
- Add MCP-side mode-aware reordering in repo_search:
  - Group core implementation code, other code, and docs differently for code_first vs docs_first
- Implement a code_first post-processing shim to ensure at least N core code hits in the top-K
  - Tunable via REPO_SEARCH_CODE_FIRST_MIN_CORE and REPO_SEARCH_CODE_FIRST_TOP_K
- Thread the mode argument through repo_search_compat so clients can select modes via the compat wrapper
Ensures the hybrid search uses the greater value between the originally requested limit and the rerank_top_n value when reranking is enabled.
Also enforces the user-requested limit on the final result set.
Refactors the core code classification logic for more accurate
identification, re-using hybrid_search's helpers when available.

This change avoids duplicating extension and path-based heuristics
and allows for better mode-aware reordering of search results.
Ensures that pseudo and tag metadata from index time are carried through in hybrid search results.

This allows downstream consumers, such as repo search rerankers, to incorporate index-time GLM/LLM labels into their scoring or display logic. It enriches candidate documents with pseudo/tags information when available, improving reranking and search result context.
feat(hybrid-search): auto-scale search parameters for large codebases

Automatically adjust RRF and retrieval parameters based on collection
size to maintain search quality at scale (100k-500k+ LOC codebases).

Changes:
- Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination
- Add _adaptive_per_query(): sqrt-based candidate retrieval scaling
- Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions
- Add _get_collection_stats(): cached collection size lookup (5-min TTL)
- Apply scaling to both MCP (run_hybrid_search) and CLI paths
- All scaling enabled by default, no configuration required

Scaling behavior:
- Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD)
- RRF k: 60 → up to 180 (3x max, logarithmic)
- Per-query: 24 → up to 72 (3x max, sqrt scaling)
- Score normalization spreads compressed score ranges

Small codebases (<10k points) are unaffected - parameters unchanged.

=== Large Codebase Scaling Tests ===
LARGE_COLLECTION_THRESHOLD: 10000
MAX_RRF_K_SCALE: 3.0
SCORE_NORMALIZE_ENABLED: True
Base RRF_K: 60

--- RRF K Scaling ---
     5000 points -> k=60
    10000 points -> k=60
    50000 points -> k=101
   100000 points -> k=120
   250000 points -> k=143
   500000 points -> k=161

--- Per-Query Scaling ---
     5000 points -> per_query=24 (filtered=24)
    10000 points -> per_query=24 (filtered=24)
    50000 points -> per_query=53 (filtered=37)
   100000 points -> per_query=72 (filtered=53)
   250000 points -> per_query=72 (filtered=72)
   500000 points -> per_query=72 (filtered=72)

--- Score Normalization ---
  Before (compressed): [0.5, 0.505, 0.51, 0.495]
  After (spread):      [0.4443, 0.5557, 0.6617, 0.3383]
  Range: 0.4950-0.5100 -> 0.3383-0.6617

-------------------
Collection: codebase
Points: 16622
Threshold: 10000
Scaling Active: True

RRF K: 60 -> 73 (scale factor: 1.22x)
Per-query: 24 -> 30 (no filters)
Per-query: 24 -> 24 (with filters)
- Avoid deriving a root-level "/work-<hash>" collection in multi-repo mode
- Resolve per-file Qdrant collections via get_collection_for_file for all data ops
- Fix on_deleted and move/rename delete paths to use repo-specific collections instead of the watcher’s default_collection
Adds a background worker to backfill missing pseudo/tags and lexical vectors in Qdrant.

This allows for a two-phase indexing process where base vectors are written first, followed by a background process to enrich them.  This is enabled via the `PSEUDO_BACKFILL_ENABLED` environment variable and configured with interval and batch size.
Adds an init container to each indexer service deployment
that waits for Qdrant to be available before starting the
indexer. This ensures that the indexer does not start
processing data before Qdrant is ready to accept connections,
preventing potential data loss or corruption.
Adds a debug mode for the pseudo backfill process, enabled via the `PSEUDO_BACKFILL_DEBUG` environment variable.

When enabled, the backfill process tracks and reports detailed statistics, including scanned points, GLM calls, the number of filled and updated vectors, and the reasons for skipping vectors, providing insights into the backfill's performance and potential bottlenecks.
Implements a mechanism to skip re-indexing files based on file size and modification time.

This optimization, enabled by the `INDEX_FS_FASTPATH` environment variable, significantly speeds up indexing, particularly when dealing with large repositories or frequent re-indexing operations where file contents may not have changed. The logic retrieves file metadata (size and mtime) from the cache and compares it with the current file's metadata. If they match, the file is skipped, avoiding unnecessary re-reading and processing.

The change also updates the cache to store file size and mtime along with the file hash.
When fast-fs is enabled, this commit refreshes the file hash
cache with size/mtime information during file skipping. This
ensures the cache remains up-to-date even when files are
skipped due to unchanged content, and enhances the fast-fs
performance.
- Add an INDEX_FS_FASTPATH-gated precheck in index_repo that walks files
  using cache.json fs metadata (size/mtime) and, when every file matches,
  exits early before model construction and Qdrant client setup, making
  true no-change runs much cheaper.
- Leave behavior unchanged for any new/changed/uncached files or entries
  missing size/mtime metadata (these still fall back to the original
  full index path).
- Add a TODO above the collection health check to note that these
  expensive Qdrant probes should eventually be split into a dedicated
  "health-check-only" mode, so "nothing changed" runs can remain fast
  while still offering an explicit way to validate collections.
@voarsh2
Copy link
Contributor Author

voarsh2 commented Dec 6, 2025

I am debating the usefulness of "mode" mcp arg for code_first/docs first... might rip it out after experimenting some more.

Refactors commit search to incorporate lexical scoring,
allowing for ranking of results by relevance when a query is provided.

This change replaces the previous strict "all tokens must appear" filter
with a field-aware scoring mechanism, enabling the system to identify
and prioritize commits that better match the specified behavior phrase.
The results are then sorted by score and trimmed to the requested limit.
Ensures related paths are emitted in the appropriate path space
(host or container) based on the PATH_EMIT_MODE environment variable.

This change introduces a mapping of container paths to host paths
to ensure consistent path representation for human-facing interfaces,
while preserving container paths for backend usage.
Implements an optional feature to enhance commit search using vector embeddings.

This feature allows for a semantic score to be computed for the query by blending it with the lexical/lineage score which is gated by a configuration setting.

This commit also includes lazy loading of the fastembed library and a sanitize helper to allow environments without these dependencies to still function with a pure lexical search.
@voarsh2 voarsh2 changed the title Pseudo / Tag backfill via watcher - indexer base index pass, fast-fs index path for unchanged files Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files Dec 7, 2025
Passes through index-time metadata

Passes through index-time pseudo/tags metadata to downstream consumers.

This allows MCP clients, rerankers, and IDEs to optionally incorporate GLM/LLM labels into their scoring or display logic.
Emphasizes the mandatory nature of Qdrant-Indexer tool usage.

Reinforces the importance of semantic search with short natural-language queries, discouraging grep/regex syntax.
@voarsh2 voarsh2 marked this pull request as ready for review December 8, 2025 17:53
@m1rl0k m1rl0k merged commit 64c436c into Context-Engine-AI:test Dec 8, 2025
1 check passed
@voarsh2 voarsh2 deleted the feat/repo-search-ranking-code-first-4 branch December 10, 2025 04:09
m1rl0k added a commit that referenced this pull request Mar 1, 2026
…st-4

Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants