Skip to content

fix(ck-engine): apply path scope before top_k in semantic search#111

Merged
runonthespot merged 1 commit into
mainfrom
fix/semantic-scoped-search-top-k
May 23, 2026
Merged

fix(ck-engine): apply path scope before top_k in semantic search#111
runonthespot merged 1 commit into
mainfrom
fix/semantic-scoped-search-top-k

Conversation

@runonthespot
Copy link
Copy Markdown
Contributor

Problem

`semantic_search_v3` computed cosine similarity for every chunk in the index, sorted them, took the global top_k, and only THEN applied the path filter. With a whole-codebase index plus a narrow `path=` query, the global top_k could be entirely consumed by chunks outside the requested scope — leaving the path filter with nothing to keep and returning an empty result set even though the in-scope file had great matches.

Found via code review (task #10 in the session task list).

Fix

  1. Build a `PathScope` (All / Dir / File) once up-front with the target canonicalized once.
  2. Apply scope at sidecar-collection time — before we even read embeddings or compute similarities. Earliest possible filter.
  3. Remove the now-redundant per-result path check from the iteration loop, which had also been re-canonicalizing `options.path` on every iteration.

Tests

  • 3 `PathScope` unit tests (`All` / `Dir` / `File`). Run under `--no-default-features` too — no embedder dependency.
  • 1 integration regression test `test_scoped_search_does_not_lose_results_to_global_top_k`: indexes 8 noisy files about "database connection" in `noisy/`, plus 1 in-scope file in `scoped/`, then searches scope=`scoped/` with `top_k=3`. Pre-fix: empty result. Post-fix: in-scope file present. Gated on `fastembed` since it needs a real embedder.

Test plan

  • `cargo test -p ck-engine --no-default-features` — 18 passed (was 15)
  • `cargo clippy -p ck-engine --no-default-features --all-targets -- -D warnings` — clean
  • `cargo fmt --all --check` — clean
  • CI green on this PR (fastembed-feature test also passes)

🤖 Generated with Claude Code

semantic_search_v3 was computing the top_k from all chunks in the
index BEFORE applying the requested path filter. With a whole-codebase
index plus a narrow \`path=\` query, the global top_k could be entirely
consumed by chunks outside the requested scope, leaving the path
filter with nothing to keep and returning an empty result set even
though the in-scope file contained excellent matches.

Two changes:
- Build the scope once up-front (PathScope enum) and filter at the
  sidecar-collection stage. Earliest possible filter, also avoids
  loading and ranking embeddings we'd discard anyway.
- Hoist the canonicalize() of \`options.path\` out of the per-result
  loop where it was running N times.

Tests:
- 3 PathScope unit tests (All / Dir / File) — run under all features
  including --no-default-features.
- 1 regression test guarding the original symptom: index 8 noisy
  files about TOPIC_A in a sibling, plus 1 in-scope file, search the
  scoped dir with top_k=3, assert the result is non-empty and stays
  in scope. Gated on fastembed since it needs a real embedder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@runonthespot runonthespot merged commit d7154a8 into main May 23, 2026
14 checks passed
@runonthespot runonthespot deleted the fix/semantic-scoped-search-top-k branch May 23, 2026 19:21
runonthespot added a commit that referenced this pull request May 23, 2026
Ships the bug fixes and security work merged today:

- #111 fix: scoped semantic search returned [] when global top_k
  was consumed by chunks outside the requested path scope
- #112 security: MCP tool handlers were sandbox-escapable via
  any readable host path; added allowed_roots + canonicalize check
- #106 fix: MCP tool schemas now Gemini-API compatible
  (no more union types in JSON Schema)
- #100 fix: oneshot 0.1.13 patches a use-after-free race
- #99 chore: docs-site + ck-vscode dev-dep bumps

See CHANGELOG.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant