Skip to content

feat(search): hybrid code search — exact + semantic in one call (#67)#75

Merged
Doorman11991 merged 1 commit into
masterfrom
feat/issue-67-hybrid-search
May 31, 2026
Merged

feat(search): hybrid code search — exact + semantic in one call (#67)#75
Doorman11991 merged 1 commit into
masterfrom
feat/issue-67-hybrid-search

Conversation

@Doorman11991
Copy link
Copy Markdown
Owner

Hybrid code search — exact + semantic in one call

Implements the ideas from #67: a search tool that combines regex/keyword AND semantic search in the same call, over a symbol-aware (AST-ish) local index, fully on-device.

What it does

New hybrid_search tool ("grep on steroids"):

  • Exact matching (regex/keyword) — the precision of grep.
  • Semantic ranking — surfaces code that does what you describe even when it doesn't contain the query words.
  • Symbol-aware chunks — files are split into function/class/method-centered chunks via lightweight definition-boundary detection across JS/TS/Python/Go/Rust/Java/etc., so semantic ranking operates on coherent units.
  • One call, fused score — BM25 + a hashed bag-of-words vector, with a boost when a chunk also matches exactly. Exact+semantic hits are marked , semantic-only .

Modes: hybrid (default), regex, keyword, semantic.

Example (run against this repo):

hybrid_search "restore terminal on suspend"
● src/tui/terminal.js:106 suspend   [score 13.86]
● src/tui/fullscreen.js:758 ...     (Ctrl+Z handler)

None of those lines contain all the query words — semantic ranking finds them.

On the referenced projects

#67 suggested colgrep (Rust + ColBERT multi-vector) and semble (Python + model2vec). Both are great, but they pull in heavy native/Python runtimes and downloaded model weights. SmallCode's whole premise is staying small and fully local with zero external services, so this reuses the existing local hybrid engine (src/rag/index_store) instead of shipping an embedding model — same single-call hybrid ergonomics, no new dependencies, runs instantly on CPU. If someone wants true neural embeddings later, this leaves room to layer a semantic MCP on top.

Changes

  • New: src/tools/hybrid_search.js, test/hybrid_search.test.js (11 cases).
  • Wired into: bin/executor.js (tool case, path-contained via safeResolvePath), bin/tools.js (schema), src/compiled/tool_router.js + src/tools/two_stage_router.js (search/code-intel categories), src/tools/dedup.js (marked pure).
  • Docs: README feature section + CHANGELOG.
  • Tunables: SMALLCODE_HYBRID_MAX_FILES (1500), SMALLCODE_HYBRID_MAX_BYTES (512KiB).

Testing

Full suite 313 passing (node --test test/*.test.js, excluding the pre-existing environment-dependent shell_session.test.js). Build clean.

Closes #67.

Adds a hybrid_search tool that fuses exact regex/keyword matching with semantic ranking over a symbol-aware local index, in a single call. Finds code that does what you describe even without the query words, and ranks grep-style hits by relevance.

New src/tools/hybrid_search.js reuses the existing local BM25 + hashed-vector engine (src/rag/index_store) — fully offline, zero model downloads, no native runtime, no external services. Modes: hybrid (default), regex, keyword, semantic. Wired into executor, tool schemas, search/code-intel routing, and dedup; path args contained via safeResolvePath. Inspired by colgrep and semble (issue #67), kept dependency-free to match SmallCode's local-first design. 11 new tests; full suite 313 passing.
@Doorman11991 Doorman11991 merged commit 4a62918 into master May 31, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(suggestion) search tool with regex/keyword/semantic capabilities

1 participant