Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300
Conversation
…ter, cheap count Three compounding bugs caused 1.5–8.5s latency on name_pattern= searches against large projects (216K nodes), now reduced to ~0ms query time (cold-start dominates): Fix 1 — regex compiled once per statement, not once per row sqlite_regexp / sqlite_iregexp now use sqlite3_get_auxdata / sqlite3_set_auxdata to cache the compiled cbm_regex_t for the lifetime of the statement. Previously cbm_regcomp + cbm_regfree ran for every row scanned. Fix 2 — LIKE pre-filter cuts rows reaching the regex Wire cbm_extract_like_hints (already implemented but dead) into search_where_basic via a new where_add_like_hints helper. For .*Controller.* this prepends n.name LIKE '%Controller%', letting the idx_nodes_name index satisfy the LIKE clause first and passing only matching rows to iregexp(). Added search_like_pool_t to manage the malloc'd LIKE strings across both statement executions. ST_SEARCH_MAX_BINDS raised 16 → 32 to accommodate extra bind slots. Fix 3 — count query no longer runs per-row edge subqueries The count SQL previously wrapped the full SELECT (which includes two correlated subqueries for in_deg / out_deg) in SELECT COUNT(*) FROM (...), executing those edge counts for every matching row even though the count needs none of that. Non-degree-filter path now uses SELECT COUNT(*) FROM nodes n WHERE <same WHERE>, which has no per-row subqueries. Degree-filter path retains the wrapped form since it needs those columns for the filter. Benchmark on home-ubuntu-dev-sis (216K nodes, 509MB DB): Query BEFORE AFTER speedup name_pattern=.*Controller.* 3099ms 508ms 6× name_pattern=.*Service.* 2006ms 506ms 4× name_pattern=.*Repository.* 2006ms 508ms 4× name_pattern=specificFuncName 1506ms 507ms 3× label=Method + name_pattern=.*get.* 8509ms 509ms 17× name_pattern=.*Approve.* 1506ms 507ms 3× name_pattern=.*authorize.* 1506ms 509ms 3× The ~500ms floor is cold-start I/O (opening a 509MB file from disk). In the long-running MCP server process the warm-cache query time is sub-millisecond. All store search tests pass including pagination, degree filter, and extract_like_hints. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Make project a required CLI argument instead of a hardcoded name, and remove internal query strings used during development testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flat BM25 queries of the form: SELECT ... FROM nodes_fts JOIN nodes WHERE MATCH ? AND project=? ORDER BY bm25() LIMIT N block FTS5 WAND/MaxScore early-exit — the outer JOIN+WHERE is invisible to the FTS5 planner, so it scores every matching document before any filter fires. On a large codebase with 100K+ matches this causes 2–16 minute queries. Fix: two-step subquery. The inner FTS5-only query: SELECT rowid, bm25(nodes_fts) FROM nodes_fts WHERE MATCH ? ORDER BY bm25() LIMIT 2000 can early-terminate because no outer predicate blocks it. The outer query then joins and filters at most BM25_INNER_LIMIT (2000) candidates. The count query uses the identical inner-limit subquery, so it benefits too. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Merged via rebase, thanks @awconstable — diagnoses are spot-on, fixes are clean, benchmarks reproduce. The auxdata caching is the canonical SQLite pattern, the LIKE pre-filter wiring is well-scoped ( A note for anyone reading this thread later: the branch also contained the FTS5 two-step subquery fix that #302 was targeting, so #302 is now superseded — closing it as resolved. Soft behavior note worth flagging on the FTS5 path: |
Fixes #254
Root cause
Three compounding bugs caused
name_pattern=searches to scan every node with an expensive compiled regex, regardless of how selective the pattern was:sqlite_iregexp/sqlite_regexprecompiled the regex on every row —cbm_regcomp+cbm_regfreefired once per node for the full table.SELECT COUNT(*) FROM (...), doubling the scan with identical per-row overhead.cbm_extract_like_hintswas implemented and correct but never called — the LIKE pre-filter that should cut the regex scan to only matching rows was dead code.Changes
Fix 1 — regex cached per statement (
sqlite_regexp/sqlite_iregexp)Use
sqlite3_get_auxdata/sqlite3_set_auxdatato cache the compiledcbm_regex_tfor the lifetime of the statement.cbm_regcompis now called exactly once per query, not once per row.Fix 2 — LIKE pre-filter wired in (
where_add_like_hints,search_where_basic)Wire
cbm_extract_like_hintsintosearch_where_basicvia a newwhere_add_like_hintshelper. For.*Controller.*this prependsn.name LIKE '%Controller%'; theidx_nodes_nameindex satisfies the LIKE clause and only matching rows reachiregexp(). Addedsearch_like_pool_tto manage the malloc'd LIKE strings across both statement executions.ST_SEARCH_MAX_BINDSraised 16 → 32.Fix 3 — count query stripped of per-row edge subqueries
For the common no-degree-filter path, the count SQL is now
SELECT COUNT(*) FROM nodes n WHERE <same WHERE>— no correlatededgessubqueries. The degree-filter path retains the wrapped form since it needs those columns for the filter.Benchmark
Tested on a large PHP codebase (~200K nodes):
name_pattern=.*Controller.*name_pattern=.*Service.*name_pattern=.*Repository.*name_pattern=specificFunctionNamelabel=Method+name_pattern=.*get.*The ~500ms floor is cold-start I/O when spawning a fresh process against a ~500MB database. In the long-running MCP server (warm file cache) the query time is sub-millisecond.
A reusable benchmark script is included at
scripts/benchmark-search-graph.sh.Tests
All store search tests pass including
store_search_pagination(offset-past-end total count),store_search_degree_filter, and the fullstore_extract_like_hintssuite.