feat: PHP grep backend for post_content LIKE queries (closes #43)#59
Merged
feat: PHP grep backend for post_content LIKE queries (closes #43)#59
Conversation
After the Index/Map Architecture (PR #41), post_content is stored as an empty string in SQLite and lazy-loaded from .md files on demand. WHERE post_content LIKE '%foo%' silently matches nothing, breaking WP default search (?s=foo) and any plugin that queries post_content directly. Fix it without reintroducing the coupling PR #41 removed: grep the .md files on disk instead of rebuilding a full-text index inside SQLite. Changes: - New WP_Markdown_Search class encapsulates the grep logic. Iterates _markdown_file_index (post_id -> file_path) and case-insensitively matches the needle against each source file. Per-request cache keyed by lowercased needle. - Driver intercepts SELECT queries with post_content LIKE clauses and rewrites each one into (table.)?ID IN (1,2,3) based on grep results, or 0=1 when nothing matches. Only the %needle% contains-match shape is rewritten; prefix, suffix, or embedded-wildcard patterns are left untouched for SQLite to handle. - Extension point: markdown_db_search_matching_ids filter lets an FTS5, Meilisearch, or Elasticsearch backend short-circuit the default grep without patching core. - 20 pure-PHP smoke tests covering grep correctness, multi-word AND queries, escaped LIKE wildcards, table prefix preservation, and unsupported pattern passthrough. Run via: php tests/smoke-search.php Closes #43
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After the Index/Map Architecture (PR #41),
post_contentis stored as an empty string in SQLite and lazy-loaded from.mdfiles on demand. This means:WHERE post_content LIKE '%keyword%'matches nothing?s=keyword) returns no resultsApproach
Rather than rebuild a full-text index inside SQLite — which would duplicate file content back into the DB and reintroduce the coupling PR #41 deliberately removed — grep the
.mdfiles directly.The layering matches the PR #41 thesis: SQLite is an index, files are the content store. Search is a content concern, so it honors that boundary.
Changes
New
WP_Markdown_Searchclass — encapsulates grep + rewrite. Iterates_markdown_file_index(post_id → file_path), case-insensitively matches the needle against each source file. Per-request cache keyed by lowercased needle avoids re-grepping on repeat queries (common duringWP_Query+FOUND_ROWS()cycles).Driver interception —
WP_Markdown_Driver::query()runs the rewriter before handing the SQL to SQLite. Only the%needle%contains-match shape is rewritten; prefix (foo%), suffix (%foo), and embedded-wildcard patterns are left untouched for SQLite to handle natively (matching pre-PR-41 behavior for those edge cases). Table prefixes (wp_posts.post_content) are preserved in the rewrite.Multi-word AND semantics —
?s=foo bargenerates separatepost_content LIKEclauses joined by the outerAND. Each clause is rewritten independently, so the AND semantics emerge naturally without special handling.Escaped wildcards —
$wpdb->esc_like()produces patterns like'%50\% OFF%'. The needle extractor unescapes\%,\_,\\,\', and doubled''quotes before grepping.Extension point — the
markdown_db_search_matching_idsfilter lets an FTS5, Meilisearch, or Elasticsearch backend short-circuit the default grep without patching core. Return an array of IDs to override; returnnull(default) to use the built-in grep.Why not FTS5?
FTS5 would work, but at a cost that doesn't match the install base:
For the current install base (sites with <1,000 posts) grep is objectively the right tradeoff. The filter extension point makes FTS5 a drop-in later if anyone hits the scaling wall.
Testing
20 pure-PHP smoke tests in
tests/smoke-search.phpcovering:0=1substitutionpost_content LIKE→ returnsnull(no-op)'foo%') left untouched'%50\% OFF%') unescaped correctlywp_posts.kept, bare stays bare)Run:
php tests/smoke-search.php— all 20 pass.Files
Closes #43.