Skip to content

feat: PHP grep backend for post_content LIKE queries (closes #43)#59

Merged
chubes4 merged 1 commit intomainfrom
feat/grep-content-search
Apr 21, 2026
Merged

feat: PHP grep backend for post_content LIKE queries (closes #43)#59
chubes4 merged 1 commit intomainfrom
feat/grep-content-search

Conversation

@chubes4
Copy link
Copy Markdown
Collaborator

@chubes4 chubes4 commented Apr 21, 2026

Problem

After the Index/Map Architecture (PR #41), post_content is stored as an empty string in SQLite and lazy-loaded from .md files on demand. This means:

  • WHERE post_content LIKE '%keyword%' matches nothing
  • WordPress default search (?s=keyword) returns no results
  • Any plugin that queries post content via SQL silently breaks

Approach

Rather than rebuild a full-text index inside SQLite — which would duplicate file content back into the DB and reintroduce the coupling PR #41 deliberately removed — grep the .md files directly.

The layering matches the PR #41 thesis: SQLite is an index, files are the content store. Search is a content concern, so it honors that boundary.

query() receives:
  SELECT * FROM wp_posts WHERE post_content LIKE '%foo%'
         ↓
  extract needle "foo"
  grep files in _markdown_file_index
         ↓
  rewrite to:
  SELECT * FROM wp_posts WHERE ID IN (10, 42, 58)

Changes

  • New WP_Markdown_Search class — encapsulates grep + rewrite. Iterates _markdown_file_index (post_id → file_path), case-insensitively matches the needle against each source file. Per-request cache keyed by lowercased needle avoids re-grepping on repeat queries (common during WP_Query + FOUND_ROWS() cycles).

  • Driver interceptionWP_Markdown_Driver::query() runs the rewriter before handing the SQL to SQLite. Only the %needle% contains-match shape is rewritten; prefix (foo%), suffix (%foo), and embedded-wildcard patterns are left untouched for SQLite to handle natively (matching pre-PR-41 behavior for those edge cases). Table prefixes (wp_posts.post_content) are preserved in the rewrite.

  • Multi-word AND semantics?s=foo bar generates separate post_content LIKE clauses joined by the outer AND. Each clause is rewritten independently, so the AND semantics emerge naturally without special handling.

  • Escaped wildcards$wpdb->esc_like() produces patterns like '%50\% OFF%'. The needle extractor unescapes \%, \_, \\, \', and doubled '' quotes before grepping.

  • Extension point — the markdown_db_search_matching_ids filter lets an FTS5, Meilisearch, or Elasticsearch backend short-circuit the default grep without patching core. Return an array of IDs to override; return null (default) to use the built-in grep.

Why not FTS5?

FTS5 would work, but at a cost that doesn't match the install base:

PHP grep (this PR) FTS5
Code ~300 lines ~400 + migration
Sync bugs None — files are the index Possible on every write
Cold boot cost Zero Populate index from files
Corruption surface None Another table (compare #47)
100 posts ~5ms ~1ms
1,000 posts ~30ms ~2ms
10,000 posts ~300ms ~3ms

For the current install base (sites with <1,000 posts) grep is objectively the right tradeoff. The filter extension point makes FTS5 a drop-in later if anyone hits the scaling wall.

Testing

20 pure-PHP smoke tests in tests/smoke-search.php covering:

  • Case-insensitive grep across real fixture files
  • Single-file, multi-file, and no-match outcomes
  • Per-request cache hit on repeat needle
  • Full WP_Query-style SQL rewrite (title/excerpt/content OR group)
  • Multi-word AND (separate LIKE groups each get rewritten)
  • No-match → 0=1 substitution
  • No post_content LIKE → returns null (no-op)
  • Prefix-only pattern ('foo%') left untouched
  • Escaped wildcard ('%50\% OFF%') unescaped correctly
  • Table prefix preservation (wp_posts. kept, bare stays bare)

Run: php tests/smoke-search.php — all 20 pass.

Files

 db.php                                   |   1 +
 inc/class-wp-markdown-driver.php         |  55 ++++++
 inc/class-wp-markdown-search.php         | 285 +++++++++++++++++++++++++++++++
 tests/smoke-search.php                   | 243 ++++++++++++++++++++++++++
 tests/stubs/stub-wp-markdown-driver.php  |  32 ++++
 tests/stubs/stub-wp-markdown-storage.php |  27 +++

Closes #43.

After the Index/Map Architecture (PR #41), post_content is stored as an
empty string in SQLite and lazy-loaded from .md files on demand. WHERE
post_content LIKE '%foo%' silently matches nothing, breaking WP default
search (?s=foo) and any plugin that queries post_content directly.

Fix it without reintroducing the coupling PR #41 removed: grep the .md
files on disk instead of rebuilding a full-text index inside SQLite.

Changes:

- New WP_Markdown_Search class encapsulates the grep logic. Iterates
  _markdown_file_index (post_id -> file_path) and case-insensitively
  matches the needle against each source file. Per-request cache keyed
  by lowercased needle.

- Driver intercepts SELECT queries with post_content LIKE clauses and
  rewrites each one into (table.)?ID IN (1,2,3) based on grep results,
  or 0=1 when nothing matches. Only the %needle% contains-match shape
  is rewritten; prefix, suffix, or embedded-wildcard patterns are left
  untouched for SQLite to handle.

- Extension point: markdown_db_search_matching_ids filter lets an
  FTS5, Meilisearch, or Elasticsearch backend short-circuit the default
  grep without patching core.

- 20 pure-PHP smoke tests covering grep correctness, multi-word AND
  queries, escaped LIKE wildcards, table prefix preservation, and
  unsupported pattern passthrough. Run via: php tests/smoke-search.php

Closes #43
@chubes4 chubes4 merged commit dfdee76 into main Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FTS5 search index for post_content

1 participant