Skip to content

feat: Index/Map Architecture — SQLite as index, .md files as content store#41

Merged
chubes4 merged 4 commits intomainfrom
feature/index-architecture
Apr 16, 2026
Merged

feat: Index/Map Architecture — SQLite as index, .md files as content store#41
chubes4 merged 4 commits intomainfrom
feature/index-architecture

Conversation

@chubes4
Copy link
Copy Markdown
Collaborator

@chubes4 chubes4 commented Apr 16, 2026

Summary

Phase 1 of the Index/Map Architecture: SQLite becomes an index, markdown files become the only content store. No more duplicating post_content into SQLite.

  • Boot now parses frontmatter only (skips file bodies), cutting boot I/O ~90%
  • post_content is stored as empty string in SQLite for markdown-sourced posts
  • New _markdown_file_index table maps post_id → file_path, file_mtime, file_size
  • Driver lazy-loads content from .md files on SELECT queries that include post_content
  • Write engine updates the file index after writing/deleting .md files

Changes

File What
class-wp-markdown-storage.php parse_file() gains $metadata_only flag; new read_frontmatter_only() (line-by-line, stops at closing ---); new public read_content_from_file() for lazy-loading; posts carry _source_file path
class-wp-markdown-loader.php load_posts() calls get_all_posts(true), inserts empty post_content, creates & populates _markdown_file_index
class-wp-markdown-driver.php query() intercepts SELECT results on wp_posts and resolves content via resolve_content(); file index cache loaded once into memory; update_file_index() / remove_from_file_index() for write-path
class-wp-markdown-write-engine.php Updates file index after writing .md files; removes index entry on DELETE

Design doc

See wiki article: Markdown DB: Index Architecture Design (ID 128)

Testing

All existing functionality preserved — lazy-load is transparent:

  • wp post list --post_type=wiki --format=count → 42 ✅
  • wp post get 58 --field=post_content → full article content ✅
  • wp intelligence wiki read karpathy-llm-wiki-pattern → full content ✅
  • wp intelligence wiki tree → full tree ✅
  • Regular posts and pages resolve correctly ✅

Next phases

  • Phase 2: Optimized frontmatter-only parsing (already partially done — read_frontmatter_only())
  • Phase 3: Persistent SQLite + manifest (warm boot in ~5ms)
  • Phase 4: Smart resolution skipping (skip file reads for metadata-only queries)

chubes4 added 3 commits April 16, 2026 08:53
…s content store

Phase 1 of the Index/Map Architecture:
- SQLite stores empty post_content for markdown-sourced posts
- New _markdown_file_index table maps post_id → file_path
- Boot parses frontmatter only (skip body), cutting boot I/O ~90%
- Driver lazy-loads content from .md files on SELECT queries
- Write engine updates file index after writing .md files
- All existing functionality preserved — lazy-load is transparent
Switch from :memory: to persistent on-disk SQLite index file.
On cold boot (no file), full load from disk as before.
On warm boot (file exists), incremental sync only:
- _json_file_manifest tracks JSON file mtimes — only reload changed tables
- _markdown_file_index tracks .md file mtimes — only re-parse changed posts
- Detect new files (INSERT), changed files (UPDATE), deleted files (DELETE)
- Falls back to full reload if incremental sync fails

SQLite index file: wp-content/markdown-index.sqlite (~700KB for 43 posts)
WAL journal mode for concurrent read/write safety.
…systems

WAL mode requires shared memory files (-shm) that don't work across
container/host filesystem boundaries (e.g. Studio). The PRAGMA
journal_mode query was corrupting the SQLite file on warm boot.

Changed to null (SQLite default DELETE mode) which is safe everywhere.
WAL can still be enabled via SQLITE_JOURNAL_MODE constant if needed.
When the persistent SQLite index file is corrupted (e.g. unclean
shutdown, filesystem issues), the site now self-heals:
1. Detects corruption via 'file is not a database' exception
2. Deletes the corrupted .sqlite file + any journal files
3. Falls back to cold boot (full rebuild from .md/JSON files)
4. Logs the recovery for admin visibility

Extracted boot_connection() from db_connect() so the connection
setup can be retried cleanly after deleting the corrupted file.
@chubes4 chubes4 merged commit d6bf813 into main Apr 16, 2026
@chubes4 chubes4 deleted the feature/index-architecture branch April 16, 2026 13:54
chubes4 added a commit that referenced this pull request Apr 21, 2026
After the Index/Map Architecture (PR #41), post_content is stored as an
empty string in SQLite and lazy-loaded from .md files on demand. WHERE
post_content LIKE '%foo%' silently matches nothing, breaking WP default
search (?s=foo) and any plugin that queries post_content directly.

Fix it without reintroducing the coupling PR #41 removed: grep the .md
files on disk instead of rebuilding a full-text index inside SQLite.

Changes:

- New WP_Markdown_Search class encapsulates the grep logic. Iterates
  _markdown_file_index (post_id -> file_path) and case-insensitively
  matches the needle against each source file. Per-request cache keyed
  by lowercased needle.

- Driver intercepts SELECT queries with post_content LIKE clauses and
  rewrites each one into (table.)?ID IN (1,2,3) based on grep results,
  or 0=1 when nothing matches. Only the %needle% contains-match shape
  is rewritten; prefix, suffix, or embedded-wildcard patterns are left
  untouched for SQLite to handle.

- Extension point: markdown_db_search_matching_ids filter lets an
  FTS5, Meilisearch, or Elasticsearch backend short-circuit the default
  grep without patching core.

- 20 pure-PHP smoke tests covering grep correctness, multi-word AND
  queries, escaped LIKE wildcards, table prefix preservation, and
  unsupported pattern passthrough. Run via: php tests/smoke-search.php

Closes #43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant