Support local filesystem sources, PGlite, and dev fixtures#2
Conversation
jpr5
left a comment
There was a problem hiding this comment.
Code Review — Local Filesystem Sources, PGlite, and Dev Fixtures
Good architecture — PGlite duck-typing, the orchestrator's local/git partitioning, and the broken-docs fixture are all well-designed. Found 2 critical issues, 8 important, and 5 suggestions.
🔴 Critical — Must Fix
C1. Path traversal — no validation on local source paths
source-indexer.ts: repoDir = path.resolve(this.sourceConfig.path) with zero boundary validation. A config entry like path: /etc would read and index arbitrary files. Git-backed sources are naturally sandboxed to a cloned repo; local sources have no sandboxing.
Fix: validate that resolved local source paths don't escape the project root or an explicit allowlist. At minimum, reject absolute paths and resolve relative to a known base directory.
C2. Previously-indexed local sources silently skipped on startup
orchestrator.ts checkAndIndex(): local sources with a last_commit_sha go into sourcesOk, then the remote-check loop filters them out via .filter(s => s.repo). Result: developer modifies local docs, restarts server, gets stale search results with no indication.
Fix: after the git-backed remote-check loop, always queue a full reindex for local sources in sourcesOk:
const localSourcesOk = sourcesOk.filter(s => !s.repo);
if (localSourcesOk.length > 0) {
console.log(`[orchestrator] Queuing reindex for ${localSourcesOk.length} local source(s)`);
this.queue.push({ type: 'full-reindex-local', sources: localSourcesOk });
this.drain().catch(err => console.error('[orchestrator] drain() failed:', err));
}🟡 Important — Should Fix
I1. repo_url: '' for local sources inflates stats
source-indexer.ts: repo_url: this.sourceConfig.repo ?? ''. The health endpoint's COUNT(DISTINCT repo_url) counts '' as a repo. Use null instead and update the stat query to exclude NULLs.
I2. getHeadSha() returns non-deterministic timestamps for local sources
fullIndex() and getHeadSha() both return local-${Date.now()} but are called at different times, producing mismatched values. Use a content-based hash (file listing + mtimes) for meaningful change detection.
I3. PGlite wrapper duck-types pg.Pool via as unknown as pg.Pool
db/client.ts: only implements query, connect, end. Extract a minimal DbPool interface to prevent future code from accidentally using pg-specific APIs that silently fail with PGlite.
I4. getPool() JSDoc is wrong
Says "creating it on first call" but for PGlite URLs it now throws if called before initializeSchema().
I5. fullIndex() JSDoc says "clone/pull the repo" for all sources
For local sources there is no clone/pull. Opening sentence is factually wrong for the new code path.
I6. No automated tests
PGlite support (73 lines), local source indexing (67 lines), webhook changes — zero automated tests. Critical untested paths: PGlite URL parsing/routing, wrapper query result shape, SourceConfig with optional repo, local source path resolution, orchestrator mixed source handling.
I7. Webhook 403 leaks configuration state
github.ts: { error: "Webhook secret not configured" } — return generic { error: "Forbidden" } and log the reason server-side.
I8. Stale threshold log message hardcodes ">24h" but threshold is configurable
orchestrator.ts: use the configured stale_threshold_hours value in the log message.
🔵 Suggestions
- S1. Whitespace-only webhook secret (
" ") bypasses the guard — use.trim(). - S2. Verify
generateMigration()/generateSchema()don't contain DDL incompatible with PGlite transactions. - S3. Add a security comment above
!cfg.githubWebhookSecretcheck explaining the rationale. - S4.
closePool()— ifpool.end()throws, pool is not nulled; second call retries on half-closed pool. - S5. Consider a discriminated union for
SourceConfig(GitSourceConfig | LocalSourceConfig) to eliminate!assertions.
✅ What's Working Well
- PGlite
initializePGlite()wraps DDL in a transaction with proper ROLLBACK handling incrementalIndex()correctly falls back tofullIndex()for local sources- Zod
.refine()forbranch requires repoprevents misconfiguration - Orchestrator guards consistently prevent local sources from entering git code paths
- Broken-docs fixture is well-designed for testing LLM grounding quality
- npm scripts are correct and well-organized
Summary: 2 critical (path traversal, stale local sources), 8 important. Path traversal is most urgent; stale-on-restart is the most common user-facing bug.
Sources without a `repo` field are treated as local: files are read directly from the configured path, repo_url is stored as NULL, and the orchestrator always reindexes them on startup since there is no remote HEAD to compare against. - Make SourceConfig.repo optional with branch-requires-repo refinement - Add local source path handling in SourceIndexer (computeLocalSha, local walk root, nullable repo_url) - Add full-reindex-local job type in orchestrator with startup reindex - Validate local source paths exist at config load time - Make repo_url column nullable in DB schema and stats query
Default GITHUB_WEBHOOK_SECRET to empty string so the server starts without it (webhooks simply reject with 403). Guard on missing/ whitespace-only secret, return generic "Forbidden" instead of leaking configuration state.
When DATABASE_URL starts with pglite://, use an in-process PGlite instance instead of connecting to an external PostgreSQL server. The PGlite wrapper duck-types as pg.Pool, supporting the query/ connect/end surface used by queries.ts.
Adds a fixture with correct and intentionally broken API docs for testing search quality and LLM grounding. Includes npm scripts for running the fixture server, fixture MCP instances, and launching Claude Code against a local instance.
What's this about?
Until now, mcp-docs could only index sources from git repos. This PR makes it possible to index local directories too — no git clone, no GitHub, just point at a folder of markdown files and go.
Combined with a new PGlite option (in-process Postgres, zero setup), you can now run the full mcp-docs stack locally with nothing more than an OpenAI key:
# Index local docs using an in-process database — no Postgres required DATABASE_URL=pglite:///tmp/my-docs \ MCP_DOCS_CONFIG=my-config.yaml \ npm run devLocal source config example
Sources no longer need a
repofield. Just pointpathat a local directory:Included test fixture
A "Breeze API" fixture is included as a working example — a tiny weather API with docs that can be indexed end-to-end:
There's also a
fixture:breeze-broken-docsvariant with intentionally bad docs for testing how the system handles poor-quality content.Other changes
GITHUB_WEBHOOK_SECRETis now optional — the server starts without it (returns 403 on webhook calls instead of crashing at boot). One less env var for local dev.closePool()— clean shutdown helper that doesn't trigger pool initialization if it never happened.Test plan
npm run fixture:breeze-docsindexes local docs into PGlite and serves search resultsnpm run fixture:breeze-broken-docshandles the broken-docs variantGITHUB_WEBHOOK_SECRET🤖 Generated with Claude Code