Skip to content

Support local filesystem sources, PGlite, and dev fixtures#2

Merged
jpr5 merged 4 commits into
mme/collectfrom
mme/fixtures
Apr 2, 2026
Merged

Support local filesystem sources, PGlite, and dev fixtures#2
jpr5 merged 4 commits into
mme/collectfrom
mme/fixtures

Conversation

@mme

@mme mme commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

What's this about?

Until now, mcp-docs could only index sources from git repos. This PR makes it possible to index local directories too — no git clone, no GitHub, just point at a folder of markdown files and go.

Combined with a new PGlite option (in-process Postgres, zero setup), you can now run the full mcp-docs stack locally with nothing more than an OpenAI key:

# Index local docs using an in-process database — no Postgres required
DATABASE_URL=pglite:///tmp/my-docs \
MCP_DOCS_CONFIG=my-config.yaml \
npm run dev

Local source config example

Sources no longer need a repo field. Just point path at a local directory:

sources:
  - name: my-docs
    type: markdown
    path: ./docs            # local path, no repo needed
    file_patterns:
      - "**/*.md"
    chunk:
      target_tokens: 600
      overlap_tokens: 50

Included test fixture

A "Breeze API" fixture is included as a working example — a tiny weather API with docs that can be indexed end-to-end:

# Terminal 1: Start the mock API (optional, just for realism)
npm run fixture:breeze-api

# Terminal 2: Index its docs with PGlite and start the MCP server
npm run fixture:breeze-docs

# Terminal 3: Connect Claude to the running MCP server
npm run claude

There's also a fixture:breeze-broken-docs variant with intentionally bad docs for testing how the system handles poor-quality content.

Other changes

  • GITHUB_WEBHOOK_SECRET is now optional — the server starts without it (returns 403 on webhook calls instead of crashing at boot). One less env var for local dev.
  • closePool() — clean shutdown helper that doesn't trigger pool initialization if it never happened.

Test plan

  • npm run fixture:breeze-docs indexes local docs into PGlite and serves search results
  • npm run fixture:breeze-broken-docs handles the broken-docs variant
  • Existing git-backed source indexing is unaffected
  • Server starts without GITHUB_WEBHOOK_SECRET
  • Webhook endpoint returns 403 when secret is unconfigured

🤖 Generated with Claude Code

@jpr5 jpr5 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Local Filesystem Sources, PGlite, and Dev Fixtures

Good architecture — PGlite duck-typing, the orchestrator's local/git partitioning, and the broken-docs fixture are all well-designed. Found 2 critical issues, 8 important, and 5 suggestions.


🔴 Critical — Must Fix

C1. Path traversal — no validation on local source paths

source-indexer.ts: repoDir = path.resolve(this.sourceConfig.path) with zero boundary validation. A config entry like path: /etc would read and index arbitrary files. Git-backed sources are naturally sandboxed to a cloned repo; local sources have no sandboxing.

Fix: validate that resolved local source paths don't escape the project root or an explicit allowlist. At minimum, reject absolute paths and resolve relative to a known base directory.

C2. Previously-indexed local sources silently skipped on startup

orchestrator.ts checkAndIndex(): local sources with a last_commit_sha go into sourcesOk, then the remote-check loop filters them out via .filter(s => s.repo). Result: developer modifies local docs, restarts server, gets stale search results with no indication.

Fix: after the git-backed remote-check loop, always queue a full reindex for local sources in sourcesOk:

const localSourcesOk = sourcesOk.filter(s => !s.repo);
if (localSourcesOk.length > 0) {
    console.log(`[orchestrator] Queuing reindex for ${localSourcesOk.length} local source(s)`);
    this.queue.push({ type: 'full-reindex-local', sources: localSourcesOk });
    this.drain().catch(err => console.error('[orchestrator] drain() failed:', err));
}

🟡 Important — Should Fix

I1. repo_url: '' for local sources inflates stats

source-indexer.ts: repo_url: this.sourceConfig.repo ?? ''. The health endpoint's COUNT(DISTINCT repo_url) counts '' as a repo. Use null instead and update the stat query to exclude NULLs.

I2. getHeadSha() returns non-deterministic timestamps for local sources

fullIndex() and getHeadSha() both return local-${Date.now()} but are called at different times, producing mismatched values. Use a content-based hash (file listing + mtimes) for meaningful change detection.

I3. PGlite wrapper duck-types pg.Pool via as unknown as pg.Pool

db/client.ts: only implements query, connect, end. Extract a minimal DbPool interface to prevent future code from accidentally using pg-specific APIs that silently fail with PGlite.

I4. getPool() JSDoc is wrong

Says "creating it on first call" but for PGlite URLs it now throws if called before initializeSchema().

I5. fullIndex() JSDoc says "clone/pull the repo" for all sources

For local sources there is no clone/pull. Opening sentence is factually wrong for the new code path.

I6. No automated tests

PGlite support (73 lines), local source indexing (67 lines), webhook changes — zero automated tests. Critical untested paths: PGlite URL parsing/routing, wrapper query result shape, SourceConfig with optional repo, local source path resolution, orchestrator mixed source handling.

I7. Webhook 403 leaks configuration state

github.ts: { error: "Webhook secret not configured" } — return generic { error: "Forbidden" } and log the reason server-side.

I8. Stale threshold log message hardcodes ">24h" but threshold is configurable

orchestrator.ts: use the configured stale_threshold_hours value in the log message.


🔵 Suggestions

  • S1. Whitespace-only webhook secret (" ") bypasses the guard — use .trim().
  • S2. Verify generateMigration()/generateSchema() don't contain DDL incompatible with PGlite transactions.
  • S3. Add a security comment above !cfg.githubWebhookSecret check explaining the rationale.
  • S4. closePool() — if pool.end() throws, pool is not nulled; second call retries on half-closed pool.
  • S5. Consider a discriminated union for SourceConfig (GitSourceConfig | LocalSourceConfig) to eliminate ! assertions.

✅ What's Working Well

  • PGlite initializePGlite() wraps DDL in a transaction with proper ROLLBACK handling
  • incrementalIndex() correctly falls back to fullIndex() for local sources
  • Zod .refine() for branch requires repo prevents misconfiguration
  • Orchestrator guards consistently prevent local sources from entering git code paths
  • Broken-docs fixture is well-designed for testing LLM grounding quality
  • npm scripts are correct and well-organized

Summary: 2 critical (path traversal, stale local sources), 8 important. Path traversal is most urgent; stale-on-restart is the most common user-facing bug.

@mme mme changed the base branch from main to mme/collect April 2, 2026 17:57
mme added 4 commits April 2, 2026 20:09
Sources without a `repo` field are treated as local: files are read
directly from the configured path, repo_url is stored as NULL, and
the orchestrator always reindexes them on startup since there is no
remote HEAD to compare against.

- Make SourceConfig.repo optional with branch-requires-repo refinement
- Add local source path handling in SourceIndexer (computeLocalSha,
  local walk root, nullable repo_url)
- Add full-reindex-local job type in orchestrator with startup reindex
- Validate local source paths exist at config load time
- Make repo_url column nullable in DB schema and stats query
Default GITHUB_WEBHOOK_SECRET to empty string so the server starts
without it (webhooks simply reject with 403). Guard on missing/
whitespace-only secret, return generic "Forbidden" instead of leaking
configuration state.
When DATABASE_URL starts with pglite://, use an in-process PGlite
instance instead of connecting to an external PostgreSQL server.
The PGlite wrapper duck-types as pg.Pool, supporting the query/
connect/end surface used by queries.ts.
Adds a fixture with correct and intentionally broken API docs for
testing search quality and LLM grounding. Includes npm scripts for
running the fixture server, fixture MCP instances, and launching
Claude Code against a local instance.
@jpr5 jpr5 merged commit 860dd67 into mme/collect Apr 2, 2026
@jpr5 jpr5 deleted the mme/fixtures branch April 2, 2026 18:42
@jpr5 jpr5 mentioned this pull request Apr 2, 2026
jpr5 added a commit that referenced this pull request Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants