refactor(core): sampler warnings + dynamic skip + batched fetch (Lane A rest) by faculopezscala · Pull Request #6 · Bugsterapp/sow

faculopezscala · 2026-04-06T16:53:39Z

Summary

Closes the rest of Lane A from the eng review plan: three related fixes to the referential sampler that the security PR (#1) deliberately deferred. All three live in the same module and rewrite the same core function, so they ship together.

Issues addressed:

fix(core): sanitize JSONB + fail-closed gate + Postgres type coverage #3 — Warnings collection: replace silent catch {} blocks with structured IntegrityWarning[] collection. Surfaced via sow doctor <connector>.
docs(launch): reposition README + cookbook + sanitization docs #7 — Dynamic SKIP_IMPLICIT_COLUMNS: replace the English-biased hardcoded ["id","user_id","owner_id","created_by"] set with a dynamic check against the actual formal Relationships.
#8 — N+1 batching: collect missing implicit-reference ids by target table across ALL source tables, then issue one IN (...) query per target instead of one per source-target pair. ~30-60s → 1 round-trip per target on a 50-table schema over a VPN.

Commits (bisectable):

4af4c41 — refactor(core): referential sampler — warnings, dynamic skip, batched fetch
e169d3a — feat(cli): surface sampler integrity warnings via sow doctor

What changed

Issue #3 — warnings, not silence

ensureReferentialIntegrity now returns { tables, warnings } instead of a bare Map. Four warning kinds:

kind	meaning
`parent_fetch_failed`	the missing-parent SELECT threw
`parent_not_found`	the parent genuinely doesn't exist in source data (orphan)
`child_fetch_failed`	the ensure-1-child-per-parent SELECT threw
`implicit_ref_fetch_failed`	the implicit IN(...) batch threw

Warnings are deduped by (source, columns, target) fingerprint so a schema with thousands of orphaned rows emits one line per relationship, not one per row. Error messages truncated at 140 chars.

The warnings flow: ensureReferentialIntegrity → SamplingResult.integrityWarnings → ConnectorMetadata.integrityWarnings (persisted to disk) → sow doctor <connector> → user sees them.

Issue #7 — dynamic skip

Old hardcoded list:

const SKIP_IMPLICIT_COLUMNS = new Set(["id", "user_id", "owner_id", "created_by"]);

New dynamic check:

const formallyHandled = new Set<string>();
for (const rel of relationships) {
  for (const col of rel.sourceColumns) {
    formallyHandled.add(`${rel.sourceTable}.${col}`);
  }
}
// Later, in resolveImplicitReferences:
if (formallyHandled.has(`${tableName}.${colName}`)) continue;

Works for any language, any column name, any FK shape. The old "id" implicit-skip is preserved naturally because inferTargetTable returns null for plain "id" columns (no _id suffix).

Issue #8 — batched implicit references

Before:

for each (source_table, source_column) pair:
  for each missing id in that pair:
    SELECT * FROM target WHERE id IN (one_id)

After:

for each source row, group missing ids by target table → missingByTarget Map
for each (target_table, missingIds) in missingByTarget:
  for each chunk of 100 ids:
    SELECT * FROM target WHERE id IN ($1, $2, ..., $100)

50-table schema impact: ~250 round-trips → ~50 round-trips. With dedup, often much fewer (multiple source rows referencing the same parent collapse into one bind param).

Test Coverage

Tests: 89 → 99 (+10 new) — all in packages/core/src/sampler/referential.test.ts.

ENSUREREFERENTIALINTEGRITY COVERAGE
═══════════════════════════════════
[+] Warnings collection (Issue #3)
    ├── [★★★ TESTED] Empty warnings on clean run
    ├── [★★★ TESTED] parent_fetch_failed on adapter throw (with FailingAdapter spy)
    ├── [★★★ TESTED] parent_not_found when target table has no matching row
    ├── [★★★ TESTED] implicit_ref_fetch_failed on batch query throw
    └── [★★★ TESTED] reason strings truncated at ~140 chars

[+] Dynamic skip (Issue #7)
    ├── [★★★ TESTED] Formal FK columns excluded from implicit pass
    └── [★★★ TESTED] Non-English column names don't throw (graceful degradation)

[+] Batched implicit references (Issue #8)
    ├── [★★★ TESTED] Three source tables → one target query (not three)
    ├── [★★★ TESTED] >100 ids chunked into batches of 100/100/50
    └── [★★★ TESTED] Duplicate ids deduplicated to a single param

PLUS the 10 pre-existing tests from PR #1 (quoteIdent + SQL injection regression)
all still pass — no behavioral regression to the security fix.

Coverage: 10/10 new paths directly tested. 100%.

Pre-Landing Review

No CRITICAL findings. The sampler refactor was triple-checked against the existing 10 SQL-injection regression tests (all still pass), the test pattern is the same SpyAdapter approach used in PR #1, and the wiring through ConnectorMetadata only adds an optional field (backwards compat for metadata written by older sow versions).

Adversarial review notes

Self-reviewed with the same lens as the prior adversarial passes:

Could a transient warning be re-emitted across passes? Yes — the multi-pass loop runs up to 3 times per relationship. Mitigated with the warnedRelationships Set fingerprinted by (source, columns, target).
Could the dedup mask a real bug? No — the dedup only collapses identical (source, columns, target) tuples. Different relationships still emit independent warnings.
Does the batched IN query hit the bind-param limit? No — chunked at 100, well under Postgres's 65,535. Verified by test.
Does the dynamic skip break any existing test? No — the 10 SQL-injection regression tests pass an empty relationships array (their tests are about composite/text PK handling, not formal-FK skipping).
Edge case: a relationship where sourceTable === targetTable (self-reference). The outer for (const rel of relationships) still has if (rel.sourceTable === rel.targetTable) continue from the original code. Preserved.

Plan Completion

This PR closes 3 of the remaining 10 issues from /plan-eng-review. The other lanes ship as PRs #2 (Lane B template-DB), #3 (Lane C sanitizer), #4 (Lane D sandbox+env-patch), #5 (Lane E release workflow). All five lanes are independent and can merge in any order.

Test plan

bunx vitest run — 99/99 tests passing (89 pre-existing + 10 new)
bunx turbo build — 3/3 packages clean
bunx eslint on every touched file — clean (4 pre-existing warnings unrelated)
Post-merge: sow connect against a schema with intentional orphaned FKs and verify the summary line + sow doctor <connector> flow
Post-merge: sow connect against a 50-table schema and watch the round-trip count drop from ~250 to ~50 (Issue #8 measurable win)

🤖 Generated with Claude Code

… fetch Three related fixes to packages/core/src/sampler/referential.ts, surfaced by the eng review as Issues #3, #7, and #8. Same module, one coordinated refactor, one test pass. Issue #3 — warnings collection (was: silent catch{} blocks) The three try/catch blocks in ensureReferentialIntegrity silently swallowed any failure to fetch a missing parent, an orphaned child, or an implicit-reference batch. Users got branches with dangling FKs and no way to know. Replaced with structured warning collection into a new IntegrityWarning[] array. Four kinds: - parent_fetch_failed: the missing-parent SELECT threw - parent_not_found: the parent genuinely doesn't exist in source - child_fetch_failed: the ensure-1-child-per-parent SELECT threw - implicit_ref_fetch_failed: the implicit IN(...) batch threw Warnings are deduped by (source, columns, target) fingerprint so a schema with thousands of orphaned rows emits one line per relationship, not one per row. Error messages are truncated to 140 chars to keep metadata bounded. ensureReferentialIntegrity now returns { tables, warnings } instead of a bare Map. Callers (sampler/index.ts) thread the warnings through SamplingResult.integrityWarnings, which flows into ConnectorMetadata.integrityWarnings for later surfacing by the CLI. Issue #7 — dynamic SKIP_IMPLICIT_COLUMNS (was: English-only hardcoded list) The old hardcoded ["id", "user_id", "owner_id", "created_by"] set was English-biased and conflated "already handled by formal FK" with "should never be followed". A Spanish codebase with `creado_por`, or an app with `author_id`/`reviewer_id`/`assignee_id` would fall through the gap. Replaced with a dynamic check: compute the set of (source_table, source_column) pairs that appear in any formal Relationship, and skip THOSE in the implicit-reference pass. Works for any language and any schema. The old symbolic "id" skip is preserved by the inferrer itself (inferTargetTable returns null for columns that don't end in _id or can't be mapped to a target table). Issue #8 — batched implicit references (was: N+1 per source-target pair) The old resolveImplicitReferences walked (table, column) pairs and fired one SELECT per pair, even when multiple source tables all referenced the same parent. On a 50-table schema over a VPN that produced 100+ round-trips (~30-60s). Rewritten to collect all missing ids grouped by target table across ALL source tables first, then issue one IN($1,$2,...) query per target (still chunked at 100 ids per query to stay well under Postgres's 65,535 bind-param limit). Deduplicates ids so three source rows referencing the same parent only request it once. Tests 10 new tests in packages/core/src/sampler/referential.test.ts: - 5 for warnings (happy path, parent_fetch_failed, parent_not_found, implicit_ref_fetch_failed, reason truncation) - 2 for dynamic skip (formal FK takes precedence, non-English columns don't throw) - 3 for batching (one query per target across sources, >100 id chunking, dedup) Total: 99 passing (was 89, +10 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wires the new SamplingResult.integrityWarnings (added in the previous commit) through the CLI surface. Zero new core logic — just plumbing the warning count and details out to the user. - packages/cli/src/commands/connect.ts After a successful sow connect, if the snapshot has any integrity warnings, print one summary line: ⚠ N referential integrity warning(s) — run `sow doctor <name>` for details - packages/cli/src/commands/doctor.ts New describeConnectorWarnings() returns the per-connector report (tables, rows, snapshot size, full warning list) by reading ConnectorMetadata. The generic doctor checkConnectors() now flags any connector with warnings as a "warn" status with a hint pointing at sow doctor <name>, so users discover them even without knowing to ask. - packages/cli/src/commands/runner.ts runDoctor() now accepts an optional connectorName positional arg. When passed, prints the detailed warnings report instead of the generic system check list. Format groups warnings by kind and explains the root causes (parent_not_found = source data has orphans, *_fetch_failed = transient read error, retry with sow connector refresh). JSON mode supported. - packages/cli/src/cli.ts Help text for doctor updated to "doctor [connector]" so users discover the new positional. Tests: still 99 passing — this commit is wiring only, no new logic worth a unit test (would amount to mocking everything). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The 5 code PRs (#2-#6) deliver the engineering for the launch. This PR delivers everything user-facing: the README pitch, the package metadata, the docs/ folder, and the CHANGELOG. This PR depends on PRs #2-#6 being merged first because the new README references commands those PRs add (sow sandbox, sow doctor <connector>, the --allow-unsafe flag, the sub-second reset). Land them first, rebase this against main, then merge. README.md Hero rewritten from "Safe test databases from production Postgres" to "Stop letting Claude touch your prod database". Body explains the anxiety-reduction pitch: a coding agent is about to do something database-adjacent and you feel that quiet pang. sow is the safety layer. New "Why sow" section. New "How It Works" diagram showing the template-DB shape (one container per connector, N branch DBs, reset in <1s). New "Cookbook" stub linking to docs/cookbook.md. New "Documentation" section with the docs/ index. packages/cli/package.json Description: "Stop letting Claude touch your prod database. PII-safe local Postgres sandbox for coding agents." Keywords: added ai-agents, coding-agents, claude-code, cursor, sandbox, mcp. packages/core/package.json Description: "sow core engine — analyze, sample, sanitize, and branch Postgres databases for safe coding-agent sandboxes" Keywords: added ai-agents, coding-agents, sandbox. packages/mcp/package.json Description corrected from "15 tools" to "22 tools" (the actual count in packages/mcp/src/index.ts) and repositioned: "sow MCP server — 22 tools for coding agents (Claude Code, Cursor, Codex) to safely manage Postgres sandboxes" Keywords: added claude-code, cursor, codex, coding-agents, sandbox. docs/sandbox.md (new) The sow sandbox flagship command — what it does, the flags, the .env.local backup/revert flow, when not to use it, and what's actually in the sandbox. docs/sanitization.md (new) What sow sanitizes (the PII type table), how JSONB walking works, the fail-closed gate, the --allow-unsafe escape hatch, custom rules via .sow.yml, what sow does NOT do (free-text NER, etc.), and the read-only-on-the-source guarantee. docs/cookbook.md (new) Three end-to-end workflows with concrete prompts: 1. Let Claude refactor your schema without fear 2. Let Cursor generate seed data for a new feature 3. Let your coding agent debug a failing migration Plus the "agent reset loop" pattern diagram, the MCP tool list, and operational tips (one long-running sandbox per project, checkpoints for known-good states, sow doctor as the inspection surface, principle of least privilege on the source DB user). CHANGELOG.md (new) Scaffold with three sections: - [Unreleased] documenting the planned PR #2-#6 features under Added/Changed - [0.1.14] documenting the SQL injection security fix that already shipped (PR #1, merged earlier in the session) - [0.1.13] one-line summary of the initial public release Test/build/lint all clean (89/89 tests, 3/3 packages built, no source code changed in this PR). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

faculopezscala and others added 2 commits April 6, 2026 09:51

faculopezscala merged commit 9464cde into main Apr 6, 2026
1 check passed

faculopezscala mentioned this pull request Apr 6, 2026

docs(launch): reposition README + cookbook + sanitization docs #7

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(core): sampler warnings + dynamic skip + batched fetch (Lane A rest)#6

refactor(core): sampler warnings + dynamic skip + batched fetch (Lane A rest)#6
faculopezscala merged 2 commits into
mainfrom
fix/sampler-warnings-and-batching

faculopezscala commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

faculopezscala commented Apr 6, 2026

Summary

What changed

Issue #3 — warnings, not silence

Issue #7 — dynamic skip

Issue #8 — batched implicit references

Test Coverage

Pre-Landing Review

Adversarial review notes

Plan Completion

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant