Skip to content

refactor(core): sampler warnings + dynamic skip + batched fetch (Lane A rest)#6

Merged
faculopezscala merged 2 commits into
mainfrom
fix/sampler-warnings-and-batching
Apr 6, 2026
Merged

refactor(core): sampler warnings + dynamic skip + batched fetch (Lane A rest)#6
faculopezscala merged 2 commits into
mainfrom
fix/sampler-warnings-and-batching

Conversation

@faculopezscala
Copy link
Copy Markdown
Contributor

Summary

Closes the rest of Lane A from the eng review plan: three related fixes to the referential sampler that the security PR (#1) deliberately deferred. All three live in the same module and rewrite the same core function, so they ship together.

Issues addressed:

  • fix(core): sanitize JSONB + fail-closed gate + Postgres type coverage #3 — Warnings collection: replace silent catch {} blocks with structured IntegrityWarning[] collection. Surfaced via sow doctor <connector>.
  • docs(launch): reposition README + cookbook + sanitization docs #7 — Dynamic SKIP_IMPLICIT_COLUMNS: replace the English-biased hardcoded ["id","user_id","owner_id","created_by"] set with a dynamic check against the actual formal Relationships.
  • #8 — N+1 batching: collect missing implicit-reference ids by target table across ALL source tables, then issue one IN (...) query per target instead of one per source-target pair. ~30-60s → 1 round-trip per target on a 50-table schema over a VPN.

Commits (bisectable):

  1. 4af4c41refactor(core): referential sampler — warnings, dynamic skip, batched fetch
  2. e169d3afeat(cli): surface sampler integrity warnings via sow doctor

What changed

Issue #3 — warnings, not silence

ensureReferentialIntegrity now returns { tables, warnings } instead of a bare Map. Four warning kinds:

kind meaning
parent_fetch_failed the missing-parent SELECT threw
parent_not_found the parent genuinely doesn't exist in source data (orphan)
child_fetch_failed the ensure-1-child-per-parent SELECT threw
implicit_ref_fetch_failed the implicit IN(...) batch threw

Warnings are deduped by (source, columns, target) fingerprint so a schema with thousands of orphaned rows emits one line per relationship, not one per row. Error messages truncated at 140 chars.

The warnings flow: ensureReferentialIntegritySamplingResult.integrityWarningsConnectorMetadata.integrityWarnings (persisted to disk) → sow doctor <connector> → user sees them.

Issue #7 — dynamic skip

Old hardcoded list:

const SKIP_IMPLICIT_COLUMNS = new Set(["id", "user_id", "owner_id", "created_by"]);

New dynamic check:

const formallyHandled = new Set<string>();
for (const rel of relationships) {
  for (const col of rel.sourceColumns) {
    formallyHandled.add(`${rel.sourceTable}.${col}`);
  }
}
// Later, in resolveImplicitReferences:
if (formallyHandled.has(`${tableName}.${colName}`)) continue;

Works for any language, any column name, any FK shape. The old "id" implicit-skip is preserved naturally because inferTargetTable returns null for plain "id" columns (no _id suffix).

Issue #8 — batched implicit references

Before:

for each (source_table, source_column) pair:
  for each missing id in that pair:
    SELECT * FROM target WHERE id IN (one_id)

After:

for each source row, group missing ids by target table → missingByTarget Map
for each (target_table, missingIds) in missingByTarget:
  for each chunk of 100 ids:
    SELECT * FROM target WHERE id IN ($1, $2, ..., $100)

50-table schema impact: ~250 round-trips → ~50 round-trips. With dedup, often much fewer (multiple source rows referencing the same parent collapse into one bind param).

Test Coverage

Tests: 89 → 99 (+10 new) — all in packages/core/src/sampler/referential.test.ts.

ENSUREREFERENTIALINTEGRITY COVERAGE
═══════════════════════════════════
[+] Warnings collection (Issue #3)
    ├── [★★★ TESTED] Empty warnings on clean run
    ├── [★★★ TESTED] parent_fetch_failed on adapter throw (with FailingAdapter spy)
    ├── [★★★ TESTED] parent_not_found when target table has no matching row
    ├── [★★★ TESTED] implicit_ref_fetch_failed on batch query throw
    └── [★★★ TESTED] reason strings truncated at ~140 chars

[+] Dynamic skip (Issue #7)
    ├── [★★★ TESTED] Formal FK columns excluded from implicit pass
    └── [★★★ TESTED] Non-English column names don't throw (graceful degradation)

[+] Batched implicit references (Issue #8)
    ├── [★★★ TESTED] Three source tables → one target query (not three)
    ├── [★★★ TESTED] >100 ids chunked into batches of 100/100/50
    └── [★★★ TESTED] Duplicate ids deduplicated to a single param

PLUS the 10 pre-existing tests from PR #1 (quoteIdent + SQL injection regression)
all still pass — no behavioral regression to the security fix.

Coverage: 10/10 new paths directly tested. 100%.

Pre-Landing Review

No CRITICAL findings. The sampler refactor was triple-checked against the existing 10 SQL-injection regression tests (all still pass), the test pattern is the same SpyAdapter approach used in PR #1, and the wiring through ConnectorMetadata only adds an optional field (backwards compat for metadata written by older sow versions).

Adversarial review notes

Self-reviewed with the same lens as the prior adversarial passes:

  • Could a transient warning be re-emitted across passes? Yes — the multi-pass loop runs up to 3 times per relationship. Mitigated with the warnedRelationships Set fingerprinted by (source, columns, target).
  • Could the dedup mask a real bug? No — the dedup only collapses identical (source, columns, target) tuples. Different relationships still emit independent warnings.
  • Does the batched IN query hit the bind-param limit? No — chunked at 100, well under Postgres's 65,535. Verified by test.
  • Does the dynamic skip break any existing test? No — the 10 SQL-injection regression tests pass an empty relationships array (their tests are about composite/text PK handling, not formal-FK skipping).
  • Edge case: a relationship where sourceTable === targetTable (self-reference). The outer for (const rel of relationships) still has if (rel.sourceTable === rel.targetTable) continue from the original code. Preserved.

Plan Completion

This PR closes 3 of the remaining 10 issues from /plan-eng-review. The other lanes ship as PRs #2 (Lane B template-DB), #3 (Lane C sanitizer), #4 (Lane D sandbox+env-patch), #5 (Lane E release workflow). All five lanes are independent and can merge in any order.

Test plan

  • bunx vitest run — 99/99 tests passing (89 pre-existing + 10 new)
  • bunx turbo build — 3/3 packages clean
  • bunx eslint on every touched file — clean (4 pre-existing warnings unrelated)
  • Post-merge: sow connect against a schema with intentional orphaned FKs and verify the summary line + sow doctor <connector> flow
  • Post-merge: sow connect against a 50-table schema and watch the round-trip count drop from ~250 to ~50 (Issue #8 measurable win)

🤖 Generated with Claude Code

faculopezscala and others added 2 commits April 6, 2026 09:51
… fetch

Three related fixes to packages/core/src/sampler/referential.ts, surfaced
by the eng review as Issues #3, #7, and #8. Same module, one coordinated
refactor, one test pass.

Issue #3 — warnings collection (was: silent catch{} blocks)
  The three try/catch blocks in ensureReferentialIntegrity silently
  swallowed any failure to fetch a missing parent, an orphaned child,
  or an implicit-reference batch. Users got branches with dangling FKs
  and no way to know. Replaced with structured warning collection into
  a new IntegrityWarning[] array. Four kinds:
    - parent_fetch_failed: the missing-parent SELECT threw
    - parent_not_found:    the parent genuinely doesn't exist in source
    - child_fetch_failed:  the ensure-1-child-per-parent SELECT threw
    - implicit_ref_fetch_failed: the implicit IN(...) batch threw
  Warnings are deduped by (source, columns, target) fingerprint so
  a schema with thousands of orphaned rows emits one line per
  relationship, not one per row. Error messages are truncated to 140
  chars to keep metadata bounded.

  ensureReferentialIntegrity now returns { tables, warnings } instead
  of a bare Map. Callers (sampler/index.ts) thread the warnings through
  SamplingResult.integrityWarnings, which flows into
  ConnectorMetadata.integrityWarnings for later surfacing by the CLI.

Issue #7 — dynamic SKIP_IMPLICIT_COLUMNS (was: English-only hardcoded list)
  The old hardcoded ["id", "user_id", "owner_id", "created_by"] set
  was English-biased and conflated "already handled by formal FK" with
  "should never be followed". A Spanish codebase with `creado_por`, or
  an app with `author_id`/`reviewer_id`/`assignee_id` would fall
  through the gap.

  Replaced with a dynamic check: compute the set of (source_table,
  source_column) pairs that appear in any formal Relationship, and
  skip THOSE in the implicit-reference pass. Works for any language
  and any schema. The old symbolic "id" skip is preserved by the
  inferrer itself (inferTargetTable returns null for columns that
  don't end in _id or can't be mapped to a target table).

Issue #8 — batched implicit references (was: N+1 per source-target pair)
  The old resolveImplicitReferences walked (table, column) pairs and
  fired one SELECT per pair, even when multiple source tables all
  referenced the same parent. On a 50-table schema over a VPN that
  produced 100+ round-trips (~30-60s).

  Rewritten to collect all missing ids grouped by target table across
  ALL source tables first, then issue one IN($1,$2,...) query per
  target (still chunked at 100 ids per query to stay well under
  Postgres's 65,535 bind-param limit). Deduplicates ids so three
  source rows referencing the same parent only request it once.

Tests
  10 new tests in packages/core/src/sampler/referential.test.ts:
    - 5 for warnings (happy path, parent_fetch_failed,
      parent_not_found, implicit_ref_fetch_failed, reason truncation)
    - 2 for dynamic skip (formal FK takes precedence, non-English
      columns don't throw)
    - 3 for batching (one query per target across sources, >100 id
      chunking, dedup)
  Total: 99 passing (was 89, +10 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wires the new SamplingResult.integrityWarnings (added in the previous
commit) through the CLI surface. Zero new core logic — just plumbing
the warning count and details out to the user.

- packages/cli/src/commands/connect.ts
  After a successful sow connect, if the snapshot has any integrity
  warnings, print one summary line:
    ⚠ N referential integrity warning(s) — run `sow doctor <name>` for details

- packages/cli/src/commands/doctor.ts
  New describeConnectorWarnings() returns the per-connector report
  (tables, rows, snapshot size, full warning list) by reading
  ConnectorMetadata. The generic doctor checkConnectors() now flags
  any connector with warnings as a "warn" status with a hint pointing
  at sow doctor <name>, so users discover them even without knowing
  to ask.

- packages/cli/src/commands/runner.ts
  runDoctor() now accepts an optional connectorName positional arg.
  When passed, prints the detailed warnings report instead of the
  generic system check list. Format groups warnings by kind and
  explains the root causes (parent_not_found = source data has
  orphans, *_fetch_failed = transient read error, retry with sow
  connector refresh). JSON mode supported.

- packages/cli/src/cli.ts
  Help text for doctor updated to "doctor [connector]" so users
  discover the new positional.

Tests: still 99 passing — this commit is wiring only, no new logic
worth a unit test (would amount to mocking everything).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@faculopezscala faculopezscala merged commit 9464cde into main Apr 6, 2026
1 check passed
faculopezscala added a commit that referenced this pull request Apr 6, 2026
The 5 code PRs (#2-#6) deliver the engineering for the launch. This PR
delivers everything user-facing: the README pitch, the package metadata,
the docs/ folder, and the CHANGELOG.

This PR depends on PRs #2-#6 being merged first because the new README
references commands those PRs add (sow sandbox, sow doctor <connector>,
the --allow-unsafe flag, the sub-second reset). Land them first, rebase
this against main, then merge.

README.md
  Hero rewritten from "Safe test databases from production Postgres" to
  "Stop letting Claude touch your prod database". Body explains the
  anxiety-reduction pitch: a coding agent is about to do something
  database-adjacent and you feel that quiet pang. sow is the safety
  layer. New "Why sow" section. New "How It Works" diagram showing the
  template-DB shape (one container per connector, N branch DBs, reset
  in <1s). New "Cookbook" stub linking to docs/cookbook.md. New
  "Documentation" section with the docs/ index.

packages/cli/package.json
  Description: "Stop letting Claude touch your prod database. PII-safe
  local Postgres sandbox for coding agents."
  Keywords: added ai-agents, coding-agents, claude-code, cursor,
  sandbox, mcp.

packages/core/package.json
  Description: "sow core engine — analyze, sample, sanitize, and branch
  Postgres databases for safe coding-agent sandboxes"
  Keywords: added ai-agents, coding-agents, sandbox.

packages/mcp/package.json
  Description corrected from "15 tools" to "22 tools" (the actual count
  in packages/mcp/src/index.ts) and repositioned: "sow MCP server — 22
  tools for coding agents (Claude Code, Cursor, Codex) to safely manage
  Postgres sandboxes"
  Keywords: added claude-code, cursor, codex, coding-agents, sandbox.

docs/sandbox.md (new)
  The sow sandbox flagship command — what it does, the flags, the
  .env.local backup/revert flow, when not to use it, and what's
  actually in the sandbox.

docs/sanitization.md (new)
  What sow sanitizes (the PII type table), how JSONB walking works,
  the fail-closed gate, the --allow-unsafe escape hatch, custom rules
  via .sow.yml, what sow does NOT do (free-text NER, etc.), and the
  read-only-on-the-source guarantee.

docs/cookbook.md (new)
  Three end-to-end workflows with concrete prompts:
  1. Let Claude refactor your schema without fear
  2. Let Cursor generate seed data for a new feature
  3. Let your coding agent debug a failing migration
  Plus the "agent reset loop" pattern diagram, the MCP tool list,
  and operational tips (one long-running sandbox per project,
  checkpoints for known-good states, sow doctor as the inspection
  surface, principle of least privilege on the source DB user).

CHANGELOG.md (new)
  Scaffold with three sections:
  - [Unreleased] documenting the planned PR #2-#6 features under
    Added/Changed
  - [0.1.14] documenting the SQL injection security fix that already
    shipped (PR #1, merged earlier in the session)
  - [0.1.13] one-line summary of the initial public release

Test/build/lint all clean (89/89 tests, 3/3 packages built, no source
code changed in this PR).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant