Problem
Today's provincial consolidate phase took ~20 min wall to move 4 pg_dumps (~10 GB total) from M1 + 3 cyphers to M4. M1's tailnet path was particularly slow (~1.7 MB/s). Most of the wall time was data movement, not pg_restore.
Meanwhile, snapshot_bcfp.sh already uses S3 (s3://newgraph/) for --with-bcfp-views. The plumbing is half-built. Routing consolidate through S3 would:
- Let each host upload in parallel with each other (vs serialized sftp)
- Use multi-part downloads on M4 (vs single-stream scp)
- Decouple "data safe" (upload done) from "M4 has it" (download done) — cyphers could burn the moment upload completes
- Bypass M1's slow tailnet (S3 over public internet is much faster than M1's WireGuard path)
Why not done today
Audit of aws CLI state on each host (2026-05-14):
| Host |
aws CLI status |
Issue |
| M4 |
✗ broken |
Homebrew python 3.14 missing _XML_SetAllocTrackerActivationThreshold in pyexpat |
| M1 |
✗ not on non-interactive PATH |
Interactive works, headless doesn't |
| cy[job1] |
✓ installed (v2.34.40) but Access Denied on s3://newgraph/ |
IAM key can't write |
So S3-based consolidate requires:
- Fix M4 aws CLI (try
brew reinstall awscli, or pin python 3.12 dep, or pivot to uv-managed install)
- Wire M1 aws CLI into non-interactive PATH (similar to libpq fix already in
snapshot_bcfp.sh)
- Create an IAM key/policy that grants write access to a transit prefix (e.g.
s3://newgraph/consolidate-transit/<TS>/) usable from all 5 hosts (M4 + M1 + cyphers — and importantly, ALSO bake into the cypher snapshot/cloud-init so new cyphers come up ready)
- Optional: lifecycle rule to auto-delete transit objects > N days old
Proposed: rewrite consolidate_schema.R to use S3
After above prereqs:
consolidate_schema(
schema = "fresh",
sources = list(
list(host = "m1", via = "docker", bucket = ...),
list(host = paste0("cypher@", IP), via = "docker", bucket = ...)
),
transit = "s3://newgraph/consolidate-transit",
backup = TRUE)
Flow:
- Each source:
pg_dump | aws s3 cp - s3://newgraph/consolidate-transit/<TS>/<host>.dump.gz (stream + compress, no local temp file)
- M4: parallel
aws s3 cp of all 4 source dumps (multi-part, multi-stream)
- M4: parallel
pg_restore of each dump as it arrives
Acceptance
- Provincial consolidate phase: ≤10 min wall for 4 sources × ~3 GB each (vs ~20-60 min today, depending on M1 tailnet luck)
- Cyphers burnable immediately after their S3 upload completes (M4's download is fully decoupled)
- No reliance on M1 tailnet bandwidth for consolidate
- Falls back to scp if
transit = NULL (preserve today's path as fallback)
Cross-ref
- link#168 (decouple bcfp compare from pipeline run) — different concern but same general "data movement is the bottleneck" theme
- Original concrete pain point: 2026-05-13 provincial run, M1→M4 scp stalled at 26% / ran 1.7 MB/s
Problem
Today's provincial consolidate phase took ~20 min wall to move 4 pg_dumps (~10 GB total) from M1 + 3 cyphers to M4. M1's tailnet path was particularly slow (~1.7 MB/s). Most of the wall time was data movement, not pg_restore.
Meanwhile,
snapshot_bcfp.shalready uses S3 (s3://newgraph/) for--with-bcfp-views. The plumbing is half-built. Routing consolidate through S3 would:Why not done today
Audit of aws CLI state on each host (2026-05-14):
_XML_SetAllocTrackerActivationThresholdin pyexpats3://newgraph/So S3-based consolidate requires:
brew reinstall awscli, or pin python 3.12 dep, or pivot to uv-managed install)snapshot_bcfp.sh)s3://newgraph/consolidate-transit/<TS>/) usable from all 5 hosts (M4 + M1 + cyphers — and importantly, ALSO bake into the cypher snapshot/cloud-init so new cyphers come up ready)Proposed: rewrite consolidate_schema.R to use S3
After above prereqs:
Flow:
pg_dump | aws s3 cp - s3://newgraph/consolidate-transit/<TS>/<host>.dump.gz(stream + compress, no local temp file)aws s3 cpof all 4 source dumps (multi-part, multi-stream)pg_restoreof each dump as it arrivesAcceptance
transit = NULL(preserve today's path as fallback)Cross-ref