Skip to content

S3-based consolidate: route pg_dumps through s3://newgraph/ instead of serial scp #170

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

Today's provincial consolidate phase took ~20 min wall to move 4 pg_dumps (~10 GB total) from M1 + 3 cyphers to M4. M1's tailnet path was particularly slow (~1.7 MB/s). Most of the wall time was data movement, not pg_restore.

Meanwhile, snapshot_bcfp.sh already uses S3 (s3://newgraph/) for --with-bcfp-views. The plumbing is half-built. Routing consolidate through S3 would:

  • Let each host upload in parallel with each other (vs serialized sftp)
  • Use multi-part downloads on M4 (vs single-stream scp)
  • Decouple "data safe" (upload done) from "M4 has it" (download done) — cyphers could burn the moment upload completes
  • Bypass M1's slow tailnet (S3 over public internet is much faster than M1's WireGuard path)

Why not done today

Audit of aws CLI state on each host (2026-05-14):

Host aws CLI status Issue
M4 ✗ broken Homebrew python 3.14 missing _XML_SetAllocTrackerActivationThreshold in pyexpat
M1 ✗ not on non-interactive PATH Interactive works, headless doesn't
cy[job1] ✓ installed (v2.34.40) but Access Denied on s3://newgraph/ IAM key can't write

So S3-based consolidate requires:

  1. Fix M4 aws CLI (try brew reinstall awscli, or pin python 3.12 dep, or pivot to uv-managed install)
  2. Wire M1 aws CLI into non-interactive PATH (similar to libpq fix already in snapshot_bcfp.sh)
  3. Create an IAM key/policy that grants write access to a transit prefix (e.g. s3://newgraph/consolidate-transit/<TS>/) usable from all 5 hosts (M4 + M1 + cyphers — and importantly, ALSO bake into the cypher snapshot/cloud-init so new cyphers come up ready)
  4. Optional: lifecycle rule to auto-delete transit objects > N days old

Proposed: rewrite consolidate_schema.R to use S3

After above prereqs:

consolidate_schema(
  schema = "fresh",
  sources = list(
    list(host = "m1",                  via = "docker", bucket = ...),
    list(host = paste0("cypher@", IP), via = "docker", bucket = ...)
  ),
  transit = "s3://newgraph/consolidate-transit",
  backup = TRUE)

Flow:

  1. Each source: pg_dump | aws s3 cp - s3://newgraph/consolidate-transit/<TS>/<host>.dump.gz (stream + compress, no local temp file)
  2. M4: parallel aws s3 cp of all 4 source dumps (multi-part, multi-stream)
  3. M4: parallel pg_restore of each dump as it arrives

Acceptance

  • Provincial consolidate phase: ≤10 min wall for 4 sources × ~3 GB each (vs ~20-60 min today, depending on M1 tailnet luck)
  • Cyphers burnable immediately after their S3 upload completes (M4's download is fully decoupled)
  • No reliance on M1 tailnet bandwidth for consolidate
  • Falls back to scp if transit = NULL (preserve today's path as fallback)

Cross-ref

  • link#168 (decouple bcfp compare from pipeline run) — different concern but same general "data movement is the bottleneck" theme
  • Original concrete pain point: 2026-05-13 provincial run, M1→M4 scp stalled at 26% / ran 1.7 MB/s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions