Skip to content

bcfp tunnel drops during long provincial runs cause silent per-WSG errors #167

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

During today's 75-min provincial dispatch (2026-05-13, link 0.36.0), the bcfp tunnel on M4 dropped at some point and stayed down. Verified after dispatch ended:

  • M4: pg_isready -h localhost -p 63333 → no response
  • No ssh -fN -L 63333:... process running

Effect on the run: 10 of 217 WSGs errored. Patterns:

  • 3 explicit connection refused on :63333 (SHUL, STIR, TABR — end-of-run, after tunnel had died)
  • 2 SSL SYSCALL error: EOF detected (FINA, SALR — mid-run, tunnel mid-drop)
  • 1 ignoring SIGPIPE signal (COWN — broken pipe to dead connection)
  • 4 attempt to apply non-function (GRAI, MAHD, NBNK, TATR — likely same root cause: dbGetQuery(conn_ref, ...) returned NULL/closure when conn_ref dropped, downstream R code crashed with cryptic error)

The 4 R "attempt to apply non-function" errors are particularly nasty — the error doesn't mention the tunnel, looks like a code bug, costs operator hours of diagnosis.

Why it's bad

  • Silent failure mode: tunnel dies → R error chain doesn't surface "tunnel is dead", just symptom errors that look like data problems
  • Per-WSG retries needed to recover (re-run 10 WSGs manually after tunnel reopens)
  • Long runs are the most affected — short smokes never trigger it
  • No mechanism in run_provincial_parity.R to detect/reconnect mid-run

Proposed

Three options, smallest blast radius first:

(a) Wrap the tunnel in autossh in the trifecta_provincial.sh M4 wrapper script. autossh monitors the tunnel via heartbeat and auto-reconnects on drop. Drop-in replacement:

autossh -M 0 -o "ServerAliveInterval=30" -o "ServerAliveCountMax=3" \
        -L 63333:127.0.0.1:5432 db_newgraph -N &

-M 0 disables autossh's own monitoring port (uses ssh's ServerAlive instead). With ServerAliveInterval=30 + ServerAliveCountMax=3, autossh detects drops within ~90s and reopens — silent recovery.

(b) Detect-and-reconnect in run_provincial_parity.R per-WSG. Before each WSG, check nc -z localhost 63333; if down, open a fresh tunnel via system(). More R-side complexity but doesn't add a system dependency.

(c) Detect-and-fail-loud, retry-only-failed. Add explicit tunnel-health check at WSG start. On drop, error with explicit "[tunnel-dropped]" prefix in the RDS error stub. Post-run, scan for that prefix and retry just those WSGs.

Recommendation

Start with (a) — add autossh to the trifecta_provincial.sh inline-tunnel block for M4 (and cyphers, since they have the same pattern). One-line dep: brew install autossh on M4, apt install autossh in cypher cloud-init.

Pairs naturally with the decouple-compare issue (separate filing) — even with autossh, decoupling the bcfp compare from the link pipeline run makes the system more resilient.

Acceptance

  • Provincial 217-WSG dispatch: zero WSGs error with connection refused / SSL SYSCALL EOF / SIGPIPE / attempt to apply non-function
  • Tunnel transparently survives ≥1 ssh-level drop event during a long run
  • Operator never needs to manually reopen tunnel mid-run

Reference run

  • Wall: 1h15m36s (217 WSGs, 5 hosts)
  • 207 OK, 10 errors (all tunnel-related per analysis above)
  • Per-WSG timing CSV: data-raw/logs/provincial_parity/20260513_1839_per_wsg_times.csv

Update 2026-05-24: superseded by tunnel-free compare (#175)

Chosen direction: eliminate the tunnel from the compare rather than keep it alive with autossh. bcfp's published output is loaded into the local DB tunnel-free by snapshot_bcfp.sh --with-bcfp-views (fresh.streams_vw_bcfp; build verified via s3://fresh-bc/bcfishpass/log.json). The compare (lnk_compare_mapping_code, #175) diffs link's persisted output against that local table — one local SQL join, no :63333. No tunnel → no drops → this failure mode disappears. Fix tracked under #175; close this when the compare is tunnel-free.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions