Problem
During today's 75-min provincial dispatch (2026-05-13, link 0.36.0), the bcfp tunnel on M4 dropped at some point and stayed down. Verified after dispatch ended:
- M4:
pg_isready -h localhost -p 63333 → no response
- No
ssh -fN -L 63333:... process running
Effect on the run: 10 of 217 WSGs errored. Patterns:
- 3 explicit
connection refused on :63333 (SHUL, STIR, TABR — end-of-run, after tunnel had died)
- 2
SSL SYSCALL error: EOF detected (FINA, SALR — mid-run, tunnel mid-drop)
- 1
ignoring SIGPIPE signal (COWN — broken pipe to dead connection)
- 4
attempt to apply non-function (GRAI, MAHD, NBNK, TATR — likely same root cause: dbGetQuery(conn_ref, ...) returned NULL/closure when conn_ref dropped, downstream R code crashed with cryptic error)
The 4 R "attempt to apply non-function" errors are particularly nasty — the error doesn't mention the tunnel, looks like a code bug, costs operator hours of diagnosis.
Why it's bad
- Silent failure mode: tunnel dies → R error chain doesn't surface "tunnel is dead", just symptom errors that look like data problems
- Per-WSG retries needed to recover (re-run 10 WSGs manually after tunnel reopens)
- Long runs are the most affected — short smokes never trigger it
- No mechanism in
run_provincial_parity.R to detect/reconnect mid-run
Proposed
Three options, smallest blast radius first:
(a) Wrap the tunnel in autossh in the trifecta_provincial.sh M4 wrapper script. autossh monitors the tunnel via heartbeat and auto-reconnects on drop. Drop-in replacement:
autossh -M 0 -o "ServerAliveInterval=30" -o "ServerAliveCountMax=3" \
-L 63333:127.0.0.1:5432 db_newgraph -N &
-M 0 disables autossh's own monitoring port (uses ssh's ServerAlive instead). With ServerAliveInterval=30 + ServerAliveCountMax=3, autossh detects drops within ~90s and reopens — silent recovery.
(b) Detect-and-reconnect in run_provincial_parity.R per-WSG. Before each WSG, check nc -z localhost 63333; if down, open a fresh tunnel via system(). More R-side complexity but doesn't add a system dependency.
(c) Detect-and-fail-loud, retry-only-failed. Add explicit tunnel-health check at WSG start. On drop, error with explicit "[tunnel-dropped]" prefix in the RDS error stub. Post-run, scan for that prefix and retry just those WSGs.
Recommendation
Start with (a) — add autossh to the trifecta_provincial.sh inline-tunnel block for M4 (and cyphers, since they have the same pattern). One-line dep: brew install autossh on M4, apt install autossh in cypher cloud-init.
Pairs naturally with the decouple-compare issue (separate filing) — even with autossh, decoupling the bcfp compare from the link pipeline run makes the system more resilient.
Acceptance
- Provincial 217-WSG dispatch: zero WSGs error with
connection refused / SSL SYSCALL EOF / SIGPIPE / attempt to apply non-function
- Tunnel transparently survives ≥1 ssh-level drop event during a long run
- Operator never needs to manually reopen tunnel mid-run
Reference run
- Wall: 1h15m36s (217 WSGs, 5 hosts)
- 207 OK, 10 errors (all tunnel-related per analysis above)
- Per-WSG timing CSV:
data-raw/logs/provincial_parity/20260513_1839_per_wsg_times.csv
Update 2026-05-24: superseded by tunnel-free compare (#175)
Chosen direction: eliminate the tunnel from the compare rather than keep it alive with autossh. bcfp's published output is loaded into the local DB tunnel-free by snapshot_bcfp.sh --with-bcfp-views (fresh.streams_vw_bcfp; build verified via s3://fresh-bc/bcfishpass/log.json). The compare (lnk_compare_mapping_code, #175) diffs link's persisted output against that local table — one local SQL join, no :63333. No tunnel → no drops → this failure mode disappears. Fix tracked under #175; close this when the compare is tunnel-free.
Problem
During today's 75-min provincial dispatch (2026-05-13, link 0.36.0), the bcfp tunnel on M4 dropped at some point and stayed down. Verified after dispatch ended:
pg_isready -h localhost -p 63333→ no responsessh -fN -L 63333:...process runningEffect on the run: 10 of 217 WSGs errored. Patterns:
connection refused on :63333(SHUL, STIR, TABR — end-of-run, after tunnel had died)SSL SYSCALL error: EOF detected(FINA, SALR — mid-run, tunnel mid-drop)ignoring SIGPIPE signal(COWN — broken pipe to dead connection)attempt to apply non-function(GRAI, MAHD, NBNK, TATR — likely same root cause:dbGetQuery(conn_ref, ...)returned NULL/closure when conn_ref dropped, downstream R code crashed with cryptic error)The 4 R "attempt to apply non-function" errors are particularly nasty — the error doesn't mention the tunnel, looks like a code bug, costs operator hours of diagnosis.
Why it's bad
run_provincial_parity.Rto detect/reconnect mid-runProposed
Three options, smallest blast radius first:
(a) Wrap the tunnel in
autosshin thetrifecta_provincial.shM4 wrapper script.autosshmonitors the tunnel via heartbeat and auto-reconnects on drop. Drop-in replacement:-M 0disables autossh's own monitoring port (uses ssh's ServerAlive instead). WithServerAliveInterval=30+ServerAliveCountMax=3, autossh detects drops within ~90s and reopens — silent recovery.(b) Detect-and-reconnect in
run_provincial_parity.Rper-WSG. Before each WSG, checknc -z localhost 63333; if down, open a fresh tunnel viasystem(). More R-side complexity but doesn't add a system dependency.(c) Detect-and-fail-loud, retry-only-failed. Add explicit tunnel-health check at WSG start. On drop, error with explicit
"[tunnel-dropped]"prefix in the RDS error stub. Post-run, scan for that prefix and retry just those WSGs.Recommendation
Start with (a) — add
autosshto the trifecta_provincial.sh inline-tunnel block for M4 (and cyphers, since they have the same pattern). One-line dep:brew install autosshon M4,apt install autosshin cypher cloud-init.Pairs naturally with the decouple-compare issue (separate filing) — even with autossh, decoupling the bcfp compare from the link pipeline run makes the system more resilient.
Acceptance
connection refused/SSL SYSCALL EOF/ SIGPIPE /attempt to apply non-functionReference run
data-raw/logs/provincial_parity/20260513_1839_per_wsg_times.csvUpdate 2026-05-24: superseded by tunnel-free compare (#175)
Chosen direction: eliminate the tunnel from the compare rather than keep it alive with
autossh. bcfp's published output is loaded into the local DB tunnel-free bysnapshot_bcfp.sh --with-bcfp-views(fresh.streams_vw_bcfp; build verified vias3://fresh-bc/bcfishpass/log.json). The compare (lnk_compare_mapping_code, #175) diffs link's persisted output against that local table — one local SQL join, no:63333. No tunnel → no drops → this failure mode disappears. Fix tracked under #175; close this when the compare is tunnel-free.