bcfp tunnel drops during long provincial runs cause silent per-WSG errors

## Problem

During today's 75-min provincial dispatch (2026-05-13, link 0.36.0), the bcfp tunnel on M4 dropped at some point and stayed down. Verified after dispatch ended:
- M4: `pg_isready -h localhost -p 63333` → no response
- No `ssh -fN -L 63333:...` process running

Effect on the run: 10 of 217 WSGs errored. Patterns:
- 3 explicit `connection refused on :63333` (SHUL, STIR, TABR — end-of-run, after tunnel had died)
- 2 `SSL SYSCALL error: EOF detected` (FINA, SALR — mid-run, tunnel mid-drop)
- 1 `ignoring SIGPIPE signal` (COWN — broken pipe to dead connection)
- 4 `attempt to apply non-function` (GRAI, MAHD, NBNK, TATR — likely same root cause: `dbGetQuery(conn_ref, ...)` returned NULL/closure when conn_ref dropped, downstream R code crashed with cryptic error)

The 4 R "attempt to apply non-function" errors are particularly nasty — the error doesn't mention the tunnel, looks like a code bug, costs operator hours of diagnosis.

## Why it's bad

- Silent failure mode: tunnel dies → R error chain doesn't surface "tunnel is dead", just symptom errors that look like data problems
- Per-WSG retries needed to recover (re-run 10 WSGs manually after tunnel reopens)
- Long runs are the most affected — short smokes never trigger it
- No mechanism in `run_provincial_parity.R` to detect/reconnect mid-run

## Proposed

Three options, smallest blast radius first:

**(a) Wrap the tunnel in `autossh`** in the `trifecta_provincial.sh` M4 wrapper script. `autossh` monitors the tunnel via heartbeat and auto-reconnects on drop. Drop-in replacement:

```bash
autossh -M 0 -o "ServerAliveInterval=30" -o "ServerAliveCountMax=3" \
        -L 63333:127.0.0.1:5432 db_newgraph -N &
```

`-M 0` disables autossh's own monitoring port (uses ssh's ServerAlive instead). With `ServerAliveInterval=30` + `ServerAliveCountMax=3`, autossh detects drops within ~90s and reopens — silent recovery.

**(b) Detect-and-reconnect in `run_provincial_parity.R` per-WSG.** Before each WSG, check `nc -z localhost 63333`; if down, open a fresh tunnel via `system()`. More R-side complexity but doesn't add a system dependency.

**(c) Detect-and-fail-loud, retry-only-failed.** Add explicit tunnel-health check at WSG start. On drop, error with explicit `"[tunnel-dropped]"` prefix in the RDS error stub. Post-run, scan for that prefix and retry just those WSGs.

## Recommendation

Start with (a) — add `autossh` to the trifecta_provincial.sh inline-tunnel block for M4 (and cyphers, since they have the same pattern). One-line dep: `brew install autossh` on M4, `apt install autossh` in cypher cloud-init.

Pairs naturally with the decouple-compare issue (separate filing) — even with autossh, decoupling the bcfp compare from the link pipeline run makes the system more resilient.

## Acceptance

- Provincial 217-WSG dispatch: zero WSGs error with `connection refused` / `SSL SYSCALL EOF` / SIGPIPE / `attempt to apply non-function`
- Tunnel transparently survives ≥1 ssh-level drop event during a long run
- Operator never needs to manually reopen tunnel mid-run

## Reference run

- Wall: 1h15m36s (217 WSGs, 5 hosts)
- 207 OK, 10 errors (all tunnel-related per analysis above)
- Per-WSG timing CSV: `data-raw/logs/provincial_parity/20260513_1839_per_wsg_times.csv`


## Update 2026-05-24: superseded by tunnel-free compare (#175)

Chosen direction: **eliminate the tunnel from the compare** rather than keep it alive with `autossh`. bcfp's published output is loaded into the local DB tunnel-free by `snapshot_bcfp.sh --with-bcfp-views` (`fresh.streams_vw_bcfp`; build verified via `s3://fresh-bc/bcfishpass/log.json`). The compare (`lnk_compare_mapping_code`, #175) diffs link's persisted output against that local table — one local SQL join, no `:63333`. No tunnel → no drops → this failure mode disappears. Fix tracked under #175; close this when the compare is tunnel-free.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcfp tunnel drops during long provincial runs cause silent per-WSG errors #167

Problem

Why it's bad

Proposed

Recommendation

Acceptance

Reference run

Update 2026-05-24: superseded by tunnel-free compare (#175)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bcfp tunnel drops during long provincial runs cause silent per-WSG errors #167

Description

Problem

Why it's bad

Proposed

Recommendation

Acceptance

Reference run

Update 2026-05-24: superseded by tunnel-free compare (#175)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions