Skip to content

feat: daily LinkedIn microservice + autocli CDP wiring + supporting fixes#2

Merged
RickSanchez88E merged 53 commits into
mainfrom
feat/daily-microservice
May 16, 2026
Merged

feat: daily LinkedIn microservice + autocli CDP wiring + supporting fixes#2
RickSanchez88E merged 53 commits into
mainfrom
feat/daily-microservice

Conversation

@RickSanchez88E
Copy link
Copy Markdown
Owner

Summary

Implements the auto-scheduled daily LinkedIn-recommended pipeline as a 5-container microservice on 100.108.80.9, publicly exposed via Cloudflare Tunnel + Access. See deploy/SPEC.md for design, deploy/PLAN.md for the walkthrough.

What's live (verified):

  • 5 healthy containers: autocli-chrome (Stagehand-style Chromium+VNC+CDP), autocli-daily (cron + FastAPI), cloudflared, prometheus, grafana
  • 3 public Cloudflare-Access-gated subdomains: autocli-vnc.pumped.ink, autocli-api.pumped.ink, autocli-grafana.pumped.ink
  • Service Token (machine auth) + Bearer (app-layer auth) verified end-to-end
  • Forced POST /api/run triggered a real LinkedIn scrape: 573 jobs upserted to Supabase (41 new + ~507 updated)
  • Phase 4a probes 9/9 green via public HTTPS

What's deferred (separate follow-ups):

  • Phase 4b/4c (autocli-cdp.pumped.ink with mTLS) — design in spec, not built
  • 30-day idle observation (Phase 6) — calendar-dependent

Architecture highlights

  • Hard Rust prereq (commits d7cd312/6f60b06/6539458): wires CdpPage into BrowserBridge::connect behind AUTOCLI_CDP_ENDPOINT so a containerised autocli drives a sibling Chrome container without the daemon+extension path.
  • rust-toolchain.toml pins workspace to 1.94 for local/CI/Phase 0 parity.
  • GHCR + Watchtower pull-based deploy with branch-safe slugified tags (feature branches get :branch-<slug> + :sha-*; :main only on main).
  • Cloudflare Tunnel in token mode; ingress + DNS + Access apps + Service Token all created via Cloudflare API (no dashboard clicks).
  • Subdomain naming flattened to one level (autocli-<sub>.zone) because Free zones get Universal SSL only on apex + one-level wildcard.

Notable bug fixes discovered during deploy

Issue Fix
Chrome DevTools DNS-rebinding rejects service-name Host headers cdp-discover.sh resolves to IP; /api/health sends Host: localhost
CI built binary on ubuntu-latest (GLIBC 2.39) vs runtime Debian Bookworm (GLIBC 2.36) CI now runs in rust:1.94-slim-bookworm container; readelf gate fails build if binary needs >2.36
source /run/cdp-endpoint.env set shell var only, autocli child didn't see env set -a / set +a around source
last_run.json empty because grep+tail on pretty-printed JSON yielded just { Capture sync stdout to dedicated file, jq direct
Token-mode cloudflared running on Mac as a second replica caused intermittent 502s Operator Ctrl+C'd Mac one; server container is sole replica
Stagehand 5900 host port already taken on prod host Drop 5900:5900 mapping (noVNC 6080 + Cloudflare ingress are the real paths)

Test plan

  • Rust unit: bridge::tests::test_connect_uses_cdp_endpoint_when_env_var_set passes
  • FastAPI: 9 pytest tests all green (auth + route shape)
  • Phase 0 local: ELF x86-64 verified inside daily image, autocli --version returns
  • Phase 2 CI: green on every commit, correct tag policy
  • Phase 3: 5 containers up + healthy on 100.108.80.9
  • Phase 4a: 9/9 public probes match expected codes (302/200/401)
  • Phase 5: real LinkedIn scrape, 573 jobs upserted (verified via Supabase MCP)
  • Phase 6: 2 consecutive scheduled 03:00 BST runs — calendar-dependent, post-merge

Notes for reviewer

  • Branch carries 40+ commits because local main had unpushed priority-scoring work that got swept along by the rebase. The core daily-microservice changes are everything under deploy/, .github/workflows/deploy-microservice.yml, rust-toolchain.toml, and the small bridge.rs patch.
  • After merge, server compose was sed'd to :branch-feat-daily-microservice for staging; post-merge bump to :main is a one-liner sed (documented in deploy/README.md).

RickSanchez88E_a8cc and others added 30 commits May 9, 2026 19:45
Create scripts/job_priority_config.py with all configuration constants,
regex patterns, and keyword sets for the deterministic job priority scoring
system. Contains no scoring logic -- only configuration to be imported by
the scorer, sync pipeline, backfill scripts, and tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure, deterministic scoring engine for AutoCLI jobs with 8 components:
compensation, role fit, seniority, work arrangement, application path,
freshness, data completeness, and source quality. Includes penalty system,
hard-reject guard, and tier mapping (high/medium/low/reject).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
REPEATED_PUNCT_RE used {2,} which matches 3+ total consecutive punctuation
chars (e.g. "!!!" -> "!"). Changed to {1,} so 2+ consecutive chars are
collapsed (e.g. "!!" -> "!", "!!!" -> "!").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Import score_job in sync_autocli_jobs.py and call it per-record
- Pass ScoreResult fields (priority_score, priority_tier, priority_version,
  priority_signals) to upsert_job RPC
- Add --disable-scoring flag for testing
- Report priority score distribution in dry-run mode
- Add comprehensive test suite (104 tests across 14 classes) covering all
  8 scoring components, penalties, hard-reject guard, edge cases, and
  integration scenarios

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Migration 20260509182000: add priority scoring columns to jobs.jobs table
  (priority_score, priority_tier, priority_version, priority_signals,
  priority_scored_at)
- Migration 20260509184000: add update_job_priority_score RPC that only
  touches scoring fields (not the full row), with schema-scoped and public
  wrappers
- scripts/backfill_priority_scores.py: batch backfill script with --force,
  --limit, --dry-run, --env-file options; reconstructs job_data from
  raw_record or DB columns; reports per-row scores, tiers, and errors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Codex <noreply@openai.com>
- Rename priority_version column to priority_scorer_version in both migrations
- Add 'unknown' to priority_tier check constraint
- Fix indices to include last_seen_at desc and priority_score desc per spec
- Add --min-priority-score and --priority-tier CLI flags for optional filtering
- Enhance dry-run with top_priority_jobs, low_priority_count, priority_tiers
- Add source-quality summary (recruiter/aggregator/raw-jd-fallback counts)
- Update backfill RPC param name to match column rename
Design covers:
- 6-container stack: chrome (Stagehand), daily (cron+FastAPI), cloudflared,
  prometheus, grafana
- Cloudflare Tunnel + Access for public exposure of /vnc /cdp /api /jobs /grafana
- GHCR + Watchtower pull-based deploy
- Phased acceptance criteria with verification commands

Worktree: feat/daily-microservice (branched from main).
Critical fixes:
- Add prereq section for autocli BrowserBridge CDP-wiring patch
- Fix cargo build to use package name 'autocli' (was '-cli')
- Switch /jobs to client.schema('jobs').table('jobs') API
- Use /json/list + page target (was /json/version, browser-level)
- Rewrite ws host localhost->autocli-chrome:9222
- Standardize on SUPABASE_SERVICE_ROLE_KEY
- Make API_RUN_TOKEN actually enforced + tested in Phase 4
- Add machine-verifiable Cloudflare Access gate before cdp ingress

High-severity fixes:
- Feature branches publish :branch-*+:sha-* only; :main from main
- Pin cloudflared/prometheus/grafana to specific semver
- Switch Cloudflare Tunnel to --token mode (no config.yml mix)
- Replace path routes with 5 subdomains (avoids prefix-strip)
- Split Access into two policies per Application (Token OR Email)
- Drop Grafana Infinity plugin dependency
- VNC password generated random in prod (no 'stagehand' default)
- shred only temp copy of operator secrets, never the source
- Unify retry to 3-attempts/15-60-240s across code+runbook+metrics
- Add explicit restart: unless-stopped to autocli-daily
- Specify Prometheus metrics_path: /api/metrics
- Unified CI build context = repo root for both Dockerfiles
- Note GHCR creds already configured on target host
Bugs:
- L103: component table referenced stale /metrics path -> /api/metrics
- L209/L236: github.ref_name with '/' produces invalid Docker tags;
  switch to docker/metadata-action's type=ref,event=branch which slugifies
- L321: /json/new requires PUT, not POST (Chrome >= M86)
- L354: jobs.autocli/ routed to backend root but /jobs is the actual route;
  drop the jobs subdomain entirely, serve via api.autocli/jobs (4 subdomains)
- L473: Phase 0 build context disagreed with CI; unify on repo root
- L522: Phase 4 step 2 implied Service Token works on vnc/grafana
  where no machine policy exists; split per-subdomain expectations
- L526: Phase 4 probed cdp.autocli before the spec said cdp ingress was
  added; split Phase 4 into 4a (pre-CDP gate) / 4b (add cdp ingress) /
  4c (cdp probes)
- L549: Phase 5 status call missing Bearer

Risks:
- L486: Phase 1 status call missing Bearer; added

Also:
- Fix '6 services' / '6 new containers' counts; actual count is 5
- Update §2.2 boundaries note from /json/version to /json/list + PUT /json/new
Bugs:
- L107: discovery wasn't actually re-run per /api/run; updated process
  tree so run-daily.sh calls cdp-discover.sh before each run, then
  sources /run/cdp-endpoint.env. §5.2 'Boot ordering' split into
  'Discovery cadence' (boot + per-run) and 'Boot ordering'.
- L292: process tree now spells out 'PUT /json/new?about:blank' for
  the empty-list case, matching §5.2.
- L429: API_RUN_TOKEN row now lists every Bearer-protected route
  (/api/status, /api/run, /api/logs, /jobs) plus the open ones
  (/api/health, /api/metrics).
- L496: Phase 2 acceptance split into feature-branch expectation
  (:branch-feat-daily-microservice + :sha-*) vs post-merge expectation
  (:main + :sha-*). Phase 3 explicitly reads :main.

Risks / nits:
- L129: file layout comment changed to match real routes
- L358: 'five hostnames' -> 'four hostnames' after dropping jobs subdomain
- L628: §9 risk #1 rewritten to reference Phase 4a/4b/4c and three
  pre-CDP subdomains, not the old Phase 4 step 1+2 / four subdomains.
Bugs:
- L211: slugifier comment was inside 'tags: |' literal block where
  metadata-action would parse it as a rule. Moved above 'tags:' as a
  proper YAML comment.
- L478: Phase 0 'cargo build' on operator's arm64-darwin host then
  COPY into linux/amd64 image would inject a Mach-O. Replaced with a
  docker run rust:1.81-slim-bookworm --platform linux/amd64 builder
  step + 'file' verification ('ELF 64-bit'). CI is unchanged because
  ubuntu-latest is already linux/amd64.

Risks:
- L312: /api/metrics annotation 'only reachable via docker network'
  was misleading — api.autocli does expose it externally via
  Cloudflare Access. Re-annotated both /api/metrics and /api/health
  as dual-path: internal direct, external via Access.
- L368: Tailscale-CGNAT IP gate on cdp.autocli would never match —
  Cloudflare sees public/WARP egress IP, not 100.x. Replaced with
  dedicated short-lived Service Token + mTLS client cert (machines),
  email OTP + required WARP posture (humans). §9 risk #1 and Phase
  4b/4c reworded to match.

Nit:
- L551: Phase 4a 'Authenticated' header was wrong for vnc/grafana
  (no auth sent). Renamed to 'humans-only negative machine probe'
  and made it actually send a Service Token to prove it gets denied.

Phase 4c also updated to use mTLS-style probes (dedicated CF_ID_CDP)
plus a websocket-upgrade smoke test for the CDP surface.
Bugs (all on Phase 4c WebSocket probe):
- curl -sI is HEAD; WebSocket Upgrade requires GET. Replaced with
  proper websocat client (preferred) and curl --http1.1 -i -N
  fallback. No more -I anywhere on the WS path.
- /devtools/page/<id> placeholder cannot run as written. Probe now
  extracts the actual page id by GETing /json/list, picking the
  first type:'page' target, and rewriting host to cdp.autocli.
- 'HTTP/2 101 Switching Protocols' does not exist — 101 is HTTP/1.1
  semantics, and Cloudflare does not speak RFC 8441 multiplexed WS.
  Probe now forces --http1.1 and expects 'HTTP/1.1 101 Switching
  Protocols'. The websocat path checks for a CDP round-trip
  (Target.getTargets) instead.

Phase 4c renumbered 4c-1..4c-4 to make the four checks explicit.
Was 1.81 — arbitrary, drifting from operator's rustc 1.94.1 and from
ubuntu-latest's default stable. Pinned to 1.94-slim-bookworm so
Phase 0 / CI / dev agree. Spec also notes the long-term hardening
(repo-tracked rust-toolchain.toml) — that task is included in the
implementation plan.
34 bite-sized tasks across 8 phases (A-H), each with TDD substeps,
exact file paths, exact commands, expected outputs. Covers:

Phase A: rust-toolchain.toml + BrowserBridge CDP patch + smoke test
Phase B: deploy/ scaffold (chrome, daily, prometheus, grafana, compose)
Phase C: GitHub Actions workflow
Phase D: Phase 0 image build (Docker rust 1.94) + Phase 1 local e2e
Phase E: GHCR push (Phase 2)
Phase F: 100.108.80.9 server bring-up (Phase 3)
Phase G: Cloudflare Tunnel + Access (Phase 4a/4b/4c)
Phase H: forced run + monitoring (Phase 5) + schedule rollover (Phase 6)

Plan is self-contained — no TBDs or 'similar to Task N' placeholders.
Self-review section maps every SPEC section to its implementing task(s).
Aligns local dev (operator was on rustc 1.94.1), CI (was using
ubuntu-latest default), and the Phase 0 Docker builder
(deploy/SPEC.md). Single source of truth; future bumps touch only
this file.
Add AUTOCLI_CDP_ENDPOINT env-var branch at the top of
BrowserBridge::connect. When set, skip daemon spawn + extension
polling and return Arc<CdpPage> directly. The IPage trait contract
is unchanged so pipeline executors and YAML adapters consume either
implementation transparently.

Required prerequisite for the autocli-daily microservice
(deploy/SPEC.md §1.A) which runs autocli in a container with no
Chrome extension or daemon, connecting to a sibling Chrome container
via CDP.
Two robustness improvements from code review:
- RAII Drop guard ensures AUTOCLI_CDP_ENDPOINT is cleared even if
  the test panics mid-way, preventing cross-test env leakage.
- Assert on CliError::BrowserConnect variant directly instead of
  string-matching the Display output. Resilient to future
  error-message wording changes.
Copy of my-stagehand-app/Dockerfile.chrome with the COPY path
rewritten for repo-root build context (deploy/SPEC.md §4.1).
Verbatim from my-stagehand-app/scripts/entrypoint-vnc.sh:
Xvfb -> x11vnc -> noVNC -> socat 9222->9223 -> Chromium with
--remote-debugging-port=9223 --user-data-dir=/root/.config/chromium.
Extension loading via /opt/extensions/*/manifest.json is preserved
even though this design ships with no extensions.
Multi-arch-aware single-stage image:
- python:3.12-slim-bookworm base
- tini (PID 1), util-linux (flock), jq (CDP discovery), curl (probes)
- supercronic (container cron) pinned to v0.2.30 with sha1 verify
- uv (Astral) for Python deps
- Pre-built autocli binary copied from deploy/daily/bin/
- FastAPI app + scripts/sync_autocli_jobs.py
- Boot via tini -> entrypoint.sh
- TZ=Europe/London, CRON_SCHEDULE default 03:00.

DONE_WITH_CONCERNS: scripts/job_priority_scorer.py and
scripts/job_priority_config.py are absent from the worktree;
their COPY lines have been omitted from the Dockerfile.
Find or create a CDP page target on autocli-chrome:9222.
- GET /json/list, pick first type:page
- if list is empty, PUT /json/new?about:blank (Chrome >= M86)
- rewrite host (localhost:9223 -> autocli-chrome:9222) so the WS URL
  is reachable from the daily container's network namespace
- write to /run/cdp-endpoint.env (sourced by run-daily.sh)
- 60s retry budget; exit 1 on timeout (entrypoint exits non-zero,
  restart: unless-stopped recreates container until chrome ready).
- flock -n to prevent cron + /api/run from colliding
- per-attempt cdp-discover refresh (page id may have rotated)
- runs autocli linkedin recommended -> JSON -> sync_autocli_jobs.py
- unified retry: 3 attempts at 15s/60s/240s (SPEC §5.2)
- writes /data/output/last_run.json consumed by /api/status.
Boot-time cdp-discover gate, then runs supercronic + uvicorn in
parallel under tini. wait -n exits as soon as either child dies, so
compose's restart policy can pick up failure modes (e.g. uvicorn
panic, supercronic crash).
03:00 daily LinkedIn pull + 04:00 30-day output retention sweep
(SPEC §5.2). TZ resolved by the container's TZ=Europe/London.
After rebase onto local main, scripts/job_priority_scorer.py and
scripts/job_priority_config.py are present. sync_autocli_jobs.py
imports them at runtime, so the daily image must ship all three.
uv-managed; pins fastapi/uvicorn/supabase/prometheus-client/httpx
to compatible ranges. Lockfile checked in so the Dockerfile's
'uv sync --frozen' is reproducible.
RickSanchez88E_a8cc added 20 commits May 16, 2026 02:17
Used by POST /api/run to spawn run-daily.sh non-blockingly.
is_running() is a non-destructive flock probe so /api/status can
report in_progress without affecting the actual run.
Routes per SPEC §5.1:
  GET  /api/health   [open]    chrome reachability + cdp file probe
  GET  /api/metrics  [open]    Prometheus exposition (delta-aware counters)
  GET  /api/status   [Bearer]  last_run.json + in_progress
  POST /api/run      [Bearer]  spawn run-daily.sh, 409 if already running
  GET  /api/logs     [Bearer]  tail of latest log (default 200 lines)
  GET  /jobs         [Bearer]  Supabase 'jobs.jobs' read proxy via
                               client.schema('jobs').table('jobs').

Import style B: 'import trigger' (flat), because entrypoint.sh does
'cd /app/api && uvicorn main:app' — no package context, flat import works.
9 tests covering:
- /api/status, /api/run, /api/logs, /jobs all return 401 without Bearer
  and 401 with wrong Bearer
- /api/status default-shape + reflects last_run.json
- /api/metrics is open and contains the autocli_daily_ family
- /api/health returns 503 when chrome:9222 unreachable.

conftest.py adds deploy/daily/api to sys.path (flat import, matching
entrypoint.sh's 'cd /app/api && uvicorn main:app' invocation).
Prometheus registry is cleared before each fresh module import to
avoid duplicate-timeseries errors across test fixtures.
Single job scraping autocli-daily:8080/api/metrics every 15s.
metrics_path is required because FastAPI mounts under /api/*.
- Datasource: Prometheus at prometheus:9090 (uid prom-autocli)
- Dashboard provider points at /etc/grafana/provisioning/dashboards
- autocli.json: time-since-last-run, last exit code, rows-upserted-today,
  CDP-up %, daily scraped/upserted/skipped time series, duration
- No plugin dependencies (Infinity dropped per L313 review).
5 services on shared autocli-net bridge:
- autocli-chrome (Stagehand, watchtower-tracked, healthcheck on 9222)
- autocli-daily (cron+FastAPI, watchtower-tracked, depends_on chrome
  healthy, env scoped to Supabase creds only)
- cloudflared (Tunnel token mode, depends_on daily healthy)
- prometheus (pinned, 90-day retention)
- grafana (pinned, anon disabled, signup disabled, admin from env)
Named volumes for profile / output / tsdb / grafana state.
Binds host ports under non-conflicting numbers (6081/5902/9223/8081/
9091/3001) so the operator can keep their existing local Chrome and
Grafana running alongside. cloudflared moved to a 'disabled' profile.
All required environment variables with empty values + inline
generator hints. Real .env never committed (.gitignore already
covers it under '.env').
Quickstart, Cloudflare dashboard checklist, forced-run snippet,
common-failure table. Points back at SPEC + PLAN for the why.
3 jobs:
1. build-autocli-binary: cargo build --release -p autocli on
   ubuntu-latest (linux/amd64) with Swatinem cache; uploads artifact
2. build-chrome-image: builds deploy/chrome from repo-root context;
   docker/metadata-action generates :main on main, :branch-<slug> on
   feature branches, :sha-<short> always
3. build-daily-image: downloads the autocli artifact, builds
   deploy/daily from repo-root context, same tag policy

Path filters include rust-toolchain.toml so a toolchain bump triggers
a rebuild.
The placeholder value was wrong (build failed with 'computed checksum
did NOT match'). Verified by downloading the GitHub release asset
and computing sha1sum from the operator's laptop.
CI builds the binary as a separate job and uploads as artifact;
Phase 0 locally rebuilds inside a Docker rust container and writes
to deploy/daily/bin/. Never commit this file (it's ~8MB).
rick-ubuntu-ssh tunnel's running replica is 2026.3.0 (per Zero Trust
dashboard). Our container joins as a 2nd HA replica; matching the
connector version avoids mixed-version edge cases.
Prod host (100.108.80.9) already has a process bound to :5900, so the
5900:5900 mapping failed container networking. Native VNC is only a
local convenience and is NOT part of the Cloudflare ingress; noVNC on
6080 (+ vnc.autocli route) is the real access path. Container still
listens on 5900 internally for websockify -> noVNC.
Chrome DevTools rejects /json* and /devtools Host headers that aren't
an IP or localhost. Reaching autocli-chrome by docker service name
failed with 'Host header is specified and is not an IP address or
localhost'.

- cdp-discover.sh: resolve CHROME_HOST -> container IP (getent, python
  fallback); use the IP for the /json probe AND the rewritten ws://
  URL so every Host header Chrome sees is an IP. Re-resolved each run.
- main.py /api/health: send Host: localhost on the liveness probe
  (yes/no check, body unused).

Found during Phase 3 server bring-up; daily container was crash-looping
on 'chrome unreachable after 60s' despite DNS + same-network OK.
Free Cloudflare zones get Universal SSL covering only <zone> + one-level
*.<zone>. Two-level subdomains like vnc.autocli.<zone> handshake-fail
('Unauthorized' / sslv3 alert) until the operator upgrades to Pro,
Total TLS, or ACM.

Rename across SPEC / PLAN / README:
  vnc.autocli.<zone>     -> autocli-vnc.<zone>
  cdp.autocli.<zone>     -> autocli-cdp.<zone>
  api.autocli.<zone>     -> autocli-api.<zone>
  grafana.autocli.<zone> -> autocli-grafana.<zone>

§9 risk nashsu#4 now documents the Free-plan SSL constraint as the reason for
the flat naming.
Host ubuntu-latest gives GLIBC 2.39 binaries that fail to load in the
daily runtime image (Debian Bookworm = GLIBC 2.36) with
'GLIBC_2.39 not found'. Pin build container to rust:1.94-slim-bookworm
so binary GLIBC requirements match runtime.

Also adds a readelf-based check that fails the build if the binary's
max GLIBC requirement exceeds 2.36.
`source /run/cdp-endpoint.env` only sets a shell variable; without
export, the autocli child process never sees AUTOCLI_CDP_ENDPOINT and
falls through to BrowserBridge's daemon path
("Chrome is not running"). Wrap source with `set -a`/`set +a` so the
assignment auto-exports as an env var that survives across fork/exec.
sync_autocli_jobs.py pretty-prints its summary with indent=2:
  {
    "input_rows": 573,
    "upserted": 573,
    ...
  }

The old run-daily.sh did 'grep "^{" log | tail -1' which matched only
the opening '{' line, yielding invalid JSON. Subsequent jq parses
failed silently, --argjson got empty values, the final jq -n -> dev/null
overwrote LAST_RUN_JSON with an empty file.

Fix: redirect sync stdout to /tmp/sync-DATE-N.json, also append to
log, then jq parses the captured JSON directly. Status now correctly
reflects rows_scraped/upserted/skipped from each run.
When run-daily.sh did 'exec 9>LOCK; flock 9' and then invoked autocli,
bash's FD 9 inherited into the autocli process by default. If autocli
took the daemon-path fallback (pre-env-export fix; or any future code
path that spawns a daemon), the detached 'autocli --daemon' child
inherited FD 9 too and held the lock for its lifetime. is_running()
then returned True forever, breaking /api/status.

Add '9>&-' to autocli and uv invocations so children can't see or hold
the lock. Verified by /proc/<pid>/fd inspection in production.
@RickSanchez88E
Copy link
Copy Markdown
Owner Author

crates/autocli-browser/src/cdp.rs:L302: 🔴 bug: CDP close sends Browser.close; post-run page.close kills shared Chrome. Use Target.closeTarget or make CDP close no-op.

deploy/chrome/entrypoint-vnc.sh:L38: 🔴 bug: -nopw disables VNC auth; noVNC exposes logged-in browser without password. Drop -nopw; rely on -rfbauth.

deploy/docker-compose.yml:L20: 🔴 bug: 9222:9222 publishes unauthenticated CDP outside Cloudflare Access. Bind 127.0.0.1 or remove host port.

supabase/migrations/20260509182000_add_priority_scoring_columns.sql:L140: 🔴 bug: unscored upsert overwrites old priority with 0 because insert coerces null to default. Branch on p_priority_score, not excluded.priority_score.

scripts/backfill_priority_scores.py:L136: 🔴 bug: client.table("jobs.jobs") queries a public table literally named jobs.jobs. Use client.schema("jobs").table("jobs") or public.jobs_jobs.

scripts/backfill_priority_scores.py:L147: 🔴 bug: backfill skips migrated rows; priority_score is NOT NULL default 0 and version defaults current. Filter priority_scored_at.is.null instead.

deploy/daily/api/main.py:L164: 🔴 bug: /jobs uses anon client against schema("jobs"), but migrations expose public.jobs_jobs view. Query public.jobs_jobs or expose jobs schema to PostgREST.

deploy/daily/crontab:L3: 🟡 risk: CRON_SCHEDULE env ignored; compose override never changes schedule. Generate crontab from env at entrypoint or remove env.

deploy/daily/crontab:L6: 🟡 risk: OUTPUT_RETENTION_DAYS ignored; retention is hardcoded to 30. Generate crontab from env or remove the env knob.

scripts/sync_autocli_jobs.py:L468: 🟡 risk: scoring exceptions are swallowed and pipeline exits success with scored:0. Log row error and fail when scoring is enabled.

scripts/sync_autocli_jobs.py:L555: 🟡 risk: dry-run reads application_path, but scorer writes application_friction; aggregator summary stays zero. Use application_friction.

cdp.rs (item 1): IPage::close was sending Browser.close, which kills the
SHARED Chrome in CDP-direct mode (and every other consumer attached to
it). Made it a no-op with explanation. Callers that need per-page
cleanup should send Target.closeTarget directly.

entrypoint-vnc.sh (item 2): -nopw was overriding -rfbauth and leaving
VNC open with no password. Anyone reaching :5900/6080 (via Tailscale
or any leaked path) could drive the logged-in browser. Removed the
flag; password auth from /root/.vnc/passwd is now enforced.

docker-compose.yml (item 3 + defense-in-depth on 6080): bound both
6080 and 9222 host ports to 127.0.0.1 only. Public path is Cloudflare
Tunnel + Access; direct host-port access would bypass every auth layer.
Backup: 'ssh -L 6080:localhost:6080' from a Tailscale-connected box.

backfill_priority_scores.py (items 5 + 6): client.table('jobs.jobs')
queried a literal 'jobs.jobs' name in public schema (always 0 rows);
fixed to client.schema('jobs').table('jobs'). Filter also moved from
priority_score.is.null (already NOT NULL DEFAULT 0 post-migration, so
matches nothing) to priority_scored_at.is.null (the only honest 'never
scored' signal).

crontab + Dockerfile + .env.example (items 8 + 9): CRON_SCHEDULE and
OUTPUT_RETENTION_DAYS env vars were placebos — supercronic reads
/etc/cron.d/autocli verbatim and does not env-substitute. Dropped the
misleading env knobs from compose / Dockerfile / .env.example and added
a comment in crontab explaining the contract.

NOT addressed in this commit:
- Item 4 (migration upsert priority overwrite) — needs a follow-up
  migration; pre-existing in main.
- Item 7 (/jobs schema) — empirically returns 500 rows with a loose
  filter; PostgREST DOES expose the jobs schema in this project. The
  reviewer's hypothesis was incorrect for this Supabase config. Pushing
  back on this one with evidence.
- Items 10, 11 — pre-existing sync_autocli_jobs.py issues from main;
  worth a separate cleanup PR.
@RickSanchez88E
Copy link
Copy Markdown
Owner Author

Addressed in af1ada7. Per-item:

# Item Action
1 cdp.rs Browser.close ✅ Fixed — made IPage::close a no-op in CDP-direct mode; killed the "kill-the-shared-Chrome" foot-gun. Per-page cleanup should use Target.closeTarget.
2 -nopw ✅ Removed. -rfbauth now actually enforced.
3 9222:9222 host port ✅ Fixed — both 9222 AND 6080 bound to 127.0.0.1 only. Public path is Cloudflare Tunnel + Access; direct host-port was bypassing every auth layer. (Backup: ssh -L 6080:localhost:6080.)
4 migration upsert ⏭️ Out of this PR's scope — needs a follow-up migration to rewrite upsert_job branching on p_priority_score IS NOT NULL (the RPC param) instead of excluded.priority_score IS NOT NULL (which is always non-null because of NOT NULL DEFAULT 0). The buggy migration is already applied in prod, so the fix needs a new migration revision. Filed as a follow-up.
5 backfill client.table("jobs.jobs") ✅ Fixed — client.schema("jobs").table("jobs"). (My earlier fix on the indeed branch got reverted by the rebase to origin/main.)
6 backfill filter ✅ Fixed — priority_scored_at.is.null instead of priority_score.is.null (the latter matches nothing post-migration).
7 /jobs schema Pushing back — empirically the endpoint returns rows. Test from the public path: curl ".../jobs?since=1970-01-01" → {"count":500, ...} with real LinkedIn job rows. PostgREST in this Supabase project IS exposing the jobs schema (visible in the Supabase API → Exposed schemas setting). The 0-rows I observed earlier was the post_time >= today filter, not a schema-exposure issue — LinkedIn post_time is the original posting date (often days old), not created_at. If we want "added since" semantics, that's a separate UX fix (switch to created_at). Happy to make that change if you confirm the semantics.
8 CRON_SCHEDULE env ignored ✅ Fixed — dropped the placebo env from compose / Dockerfile / .env.example. supercronic reads the crontab verbatim with no env substitution. Schedule lives in deploy/daily/crontab; comment now explains.
9 OUTPUT_RETENTION_DAYS env ignored ✅ Same fix as nashsu#8.
10 sync_autocli_jobs.py L468 scoring swallow ⏭️ Pre-existing in main (rebase-carried). Real bug worth fixing in a focused PR on scripts/sync_autocli_jobs.py.
11 sync_autocli_jobs.py L555 application_path/friction ⏭️ Same as nashsu#10 — pre-existing, dry-run-only impact.

Additional critical finding surfaced by Supabase MCP during verification:

  • jobs.jobs has RLS DISABLED. Combined with us mapping SUPABASE_KEY (service-role) to BOTH SUPABASE_SERVICE_ROLE_KEY and SUPABASE_ANON_KEY in the daily container's env, the /jobs route effectively runs with service-role privileges — anyone who passes Cloudflare Access + Bearer can read the entire jobs.jobs table.
  • Recommended follow-up: (a) create a real anon key, (b) enable RLS with a read-only policy for anon, (c) /jobs continues using anon → defense-in-depth restored. Service-role stays scoped to the sync write path.
  • Reproducing SQL (don't auto-apply): ALTER TABLE jobs.jobs ENABLE ROW LEVEL SECURITY; plus appropriate CREATE POLICY ... statements.

CI on af1ada7 is rebuilding now; will pull + re-verify on server once green and report back.

@RickSanchez88E
Copy link
Copy Markdown
Owner Author

Server re-verified against af1ada7:

  • 5 containers healthy after recreate
  • 9222 + 6080 host bindings now 127.0.0.1 only — external nc 100.108.80.9 9222 / 6080 both refused
  • Public path https://autocli-vnc.pumped.ink still 302→Access ✅
  • /api/health (Bearer-protected) returns {"chrome": true, "cdp_endpoint_file": true}
  • /api/status still reflects the last 572-row successful run

The -nopw removal means the VNC web client now requires the VNC password (in server .env as VNC_PASSWORD) after Cloudflare Access — defense in depth restored.

RickSanchez88E_a8cc added 2 commits May 16, 2026 13:04
Items 1, 2, 3 from PR review #4466756456:

1) New migration 20260516120000_fix_priority_upsert_data_loss.sql:
   recreates jobs.upsert_job so the ON CONFLICT DO UPDATE branches
   on the function PARAMETER (p_priority_score IS NOT NULL) instead
   of excluded.priority_score (which the INSERT body had already
   coerced from NULL to 0, making the case-when always true and
   silently zeroing prior scores). Same correction for
   priority_tier / scorer_version / signals / scored_at. Applied to
   production via Supabase MCP — verified success: True.

2) New migration 20260516120100_enable_jobs_jobs_rls.sql + GRANT
   migration: turns on RLS on jobs.jobs with a select-only policy
   for anon/authenticated, grants USAGE on the jobs schema and
   SELECT on the table to those roles. Server .env now uses the
   real anon JWT for SUPABASE_ANON_KEY (sync writes still use
   SUPABASE_SERVICE_ROLE_KEY which bypasses RLS). Combined with
   Cloudflare Access + Bearer this gives defence in depth.

3) /jobs endpoint now filters on created_at (database insert time)
   instead of post_time (LinkedIn original posting date — almost
   always older than today for fresh scrapes). Doc string updated;
   created_at added to the SELECT projection so clients can see
   it. Verified by direct REST against PostgREST + by
   python-in-container test (3 rows returned for since=today).
Companion to 20260516120100. RLS policies don't grant SELECT; PostgREST
also needs the role to have USAGE on the schema and SELECT on the table.
Already applied to production via Supabase MCP but the file was missing
from the PR — without it a fresh project provisioning from these
migrations would have count=0 on /jobs until the GRANT was applied
manually.
@RickSanchez88E
Copy link
Copy Markdown
Owner Author

Three follow-up items addressed in 5143d21 + fix the RLS GRANT migration:

1. Migration data-loss fix (commit 5143d21, file supabase/migrations/20260516120000_fix_priority_upsert_data_loss.sql):
Recreated jobs.upsert_job so the ON CONFLICT DO UPDATE branches on the function PARAMETER p_priority_score IS NOT NULL instead of excluded.priority_score (which the INSERT side had already coerced from NULL to 0, making the case-when always true and silently zeroing prior scores). Same correction for tier / scorer_version / signals / scored_at. Applied to production via Supabase MCP — apply_migration returned {success: true}.

2. Anon key + RLS (commits 5143d21 + grant migration):

  • Server .env switched: SUPABASE_ANON_KEY now holds the project's real legacy anon JWT (eyJ...role:anon), not the service-role key. Service-role still wires only to SUPABASE_SERVICE_ROLE_KEY for the sync write path.
  • Migrations 20260516120100_enable_jobs_jobs_rls.sql + 20260516120200_grant_anon_read_jobs_jobs.sql enable RLS on jobs.jobs, grant USAGE on the schema + SELECT on the table to anon, authenticated, and create a select-only policy anon_read_jobs_jobs (USING (true)).
  • Verified end-to-end:
    • GET /jobs?since=2026-05-16 via Cloudflare returns count=42 real LinkedIn rows.
    • PATCH /rest/v1/jobs?id=eq.<real-uuid> with anon key + Prefer: return=representation returns [] (0 rows affected) — RLS blocks the write. Read-back confirms priority_score unchanged.

3. /jobs since semantics (was 🟡 trivial — fixed not deferred):
gte("post_time", since)gte("created_at", since). post_time is LinkedIn's posting date (often days/weeks old even for fresh scrapes); created_at is the database insert time which matches the "jobs added since" expectation. Ordering also flipped to created_at desc. created_at added to the response projection so callers can see it. 9/9 FastAPI tests still pass.

PR now at 53 commits. CI green throughout. Ready when you are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant