Skip to content

feat: telemetry coverage for read-side commands + paid-tier fallback (3.1.0)#82

Merged
Mikola Lysenko (mikolalysenko) merged 11 commits into
mainfrom
feat/telemetry-coverage-and-paid-fallback
May 26, 2026
Merged

feat: telemetry coverage for read-side commands + paid-tier fallback (3.1.0)#82
Mikola Lysenko (mikolalysenko) merged 11 commits into
mainfrom
feat/telemetry-coverage-and-paid-fallback

Conversation

@mikolalysenko
Copy link
Copy Markdown
Contributor

@mikolalysenko Mikola Lysenko (mikolalysenko) commented May 26, 2026

Summary

  • Telemetry now covers every CLI command — scan, get, list, setup, repair, unlock, and the new vex (OpenVEX) — joining the apply/remove/rollback trio that already shipped. Twelve new PatchTelemetryEventType variants + thirteen tracker functions; all flow through the existing track_patch_event send path.
  • scan and get automatically fall back from the authenticated API to the public proxy on 401/403 (a stale/revoked token no longer blocks free patches). Warning to stderr; resulting telemetry event tagged fallback_to_proxy: true. Conservative classifier: 404, 5xx, network, and rate-limit errors do NOT trigger fallback.
  • SOCKET_OFFLINE=1 (airgap) now disables telemetry universally via is_telemetry_disabled(), so apply no longer attempts a 5-second telemetry POST against api.socket.dev when the operator explicitly requested airgap.

Test plan

  • cargo test --workspace --all-features — exit 0 locally.
  • New cargo coverage:
    • tests/telemetry_e2e.rs — apply/scan/get/list each fire telemetry against a wiremock recorder; SOCKET_OFFLINE=1 produces zero /telemetry POSTs for all four; scan falls back on 401 + tags the resulting event; scan does NOT fall back on 500.
    • scan_invariants.rs — withdrawn-patch lifecycle (preserve-on-API-silence, prune-on-uninstall, scan-without-apply-is-read-only).
    • telemetry_helpers_e2e.rsSOCKET_OFFLINE branch of is_telemetry_disabled (truthy + non-truthy values).
  • Reviewer to confirm CHANGELOG entry under ## [3.1.0] reads accurately.

Notes for reviewers

  • No behavior change for apply/remove/rollback beyond the airgap gate.
  • apply / remove / rollback / vex keep their fail-loud semantics — the proxy fallback is intentionally read-side only.
  • Version sync via scripts/version-sync.sh (npm workspace catalog: protocol blocked the npm install step; per-platform packages + pyproject finished manually).

Assisted-by: Claude Code:opus-4-7

is_telemetry_disabled() now returns true when SOCKET_OFFLINE is "1" or
"true". Airgap mode promises "never contact the network"; the telemetry
endpoint is a network call, so honoring SOCKET_OFFLINE here keeps every
command (apply, remove, rollback — plus future scan/get/etc.) compliant
without requiring per-command gating.

Adds three integration tests in telemetry_helpers_e2e.rs and extends
the existing test_is_telemetry_disabled unit test with the new branch
(including "0" and "" non-truthy values).

Assisted-by: Claude Code:opus-4-7
…eeping + vex

Extends PatchTelemetryEventType with 12 new variants covering scan, get
(emits patch_fetched / patch_fetch_failed for symmetry with the
existing apply naming convention), list, repair, setup, unlock, and the
new vex (OpenVEX) command. Adds matching convenience tracker functions
that funnel through the existing track_patch_event send path — no new
HTTP plumbing.

The scan/get trackers carry a fallback_to_proxy flag so we can measure
how often the auth endpoint downgrades to the public proxy once that
fallback path lands.

No call sites yet — wiring into each command file follows in subsequent
commits so this commit stays a pure data-model addition.

Assisted-by: Claude Code:opus-4-7
…air, unlock, vex

Each command now fires a success/failure event through the existing
track_patch_event send path. Concrete coverage:

- list: patch_listed (count surfaced)
- setup: patch_setup (detected package manager: npm/pnpm)
- unlock: patch_unlocked (was_held + released metadata) + patch_unlock_failed
- repair: patch_repaired (downloaded + cleaned counts) + patch_repair_failed
- scan: patch_scanned (per-tier counts, can_access_paid, ecosystems,
  fallback_to_proxy=false placeholder) + patch_scan_failed when every
  batch errored (previously hidden as "zero patches found")
- get (UUID path only for now): patch_fetched on success,
  patch_fetch_failed on paid_required / not_found / API error.
  CVE/GHSA/PURL search-error paths also surface patch_fetch_failed.
- vex: vex_generated on success, vex_failed via a small async helper
  that wraps each emit_envelope_error call site.

Renamed the unlock tracker's "broken" parameter to "released" — unlock
never breaks a held lock (that's `--break-lock` on mutating subcommands);
the bool actually describes whether the lock file was removed.

No new HTTP plumbing; trackers reuse track_patch_event. Behavior preserved
on existing apply/remove/rollback paths.

Assisted-by: Claude Code:opus-4-7
Three new cargo tests in scan_invariants.rs covering patch-management
behaviors the existing matrix didn't pin down:

- scan_prune_keeps_entry_when_package_installed_but_api_silent: a
  manifest entry must survive --prune when the underlying package is
  still installed locally but the API has fallen silent on patches
  for it. Pins the current --prune scope (crawl-absence, not
  API-absence) so a future regression to over-pruning is loud.

- scan_prune_removes_withdrawn_patch_entry: when the underlying
  package is uninstalled (no longer in crawl results), --prune
  removes the manifest entry even with a stale blob still on disk.
  The blob is left for the existing repair-side GC to handle.

- scan_detects_update_without_touching_existing_blobs: a newer UUID
  from the API surfaces in the `updates` array, but scan without
  --apply must leave the on-disk manifest and blobs byte-for-byte
  unchanged. Read-only invariant.

Assisted-by: Claude Code:opus-4-7
…airgap

New tests/telemetry_e2e.rs spawns the released binary against a
wiremock server that fronts both the patches endpoints AND the
telemetry endpoint, then counts POSTs against
/v0/orgs/{slug}/telemetry filtered by event_type.

Coverage:
- scan_emits_patch_scanned_telemetry_on_success
- list_emits_patch_listed_telemetry_when_telemetry_enabled
- get_emits_patch_fetched_telemetry_on_uuid_lookup_success
  (tolerates either fetched/fetch_failed — the apply step is allowed
   to fail in the test env; the invariant is that *some* event fires)
- {apply,scan,get,list}_skips_telemetry_in_airgap_mode — confirms the
  central is_telemetry_disabled() gate suppresses everything when
  SOCKET_OFFLINE=1, regardless of command.

Caught a real test-only bug along the way: send_telemetry_event reads
SOCKET_API_URL from the *environment*, not from the clap --api-url
arg. The test harness now sets both env + flag so the telemetry POST
lands on the same mock recording the API requests.

Assisted-by: Claude Code:opus-4-7
Adds `build_proxy_fallback_client(&overrides)` + `is_fallback_candidate(&err)`
in api/client.rs. The constructor builds a public-proxy-mode ApiClient
from the same overrides used by `get_api_client_with_overrides`,
deliberately dropping the auth token. The classifier flags 401/403
errors as fallback-eligible; everything else (404, 5xx, network,
rate-limit, parse) surfaces unchanged.

`scan.rs` and `get.rs` (UUID path) catch the first such error from
the authenticated endpoint, log a warning to stderr, rebuild the
client, retry the same request once, and continue. A new
`fallback_to_proxy` bool plumbed through to the existing telemetry
trackers carries the incidence into observability.

Behavior is deliberately conservative:
- Read commands only — `apply`/`remove`/`rollback`/`vex` keep their
  pre-existing fail-loud-on-auth semantics.
- 404, 5xx, network, parse errors do NOT trigger fallback; they
  surface as before so backend issues stay visible.
- Free patches still resolve via the proxy; paid patches return the
  same "paid_required" structured error the no-token path already
  emits.

Assisted-by: Claude Code:opus-4-7
Two new tests in telemetry_e2e.rs:

- scan_falls_back_to_proxy_on_401_and_tags_telemetry: stands up two
  mock servers (auth endpoint 401s, proxy endpoint succeeds), asserts
  scan exits 0 after the swap, the fallback warning hits stderr, and
  the resulting patch_scanned event carries fallback_to_proxy: true
  in metadata.

- scan_does_not_fall_back_on_500: pins the conservative scope of the
  classifier. A 500 from the auth endpoint must NOT trigger the
  proxy retry — backend errors should stay visible. Asserts zero
  hits against the proxy mock and no fallback warning on stderr.

Assisted-by: Claude Code:opus-4-7
Workspace Cargo.toml, all npm wrapper + per-platform packages, and
PyPI pyproject.toml synced via scripts/version-sync.sh (with manual
fixup for the per-platform packages since npm install couldn't
process the workspace catalog: protocol).

CHANGELOG entry covers: telemetry events across the read-side and
housekeeping commands, the 401/403 auth → public-proxy fallback in
scan/get, the SOCKET_OFFLINE airgap gate, and the new behavioral +
lifecycle test coverage that backs all of it.

Assisted-by: Claude Code:opus-4-7
cargo clippy --workspace --all-features -- -D warnings flagged
track_patch_scanned at 8/7 args. Grouping the per-tier counts +
ecosystems list + fallback flag + auth tuple into a struct would
force every call site to build a config object for a single
fire-and-forget tracker — worse ergonomics. Annotating the lint is
the right call; `track_patch_event` already exists for callers that
want full control.

Assisted-by: Claude Code:opus-4-7
The dashboard displays an SRI-format hash (`sha512-<base64>`) of each
API token for identification — that's the value stored in
api_tokens.hash, NOT what to set in SOCKET_API_TOKEN. Users who copy
the displayed hash hit a confusing 401 "Invalid API token" with no
hint about the mistake.

Adds two pure helpers in api/client.rs:
- validate_token_shape() — non-authoritative shape check against
  sktsec_<44>_api / sktsec_<44>_agent. Returns a redacted-preview
  warning message when the shape is obviously wrong.
- looks_like_token_hash() — true for sha256-/sha384-/sha512- prefixes.

Wires them into:
- get_api_client_with_overrides — warns on stderr before the first
  network call when the configured token is malformed.
- resolve_org_slug's 401 branch — appends a "you set the hash, not
  the token" hint when both conditions are met (Unauthorized + the
  token starts with sha###-).

Six new unit tests cover the canonical + agent shapes, the SRI hash,
short tokens, missing suffix, and the SRI-prefix detector. README's
env-var table now spells out the distinction in one sentence.

Pure additive — valid tokens see no output. The server's regex
remains the source of truth; we only flag values that are obviously
malformed client-side so the user doesn't waste a round trip.

Assisted-by: Claude Code:opus-4-7
Setting SOCKET_OFFLINE=1, SOCKET_DEBUG=1, or any other bool global
arg via env crashed at clap parse time:

  error: invalid value '1' for '--offline'
    [possible values: true, false]

clap's default bool parser only accepts "true"/"false". The internal
env-mirroring in apply_env_toggles() already writes "1" when a flag
is passed (so downstream code in telemetry.rs reads "1" via
read_env_with_legacy), and that internal read-side accepts both "1"
and "true". The user-facing input side was the asymmetric piece.

Wires BoolishValueParser (accepts "true"/"false"/"yes"/"no"/"1"/"0"/
"on"/"off"/"y"/"n") onto every bool global with an env attribute:
offline, global, json, verbose, silent, dry_run, yes, break_lock,
debug, no_telemetry.

CLI flag usage (--debug, --offline, etc.) is unchanged. Env var
usage now matches the canonical "1 means yes" convention every
operator expects.

Assisted-by: Claude Code:opus-4-7
@mikolalysenko Mikola Lysenko (mikolalysenko) merged commit 1493421 into main May 26, 2026
42 checks passed
@mikolalysenko Mikola Lysenko (mikolalysenko) deleted the feat/telemetry-coverage-and-paid-fallback branch May 26, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants