Skip to content

feat(ops): internal observability dashboard at /ops#2075

Merged
aalemayhu merged 11 commits intomainfrom
feat/ops-observability
May 9, 2026
Merged

feat(ops): internal observability dashboard at /ops#2075
aalemayhu merged 11 commits intomainfrom
feat/ops-observability

Conversation

@aalemayhu
Copy link
Copy Markdown
Contributor

@aalemayhu aalemayhu commented May 9, 2026

Summary

  • Adds an internal-only /ops dashboard for Al with two tabs:
    • Engineering — inbound request volume, route latency (avg/p95), outbound API calls per service, error rate by route/service. Read from two new append-only Postgres tables (request_logs, outbound_call_logs) populated by a global Express middleware and an opt-in instrumentedAxios wrapper.
    • Business — six big-number cards (MRR, Net new MRR MTD, Active paying subs, Churn 30d, Failed payments 7d, New paid conversions 7d) + four Recharts panels (MRR 90d area, active-subs 90d line, new-vs-churned 12-week paired bars, failed-payments 12-week bars). All Stripe-derived, no DB persistence — historicals are reconstructed from a single paginated subscriptions.list({status:'all'}) walk plus a 12-week invoices.list walk, cached for 15 min per metric.
  • Writes for the engineering side go through a fire-and-forget ObservabilitySink that batches every 5s or 100 rows; the request path takes a single Date.now() and a res.on('finish') callback — no blocking work, no impact on user-facing latency.
  • Six existing axios callers (Notion OAuth + media + icons, Google login, Dropbox download, Drive download) are routed through the wrapper so we capture host+pathname, status, and duration. URLs are query-stripped before persistence and we never store bodies, headers, owners, IPs, or tokens.
  • The dashboard itself is a lazy-loaded React Query page using Recharts (~113KB gzip, only loaded on /ops); auto-refreshes every 30s, pauses while the tab is hidden, falls back to the last good snapshot on a transient API error so the charts don't blank out. The Business tab mirrors the same visual treatment — ChartPanel wrapper, 220px chart frame, snapshot-during-refetch, no <pre> JSON anywhere.
  • Access is gated server-side by RequireOpsAccess returning 404 (not 403 — we don't reveal the dashboard exists). The matching features.ops flag drives the navbar entry, hidden for everyone except the ops owner.

Note: PR #2076 is auto-merged/redundant; this PR is the canonical landing for both Engineering and Business tabs.

Spec / design

  • Spec: Documentation/ops-observability/SPEC.md
  • Design: Documentation/ops-observability/DESIGN.md

Test plan

  • pnpm test for the new files (41/41 green for engineering side; 14/14 for BusinessMetricsService including time-series cases; 5/5 for BusinessTab.test.tsx).
  • pnpm tsc -p . clean.
  • pnpm --filter 2anki-web typecheck clean.
  • pnpm --filter 2anki-web test clean (72/72).
  • pnpm --filter 2anki-web lint clean (Biome).
  • pnpm --filter 2anki-web build produces a separate OpsPage-*.js chunk (~113KB gzip) so Recharts is not paid for outside /ops.
  • Manual smoke on staging: /ops (Engineering) — four charts, window dropdown, 30s tick.
  • Reconciliation gate/ops/business: eyeball the six cards, then cross-check MRR and active-paying-subs against the Stripe dashboard. Don't declare success until those two numbers line up to within rounding.
  • Visit as a non-Al user → 404 from the API and no navbar entry.

Risks

  • The migration adds two indexed tables. Indexes are (created_at desc) and (<key>, created_at desc); size grows linearly with traffic. Without a retention/rotation cron we'll need to revisit once we have a week of real volume — see follow-ups.
  • The sink uses the singleton DB connection. If a flush ever blocks longer than the 5s interval, batches stack up in memory; on insert error we drop and log via console.error rather than retry, which is the right call (instrumentation must never threaten the request path).
  • Business tab Stripe call budget: ~2 paginated walks per refresh × 96 refreshes/day (15-min cache) = ~200 list calls/day. Well under Stripe rate limits.
  • Historical-reconstruction edge cases for the Business tab: canceled_at is set when a subscription is requested canceled (potentially before period end), while ended_at is the actual termination time. We treat endedAt ?? canceledAt as the historical "active until" boundary, and we explicitly do not try to reconstruct trial→active transitions (current status === 'trialing' excludes the sub from time series — a v3 problem).
  • Rollback plan: revert this PR and run migrations/<...>_observability.js down. The middleware and instrumentedAxios wrapper become no-ops if the tables vanish (sink swallows insert errors and continues). Reverting also drops the Business tab; nothing else in the app depends on it.

Future work (intentionally not in this PR)

  • Retention / rotation cron. Spec calls this out as out-of-scope until we see real volume. We'll likely partition by week or run a daily prune job once we know our row rate.
  • Drill-down row view. Spec says "no per-row table view in v1; if Al needs one later, we add it." Will revisit after a week of using the charts.
  • Trial→active reconstruction in MRR history. The current historical rule excludes any sub whose current status is trialing. Reconstructing the moment a trial converted requires walking subscription events, not just the sub list.

Goal alignment

Scaling toward 300K users requires we know (1) what's slow and what's broken on the engineering side and (2) what's happening to revenue on the business side, before users or the bank tell us. This PR is the foundation that makes every later perf/reliability and pricing bet data-driven instead of vibes-driven, with zero user-facing surface area added.

🤖 Generated with Claude Code

aalemayhu and others added 4 commits May 9, 2026 12:39
Adds the persistence + read path for an internal observability dashboard:
two append-only tables (request_logs, outbound_call_logs), an
ObservabilitySink that buffers and batch-flushes (5s interval, 100-row
threshold) so writes never block requests, and the layered read pipeline
RouteOpsRouter -> OpsController -> GetOpsMetricsUseCase ->
ObservabilityQueryService -> ObservabilityRepository returning bucketed
JSON for /ops.

The inbound `requestLoggingMiddleware` is mounted globally before
routers, captures method/route-template/status/duration on
res.on('finish'), and skips /ops/* itself.

`/ops/api/metrics` is gated by RequireOpsAccess (404 for non-Al, not 403
- we do not reveal the dashboard's existence). The matching feature
flag `features.ops` is exposed via /api/users/debug/locals so the
frontend can render the navbar entry only for Al.

Spec + design notes live in Documentation/ops-observability/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routes the six existing axios call sites that target our allowlisted
external services through `instrumentedAxios` so each call records its
service label, host+pathname (query stripped), status code, and
duration into outbound_call_logs:

- NotionService.getAccessData       -> notion
- NotionService.helpers.downloadMediaOrSkip -> notion
- NotionService.helpers.renderIcon  -> notion
- AuthenticationService.loginWithGoogle -> google_drive
- handleDropbox upload download     -> dropbox
- handleGoogleDrive upload download -> google_drive

Skipped on purpose:
- BlockBookmark.useMetadata fetches arbitrary user-supplied URLs and
  doesn't fit the closed allowlist; instrumenting it would require
  expanding the allowlist with no useful service label. Leaving it
  un-instrumented matches the spec ("if a caller doesn't fit, skip
  it").
- Patreon has no live HTTP caller in the codebase today; the label is
  reserved for the future Patreon webhook ingest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renders the four charts spec'd in
Documentation/ops-observability/DESIGN.md from the aggregated
/ops/api/metrics payload: inbound volume stacked by status class,
top-15 routes by avg/p95 latency, outbound calls per service over
time, and side-by-side error-rate bars for routes and services.

- Recharts is added as a dependency and only ships in the lazy-loaded
  /ops chunk - no impact on the upload/Ankify hot paths.
- Window is a URL query param (?window=1h|24h|7d, default 24h),
  reload-safe.
- React Query handles the 30-second background refetch; refetch is
  paused while the tab is hidden via refetchIntervalInBackground:
  false.
- The visible chart data falls back to the last successful snapshot on
  fetch error so a transient 5xx doesn't blank out the dashboard - it
  just renders the alert banner above the panels.
- All numeric formatting (% color thresholds, status grouping, bucket
  labels) is covered by unit tests in opsHelpers.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an "Ops" entry to the desktop and mobile navbar that's gated on
`features.ops` from /api/users/debug/locals - hidden for everyone
except the ops owner, who is also the only user the backend will serve
/ops/api/metrics for. A small uppercase "admin" tag sits next to the
label so it reads as an internal tool, not a regular feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aalemayhu aalemayhu marked this pull request as ready for review May 9, 2026 10:42
verb: 'get' | 'delete',
url: string,
config: AxiosRequestConfig | undefined
): Promise<AxiosResponse<T>> => axios[verb]<T>(url, config);
verb: 'get' | 'delete',
url: string,
config: AxiosRequestConfig | undefined
): Promise<AxiosResponse<T>> => axios[verb]<T>(url, config);
aalemayhu and others added 3 commits May 9, 2026 12:46
URL.pathname is already normalized by the URL parser, so the
`/\/+$/` regex was doing nothing useful for endpoint labels.
Removing it also clears Sonar S5852 (slow-regex hotspot) — the
regex is linear, but reviewing it adds noise we don't need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dashboard was fetching `/ops/api/metrics`, but vite's dev proxy
only forwards `/api/*` to the backend, so the SPA's index.html came
back as `<!DOCTYPE...` and the page rendered an empty state. The
production server's DefaultRouter catch-all `^/(?!api).*` would have
hit the same issue once deployed.

Moving the JSON endpoint to `/api/ops/metrics` matches the codebase
convention (`/api/*` for JSON, everything else for SPA), so vite
proxies it automatically and DefaultRouter's regex naturally lets it
through. The dashboard page itself still lives at `/ops`.

requestLoggingMiddleware now skips both `/ops/*` (the page) and
`/api/ops/*` (its own polling) so the dashboard never logs itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds /api/ops/business/metrics returning six business metrics (MRR, net new
MRR MTD, active paying subs, 30d churn, 7d failed payments, 7d new paid
conversions). Five come from the local subscriptions table via the new
SubscriptionsAnalyticsRepository; failed payments hit Stripe with a 15-min
in-memory cache. Owner-only via existing RequireOpsAccess.

The Engineering view moves into a sibling EngineeringTab under a shared
OpsLayout, with /ops/business rendering raw JSON in a <pre> as a sanity
check before the card grid in PR B. Hourly Stripe sync added to
ScheduleCleanup so the local subscriptions table stays fresh enough for
the 15-min response cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aalemayhu and others added 4 commits May 9, 2026 16:06
Drops the local-first approach (subscriptions table read) in favour of
calling Stripe directly with a 15-min server-side cache. Decouples the
dashboard from updateStripeSubscriptions, which Al kept manual-only.

- Delete SubscriptionsAnalyticsRepository + test
- BusinessMetricsService now owns all six metric queries via Stripe SDK
  - mrr_usd and active_paying_subs share one paginated walk of
    subscriptions.list({status: 'active'}) — assert in tests
  - net_new_mrr_mtd_usd: subscriptions.list({created.gte: month_start})
  - new_paid_conversions_7d: subscriptions.list({created.gte: 7d_ago})
  - churn_30d_pct: subscriptions.search('canceled_at>:30d') / active count
  - failed_payments_7d: invoices.list filtered to open + attempts > 0
  - MRR normalization in TS: per-item unit_amount × quantity × factor
    where factor is {month:1, year:1/12, week:4.33, day:30}/interval_count
- OpsRouter drops repository wiring; service constructed empty
- Tests now mock the Stripe SDK (external dep), cover yearly/weekly
  normalization, multi-item subs, trialing exclusion, partial-failure
  errors[], cache hit/expiry, single active-list walk

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stripe Search API uses range operators without the colon (`canceled_at>1234`),
not `canceled_at>:1234`. The colon form is rejected with 'Ensure you have
properly quoted values'. Adds a regex assertion on the search call so the
syntax can't regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reconstructs four historical series from a single paginated walk of
subscriptions.list({status:'all'}) plus the existing 12-week invoices
walk: mrr_timeseries (90d), active_subs_timeseries (90d),
conversions_vs_churn_weekly (12w), failed_payments_weekly (12w).
Today snapshots and time-series share one walk per refresh; each metric
is independently cached for 15 min and partial failures still surface
via response.errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six big-number cards (MRR, Net new MRR MTD, Active paying subs, Churn
30d, Failed payments 7d, New paid conversions 7d) sit above a 2x2 grid
of Recharts panels: MRR area, active-subs line, new-vs-churned paired
bars, failed-payments bars. Mirrors the Engineering tab's visual
treatment (ChartPanel wrapper, 220px chart frame, 30s auto-refresh,
last-snapshot retention during refetches, pageWide container). The
previous <pre> JSON fallback is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aalemayhu aalemayhu merged commit db7f1c2 into main May 9, 2026
6 checks passed
@aalemayhu aalemayhu deleted the feat/ops-observability branch May 9, 2026 14:58
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 9, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
4.1% Duplication on New Code (required ≤ 3%)
C Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

aalemayhu added a commit that referenced this pull request May 9, 2026
PR #2075 introduced four new Recharts components (MRR, ActiveSubs,
ConversionsChurn, FailedPaymentsWeekly) that each duplicated chart
margins, axis tick props, axis stroke, grid stroke, tooltip cursor
fill, and the tooltip wrapper markup. Sonar measured the new code at
4.1% duplication, failing the <=3% quality gate.

Extract the shared tokens into timeSeriesChartHelpers.ts and the
tooltip shell + row into TimeSeriesTooltipShell.tsx, then route the
four new charts through both. Also pass tooltip components by
reference to <Tooltip content={X}> instead of an inline arrow,
addressing the typescript:S6478 "component-definition-in-parent"
finding on those four files at the same time. Engineering charts
(InboundVolume, LatencyByRoute, OutboundByService, ErrorRate) are
left untouched per scope discipline — they were not the source of
the new-code duplication delta.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aalemayhu added a commit that referenced this pull request May 9, 2026
PR #2075 merged with the SonarCloud Code Analysis check at
conclusion=FAILURE because that check is not on the GitHub branch
protection required-checks list, and neither safety.py nor
check-commit-message.py inspect PR check status before
\`gh pr merge\`. Sonar's quality-gate failure (4.1% duplication on
new code, C security rating) went uncaught.

Add a third PreToolUse hook, check-merge-status.py, that:

  * Detects \`gh pr merge\` invocations in several command shapes
    (positional number, --rebase first, full PR URL, no PR -> let gh
    resolve from the current branch).
  * Calls \`gh pr view <ref> --json statusCheckRollup\` and inspects
    every entry — not just named-required ones.
  * If any conclusion == "FAILURE", denies the tool call with a
    bullet list of the failing check names and a hint to bypass via
    CLAUDE_SKIP_SAFETY=1.
  * On gh / network / parse errors, prints a one-line warning to
    stderr and exits 0 — never breaks a legitimate merge over a
    transient API issue.

Wired into .claude/settings.json alongside safety.py and
check-commit-message.py under the same Bash PreToolUse matcher.

Manual verification:
  PR 2075 (FAILURE on SonarCloud) -> deny.
  PR 2074 (all green)             -> allow.
  Non-merge command \`ls\`        -> allow.
  --rebase 2075 / URL form        -> deny (same).
  CLAUDE_SKIP_SAFETY=1 bypass     -> allow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aalemayhu added a commit that referenced this pull request May 9, 2026
PR #2075 introduced four new Recharts components (MRR, ActiveSubs,
ConversionsChurn, FailedPaymentsWeekly) that each duplicated chart
margins, axis tick props, axis stroke, grid stroke, tooltip cursor
fill, and the tooltip wrapper markup. Sonar measured the new code at
4.1% duplication, failing the <=3% quality gate.

Extract the shared tokens into timeSeriesChartHelpers.ts and the
tooltip shell + row into TimeSeriesTooltipShell.tsx, then route the
four new charts through both. Also pass tooltip components by
reference to <Tooltip content={X}> instead of an inline arrow,
addressing the typescript:S6478 "component-definition-in-parent"
finding on those four files at the same time. Engineering charts
(InboundVolume, LatencyByRoute, OutboundByService, ErrorRate) are
left untouched per scope discipline — they were not the source of
the new-code duplication delta.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aalemayhu added a commit that referenced this pull request May 9, 2026
PR #2075 merged with the SonarCloud Code Analysis check at
conclusion=FAILURE because that check is not on the GitHub branch
protection required-checks list, and neither safety.py nor
check-commit-message.py inspect PR check status before
\`gh pr merge\`. Sonar's quality-gate failure (4.1% duplication on
new code, C security rating) went uncaught.

Add a third PreToolUse hook, check-merge-status.py, that:

  * Detects \`gh pr merge\` invocations in several command shapes
    (positional number, --rebase first, full PR URL, no PR -> let gh
    resolve from the current branch).
  * Calls \`gh pr view <ref> --json statusCheckRollup\` and inspects
    every entry — not just named-required ones.
  * If any conclusion == "FAILURE", denies the tool call with a
    bullet list of the failing check names and a hint to bypass via
    CLAUDE_SKIP_SAFETY=1.
  * On gh / network / parse errors, prints a one-line warning to
    stderr and exits 0 — never breaks a legitimate merge over a
    transient API issue.

Wired into .claude/settings.json alongside safety.py and
check-commit-message.py under the same Bash PreToolUse matcher.

Manual verification:
  PR 2075 (FAILURE on SonarCloud) -> deny.
  PR 2074 (all green)             -> allow.
  Non-merge command \`ls\`        -> allow.
  --rebase 2075 / URL form        -> deny (same).
  CLAUDE_SKIP_SAFETY=1 bypass     -> allow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants