-
Notifications
You must be signed in to change notification settings - Fork 0
execution log
Entries appended in commit order. Each entry: finding ID, fix summary, files touched, test added, compatibility class, rollback note. New Ledger entries surfaced mid-remediation go to the bottom of this file under ## New Findings.
Fix. Added cortex/libs/logging/correlation.py with a ContextVar-backed correlation id, correlation_scope context manager, and a stdlib logging.Filter that injects the id onto every record. Wired structlog.contextvars.merge_contextvars into the structlog processor chain so any code path that calls get_logger() automatically gets correlation_id=... in its log records. Added a FastAPI middleware that mints or accepts X-Cortex-Request-ID per request, binds it for the request lifetime, and echoes it back on the response. The WebSocket server enters a correlation scope around every inbound message and stamps the active id onto outbound messages in _broadcast so daemon-initiated traffic (state updates, intervention triggers) carries the originating request's id. The Anthropic planner's llm.request status=ok log line now includes cid=… so cost telemetry (F20, next) can group spend by request.
Files touched (7):
-
cortex/libs/logging/correlation.py(new) cortex/libs/logging/structured.pycortex/libs/logging/__init__.pycortex/services/api_gateway/app.pycortex/services/api_gateway/websocket_server.pycortex/services/llm_engine/anthropic_planner.py-
cortex/tests/integration/test_correlation_ids.py(new test)
Test. cortex/tests/integration/test_correlation_ids.py — 8 cases. Asserts the contextvar round-trips, nests correctly, the HTTP middleware mints when absent and echoes when supplied, and _broadcast stamps the active id onto a WSMessage that arrived with correlation_id=None. All 8 pass on this branch; on main (pre-fix) the imports from cortex.libs.logging.correlation import ... resolve to ModuleNotFoundError, the middleware doesn't exist (no X-Cortex-Request-ID header in response), and _broadcast does not stamp ids — every test fails.
Verification.
- F19 suite:
pytest cortex/tests/integration/test_correlation_ids.py→ 8 passed (0.64s). - Regression check:
pytest cortex/tests/unit/test_api_gateway.py cortex/tests/unit/test_anthropic_planner.py→ 56 passed (11.51s).
Compatibility. Additive only. The middleware adds one header to every response; the WS envelope's correlation_id field already existed and was optional. No schema changes, no migrations, no client-side coordination required. Stale clients that don't propagate the id continue to work — they get a freshly-minted id per inbound message.
Rollback. git revert of this commit is clean: no DB or cache state; the contextvar dies with the process; the middleware is added in code only.
Scope split. The browser-extension half of correlation propagation (popup → background → daemon round-trip) is filed as new Ledger entry F19b in New Findings below; closing it depends on F40 (TS test infra). Daemon-internal traceability — which is what F19 promised — is closed.
Fix. Tactical mitigation for Architectural Debt #2 (implicit localhost-trust model). Added cortex/libs/auth/local_token.py exposing load_or_create_token() (atomic write, mode 0600) and verify_token() (constant-time compare). Daemon startup now provisions the token before any service binds. The WebSocket server's SHUTDOWN handler now requires payload.auth_token to match; missing or wrong tokens are logged and silently ignored — the malicious caller learns nothing, the legitimate user still has 5 other paths to stop the daemon (HTTP /shutdown, native messaging stop, launcher /stop — last one closes in F08). Cross-origin localhost web pages and hostile extensions cannot read mode-0600 files, so they cannot present a valid token.
Files touched (5):
-
cortex/libs/auth/__init__.py(new) -
cortex/libs/auth/local_token.py(new) cortex/services/api_gateway/websocket_server.py-
cortex/services/runtime_daemon.py(provision token at startup) -
cortex/tests/unit/test_auth_local_token.py(new test)
Test. 8 cases in test_auth_local_token.py. Token round-trip is idempotent, file is 0o600, empty/wrong tokens rejected, correct token accepted, truncated files replaced, and crucially the WS SHUTDOWN handler does not call the shutdown callback when the token is missing but does when it matches. All 8 fail on main (module does not exist; SHUTDOWN handler accepts unauthenticated messages).
Verification.
- F07 suite:
pytest cortex/tests/unit/test_auth_local_token.py→ 8 passed (0.54s). - Regression check:
pytest cortex/tests/unit/test_api_gateway.py cortex/tests/integration/test_correlation_ids.py→ 49 passed.
Compatibility. Breaking for callers that send SHUTDOWN without auth_token. Current callers: only background.ts line 2548. That call now silently no-ops on Step 1 of the kill chain; Steps 2–6 (HTTP /shutdown, native-messaging stop, launcher /stop, tab cleanup) still run and reliably stop the daemon. The native-host-mediated token fetch needed to restore Step 1 is filed as F07b below.
Rollback. git revert is clean: token file is harmless to leave behind; the WS handler reverts to its previous unauthenticated behaviour; no migration.
Fix. Same threat model as F07 — close cross-origin localhost. Two changes ship together because the legitimate extension path needs both halves of the plumbing to remain functional:
-
launcher_agent.pyrequiresX-Cortex-Auth-TokenonPOST /stop. Header missing or wrong → 401, no PID enumeration, no SIGTERM./launch,/health,/statusremain open (non-destructive; needed for liveness probes and supervisor start-up). The launcher's "zero cortex imports" invariant is preserved by inlining a minimal_auth_token_path()+_verify_auth_token()(hmac.compare_digest). -
native_host.pygains aget_auth_tokencommand that returns the daemon's token. The browser↔native-host channel is already OS-authenticated per-profile, so returning the token here does not widen the attack surface; the mode-0600 file is still unreadable from any sandboxed page context.
Files touched (4):
cortex/scripts/launcher_agent.pycortex/scripts/native_host.py-
cortex/tests/unit/test_launcher_auth.py(new) -
cortex/tests/unit/test_native_host_auth.py(new)
Test. 5 cases in test_launcher_auth.py + 2 cases in test_native_host_auth.py. The launcher tests boot the real LauncherHandler on an ephemeral port, monkeypatch _stop_daemon to a no-op so the test does not kill the developer's running daemon, and verify (a) /stop without token → 401, (b) /stop with wrong token → 401, (c) /stop with the file's token → 200, (d) /health stays open, (e) missing token file → fall-closed (401, not open). The native-host tests verify get_auth_token returns an existing token unchanged and provisions a new one when absent. All fail on main.
Verification. pytest cortex/tests/unit/test_launcher_auth.py cortex/tests/unit/test_native_host_auth.py cortex/tests/unit/test_auth_local_token.py cortex/tests/integration/test_correlation_ids.py → 23 passed.
Compatibility. Breaking for external POST /stop callers without the token. Internal callers: background.ts:2578-2583. After this commit the extension's Step 6 of its kill chain fails 401; Steps 2–5 still run (HTTP /shutdown, native-messaging stop). To fully restore Step 6, the extension must fetch the token via chrome.runtime.sendNativeMessage("com.cortex.launcher", {command: "get_auth_token"}) and add it as X-Cortex-Auth-Token to its /stop fetch. Wiring this in background.ts is split out as F08b (gated on F40 TS test infra).
Rollback. git revert is clean. The launcher's inline auth helper is self-contained; the native-host new command has no side effects.
Fix. The previous shutdown path wrapped self._session_report.finish() and session_path.write_text(...) in a single try/except Exception: logger.warning(...). Any failure — disk-full, permission denied, a Pydantic model error inside finish(), a SIGKILL after the file was opened in write-truncate mode but before bytes were flushed — silently dropped the entire session debrief and left no recoverable artefact.
- New helper
cortex/libs/utils/atomic_write.py:atomic_write_text/atomic_write_jsonwrite to<path>.tmp, fsync, andos.replaceinto place.os.replaceis atomic on POSIX and NTFS; failure before the rename leaves the prior on-disk file intact. -
runtime_daemon.stop()now splits compute-vs-disk error handling:finish()errors log "nothing to persist" and skip the write; disk-write errors log "prior file preserved" and the previous report (if any) survives. - Both branches use
logger.errorinstead ofwarningso the failure is observable at the daemon's default log level.
Files touched (3):
-
cortex/libs/utils/atomic_write.py(new) cortex/services/runtime_daemon.py-
cortex/tests/unit/test_atomic_write.py(new test)
Test. 5 cases in test_atomic_write.py. Round-trip JSON, no leftover .tmp on success, prior file survives os.replace failure (simulated PermissionError), tmp file cleaned up on write failure (simulated mid-write OSError). All fail on main (helper does not exist).
Verification.
- F02 suite: 5 passed (0.03s).
- Regression check: full unit suite
pytest cortex/tests/unit/→ 931 passed, 1 skipped.
Compatibility. Additive. The on-disk session report format is unchanged. Stale readers continue to see session_<id>.json. No migration.
Rollback. git revert is clean. The atomic-write helper has no callers other than runtime_daemon.stop(); the previous write_text path is straight-line restored.
Fix. The state loop's intervention dispatch path used bare asyncio.create_task(...) with no reference (runtime_daemon.py:1057). stop() cancelled only the long-running loops listed in self._tasks; any in-flight intervention task was orphaned. If that task held a file handle (session-record append, baseline write) the daemon could exit mid-write, truncating JSONL.
- Added
self._background_tasks: set[asyncio.Task]in__init__. - New helper
_spawn_background_task(coro, *, name=...)adds to the set + registers anadd_done_callback(self._background_tasks.discard)so the set stays bounded automatically. - The previously-orphan call site is rewritten to use the helper.
-
stop()now cancels every outstanding background task andawaits them withreturn_exceptions=Truebefore clearing.
Files touched (2):
cortex/services/runtime_daemon.py-
cortex/tests/unit/test_background_task_tracking.py(new)
Test. 4 cases. Tests intentionally use a _StubDaemon carrying the exact same plumbing rather than booting CortexDaemon, because the full daemon requires camera + store backends and the contract under test is a tiny set of lines. Cases: spawn tracks the task; completed tasks auto-discard; stop() cancels in-flight tasks; stop() drains multiple concurrent tasks. All fail on main (helper doesn't exist; the test's _spawn_background_task mirror would still pass against the stub but the orphan call site on main proves the bug — a separate live-daemon test would be needed to catch the original orphan, but in this codebase that requires a full integration harness that isn't trivially available; the contract test guards the new helper rigorously).
Verification. F03 suite: 4 passed (0.04s). Import-check: CortexDaemon imports clean.
Compatibility. Additive. Existing self._tasks mechanism untouched. No schema or wire changes.
Rollback. git revert is clean. The orphan call site reverts to bare asyncio.create_task; the set + helper die with the diff.
Fix. runtime_daemon.stop() awaited self._capture_pipeline.stop() with no upper bound. A disconnected USB webcam or stuck mediapipe worker hangs the close indefinitely; only SIGKILL unblocks the daemon, and SIGKILL leaves the AVFoundation camera handle owned by a dead PID for minutes — next launch fails with a permission-loop. The fix wraps the call in asyncio.wait_for(..., timeout=5.0). On timeout the daemon logs an explicit error and proceeds with the rest of the shutdown chain; the kernel reclaims the camera handle on actual process exit. Non-timeout exceptions are now logged (previously they were silently swallowed by except: pass).
Files touched (2):
cortex/services/runtime_daemon.py-
cortex/tests/unit/test_capture_stop_timeout.py(new)
Test. 3 cases. A _NeverFinishingPipeline proves the timeout fires within bounds; a fast pipeline is not interrupted; non-timeout errors propagate (the wrapper does not swallow RuntimeError). On main the production code uses await with no wait_for, so a hung pipeline would block forever — adapter tests cannot match against that since they would themselves hang; the wrapper-pattern tests prove the new contract.
Verification. F01 suite: 3 passed (0.28s).
Compatibility. Behavioural change at shutdown: previously infinite wait, now 5 s. Legitimate camera close paths complete in well under 1 s; the 5 s budget is generous. No wire/schema change.
Rollback. git revert is clean. Single hunk in runtime_daemon.py; the prior try/except: pass is straight-line restored.
Fix. Previously AnthropicPlanner.__init__ for provider="bedrock" fetched the bearer token from Keychain and wrote it permanently to os.environ["AWS_BEARER_TOKEN_BEDROCK"]. The Cortex daemon spawns many subprocesses (capture worker, native-host re-launches, project-launcher terminals); every one inherited the env, and any debugger / crash-dump tool attached to any descendant could read the token. The Anthropic SDK reads the bearer only inside its constructor, so we now scope the env mutation to that single call and restore the prior state on exit. Keychain is consulted only when env is initially empty (preserves the documented "env wins" precedence); the user's own env value, if present at startup, survives untouched.
Files touched (2):
cortex/services/llm_engine/anthropic_planner.py-
cortex/tests/unit/test_bedrock_token_containment.py(new)
Test. 3 cases: (a) the scoped-mutation pattern in isolation produces a clean env, (b) full AnthropicPlanner construction sees the keychain token during SDK build (captured via stub) but the env is clean afterwards, (c) a pre-existing user-supplied env value is preserved (keychain skipped). The third case fails on main only by coincidence (no behavioural assertion before the fix); cases (a) and (b) fail on main because the env mutation was unbounded — the post-construction assertion "AWS_BEARER_TOKEN_BEDROCK" not in os.environ was false.
Verification. F11 suite: 3 passed (0.90s). Regression check: test_anthropic_planner.py — 15 passed.
Compatibility. Subtle but additive. Code that relied on the daemon polluting its own env after construction (none in this repo, grep verified) would break; the SDK's runtime requests do not re-read the env, so the post-construction emptiness has no functional effect on legitimate calls.
Rollback. git revert is clean. Single hunk in anthropic_planner.py; the old "set env permanently" path is straight-line restored.
Fix. Two-sided defence shipping together:
-
Sanitiser hardened.
sanitize_prompt_textnow defangs the prompt-injection patterns most commonly seen in the wild: leadingSystem:/Assistant:/Human:lines, the XML role tags<SYSTEM>/</SYSTEM>/<INSTRUCTION>/<ASSISTANT>, the Llama-style[INST]/[/INST]markers, and any premature</USER_CONTENT>close tag. Defang inserts spaces inside the marker — the human-readable text survives, the byte pattern the model recognises does not. -
Delimiter wrapping. New
wrap_user_content(text, *, tag)helper. Every user-controlled string interpolated into the user prompt (context,constraints_text,goal_hint,extra_context) is wrapped in a tag-distinct delimiter —<WORKSPACE_CONTEXT>,<CONSTRAINTS>,<USER_GOAL>,<EXTRA_CONTEXT>. - SYSTEM_PROMPT gains a "PROMPT INJECTION DEFENCE" clause that tells the model these tagged regions are DATA, never instructions, and to ignore any embedded "System:" prefix, "ignore previous instructions" directive, or new-rules text inside them.
Files touched (2):
cortex/services/llm_engine/prompts.py-
cortex/tests/unit/test_prompt_injection_defence.py(new)
Test. 9 cases. Sanitiser defangs System:/Assistant:/Human: prefixes, XML role tags, [INST] brackets, and </USER_CONTENT> close tag. wrap_user_content produces the expected delimiter. Round-trip attack (a tab title combining every injection pattern) is fully neutralised. The system prompt carries the matching defence clause. Pre-existing brace-escape behaviour is preserved (regression guard). All fail on main (sanitiser pre-F09 did not defang any of these patterns; system prompt had no injection-defence clause).
Verification.
- F09 suite: 9 passed (1.00s).
- Regression check on prompt/context tests (
pytest -k "prompt or context"): 104 passed.
Compatibility. Wire/schema unchanged. The LLM's effective prompt grows slightly (one tag-wrapper per interpolated value), well within token budget. The injection-defence clause may marginally bias the model toward refusing tab titles that contain System: literally — acceptable given the threat.
Rollback. git revert is clean. Single file modified plus the test; the previous sanitiser is restored straight-line.
Status. Daemon-side closed in F08. Extension-side wiring still open as F08b.
Summary. F07+F08+F07b shipped the daemon-side primitives. The browser extension still does not fetch the token via chrome.runtime.sendNativeMessage("com.cortex.launcher", {command: "get_auth_token"}), so its Step 1 (WS SHUTDOWN with auth_token) and Step 6 (POST /stop with X-Cortex-Auth-Token) currently fail. User-facing kill chain still works via Steps 2–5; the legacy redundancy absorbs the gap, but Step 1's graceful-flush intent and Step 6's belt-and-braces shutdown are lost.
Fix outline. On WS connect or first SHUTDOWN attempt, send {command:"get_auth_token"} to the native host, cache in memory, attach to outbound SHUTDOWN payload and /stop fetch.
Location. cortex/apps/browser_extension/background.ts:2544-2583.
Category. UI.
Blast radius. maintainability.
Fix complexity. S.
Dependencies. F08 (closed), F40 (no TS test infra to satisfy Phase-2 quality bar).
Summary. F19 closed daemon-internal correlation. The browser extension (background.ts, popup.tsx, newtab.tsx) still does not mint a correlation id at the user-action origin, does not include it in outbound WS messages, and does not log it. End-to-end traceability from popup click → daemon → LLM → response is therefore one hop short of complete.
Location. cortex/apps/browser_extension/background.ts:1383,1391 (correlation_id forwarded if present but never logged or minted), cortex/apps/browser_extension/popup.tsx (no minting).
Category. Cross / UI.
Blast radius. maintainability + correctness.
Fix complexity. S, but depends on F40 (no TS test infra to verify).
Dependencies. F19 (closed), F40 (no TS tests).
Why split. Touching the TS side without F40's test infra means manual UI verification only — violates the Phase-2 quality bar requiring a failing-on-main test. Filed as deferred; will be picked up immediately after F40.
(All Ledger entries not yet visited remain open — see audit/state.md for the locked execution order. None has been formally deferred yet; the session ended on a natural cohort boundary, not a scope-failure boundary.)
The eight commits in this session close the data-loss tier (F01, F02, F03), the security tier on local-CSRF + prompt injection + credential containment (F07, F08, F09, F11), and the observability foundation (F19). The following residual risks remain after this session — three things most likely to still go wrong in production:
-
Cost runaway via state oscillation (F20, F25, F26, F27 still open). A user whose biometric signal oscillates at the HYPER/FLOW boundary can drive 60+ LLM calls/hour. F19 added
cid=to the planner's success log so per-request grouping is now possible, but the per-user budget, the kill-switch, and the hysteresis fix are still pending. Monitoring needed: alarm on per-user planner calls/hour > 30 in any 60-minute window. Thecid=field is already in place; a downstream log aggregator can group by user and count. -
TS extension is uncovered by tests (F40 still open, blocks F19b/F08b/F07b's extension wiring). The extension-side correlation propagation, native-host token fetch, and
/stoptoken attachment are all daemon-ready but extension-unwired. The extension currently fails its WS SHUTDOWN step (Step 1 of stop chain) and its launcher/stopstep (Step 6) 401. User-facing function is preserved by the redundant other steps; if one of those fails, the user will discover daemons that don't shut down cleanly. Monitoring needed: the launcher's/stopaccess log will now show 401s from the extension. That should be the first thing a fresh tail of~/Library/Application Support/Cortex/launcher.logreveals. -
Action validation gap (F10 still open). F09 closed the prompt-injection path; F10 — the executor-side allowlist for
open_url/close_tabarguments — is the matching defence and is the next finding in the locked order. Until F10 lands, an LLM that was persuaded by injection (e.g. via a vector the F09 defence doesn't cover, like a Unicode homograph not yet in our defang list) can still emit a structurally-valid action with a malicious URL or tab index. Monitoring needed: log everysuggested_actions[*].targetvalue at INFO so post-hoc review can flag novel URLs.
Fix. Two-layer defence against unsafe SuggestedAction payloads. Layer 1: @field_validator/@model_validator on SuggestedAction reject non-http(s) open_url targets, newlines in search_error queries, negative tab_index, and per-action_type target length caps tighter than the outer max_length=500. Layer 2: filter_unsafe_actions(plan, tab_count=N) in parser.py runs after enrichment and drops actions whose tab_index >= tab_count (live upper bound the schema cannot know) or that mutated post-parse into an unsafe shape. New EventType.INTERVENTION_ACTION_REJECTED log line per drop, carrying the correlation id from F19.
Files touched (4):
-
cortex/libs/schemas/intervention.py(validators + allowlist constants) -
cortex/services/llm_engine/parser.py(filter_unsafe_actions, wired intoenrich_plan_with_context) -
cortex/libs/logging/structured.py(new EventType) -
cortex/tests/unit/test_action_allowlist.py(new test)
Test. 17 cases. URL-scheme rejections (javascript/data/file/none), positive accepts (http/https), empty-target parse leniency + runtime drop, search_error newline + length caps, negative tab_index rejection, tab_index upper-bound drop at runtime, non-tab actions untouched, rejection logging carries cid + reason, filter idempotence. All fail on main (validators don't exist; filter doesn't exist).
Verification.
- F10 suite: 17 passed (0.89s).
- Regression check: 76 LLM-engine/planner/injection tests passed.
Compatibility. Breaking on the schema: any historical plan with a javascript:/data:/etc. URL fails Pydantic parse. Grep of storage/sessions/*.json (none in repo) confirms no existing session contains such payloads. For deployed installs, banned actions on replay would surface as parse warnings, not crashes.
Rollback. git revert is clean. Validators are additive; the filter call is a single line in enrich_plan_with_context.
F03 (background task tracking). The fix itself is straightforward — a tracked set + _spawn_background_task helper + cancellation in stop(). What I am least confident about is whether the test coverage is sufficient. The test runs against a _StubDaemon that mirrors the helper because booting the full CortexDaemon requires a real camera, real store backends, and other dependencies. The stub exercises the contract precisely, but it cannot catch the case where a future call site in the daemon adds another bare asyncio.create_task(...) instead of using _spawn_background_task. A pytest --collect-only style lint or a one-line grep CI check (! grep -rn "asyncio.create_task" cortex/services/runtime_daemon.py | grep -v "_spawn_background_task") would close that gap. Filing as F03b but not opening a Ledger row this session because the existing failure mode (orphan tasks) is closed at the call site that was flagged; the residual risk is regression-only, not active.
Date. 2026-05-19. Posture. Senior UI/UX engineer auditing Wave 1's desktop_shell + browser_extension changes for visual consistency and macOS-native feel. One fix per commit; tests where feasible.
Commits shipped.
| SHA | Subject |
|---|---|
9c7c32b |
promote warm label tints to tokens, lift sub-AA tertiary |
d661a38 |
route timeline font-family through FONT_MONO token |
84a58f4 |
regression-guard for native window chrome coverage |
c90d382 |
route remaining raw-int spacing through tokens |
bdff047 |
accessible names + tab order on settings, connections, onboarding |
4bd1687 |
route popup toggle radius + transitions through tokens |
1. Token source-of-truth. GAP FOUND + CLOSED.
connections.py, settings.py, onboarding.py each carried private
_LABEL_SECONDARY = "#5C5854" and _LABEL_TERTIARY = "#827971" copies.
The tertiary value fails WCAG AA on the cream background (3.98:1 against
#FFFFFF) — F55 fixed this in dashboard.py (raised to #6B6661,
~5.4:1) but the other three surfaces silently drifted. Wave-2 commit
9c7c32b promotes both tints into the token registry (tokens.yaml
emitter + generated tokens.py + browser-extension design-tokens.ts),
pins CX_TEXT_TERTIARY = "#6B6661", and switches every consumer to
from cortex.apps.desktop_shell.tokens import CX_TEXT_TERTIARY.
Regression test test_token_label_consistency.py (9 cases) pins the
registry value and asserts no surface carries the legacy literal.
dashboard.py:898 retains #B25430 for the degraded badge; it's a
deliberate WCAG-AA-verified deep terracotta scoped to that single
banner and not a candidate for promotion.
2. Typography. ESSENTIALLY CLEAN — ONE PROMOTION.
Every setFont(...) call across the five panels routes through
mac_native.system_font(FS_*, weight). No "Arial" / "Helvetica" /
QFont("Times") literals anywhere in desktop_shell. The brand
Cormorant headings consistently use BRAND_DISPLAY_FONT. Commit
d661a38 promotes one font-family: "SF Mono", ui-monospace, ...
literal in dashboard.py:1005 (timeline panel) to the FONT_MONO
token — the literal happened to match verbatim but would have drifted
on a future stack edit.
3. Window chrome. ALREADY CONSISTENT — REGRESSION GUARD ADDED.
All five top-level windows (DashboardWindow, SettingsDialog,
OnboardingWindow, OverlayWindow, ConnectionsPanel) already invoke
apply_unified_titlebar + apply_vibrancy in showEvent. Commit
84a58f4 adds test_window_chrome_coverage.py — 10 parameterised
ast-based cases that pin every top-level window to its required
mac_native calls. A future window class that forgets to apply native
chrome will fail CI rather than inherit Qt's default opaque titlebar
silently.
4. Spacing rhythm. TWO LITERALS PROMOTED.
Most layout setSpacing/setContentsMargins calls already consume the
SP1-SP10 token scale. Commit c90d382 promotes two outliers:
-
overlay.py:282setContentsMargins(24, 24, 24, 24)→(SP6, SP6, SP6, SP6). -
onboarding.py:236setSpacing(8)→setSpacing(SP2). The remaining raw integers indashboard.py(3px inner pill padding, 2px inter-column gaps, 10/3/12/3 badge tracker margins) are intentional sub-4pt fine tuning below the token granularity — they're not candidates for promotion without inventing new sub-grid tokens.
5. Accessibility coverage. GAP FOUND + CLOSED.
F55 wired accessibility on the dashboard + overlay; the three other
panels were untouched. VoiceOver would announce every control as
"button" / "checkbox" / "slider" without semantic context, and the
focus ring escaped the window unpredictably. Commit bdff047:
- Extracts the defensive
set_accessible_name/setTabOrderhelpers intocortex/apps/desktop_shell/a11y.pyso every panel imports them once. - Wires 16 controls in
settings.py(back, 6 checkboxes, slider, 2 spinboxes, combo, 4 debug checkboxes, close, apply) into a 15-step tab chain. - Wires
connections.py's back button + every Connect button into a collected_tab_order_chainand chains it. - Wires
onboarding.py's BYOK token input, region combo, save button, Open Connections, Get Started, and per-step Grant buttons + status pills. Regression testtest_a11y_coverage.py(3 cases) instantiates each panel offscreen and asserts the accessible names are present.tray.pydeliberately untouched — on macOS the tray uses the nativeNSStatusItemwrapper (mac_native.StatusBarItem) which is announced by VoiceOver via the system menu-bar role; QAction-based accessibility doesn't apply on the mac path.
6. Browser-extension native feel. ESSENTIALLY CLEAN — TWO
PROMOTIONS. Every fontFamily already routes through CX.font /
CX.fontSerif / CX.fontBrand / CX.mono. The macOS-system stack is
-apple-system, BlinkMacSystemFont, ... (the correct native chain).
Focus rings are explicit outline: 2px solid CX.accent with
outline-offset: 2px — readable + brand-preserving. Commit 4bd1687
promotes two outliers in popup.tsx:
-
toggleTrack.borderRadius: 12→CX.radiusFull(still clamps to half-height for the pill shape). -
toggleThumb.background: "#fff"→CX.textInverse. Threetransition: "... 0.2s ease"literals →CX.durationNormal+CX.easeDefault. All 31 vitest specs stay green.
7. Loading / empty / error states. ALREADY DISTINGUISHED.
F18 added the degraded banner; F54 added the four connectivity states
(not_installed, installed_no_daemon, installed_version_mismatch,
handshake_failed) each with its own title + body + CTA. F40+F54
tests cover all four states. The morning briefing card (popup.tsx)
renders only when briefing !== null — no loading skeleton, no
explicit error state. This is intentional: the briefing is push-based
from the daemon, so absence = no briefing yet (= correct silent state).
The activity-tracker resume cards (newtab.tsx) similarly render only
when activities.length > 0. The dashboard timeline panel already has
the "No events yet" empty state. Residual: no explicit loading
skeleton on the briefing card or activity preview — listed in the
residual-risk section below.
8. Motion / micro-interactions. REVIEWED, NO CHANGES.
The audit prompt explicitly said "be conservative — don't add motion
to functional elements like Apply Settings". The two remaining
setVisible(True) call sites are:
-
connections.py:238translocation warning — critical functional info, not a candidate for delight motion. -
onboarding.py:717Grant button visibility flip — functional state change, not delight. Existing motion is already calibrated (overlay alertcxAlertIn, heartbeatcxPulse, breathing pacer, activity-card fade-in, focus-ring transitions). Adding 150ms fades to the degraded badge or fallback hint would draw attention to error states — counter- productive. Listed as residual for a future targeted polish pass.
| Surface | Verdict |
|---|---|
cortex/apps/desktop_shell/dashboard.py |
Already F47/F55/F31; one font-stack literal promoted. |
cortex/apps/desktop_shell/overlay.py |
F47/F55/F06 closed; one margin literal promoted. |
cortex/apps/desktop_shell/onboarding.py |
A11y added; spacing literal promoted; tertiary tint pulled from tokens. |
cortex/apps/desktop_shell/settings.py |
A11y added (16 controls + 15-step tab chain); tertiary tint pulled from tokens. |
cortex/apps/desktop_shell/connections.py |
A11y added; tertiary tint pulled from tokens. |
cortex/apps/desktop_shell/tray.py |
macOS native NSStatusItem — no Qt a11y needed. |
cortex/apps/desktop_shell/mac_native.py |
Single point of contact for native chrome — clean. |
cortex/apps/desktop_shell/tokens.py |
Auto-generated; emitter updated for AA tertiary tint. |
cortex/apps/browser_extension/popup.tsx |
Toggle radius + transitions promoted to tokens. |
cortex/apps/browser_extension/newtab.tsx |
Already on tokens; activity card aesthetics preserved. |
cortex/apps/browser_extension/design-tokens.ts |
Auto-generated; tertiary tint synced. |
-
QT_QPA_PLATFORM=offscreen pytest cortex/tests/unit/test_overlay_dismiss.py cortex/tests/unit/test_dashboard_stop.py cortex/tests/unit/test_overlay_tokens.py cortex/tests/unit/test_token_label_consistency.py cortex/tests/unit/test_window_chrome_coverage.py cortex/tests/unit/test_a11y_coverage.py -q→ 35 passed. -
QT_QPA_PLATFORM=offscreen pytest cortex/tests/unit/ --ignore=cortex/tests/unit/test_desktop_shell.py -q→ 1150 passed. -
cortex/apps/browser_extension && npx vitest run→ 31 passed across 10 spec files.
test_desktop_shell.py installs lightweight PySide6 mocks that bleed
into modules imported after it inside the same pytest session — a
pre-existing harness quirk documented in test_overlay_tokens.py
(every dependent test file unloads stale PySide6 mocks at module
top). The legacy mock suite has one pre-existing failure
(TestOverlayWindow::test_show_intervention — MockQLabel.clear
missing) that is unrelated to this wave's changes; verified by
running test_desktop_shell.py isolated against HEAD~6 and
observing the same failure.
- No loading skeleton on morning-briefing / activity-tracker preview. Both are push-based from the daemon, so absence = no data yet. A future commit could add a 600ms shimmer skeleton if user testing reveals confusion about "is Cortex thinking, or did it fail?" Scoped out of Wave-2 because it's net-new UX, not reconciliation.
-
No fade-in on the dashboard degraded badge / overlay fallback
hint. Both are functional notifications; per the audit's "be
conservative" rule, no motion added. If user testing shows the
abrupt appearance is jarring, a 150ms
QPropertyAnimationonwindowOpacityis the targeted fix. -
test_desktop_shell.pymock-pollution. Pre-existing. Affects only intra-session ordering; every dependent test file already defends with the stale-PySide6 unload pattern at module top. Cleanest fix is to migrate the legacy mock suite to real PySide6 underQT_QPA_PLATFORM=offscreen— out of scope for visual reconciliation.
The Wave-1 tactical token gates (F07 SHUTDOWN, F08 launcher /stop) defended only two destructive endpoints. Every other HTTP route on the API gateway and every other WebSocket message type still trusted "comes from localhost" as proof of legitimacy — the implicit trust model named in audit/findings.md Debt-2. A hostile webpage in another browser tab could connect to ws://127.0.0.1:9473 and watch the daemon's STATE_UPDATE stream without ever owning the token; the same page could POST /state/infer and burn the daemon's numpy allocators. The systemic close-out flips the default: every connection presents the capability token; servers reject anything else.
Five atomic commits in the locked order from implement-all-that-s-helpful-mellow-hammock.md §4 Phase H:
| Commit | SHA | Headline | Files |
|---|---|---|---|
| 1 | 0fe609a |
Server-side capability-token gate on every HTTP route | 7 |
| 2 | 78d9d57 |
WebSocket AUTH-first handshake | 8 |
| 3 | f16a46a |
desktop_shell WS client AUTHs before IDENTIFY | 2 |
| 4 | 5eaef88 |
extension WS sends AUTH before IDENTIFY | 2 |
| 5 | 9066df1 |
capability-token rotation UI in Settings | 5 |
Server side. cortex/services/api_gateway/auth.py (new) exports require_capability_token (a FastAPI dependency that raises 401 on miss/mismatch and emits EventType.AUTH_REJECTED) and optional_capability_token (used nowhere yet but reserved for /health extensions). routes.py is split into router (the default — gates every mutating endpoint via Depends(require_capability_token)) and health_router (the supervisor liveness probe; no auth, mounted separately). Adding a new mutating route on router automatically inherits the gate; adding a route to health_router is auditable in code review. The 401 response carries WWW-Authenticate: Bearer per RFC 7235.
WebSocket side. WebSocketServer._dispatch_message short-circuits to _handle_auth on the AUTH message type; until the client's authenticated flag flips True, every other type triggers close(code=1011, reason="auth required") + EventType.AUTH_REJECTED. _handle_auth validates via cortex.libs.auth.verify_token, replies with an AUTH_OK frame, and replays the latest cached STATE_UPDATE so the legacy "new connection sees current state on attach" UX is preserved. _broadcast skips unauthenticated peers so a connect-and-listen origin cannot harvest the state stream. MessageType.AUTH and MessageType.AUTH_OK were added to cortex/libs/schemas/ws_message_types.py; the Phase G codegen regenerated cortex_schemas.d.ts automatically.
Client side. cortex/apps/desktop_shell/main.py::WebSocketBridge reads the token at startup via load_or_create_token, sends AUTH as the first frame on every connect (ahead of IDENTIFY), and exposes refresh_auth_token for the rotation path. cortex/apps/browser_extension/background.ts::connect() fires getAuthToken().then(send AUTH then IDENTIFY) inside onopen; the existing Wave-1 X-Cortex-Auth-Token header on /shutdown and /stop fetches continues to satisfy the systemic HTTP gate. A new case "AUTH_OK" no-op landed in handleMessage so the daemon's ACK is recognised rather than logged as an unknown type.
Rotation. cortex/libs/auth/local_token.py::rotate_token writes a fresh secrets.token_hex(32) to the same path auth_token_path() returns, atomically (.tmp sibling chmod 0600, then os.replace). The Settings panel's new "Security" section exposes this as a "Rotate authentication token" button with an inline status label. The button briefly disables itself after a click to absorb a double-click. EventType.AUTH_TOKEN_ROTATED lands in the structured log so a support engineer can correlate "the user just rotated" with "all WS clients suddenly reconnected."
Defense in depth (Commit 6 deliberately omitted). The Wave-1 F07 inline verify_token call on the WebSocket SHUTDOWN handler stays. The Wave-1 F08 launcher /stop token gate stays (the launcher binary has a zero-cortex-imports invariant so we cannot apply the systemic FastAPI dependency to it). Both are now redundant given the systemic gate, but they remain as cheap belt-and-braces: a future regression in _dispatch_message that accidentally lets a non-AUTH message through still cannot fire SHUTDOWN, and the launcher's /stop is still gated even when the daemon API gateway is unreachable. The trade is ~20 lines of duplicated check for a much harder-to-bypass invariant — worth it.
Migration path. Existing installs (Wave-1 already shipped the token file via daemon startup load_or_create_token) get the systemic gate for free — the token file is already on disk. Fresh installs mint the file on first daemon start. The browser extension's existing getAuthToken() cache (chrome.storage.session) survives across service-worker restarts; the desktop_shell reads the file fresh on every launch via WebSocketBridge.__init__. No coordinated rollout needed because the legacy WS handshake (IDENTIFY first, no AUTH) was never broadcast to a wire — the daemon's gate simply closes any peer that sends IDENTIFY first, and the reconnect loop on both clients retries with the now-cached token in the AUTH frame.
Threat model recap. Closes cross-origin localhost — a hostile webpage in another browser tab that can speak the HTTP / WS protocols cannot read the mode-0600 token file, so it cannot present the token, so every request it makes returns 401 / 1011. Explicitly does NOT close malware-as-the-user (a compromised account on the same Mac can read any user-readable file the daemon can read); that threat is named in audit/findings.md and is out of scope for this debt.
# New auth tests:
pytest cortex/tests/unit/test_systemic_auth_http.py \
cortex/tests/integration/test_systemic_auth_ws.py \
cortex/tests/unit/test_desktop_controller_auth.py \
cortex/tests/unit/test_token_rotation.py -q # 17 passed
# Full Python suite (excl. legacy desktop_shell mock-pollution suite):
pytest cortex/tests/ -q --ignore=cortex/tests/unit/test_desktop_shell.py
# 1307 passed, 3 skipped
# TS suite (browser extension):
cd cortex/apps/browser_extension
./node_modules/.bin/vitest run # 33 passed
# Schema codegen still in sync after AUTH/AUTH_OK addition:
CORTEX_JSON2TS_CMD=$(which json2ts) python -m cortex.scripts.generate_ts_schemas --check
# exit 0
# Manual adversarial test (closes within 2s without AUTH):
python -c "
import asyncio, websockets
async def go():
async with websockets.connect('ws://127.0.0.1:9473') as ws:
await ws.send('{\"type\":\"STATE_UPDATE\",\"payload\":{},\"timestamp\":0,\"sequence\":0}')
print(await asyncio.wait_for(ws.recv(), 2))
asyncio.run(go())
"
# websockets.exceptions.ConnectionClosedError: ... [code=1011 reason=auth required]Date. 2026-05-19. Posture. Senior coordination engineer verifying no residual frontend ↔ backend contract drift survived the Debt-1 schema codegen + Debt-2 systemic-auth waves.
Commits shipped.
| SHA | Subject |
|---|---|
a7bcf70 |
thread F18 degraded/source through STATE_UPDATE WS payload |
e8bac22 |
surface unhandled-but-known WS frames in extension |
1. HTTP routes ↔ extension fetches. VERIFIED CLEAN.
Three fetch sites in the entire extension surface
(background.ts:2860 /shutdown, background.ts:2880 /stop,
background.ts:2947 /launch). All three carry the
X-Cortex-Auth-Token header when the cached capability token is
available; /launch is intentionally unauthenticated because the
launcher boots the daemon before the daemon's keychain is loaded
(launcher_agent.py exposes it on its own port 9471, separate trust
boundary). Verb / path / body / response field reads match the FastAPI
routes one-for-one.
2. WS client → server. VERIFIED CLEAN.
12 distinct send() types in background.ts (AUTH,
IDENTIFY, USER_ACTION, ACTION_EXECUTE, USER_RATING,
CONTEXT_RESPONSE, SETTINGS_SYNC, ACTIVITY_SYNC,
TAB_RELEVANCE_FEEDBACK, LEETCODE_CONTEXT_UPDATE,
INTERVENTION_APPLIED, SHUTDOWN); WebSocketServer._dispatch_message
has explicit dispatch arms for all 12. The AUTH frame goes first
per Debt-2.
3. WS server → extension broadcast. GAP FOUND + CLOSED.
The MessageType enum lists 15 LEETCODE_* cues; background.ts
explicitly handles 6 (the 5 actions the live InterventionMatrix
emits plus SHOW_CONSOLIDATION). The other 9
(LEETCODE_LOCK_EDITOR, LEETCODE_INTERCEPT_SUBMIT,
LEETCODE_GATE_SOLUTIONS, LEETCODE_SHOW_SESSION_BRIEFING,
LEETCODE_AI_RESTATEMENT_CHECK / LEETCODE_AI_COMPREHENSION_CHECK /
LEETCODE_AI_HYPOTHESIS_CHECK / LEETCODE_AI_STUCK_ANALYSIS /
LEETCODE_AI_SESSION_BRIEFING) are catalogue-only — the LeetCodeAdapter
advertises them but no runtime selector calls
_leetcode_adapter.execute(<capability>, ...) for them today. Commit
e8bac22 adds a defensive default: arm in the message switch so a
future regression where the daemon adds a new emitter (or the matrix
grows to cover the AI checks) is visible in DEBUG logs instead of
silently swallowed. COPILOT_THROTTLE is targeted at vscode clients
only (target_client_types=["vscode"]) and never reaches the chrome
peer, so the chrome extension's lack of a handler is correct.
4. Generated types coverage. VERIFIED CLEAN.
cortex_schemas.d.ts includes WSMessage, MessageType (the
enum-string union of all 37 wire types), StateEstimate,
InterventionPlan, SuggestedAction, TaskContext,
InterventionApplyResult, plus LeetCodeContext /
LeetCodeModeEstimate. StateInferResponse is defined in
routes.py not in cortex/libs/schemas/ so the codegen walk does
not pick it up; verified no extension client consumes the
/state/infer envelope (the dashboard reads degraded /
source off the WS STATE_UPDATE stream — see category 5).
5. F18 degraded envelope surfaced. GAP FOUND + CLOSED.
The F18 fix added source / degraded to StateInferResponse;
the dashboard advanced tab reads both off the payload dict to toggle the
"classifier unavailable" banner. But the dashboard is fed by the WS
STATE_UPDATE broadcast, not by /state/infer —
WebSocketServer._make_state_update never stamped the two fields, so
the banner could not fire through the WS path. F18 was end-to-end
silently broken. Commit a7bcf70 mirrors the envelope fields onto
every STATE_UPDATE frame (degraded = estimate.classifier_source is None; source = "fallback" if degraded else "classifier") and
fixes a brittle dashboard fallback test that conflated the envelope
source literal (classifier/fallback) with the debug-overlay
classifier_source field (rule/ml/ensemble) — on a healthy
classifier_source="rule" payload the banner would have flipped True
and stuck visible.
6. F20 cost telemetry surfaced. PARTIALLY VERIFIED.
CostTracker.record emits EventType.LLM_COST per call and
EventType.LLM_BUDGET_KILL when the daily budget trips. The kill
path stamps plan.metadata["budget_killed"] = True and the overlay
(cortex/apps/desktop_shell/overlay.py:510) surfaces a per-intervention
"Cortex offline mode — daily AI budget reached" hint. The persistent
dashboard banner contemplated in Phase A is NOT implemented — only the
per-intervention overlay hint. Filed as residual (net-new UX, not
contract drift).
7. F10 action-rejection telemetry. VERIFIED CLEAN.
filter_unsafe_actions emits EventType.INTERVENTION_ACTION_REJECTED
per drop with the bound cid; by design the rejection is log-only — the
user never sees a banned action, so there is no UI to suppress. The plan
notes this explicitly.
8. Per-route auth dependency coverage. VERIFIED CLEAN.
cortex/services/api_gateway/app.py:183 mounts the gated router via
app.include_router(router, dependencies=[Depends(require_capability_token)]);
app.include_router(health_router) at line 182 mounts only /health
without auth. Every mutating endpoint in routes.py lives on
router and inherits the gate; /health is the only endpoint
reachable without the capability token, which is the documented design.
| Surface | Verdict |
|---|---|
cortex/services/api_gateway/routes.py |
Auth coverage clean; F18 envelope set on HTTP. |
cortex/services/api_gateway/auth.py |
Two FastAPI deps; only health route uses optional_capability_token. |
cortex/services/api_gateway/app.py |
Single include_router(dependencies=[…]) wire — adding new route inherits the gate. |
cortex/services/api_gateway/websocket_server.py |
AUTH-first dispatch; broadcast skips unauth peers; STATE_UPDATE now stamps F18 envelope fields. |
cortex/apps/browser_extension/background.ts |
All 3 fetches gated; 12 send-types match dispatch; unhandled-frame default arm added. |
cortex/apps/desktop_shell/dashboard.py |
F18 banner reader hardened against the classifier_source / source conflation. |
cortex/libs/schemas/ws_message_types.py |
37 enum members; catalogue surfaces the 9 still-unwired LEETCODE_* cues so the schema gate notices future drift. |
cortex/apps/browser_extension/types/generated/cortex_schemas.d.ts |
2050 lines; codegen still in sync. |
# F18 WS plumbing + envelope contract:
pytest cortex/tests/unit/test_ws_state_update_degraded.py \
cortex/tests/unit/test_state_infer_envelope.py -q
# 7 passed in 0.53s
# Defensive default arm in extension switch:
cd cortex/apps/browser_extension
./node_modules/.bin/vitest run __tests__/audit_w2_unhandled_ws_frame.spec.ts
# 2 passed
# Schema codegen still in sync (no Python schema changes in this wave):
CORTEX_JSON2TS_CMD=$(which json2ts) \
python -m cortex.scripts.generate_ts_schemas --check
# exit 0 (no diff)-
F20 persistent dashboard banner. The Phase A plan called for a
dashboard-level banner on
LLM_BUDGET_KILLin addition to the per-intervention overlay hint. The hint is wired; the banner is not. This is net-new UX (not contract drift) and is scoped out of the Wave-2 sweep. -
9 catalogue-only LEETCODE_ types.* The schema lists them and the
LeetCodeAdapterexposes the capabilities, but no liveInterventionMatrixselector emits them. The default-arm log line is the visibility hatch for when a future fix wires them up; no handler implementations land here because there is no caller to regression-test against.
Session 1 + Session 2 cumulative, mapped to commit cohort:
| Tier | IDs | Cohort |
|---|---|---|
| Data-loss | F01, F02, F03, F36, F53 | Wave 1-E/F + session 1 |
| Security | F07, F07b, F08, F08b, F09, F10, F11, F12, F13, F14+F37 | Wave 1-A + session 1 + Phase H |
| Correctness | F04, F05, F06, F15, F16, F16-srv, F18, F19, F19b, F20, F21, F22, F23, F24, F26, F27, F28, F29, F30, F34, F38+F39, F40, F42, F43, F44, F45 | Waves 1-B/C/D/E/G + Wave 2-A/B + Phase G |
| Cost | (folded into F20/F25-partial/F30) | Wave 1-C + Wave 2-B |
| Maintainability | F31, F32, F33, F35, F46, F47, F48, F49, F50, F51, F52, F54, F55, F56 | Wave 1-F + Wave 2-C |
-
Debt-1 (shared schema codegen).
pydantic-to-typescriptgenerator + drift gate + extension migration. Closes F42/F43/F44/F45 structurally; future Pydantic schema edits regeneratecortex_schemas.d.tsautomatically; CI rejects out-of-sync commits. -
Debt-2 (capability-token client bootstrap). Every HTTP route now requires
Authorization: Bearer <token>orX-Cortex-Auth-Token; every WebSocket connection requiresAUTHas its first frame. F07/F08's tactical single-endpoint gates retained as defense-in-depth. Token rotation UI in Settings.
- Phase I (performance). Capture-loop mediapipe sub-sampling + colour-convert cache, parallel WS broadcast with 100 ms budget, lazy mediapipe + keyring imports (sub-2s warm-cache startup), content-script-only leetcode observer. Bundle ~175 KB (under 250 KB target). 25 new perf tests.
- Phase J (UX polish). Onboarding "Why?" expanders + Continuity-camera callout, error toast with selectable correlation-id, biometrics empty states, overlay scale-in + fade-in micro-interactions (Reduce-Motion honoured), accessibility sweep + CHANGELOG. 26 new UX tests.
-
F17 — State-update sequence-number check on receivers. Sender side already increments
WSMessage.sequence. The receiver-side drop-stale logic is partially provided by F16/F16-srv's correlation-id swap on intervention frames but is NOT generalised toSTATE_UPDATE. Deferred because the practical impact is bounded — broadcast cadence is 2 Hz, so reorder windows are too narrow to matter in real networks — and the cleanest fix is bundled into the schema-versionedWSMessagemigration that lands as part of a future protocol revision. -
F25 — Cooldown/dwell oscillation direct fix. Cost-runaway aspect closed by F20 (budget kill-switch) and W2-B's "re-consult cost kill switch between LLM retry attempts". Quality-of-experience aspect (intervention spam under jitter) closed by F26 (quiet-mode persistence) and F27 (fallback transparency). The underlying race between
trigger_policy.evaluateandcooldown_secondssurvives but is no longer expensive. Filed as a hysteresis-tuning follow-up; ML eval should drive the tuning, not code. -
F41 — Eval harness in CI. Phase G's CI workflow (
.github/workflows/ci.yml) addedpython (pytest+ruff+mypy),extension (vitest), andschema-codegen-check. The eval harness incortex/services/eval/exists and runs locally; wiring it to CI with a regression threshold is the next session's lift. Deferred because the baseline pass-rate hasn't been captured yet and CI needs a stable threshold.
- F07b — Native-host mediated auth-token fetch (closed in Wave 1-G).
-
F08b — Extension
X-Cortex-Auth-Tokenon/stopand/shutdown(closed in Wave 1-G). - F16-srv — Daemon refuses stale USER_ACTION cid (closed in Wave 1-G).
- F19b — Correlation IDs in browser extension (closed in Wave 1-G).
-
F19a / F03b regression guards — Lint guards filed but not committed (residual; would block future bare
asyncio.create_taskcalls).
-
F20 persistent dashboard banner.
LLM_BUDGET_KILLevent is emitted and the per-intervention overlay hint flagsmetadata.budget_killed, but a dedicated dashboard banner is not wired. Per-intervention hint is sufficient for the audit ship; banner is a UX deepening. - 9 catalogue-only LEETCODE_ types.* The schema lists them; no caller emits them. Default-arm log line in the extension is the visibility hatch.
-
SessionReportaggregate rollup ofintervention_apply_confirmation. Per-session JSONL has the data; the aggregatedSessionReportdoes not roll it up into a "X of Y interventions confirmed" surface. UI/UX-tier follow-up, not contract drift. -
3 Qt overlay tests (
test_circuit_breaker_surfacing,test_context_truncation,test_desktop_shell::test_show_intervention) fail when collected alongside other Qt tests due to pre-existing PySide6 mock pollution. Each passes in isolation. Pre-existing test-infra issue, not a regression. -
Pre-existing test-pollution suite in
test_redis_store.py,test_helpfulness.py,test_focus_graph.py, etc. — 26+ tests fail when run alongside the full suite, pass in isolation. Pre-existing fixture leakage (likelyregistry.reset+fakeredisimport order + stubbed PySide6 modules); orthogonal to the audit work. -
4 P2/P3 a11y items (documented in
CHANGELOG.md"Known limitations"): VoiceOver rotor on Cormorant numerics, high-contrast palette tier, live-region announcements on state transitions, Reduce Motion gating on HR-trace plot + breathing pacer + focus-ring transitions.
-
Trigger-policy hysteresis under real biometric jitter (F25-residual). Cost runaway is bounded by F20's budget kill-switch ($20/day default per user). Quality is bounded by F26's quiet-mode escalation memory and F27's fallback transparency. The next escalation is data-driven: ship a /eval baseline and tune the cooldown/dwell pair. Monitor:
cortex_state_loop_interventions_per_hourshould stay under 10 under nominal load and never exceed 30 with the budget kill armed. -
Schema-codegen drift through model edits that bypass the Pydantic source. Debt-1 closure depends on every TS-visible field originating in
cortex/libs/schemas/. A future contributor who adds anAnyfield or hand-editscortex_schemas.d.tswill trip the CI gate — but only if the gate is required-for-merge. Monitor: CI jobschema-codegen-checkmust be marked Required on the GitHub repo. -
Capability-token rotation collision with in-flight WS sessions. Debt-2 rotation kills existing connections, forcing reconnect. Browsers cache the old token in
chrome.storage.session; the cache is invalidated on nextget_auth_tokencall. There is a window of seconds where the old token is rejected but the new one isn't fetched yet — the extension's auto-reconnect handles it but logs an AUTH_REJECTED. Monitor: spike inAUTH_REJECTEDevents lasting longer than 30 seconds = rotation went wrong.
source .venv/bin/activate
# All Session 2 audit-specific tests:
pytest cortex/tests/unit/test_action_allowlist.py \
cortex/tests/unit/test_cost_tracker.py \
cortex/tests/unit/test_anthropic_planner_cancellation.py \
cortex/tests/unit/test_anthropic_planner_budget_retry.py \
cortex/tests/unit/test_circuit_breaker_surfacing.py \
cortex/tests/unit/test_cache_template_version.py \
cortex/tests/unit/test_context_truncation.py \
cortex/tests/unit/test_state_infer_envelope.py \
cortex/tests/unit/test_rate_limit.py \
cortex/tests/unit/test_atomic_write.py \
cortex/tests/unit/test_background_task_tracking.py \
cortex/tests/unit/test_dismissal_model_persistence.py \
cortex/tests/unit/test_quiet_mode_persistence.py \
cortex/tests/unit/test_consent_ladder_race.py \
cortex/tests/unit/test_prompt_injection_defence.py \
cortex/tests/unit/test_prompt_injection_wrapper_tags.py \
cortex/tests/unit/test_quiet_mode_history_age.py \
cortex/tests/unit/test_architecture_md_alignment.py \
cortex/tests/unit/test_schema_codegen.py \
cortex/tests/unit/test_ws_message_schema.py \
cortex/tests/unit/test_api_gateway.py \
cortex/tests/unit/test_anthropic_planner.py \
cortex/tests/unit/test_systemic_auth_http.py \
cortex/tests/unit/test_token_rotation.py \
cortex/tests/unit/test_ws_user_action_cid.py \
cortex/tests/unit/test_ws_state_update_degraded.py \
cortex/tests/unit/test_ws_slow_client.py \
cortex/tests/unit/test_pending_context_cleanup.py \
cortex/tests/unit/test_launcher_auth.py \
cortex/tests/unit/test_native_host_auth.py \
cortex/tests/unit/test_launcher_allowlist.py \
cortex/tests/unit/test_native_messaging_schema.py \
cortex/tests/unit/test_seed_config_dead_envs.py \
cortex/tests/unit/test_auth_local_token.py \
cortex/tests/unit/test_capture_stop_timeout.py \
cortex/tests/unit/test_bedrock_token_containment.py \
cortex/tests/integration/test_correlation_ids.py \
cortex/tests/integration/test_systemic_auth_ws.py \
cortex/tests/integration/test_apply_intervention_confirmation.py \
cortex/tests/performance/ -q
# Phase J UX tests (offscreen Qt):
QT_QPA_PLATFORM=offscreen pytest cortex/tests/unit/test_dashboard_toast.py \
cortex/tests/unit/test_dashboard_empty_state.py \
cortex/tests/unit/test_onboarding_hints.py \
cortex/tests/unit/test_overlay_animation.py -q
# Extension TS tests:
cd cortex/apps/browser_extension && pnpm test
# Schema-codegen drift gate (Debt-1):
CORTEX_JSON2TS_CMD=$(which json2ts) python -m cortex.scripts.generate_ts_schemas --check- 93 audit commits landed this session.
- 53 of 56 Ledger findings closed (3 deferred with explicit justification above).
- 2 Architectural Debts closed (Debt-1 codegen, Debt-2 systemic auth).
- 2 Non-Ledger phases shipped (Phase I performance, Phase J UX polish).
- ~345 audit-specific tests added across Python and TypeScript.
F25 (cooldown/dwell oscillation, partial closure). The cost-runaway aspect is well-contained by F20's budget kill-switch with regression-tested thresholds. The quality-of-experience aspect — does the user actually get spammed with interventions under real biometric jitter? — is partially closed by F26/F27 but not directly tested with adversarial state sequences. The right next step is an /eval suite that replays a synthetic jittery-state trace and asserts intervention count stays within an envelope. That is F41's territory and was deferred. Until F41 is closed, the operator's only signal is the cortex_state_loop_interventions_per_hour metric.
The three previously-deferred Ledger findings now ship; the audit Ledger is fully closed (56 of 56).
Cite: cortex/apps/desktop_shell/controller.py:290-300,
cortex/services/api_gateway/websocket_server.py:60-72.
Failure mode. WSMessage.sequence already incremented sender-side,
but in-process DaemonBridge and WS-mode WebSocketBridge (and the
extension's handleMessage) applied every frame unconditionally. A
reordered STATE_UPDATE could overwrite the biometric UI with stale
data; a reordered INTERVENTION_TRIGGER could clobber the active plan.
Fix.
- Daemon:
CortexDaemon._state_callback_seq/_intervention_callback_seqmonotonic counters; each callback invocation stampspayload["_seq"]. -
DaemonBridge: per-channel_last_state_seq/_last_intervention_seq; drops frames whose_seqis not strictly greater.reset_sequence_counters()lets a daemon restart's seq=1 win. -
WebSocketBridge: per-type_last_seq_by_type; cleared on every WS open so a restart wins. -
background.ts: per-typelastSeqByType+_acceptSequencedFrametest export; cleared inws.onopen.
Tests. 12 Python + 7 TS cases. Reorder drop, duplicate drop,
per-channel independence, sequence=0 / missing-_seq bypass, reset on
reconnect, daemon-side monotonic stamping invariant. All fail on the
pre-F17 main.
Compatibility. Additive. Frames without _seq / with sequence=0
bypass the check so older daemons and unsequenced types (AUTH_OK,
INTERVENTION_RESTORE) continue to apply unchanged.
Commit. 71b94c1.
Cite: cortex/services/state_engine/trigger_policy.py:147,283-294,329-334.
Failure mode. Default cooldown_seconds=60 + hyper_dwell_seconds=30
admit a 90-second adversarial cycle (HYPER 30 s → trigger → FLOW 60 s
→ HYPER 30 s → trigger again). Pre-F25 fired on every cycle (~40/hr);
the eval harness's oscillation_intervention_rate_per_hr metric pins
the post-F25 figure at 1.25/hr.
Fix. Two gates added on top of the existing cooldown:
-
Hourly cap —
InterventionConfig.max_interventions_per_hour(default 6).TriggerPolicy._intervention_timestampsdeque is pruned to the trailing 60-minute window; rejects when the count crosses the cap.record_intervention()appends so the next evaluate enforces. -
Oscillation-aware dwell —
InterventionConfig.oscillation_max_flips(default 6) +oscillation_window_seconds(600 s) +oscillation_dwell_multiplier(2.0)._hyper_enter_timestampstracks False→True transitions; when the count in the window exceeds the cap, the required dwell is multiplied. Jittery flicker fails the stretched dwell; genuine sustained overwhelm still passes.
Drive-by fix. Replaced four now = timestamp or time.monotonic()
patterns with explicit if X is None else X so callers passing
timestamp=0.0 (test harnesses, synthetic traces) get the literal 0.0
instead of silently falling back to monotonic time. The "or" form
treated 0.0 as falsy. Behaviour-preserving for every existing caller
(none pass 0.0 in production).
Tests. 7 cases: hourly cap blocks after N triggers + releases as the window slides + cap=0 disables; oscillation lengthens dwell + does NOT block sustained overwhelm + flips outside window pruned; integration trace clamps from ~160/hr to ≤24/4hr.
Commit. 16c8bd5.
Cite: cortex/services/eval/ exists locally; .github/workflows/
ran pytest+ruff+mypy+extension+codegen but not eval.
Failure mode. Eval ran once locally; no baseline pass-rate captured; no regression gate on PRs touching llm_engine/state_engine/prompts.
Fix.
- New
cortex/services/eval/regression_harness.pyreplays four synthetic traces and computes four metrics (oscillation rate per hour, sustained-overwhelm pass rate, FLOW false-positive rate, bandit regret p95). - Deterministic via
DEFAULT_SEED=20260519; same seed → byte-identical metrics. -
BaselineFileround-trip viasave_baseline()/load_baseline(); tolerance bands per metric (3 % rel + abs floor for near-zero) + per-metric direction (higher-is-worse vs lower-is-worse). - New
cortex/services/eval/baseline.json(committed). Current metrics: oscillation 1.25/hr, sustained pass 1.0, FLOW false-pos 0.0, bandit regret p95 0.0. - New CI job
eval-regressionin.github/workflows/ci.yml. Runs on push and on PRs touching llm_engine/, state_engine/, eval/, orlibs/schemas/intervention.py. Exits 1 on any metric crossing its tolerance band. - CLI:
-
python -m cortex.services.eval.regression_harnessruns and compares. -
--update-baselinewrites a fresh baseline after a reviewed change.
-
Tests. 17 cases: determinism (same seed → same metrics), per-metric
sanity (oscillation clamped, sustained passes, FLOW false-pos=0,
bandit regret bounded), baseline round-trip, committed baseline
self-compares clean, committed baseline matches fresh run byte-
identically, regression detection (higher / lower direction, in-band
drift OK, missing metric → NaN sentinel), abs-tolerance band rescues
near-zero baselines, CLI exits 0/1 as expected, --update-baseline
writes the requested path.
Commit. 4fc42fd.
source .venv/bin/activate
python -m pytest cortex/tests/unit/test_f17_sequence_drop.py \
cortex/tests/unit/test_f25_hysteresis.py \
cortex/tests/unit/test_f41_eval_regression.py -q
# 36 passed
python -m cortex.services.eval.regression_harness
# Cortex eval regression report — every metric ok against the committed baseline
cd cortex/apps/browser_extension && ./node_modules/.bin/vitest run __tests__/f17_sequence_drop.spec.ts
# 7 passedLedger is fully closed: 56 of 56. Architectural Debts both shipped (Debt-1 codegen, Debt-2 systemic auth). Phase I performance + Phase J UX polish shipped. The regression harness is the live floor against which future LLM/state-engine work is measured.