π release(v1.5): rc.28 β handshake cold-start race + perf/security/refactor batch#396
Merged
Conversation
Restructure Roadmap and Recent Updates to match the Migration pattern: centered `<h2>` outside the `<details>`, with a `<summary>` line inside acting as the collapse trigger. Keeps headings visually consistent with other top-level sections while preserving the collapsed-by-default body.
handleAckEvent fires void this.handshake() synchronously and returns immediately; the SSE handler chain doesn't await it. A burst of dd:ack events (e.g. during a reconnect storm) could start a second handshake while deregisterAgentComponents β registerAgentComponents was still running on the first, producing a double-registration window where both old and new component stubs were briefly live. Same class as the rc.27 operationId fix. Adds a handshakeInProgress in-flight Promise guard: handshake() now returns the active in-flight promise to subsequent callers and only fires _doHandshake() once per burst. Resets via .finally() so the next genuine handshake gets a fresh run. Tests cover both burst dd:ack and direct concurrent handshake() calls.
Three independent perf cleanups on the read/poll hot paths: - store: bump LokiJS autosaveInterval 60s β 300s. The autosave does a synchronous JSON.stringify of the entire in-memory DB on the event loop; at scale (200+ containers with security-scan data) the 60s cadence was a measurable stall. Explicit shutdown save is unchanged. - Docker watcher: same-source container lookup now calls getContainersRaw() instead of getContainers(), skipping the per-container structuredClone+redaction pass that fires on every poll cycle. The watcher only needs id/name for pruneOldContainers. - SSE event buffer: replaySince() replaces a linear ring.filter over up to 500 buffered events with a binary search on the monotonic counter. Cuts reconnect-storm CPU on the broadcast side. Deferred (tracked in .planning/perf-followup.md): routing api/stats, agent/api/event, and prometheus/container off the cloned read path β needs a coordinated type-contract change.
- registry/index.ts: getState() now returns Readonly<RegistryState>. Callers in Docker.ts, Dockercompose.ts, and scheduler.ts read-only. Returning the live mutable global was a defensive hazard waiting on the first caller to mutate it by accident. - registry/index.ts: SIGINT/SIGTERM handlers swap to process.once() so re-entry of init() can't stack duplicate shutdown handlers. - registries/providers/dhi: maskConfiguration() now uses the shared BaseRegistry.maskSensitiveFields helper, so future config fields are masked by default instead of silently dropped from logs. - watchers/providers/docker/label.ts: legacy ddTriggerInclude and ddTriggerExclude constants get @deprecated JSDoc pointing at the v1.7.0 removal and the canonical dd.action.* / dd.notification.* replacements. ECR maskConfiguration is intentionally hand-rolled β ECR extends Registry directly (not BaseRegistry), so maskSensitiveFields isn't available without a deeper refactor.
The command action spawns the subprocess with the controller's full process.env so PATH, HOME, TZ etc. behave as operators expect. This also means every DD_* secret the controller loaded (registry passwords, OIDC client secrets, agent tokens) is visible to the spawned process. This is admin-only and the command is fully operator-controlled, so it is not exploitable β but operators running third-party scripts need to know. Adds: - Security note in runCommand JSDoc with the unset DD_* wrapper pattern - Warning callout in content/docs/current/configuration/triggers/command Filtering to an allowlist is queued for v1.7.0; in an RC it would silently break every existing command-action user who depends on inherited env.
10 services were calling response.json() directly, so when a reverse proxy returned an HTML 502/504 the user got a cryptic "SyntaxError: Unexpected token '<'" instead of a useful message. Swap 17 happy-path call sites to readJsonResponse, which sniffs Content- Type and surfaces a clean error. Skipped 5 call sites that live on !response.ok error paths with intentional .catch(() => fallback) handling β converting them would swallow the existing fallback messages.
Backend (app/model/container-update-operation.ts) defines 'scanning' and 'sbom-generating' as legal ContainerUpdateOperationPhase values; the UI's authoritative copy was missing both. isContainerUpdateOperationPhase and isInProgressContainerUpdateOperationPhase therefore returned false for those states, so getOperationPhase (operations store) silently dropped them. Render-time fell back to the generic 'updating' label instead of 'Scanning' / 'Generating SBOM'. Adds both to CONTAINER_UPDATE_OPERATION_PHASES and IN_PROGRESS_CONTAINER_UPDATE_OPERATION_PHASES to match the backend.
Two defense-in-depth hardenings: - app/api/auth.ts: the auth router's express.json() had no limit, so unauthenticated POSTs to /auth/login could spool arbitrarily large bodies into memory. The main API router already enforces 256kb; the auth router now uses 64kb (login payloads are tiny). - app/api/sse.ts: per-IP and per-session SSE caps already existed (10 each), but a global cap across all sessions was missing β many distinct authenticated users/IPs could still exhaust file descriptors. Adds DD_SSE_MAX_CLIENTS (default 500) checked as the first guard in eventsHandler; returns 429 when exceeded. Documents DD_SSE_MAX_CLIENTS in content/docs/current/configuration/server.
Five small wins on production hot paths: - ui/vite.config.ts: __VUE_OPTIONS_API__ flipped true β false. Codebase is 100% Composition API; the Options runtime is dead weight (~10 KB). - ui/vite.config.ts: dedicated 'i18n' code-split group for vue-i18n, so the locale runtime doesn't ride along in every route's vendor chunk. - ui/src/stores/eventStream.ts: SSE reconnect was a fixed 5 s delay. After a pod restart, hundreds of browser tabs all reconnected on exactly the same beat. Switches to exponential backoff with jitter: min(30s, 1s * 2^errors) * (0.5..1.0). - app/api/api.ts: /openapi.json was re-running JSON.stringify on the full document on every request. Pre-serialize once at startup, send the cached string. - app/security/scheduler.ts: per-container security alerts were awaited serially per digest group. Switches to Promise.allSettled so one failure doesn't sink the rest and group fan-out runs in parallel.
Two hot-path tightenings that bite on lagging or busy clients: - ContainersGroupedViews.vue: openActionsContainer was a computed that depended on the full displayContainers ref, so every SSE patch on any container field invalidated it and re-ran a linear .find over ~200 containers. Short-circuit when no menu is open. The faster containerIndexById Map exists inside useContainerSsePatchPipeline but is not exposed to this component; revisit if/when it is. - app/api/log-stream.ts: trySendLogEntry was sending JSON frames synchronously into the ws write buffer with no bufferedAmount check. A slow consumer grew the buffer unboundedly until the connection collapsed and took memory with it. Now drops live entries when bufferedAmount > 2 MB, and closes the socket with 1008 Policy Violation after 3 consecutive drops. The drop counter resets on a successful drain.
β¦-safe
- app/api/openapi/paths/notification-outbox.ts (new): adds OpenAPI
entries for GET /api/notifications/outbox, POST .../{id}/retry,
DELETE .../{id}. The router was mounted but never described, so the
contract validator couldn't catch response-shape regressions.
- app/notifications/outbox-worker.ts: OutboxWorker.stop() cleared
this.inflight synchronously without awaiting dispatches already in
flight. A dispatch that completed after stop would still call
markOutboxEntryDelivered / markOutboxEntryAttempted on the store,
and a subsequent start() would re-enqueue the same entry β spurious
extra delivery attempt. Adds a stopped flag checked on both the
success and failure branches of dispatch(), reset on start().
`src/icons.ts` (822-line icon-name lookup table) and `src/i18n/locales.ts` (locale strings) were dominating the `ui-shell-app` mutation slice. Their string-literal mutants caused transitive jsdom hangs in Vue tests (63% of tested mutants timed out), pushing the slice past the 6h runner timeout. Mutating data tables is also low-signal β surviving mutants only confirm that data isn't asserted on, which is correct for a lookup table. Excluded the two files in both the workflow matrix and `stryker.conf.mjs` so local and CI runs stay in sync.
A `app-orch-trigger-notify-secondary` shard in run 26340803265 failed with `fatal: could not read Username for 'https://github.com'` during the initial `git fetch`. All three of actions/checkout's internal retries reused the same token and were rejected identically, even though 26 sibling matrix slots in the same run checked out fine β a one-off GitHub-side token race. Wrapped checkout in both jobs (`stryker` matrix + `aggregate`) with `continue-on-error: true` β 30s sleep β second invocation, so a transient 401 has time to clear before re-issuing the auth header.
Match the established "<emoji> <Category>: <Subject>" naming used by the other custom workflows (CI Verify, E2E: Playwright, Release: Cut, Security: *, Quality: *) so the Actions sidebar reads consistently.
readJsonResponse defaults its generic to unknown, so json.error was a TS2339 in trigger.ts:154. Pass the expected shape explicitly so the error narrowing compiles without changing runtime behavior.
b2999eb added a fourth i18n group to vite.config.ts but the priority spec still asserted three groups. Align the test (count, order, name) with the current four-group reality.
07e3580 moved Docker.getContainers' same-source lookup from storeContainer.getContainers to .getContainersRaw, but the two "normalize non-string watcher agent" tests still only stubbed getContainers. The unstubbed getContainersRaw returned undefined, the spread threw, the try/catch swallowed it, and the registry.getState assertion never fired. Stub getContainersRaw to keep the path under test exercised.
β¦t race) When AgentClient._doHandshake received 0 containers it pruned the controller-side store unconditionally, wiping last-known state for any agent that had just restarted with an empty in-memory store while docker still had N running containers. The agent's first watch cycle would repopulate seconds later, but the controller's UI rendered 0 until then. Skip the prune when containers.length === 0 and defer to the next authoritative dd:watcher-snapshot (already gated on !containerEnumerationFailed && enrichmentErrors === 0, so a 0-container snapshot really means 0). Warn when hasConnectedOnce so genuinely-empty fresh connections stay silent. Non-zero handshakes prune normally. Adds defence-in-depth on the controller side that complements the rc.25 watcher-snapshot suppression (d02080a) on the agent side. Fixes: #386
The two #289 entries shipped in rc.27 release notes but were left under [Unreleased] because release-cut.yml uses the [Unreleased] fallback for pre-release tags without stamping the file. Move them into the proper [1.5.0-rc.27] section so the upcoming rc.28 release notes don't re-publish them.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
biggest-littlest
previously approved these changes
May 27, 2026
Member
biggest-littlest
left a comment
There was a problem hiding this comment.
Approving rc.28: handshake cold-start race fix + perf/security batch. CHANGELOG cleaned up.
ALARGECOMPANY
previously approved these changes
May 27, 2026
Member
ALARGECOMPANY
left a comment
There was a problem hiding this comment.
LGTM β rc.28 ready to cut.
Codecov Reportβ All modified and coverable lines are covered by tests. π’ Thoughts on this report? Let us know! |
The npm ci step transitively triggers @playwright/browser-chromium's postinstall, which downloads ~167 MiB of chromium. PR #396 hit the 5-minute timeout twice with the binary reaching 100% but the overall step expiring β the timeout was right at the edge of what cold-cache CDN runs need. 10 minutes matches the dedicated chromium install step that runs immediately after.
5826de3
biggest-littlest
approved these changes
May 28, 2026
Member
biggest-littlest
left a comment
There was a problem hiding this comment.
Re-approving after CI timeout bump.
ALARGECOMPANY
approved these changes
May 28, 2026
Member
ALARGECOMPANY
left a comment
There was a problem hiding this comment.
Re-approving after CI fix; rc.28 ready.
4 tasks
s-b-e-n-s-o-n
added a commit
that referenced
this pull request
May 28, 2026
β¦397) ## Summary Three consecutive Playwright runs failed at the same step on `cdn.playwright.dev` CDN throttling: - rc.28 PR #396 first attempt (`bca0b7bb`) - rc.28 PR #396 second attempt (`bca0b7bb`) - post-merge on main (`7dc8b2d3`) β blocking release-cut for v1.5.0-rc.28 - post-merge retry attempt 2 β same shape Each attempt: chromium download reaches 100% at ~9m30s, then `node install.js` SIGTERMs at the 10-min wall clock. Same workflow file, same package versions β CDN throttling is currently routing each runner pool through a slow edge. ### Changes - **actions/cache for `~/.cache/ms-playwright`** keyed on lockfile hash + `github.ref_name`. Each branch gets its own cache namespace (matches the scoping pattern in `quality-mutation-monthly.yml`). After the first cold-cache run, subsequent runs short-circuit the postinstall download. - **timeout_minutes 10 β 15** covers the worst observed throttled-download case with margin for the `npm install` post-extraction work. - Inline `# zizmor: ignore[cache-poisoning]` pragma with rationale; the workflow's push:main + PR triggers cause zizmor to flag any actions/cache as a poisoning risk regardless of key scope. The dedicated `Install Playwright Chromium` step that runs after `npm ci` keeps its 10-min retry budget and will now hit the warm cache on every run. ## Test plan - [x] Pre-push gates green locally (clean-tree, biome, qlty, qlty-smells, typecheck, 100% coverage, build, zizmor) - [x] zizmor passes on the edited workflow - [ ] PR-event Playwright workflow run passes on this commit - [ ] After merge: rerun the failed main-branch Playwright on `7dc8b2d3` (which won't have the fix on its tree but will share the cache β chromium binary persists), OR re-dispatch release-cut against the new merge commit
6 tasks
s-b-e-n-s-o-n
added a commit
that referenced
this pull request
May 28, 2026
β¦or rc.28 prep (#399) ## Summary Consolidates **45 commits** of CI hardening, security scanner improvements, dependency bumps, and the resulting test fixes onto `main` so rc.28 can be cut cleanly. Builds on top of the already-merged #396 (rc.28 app code) and #397/#398 (Playwright cache/runner-image fixes). **Theme breakdown:** - π¦ **Dependency bumps** (12) β Biome, Vitest, Playwright, Artillery, Codecov, CodeQL, Snyk, Nuclei, ZAP, cosign, Docker actions, zizmor - π **Security workflow hardening** (14) β Scorecard, ZAP/Snyk SARIF uploads, Nuclei DAST, CodeQL query pack pin, Crowdin sync, DAST credential masking, Harden Runner coverage, Renovate scoping, release SBOM attestation/verification with pinned Trivy, Docker build metadata standardization, release retry normalization - π§ **Workflow hygiene** (8) β Playwright host-install download skip, mutation artifact layout verification, Stryker shard dashboards, Artillery advisory gates, Crowdin emoji prefix, Playwright image sync, share runtime path filters, move workflow tests out of the app suite - β‘ **Performance** (1) β split load test profiles into parallel jobs - π§ͺ **Test coverage** (5) β ZAP SARIF converter branches, Playwright retry artifact retention, workflow tests (release metadata, action pin fixtures, retry attempts, Playwright download env), script tests wired into local + CI gates - π **Follow-up fixes from this consolidation** (3): - `π¨ style(lint): apply Biome auto-fixes after bumped rules` β bumped Biome flagged pre-existing literal-key / template-string / formatter issues - `π fix(ci): drop invalid artifact-metadata permission scope` β `artifact-metadata` isn't a real GHA scope; `attestations: write` already covers it - `π§ͺ test(ui): align crowdin-config invariant with single-source-of-truth` β `377b6bdf` moved language scoping into `crowdin.yml`, so the pre-existing UI test had to flip its assertion Local pre-push pipeline all green (clean-tree β biome β qlty β qlty-smells β scripts-test β workflow-tests β typecheck-ui β coverage 100% sharded β build sharded β zizmor). ## Test plan - [ ] CI Verify (`ci-verify.yml`) passes β load-test and cucumber jobs now skip the Playwright postinstall download - [ ] E2E Playwright (`e2e-playwright.yml`) passes inside the official Playwright container (#398 carry-forward) - [ ] Mutation slices unchanged from main + new mutation artifact layout check passes - [ ] CodeQL + Scorecard + zizmor + qlty all green - [ ] Workflow tests (`.github/tests/`, 26 tests) pass in CI - [ ] After merge: `gh workflow run release-cut.yml --ref main -f release_tag=v1.5.0-rc.28`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cuts v1.5.0-rc.28 β 19 commits since rc.27, all on
feature/v1.5-rc28.Highlights
7ef4aa61) βAgentClient._doHandshakenow deferspruneOldContainerswhen handshake returns 0 containers, eliminating the cold-start race where a freshly-restarted agent would wipe the controller's last-known container count between handshake and first watch cycle. Combined with the rc.25 (d02080ae) snapshot suppression and rc.26 (512c3751) stats-changed broadcast, the agent count display is now robust across the full restart lifecycle. Cesc1986's rc.27 reporter log proved the failure mode (agent-client.ml Handshake β 0after restart, wiping 5 known containers).3da42587) β in-flight_doHandshakePromise reused across overlapping reconnects.5e02f6df,9b6e1b97) β auth JSON body limit + global SSE connection cap; doc note that Command trigger inherits allDD_*env.07e3580e,b2999eb7,dceec898) β store/watcher/SSE replay hot paths, bundle splitting, SSE jitter, OpenAPI cache, alert fan-out, containers menu scan, WS log backpressure.47c0cf44,4d34b5e7,6ee07006) β notifications outbox openapi docs + drain-safe stop, registry contract hygiene,readJsonResponseadoption.de07d412) β include scanning/sbom-generating in phase enum.794497e5,06e56be3,612e81c4,06eed8dc,82caeb08,3721f66b) β stubgetContainersRawin same-source watcher tests, vite codeSplitting i18n group,runTriggerjson type narrowing, Crowdin Sync emoji prefix, transient 401 checkout retry, exclude pure-data files from UI mutation slice.CHANGELOG
[Unreleased]now contains only the Agents still displaying 0 running containersΒ #386 deferred-prune entry (will be stamped to[1.5.0-rc.28]byrelease-cut.yml).[1.5.0-rc.27]section added retroactively for the two Regression with updating container stateΒ #289 entries that shipped in rc.27 release notes but were never stamped into the file.Test plan
bca0b7bb: clean-tree, biome, qlty, qlty-smells, typecheck-ui, 100% coverage, buildnpx vitest run agent/AgentClient.test.tsβ 455/455 pass (includes 4 new regression tests for handshake-zero prune-skip)gh workflow run release-cut.yml --ref main -f release_tag=v1.5.0-rc.28Fixes: #386