Skip to content

πŸš€ release(v1.5): rc.28 β€” handshake cold-start race + perf/security/refactor batch#396

Merged
s-b-e-n-s-o-n merged 20 commits into
mainfrom
feature/v1.5-rc28
May 28, 2026
Merged

πŸš€ release(v1.5): rc.28 β€” handshake cold-start race + perf/security/refactor batch#396
s-b-e-n-s-o-n merged 20 commits into
mainfrom
feature/v1.5-rc28

Conversation

@s-b-e-n-s-o-n
Copy link
Copy Markdown
Contributor

Summary

Cuts v1.5.0-rc.28 β€” 19 commits since rc.27, all on feature/v1.5-rc28.

Highlights

  • Agents still displaying 0 running containersΒ #386 follow-through (7ef4aa61) β€” AgentClient._doHandshake now defers pruneOldContainers when handshake returns 0 containers, eliminating the cold-start race where a freshly-restarted agent would wipe the controller's last-known container count between handshake and first watch cycle. Combined with the rc.25 (d02080ae) snapshot suppression and rc.26 (512c3751) stats-changed broadcast, the agent count display is now robust across the full restart lifecycle. Cesc1986's rc.27 reporter log proved the failure mode (agent-client.ml Handshake β†’ 0 after restart, wiping 5 known containers).
  • Agents still displaying 0 running containersΒ #386 concurrent-handshake guard (3da42587) β€” in-flight _doHandshake Promise reused across overlapping reconnects.
  • Security caps (5e02f6df, 9b6e1b97) β€” auth JSON body limit + global SSE connection cap; doc note that Command trigger inherits all DD_* env.
  • Perf batch (07e3580e, b2999eb7, dceec898) β€” store/watcher/SSE replay hot paths, bundle splitting, SSE jitter, OpenAPI cache, alert fan-out, containers menu scan, WS log backpressure.
  • Refactors (47c0cf44, 4d34b5e7, 6ee07006) β€” notifications outbox openapi docs + drain-safe stop, registry contract hygiene, readJsonResponse adoption.
  • UI fix (de07d412) β€” include scanning/sbom-generating in phase enum.
  • CI polish (794497e5, 06e56be3, 612e81c4, 06eed8dc, 82caeb08, 3721f66b) β€” stub getContainersRaw in same-source watcher tests, vite codeSplitting i18n group, runTrigger json type narrowing, Crowdin Sync emoji prefix, transient 401 checkout retry, exclude pure-data files from UI mutation slice.

CHANGELOG

Test plan

  • All pre-push gates green on bca0b7bb: clean-tree, biome, qlty, qlty-smells, typecheck-ui, 100% coverage, build
  • npx vitest run agent/AgentClient.test.ts β†’ 455/455 pass (includes 4 new regression tests for handshake-zero prune-skip)
  • Required CI checks green on this PR
  • After merge: gh workflow run release-cut.yml --ref main -f release_tag=v1.5.0-rc.28
  • Reporter (Cesc1986) verifies Agents still displaying 0 running containersΒ #386 fixed on rc.28 image

Fixes: #386

Restructure Roadmap and Recent Updates to match the Migration pattern:
centered `<h2>` outside the `<details>`, with a `<summary>` line inside
acting as the collapse trigger. Keeps headings visually consistent with
other top-level sections while preserving the collapsed-by-default body.
handleAckEvent fires void this.handshake() synchronously and returns
immediately; the SSE handler chain doesn't await it. A burst of dd:ack
events (e.g. during a reconnect storm) could start a second handshake
while deregisterAgentComponents β†’ registerAgentComponents was still
running on the first, producing a double-registration window where both
old and new component stubs were briefly live. Same class as the rc.27
operationId fix.

Adds a handshakeInProgress in-flight Promise guard: handshake() now
returns the active in-flight promise to subsequent callers and only
fires _doHandshake() once per burst. Resets via .finally() so the next
genuine handshake gets a fresh run.

Tests cover both burst dd:ack and direct concurrent handshake() calls.
Three independent perf cleanups on the read/poll hot paths:

- store: bump LokiJS autosaveInterval 60s β†’ 300s. The autosave does a
  synchronous JSON.stringify of the entire in-memory DB on the event
  loop; at scale (200+ containers with security-scan data) the 60s
  cadence was a measurable stall. Explicit shutdown save is unchanged.

- Docker watcher: same-source container lookup now calls
  getContainersRaw() instead of getContainers(), skipping the
  per-container structuredClone+redaction pass that fires on every
  poll cycle. The watcher only needs id/name for pruneOldContainers.

- SSE event buffer: replaySince() replaces a linear ring.filter over
  up to 500 buffered events with a binary search on the monotonic
  counter. Cuts reconnect-storm CPU on the broadcast side.

Deferred (tracked in .planning/perf-followup.md): routing
api/stats, agent/api/event, and prometheus/container off the
cloned read path β€” needs a coordinated type-contract change.
- registry/index.ts: getState() now returns Readonly<RegistryState>.
  Callers in Docker.ts, Dockercompose.ts, and scheduler.ts read-only.
  Returning the live mutable global was a defensive hazard waiting
  on the first caller to mutate it by accident.

- registry/index.ts: SIGINT/SIGTERM handlers swap to process.once() so
  re-entry of init() can't stack duplicate shutdown handlers.

- registries/providers/dhi: maskConfiguration() now uses the shared
  BaseRegistry.maskSensitiveFields helper, so future config fields
  are masked by default instead of silently dropped from logs.

- watchers/providers/docker/label.ts: legacy ddTriggerInclude and
  ddTriggerExclude constants get @deprecated JSDoc pointing at the
  v1.7.0 removal and the canonical dd.action.* / dd.notification.*
  replacements.

ECR maskConfiguration is intentionally hand-rolled β€” ECR extends
Registry directly (not BaseRegistry), so maskSensitiveFields isn't
available without a deeper refactor.
The command action spawns the subprocess with the controller's full
process.env so PATH, HOME, TZ etc. behave as operators expect. This
also means every DD_* secret the controller loaded (registry passwords,
OIDC client secrets, agent tokens) is visible to the spawned process.
This is admin-only and the command is fully operator-controlled, so it
is not exploitable β€” but operators running third-party scripts need to
know.

Adds:
- Security note in runCommand JSDoc with the unset DD_* wrapper pattern
- Warning callout in content/docs/current/configuration/triggers/command

Filtering to an allowlist is queued for v1.7.0; in an RC it would
silently break every existing command-action user who depends on
inherited env.
10 services were calling response.json() directly, so when a reverse
proxy returned an HTML 502/504 the user got a cryptic
"SyntaxError: Unexpected token '<'" instead of a useful message. Swap
17 happy-path call sites to readJsonResponse, which sniffs Content-
Type and surfaces a clean error.

Skipped 5 call sites that live on !response.ok error paths with
intentional .catch(() => fallback) handling β€” converting them would
swallow the existing fallback messages.
Backend (app/model/container-update-operation.ts) defines 'scanning'
and 'sbom-generating' as legal ContainerUpdateOperationPhase values;
the UI's authoritative copy was missing both. isContainerUpdateOperationPhase
and isInProgressContainerUpdateOperationPhase therefore returned false
for those states, so getOperationPhase (operations store) silently
dropped them. Render-time fell back to the generic 'updating' label
instead of 'Scanning' / 'Generating SBOM'.

Adds both to CONTAINER_UPDATE_OPERATION_PHASES and
IN_PROGRESS_CONTAINER_UPDATE_OPERATION_PHASES to match the backend.
Two defense-in-depth hardenings:

- app/api/auth.ts: the auth router's express.json() had no limit, so
  unauthenticated POSTs to /auth/login could spool arbitrarily large
  bodies into memory. The main API router already enforces 256kb;
  the auth router now uses 64kb (login payloads are tiny).

- app/api/sse.ts: per-IP and per-session SSE caps already existed
  (10 each), but a global cap across all sessions was missing β€” many
  distinct authenticated users/IPs could still exhaust file
  descriptors. Adds DD_SSE_MAX_CLIENTS (default 500) checked as the
  first guard in eventsHandler; returns 429 when exceeded.

Documents DD_SSE_MAX_CLIENTS in content/docs/current/configuration/server.
Five small wins on production hot paths:

- ui/vite.config.ts: __VUE_OPTIONS_API__ flipped true β†’ false. Codebase
  is 100% Composition API; the Options runtime is dead weight (~10 KB).

- ui/vite.config.ts: dedicated 'i18n' code-split group for vue-i18n,
  so the locale runtime doesn't ride along in every route's vendor
  chunk.

- ui/src/stores/eventStream.ts: SSE reconnect was a fixed 5 s delay.
  After a pod restart, hundreds of browser tabs all reconnected on
  exactly the same beat. Switches to exponential backoff with jitter:
  min(30s, 1s * 2^errors) * (0.5..1.0).

- app/api/api.ts: /openapi.json was re-running JSON.stringify on the
  full document on every request. Pre-serialize once at startup, send
  the cached string.

- app/security/scheduler.ts: per-container security alerts were
  awaited serially per digest group. Switches to Promise.allSettled so
  one failure doesn't sink the rest and group fan-out runs in parallel.
Two hot-path tightenings that bite on lagging or busy clients:

- ContainersGroupedViews.vue: openActionsContainer was a computed that
  depended on the full displayContainers ref, so every SSE patch on
  any container field invalidated it and re-ran a linear .find over
  ~200 containers. Short-circuit when no menu is open. The faster
  containerIndexById Map exists inside useContainerSsePatchPipeline
  but is not exposed to this component; revisit if/when it is.

- app/api/log-stream.ts: trySendLogEntry was sending JSON frames
  synchronously into the ws write buffer with no bufferedAmount check.
  A slow consumer grew the buffer unboundedly until the connection
  collapsed and took memory with it. Now drops live entries when
  bufferedAmount > 2 MB, and closes the socket with 1008 Policy
  Violation after 3 consecutive drops. The drop counter resets on a
  successful drain.
…-safe

- app/api/openapi/paths/notification-outbox.ts (new): adds OpenAPI
  entries for GET /api/notifications/outbox, POST .../{id}/retry,
  DELETE .../{id}. The router was mounted but never described, so the
  contract validator couldn't catch response-shape regressions.

- app/notifications/outbox-worker.ts: OutboxWorker.stop() cleared
  this.inflight synchronously without awaiting dispatches already in
  flight. A dispatch that completed after stop would still call
  markOutboxEntryDelivered / markOutboxEntryAttempted on the store,
  and a subsequent start() would re-enqueue the same entry β†’ spurious
  extra delivery attempt. Adds a stopped flag checked on both the
  success and failure branches of dispatch(), reset on start().
`src/icons.ts` (822-line icon-name lookup table) and
`src/i18n/locales.ts` (locale strings) were dominating the
`ui-shell-app` mutation slice. Their string-literal mutants caused
transitive jsdom hangs in Vue tests (63% of tested mutants timed
out), pushing the slice past the 6h runner timeout. Mutating
data tables is also low-signal β€” surviving mutants only confirm
that data isn't asserted on, which is correct for a lookup table.

Excluded the two files in both the workflow matrix and
`stryker.conf.mjs` so local and CI runs stay in sync.
A `app-orch-trigger-notify-secondary` shard in run 26340803265
failed with `fatal: could not read Username for 'https://github.com'`
during the initial `git fetch`. All three of actions/checkout's
internal retries reused the same token and were rejected
identically, even though 26 sibling matrix slots in the same run
checked out fine β€” a one-off GitHub-side token race.

Wrapped checkout in both jobs (`stryker` matrix + `aggregate`)
with `continue-on-error: true` β†’ 30s sleep β†’ second invocation,
so a transient 401 has time to clear before re-issuing the auth
header.
Match the established "<emoji> <Category>: <Subject>" naming used by
the other custom workflows (CI Verify, E2E: Playwright, Release: Cut,
Security: *, Quality: *) so the Actions sidebar reads consistently.
readJsonResponse defaults its generic to unknown, so json.error was a
TS2339 in trigger.ts:154. Pass the expected shape explicitly so the
error narrowing compiles without changing runtime behavior.
b2999eb added a fourth i18n group to vite.config.ts but the priority
spec still asserted three groups. Align the test (count, order, name)
with the current four-group reality.
07e3580 moved Docker.getContainers' same-source lookup from
storeContainer.getContainers to .getContainersRaw, but the two
"normalize non-string watcher agent" tests still only stubbed
getContainers. The unstubbed getContainersRaw returned undefined, the
spread threw, the try/catch swallowed it, and the registry.getState
assertion never fired. Stub getContainersRaw to keep the path under
test exercised.
…t race)

When AgentClient._doHandshake received 0 containers it pruned the
controller-side store unconditionally, wiping last-known state for any
agent that had just restarted with an empty in-memory store while
docker still had N running containers. The agent's first watch cycle
would repopulate seconds later, but the controller's UI rendered 0
until then.

Skip the prune when containers.length === 0 and defer to the next
authoritative dd:watcher-snapshot (already gated on
!containerEnumerationFailed && enrichmentErrors === 0, so a 0-container
snapshot really means 0). Warn when hasConnectedOnce so genuinely-empty
fresh connections stay silent. Non-zero handshakes prune normally.

Adds defence-in-depth on the controller side that complements the rc.25
watcher-snapshot suppression (d02080a) on the agent side.

Fixes: #386
The two #289 entries shipped in rc.27 release notes but were left under
[Unreleased] because release-cut.yml uses the [Unreleased] fallback for
pre-release tags without stamping the file. Move them into the proper
[1.5.0-rc.27] section so the upcoming rc.28 release notes don't
re-publish them.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
drydock-demo Ready Ready Preview, Comment May 27, 2026 11:58pm
drydock-website Ready Ready Preview, Comment May 27, 2026 11:58pm

Copy link
Copy Markdown
Member

@biggest-littlest biggest-littlest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving rc.28: handshake cold-start race fix + perf/security batch. CHANGELOG cleaned up.

ALARGECOMPANY
ALARGECOMPANY previously approved these changes May 27, 2026
Copy link
Copy Markdown
Member

@ALARGECOMPANY ALARGECOMPANY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM β€” rc.28 ready to cut.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

βœ… All modified and coverable lines are covered by tests.

πŸ“’ Thoughts on this report? Let us know!

The npm ci step transitively triggers @playwright/browser-chromium's
postinstall, which downloads ~167 MiB of chromium. PR #396 hit the
5-minute timeout twice with the binary reaching 100% but the overall
step expiring β€” the timeout was right at the edge of what cold-cache
CDN runs need. 10 minutes matches the dedicated chromium install step
that runs immediately after.
Copy link
Copy Markdown
Member

@biggest-littlest biggest-littlest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after CI timeout bump.

Copy link
Copy Markdown
Member

@ALARGECOMPANY ALARGECOMPANY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after CI fix; rc.28 ready.

@s-b-e-n-s-o-n s-b-e-n-s-o-n merged commit 7dc8b2d into main May 28, 2026
25 checks passed
@s-b-e-n-s-o-n s-b-e-n-s-o-n deleted the feature/v1.5-rc28 branch May 28, 2026 00:20
s-b-e-n-s-o-n added a commit that referenced this pull request May 28, 2026
…397)

## Summary

Three consecutive Playwright runs failed at the same step on
`cdn.playwright.dev` CDN throttling:
- rc.28 PR #396 first attempt (`bca0b7bb`)
- rc.28 PR #396 second attempt (`bca0b7bb`)
- post-merge on main (`7dc8b2d3`) β€” blocking release-cut for
v1.5.0-rc.28
- post-merge retry attempt 2 β€” same shape

Each attempt: chromium download reaches 100% at ~9m30s, then `node
install.js` SIGTERMs at the 10-min wall clock. Same workflow file, same
package versions β€” CDN throttling is currently routing each runner pool
through a slow edge.

### Changes

- **actions/cache for `~/.cache/ms-playwright`** keyed on lockfile hash
+ `github.ref_name`. Each branch gets its own cache namespace (matches
the scoping pattern in `quality-mutation-monthly.yml`). After the first
cold-cache run, subsequent runs short-circuit the postinstall download.
- **timeout_minutes 10 β†’ 15** covers the worst observed
throttled-download case with margin for the `npm install`
post-extraction work.
- Inline `# zizmor: ignore[cache-poisoning]` pragma with rationale; the
workflow's push:main + PR triggers cause zizmor to flag any
actions/cache as a poisoning risk regardless of key scope.

The dedicated `Install Playwright Chromium` step that runs after `npm
ci` keeps its 10-min retry budget and will now hit the warm cache on
every run.

## Test plan
- [x] Pre-push gates green locally (clean-tree, biome, qlty,
qlty-smells, typecheck, 100% coverage, build, zizmor)
- [x] zizmor passes on the edited workflow
- [ ] PR-event Playwright workflow run passes on this commit
- [ ] After merge: rerun the failed main-branch Playwright on `7dc8b2d3`
(which won't have the fix on its tree but will share the cache β€”
chromium binary persists), OR re-dispatch release-cut against the new
merge commit
s-b-e-n-s-o-n added a commit that referenced this pull request May 28, 2026
…or rc.28 prep (#399)

## Summary

Consolidates **45 commits** of CI hardening, security scanner
improvements, dependency bumps, and the resulting test fixes onto `main`
so rc.28 can be cut cleanly. Builds on top of the already-merged #396
(rc.28 app code) and #397/#398 (Playwright cache/runner-image fixes).

**Theme breakdown:**
- πŸ“¦ **Dependency bumps** (12) β€” Biome, Vitest, Playwright, Artillery,
Codecov, CodeQL, Snyk, Nuclei, ZAP, cosign, Docker actions, zizmor
- πŸ”’ **Security workflow hardening** (14) β€” Scorecard, ZAP/Snyk SARIF
uploads, Nuclei DAST, CodeQL query pack pin, Crowdin sync, DAST
credential masking, Harden Runner coverage, Renovate scoping, release
SBOM attestation/verification with pinned Trivy, Docker build metadata
standardization, release retry normalization
- πŸ”§ **Workflow hygiene** (8) β€” Playwright host-install download skip,
mutation artifact layout verification, Stryker shard dashboards,
Artillery advisory gates, Crowdin emoji prefix, Playwright image sync,
share runtime path filters, move workflow tests out of the app suite
- ⚑ **Performance** (1) β€” split load test profiles into parallel jobs
- πŸ§ͺ **Test coverage** (5) β€” ZAP SARIF converter branches, Playwright
retry artifact retention, workflow tests (release metadata, action pin
fixtures, retry attempts, Playwright download env), script tests wired
into local + CI gates
- πŸ› **Follow-up fixes from this consolidation** (3):
- `🎨 style(lint): apply Biome auto-fixes after bumped rules` β€” bumped
Biome flagged pre-existing literal-key / template-string / formatter
issues
- `πŸ› fix(ci): drop invalid artifact-metadata permission scope` β€”
`artifact-metadata` isn't a real GHA scope; `attestations: write`
already covers it
- `πŸ§ͺ test(ui): align crowdin-config invariant with
single-source-of-truth` β€” `377b6bdf` moved language scoping into
`crowdin.yml`, so the pre-existing UI test had to flip its assertion

Local pre-push pipeline all green (clean-tree β†’ biome β†’ qlty β†’
qlty-smells β†’ scripts-test β†’ workflow-tests β†’ typecheck-ui β†’ coverage
100% sharded β†’ build sharded β†’ zizmor).

## Test plan

- [ ] CI Verify (`ci-verify.yml`) passes β€” load-test and cucumber jobs
now skip the Playwright postinstall download
- [ ] E2E Playwright (`e2e-playwright.yml`) passes inside the official
Playwright container (#398 carry-forward)
- [ ] Mutation slices unchanged from main + new mutation artifact layout
check passes
- [ ] CodeQL + Scorecard + zizmor + qlty all green
- [ ] Workflow tests (`.github/tests/`, 26 tests) pass in CI
- [ ] After merge: `gh workflow run release-cut.yml --ref main -f
release_tag=v1.5.0-rc.28`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agents still displaying 0 running containers

3 participants