Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,69 @@

All notable changes to workflow-plugin-digitalocean are documented here.

## [v0.7.8] - 2026-04-24

### Added

- `AppPlatformDriver.Troubleshoot` implements `interfaces.Troubleshooter` from workflow
v0.18.10. On deploy health-check failure wfctl automatically fetches the app's
in-progress/pending/active deployment slots (prioritised in that order) plus up to
5 recent historical deployments, synthesises `[]Diagnostic` entries with per-phase
root-cause lines extracted from `Progress.SummarySteps` and `Progress.Steps`, and
surfaces them in CI output — no DO console trip required to diagnose failures.
- `pickTroubleshootDeployments` helper: priority-ordered candidate selection with dedup.
- `buildDiagnosticFor` helper: structured Diagnostic extraction per deployment.
- `extractCause` helper: scans log tail / reason messages for common error patterns
(`Error:`, `exit status`, `panic:`, `fatal:`, `failed to`, …) with last-line fallback.
- `ResourceDriver.Troubleshoot` gRPC dispatch in plugin `InvokeMethod`; returns
`codes.Unimplemented` for drivers that don't implement `Troubleshooter` so wfctl
silently no-ops without error.

### Fixed

- **State-heal for stale name-as-ProviderID** — `AppPlatformDriver.Update` and `Delete`
now call `resolveProviderID` before hitting the DO API. When `ref.ProviderID` is not a
canonical UUID (36 chars, hyphens at positions 8/13/18/23), the driver logs a WARN and
transparently falls back to `findAppByName` to recover the real UUID. The healed UUID is
returned in `ResourceOutput.ProviderID` so wfctl rewrites state on the next Apply — no
manual teardown or state editing required.

Root-cause: a pre-v0.7.7 code path in `DOProvider.Apply` substituted `spec.Name` as
`ProviderID` when the godo API returned a zero-ID response. v0.7.7 added an empty-ID
guard on the Create path but did not heal existing stale state. v0.7.8 heals it at
Update/Delete time. Triggered by BMW staging deploy `24901939350` where
`state.json` contained `ProviderID="bmw-staging"` instead of the app UUID.

- New shared helper `isUUIDLike(s string) bool` in `internal/drivers/shared.go` — used
by `resolveProviderID`; 11-case table-driven unit test in `shared_test.go`.
- A WARN log (`"state-heal"` keyword) is emitted when heal fires so operators can observe
state drift in CI output without the deploy failing.
- New integration-test harness in `internal/drivers/integration_test_helpers_test.go`:
`fakeAppsClient` (full `AppPlatformClient` stub with per-method call tracking),
`inMemoryState` (minimal state round-trip store), and `applySim` (mimics wfctl's
Apply→persist loop). Five integration tests in `app_platform_integration_test.go`
exercise the full Create → state persist → Update flow including:
- UUID stored (not spec name) after Create
- No heal for valid UUID on Update
- Stale name healed on Update (core BMW regression test)
- Stale name healed on Delete
- Clear error when heal can't resolve the name

### Changed

- Depends on workflow v0.18.10.1 (was v0.18.6).
- `AppPlatformDriver.Troubleshoot`: empty `ProviderID` now returns `(nil, nil)` instead
Comment on lines +53 to +56
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry says the plugin depends on workflow v0.18.10, but go.mod in this PR still requires v0.18.6 (and currently uses a local replace). Please update the changelog to match what will actually ship in this release, or bump the module dependency to the stated version once the tag exists.

Copilot uses AI. Check for mistakes.
of an error; `ListDeployments` errors are best-effort (swallowed, slot-based data used).
- Test ProviderIDs updated from `"app-123"` to proper UUID format throughout driver tests
(required because `"app-123"` is not UUID-like and would trigger the heal path).

### Known follow-up (v0.7.9)

- Replicate state-heal (`resolveProviderID` equivalent) across the other UUID-based
drivers (`vpc`, `firewall`, `database`, `cache`, `load_balancer`, `certificate`,
`api_gateway`, `kubernetes`, `droplet`) — the same class of stale state is theoretically
possible for any driver that was deployed before v0.7.7's empty-ID guard.

## [v0.7.7] - 2026-04-24

### Fixed
Expand Down
159 changes: 159 additions & 0 deletions docs/plans/2026-04-24-provider-id-state-heal-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# DO Plugin — ProviderID State-Heal + Integration Test Coverage

**Status:** Approved (autonomous pipeline, 2026-04-24)

**Target release:** workflow-plugin-digitalocean v0.7.8 (bundle into in-flight PR #22)

**Related tasks:** #67 (this work), #58 (v0.7.7 empty-ID guards — prior attempt), #25 (BMW staging deploy unblock), #64 / #72 (wfctl observability).

---

## Problem

BMW staging deploy has failed twice tonight with the same error:

```
update/bmw-staging: PUT https://api.digitalocean.com/v2/apps/bmw-staging: 400 invalid uuid
```

DO's App Platform API requires a UUID in the URL path (`/v2/apps/{uuid}`). BMW's IaC state at `s3://bmw-iac-state/staging/state.json` has `ProviderID = "bmw-staging"` (the resource name) instead of the real UUID (`f8b6200c-3bba-48a7-8bf1-7a3e3a885eb5`). DO rejects the PUT.

v0.7.7 shipped as the fix for this bug class ("empty-ID guards + capture UUID from API response"), but the bug recurred — meaning either (a) v0.7.7 doesn't actually prevent the UUID→name substitution in all code paths, or (b) the state that's been failing was populated by an earlier version and never healed.

User mandate: proper fix, with the test that would have caught this. No teardown. No hand-edit of state files.

---

## What's verified

1. **wfctl is a passthrough for `ProviderID`.** `cmd/wfctl/infra_apply.go:281-299` reads `result.Resources[i].ProviderID` directly into `ResourceState.ProviderID` — if the driver returns the correct UUID, state gets the UUID. So the bug lives in the driver or its call chain, not wfctl.
2. **`AppPlatformDriver.Create` writes `ProviderID: app.ID`** (line 403 via `appOutput(app)`). On the success path, the UUID flows correctly.
3. **State-heal is already drafted** on feat/v0.7.8-troubleshoot (uncommitted): `resolveProviderID(ctx, ref)` + `isUUIDLike` + wired into Update/Delete. The draft matches the design below.
4. **Other DO drivers are vulnerable** to the same class of bug where UUID-format providers are concerned (api_gateway, database, cache, certificate, droplet, load_balancer, vpc, firewall, reserved_ip). DNS and spaces use non-UUID identifiers so are not affected.

---

## Approach

Three layers of defense, each addressable independently, bundled together for v0.7.8:

### Layer 1 — Root-cause audit of `v0.7.7` empty-ID guard

The v0.7.7 "empty-ID guard" (commit `94c9227`) is suspect. Audit it:

- Does any path in `Create` fall back to `ref.Name` when `app.ID == ""`? If yes, that's wrong — it should error out loudly so state never silently persists a name-as-UUID.
- Is there a gRPC-boundary issue where `ResourceOutput.ProviderID` is getting dropped in marshaling, causing wfctl to reuse the previous state's (broken) ProviderID?
- Was the BMW-staging state originally populated pre-v0.7.7, and the v0.7.7 fix only prevents NEW-creation drift but doesn't heal EXISTING drift?

**Deliverable:** a brief root-cause write-up in the PR body naming the specific code path that produced BMW's bad state. Fix it if it still exists (e.g., replace silent fallback with `return nil, fmt.Errorf(...)` at the offending site).

### Layer 2 — State-heal in the driver, per-driver

Each DO driver using UUID-format identifiers validates `ref.ProviderID` shape before hitting the API. On mismatch, fall back to the driver's existing name-based lookup (`findAppByName` / `findDatabaseByName` / etc.) to resolve the real UUID, use it for the API call, and return `ResourceOutput.ProviderID` containing the healed UUID so wfctl rewrites state transparently.

Shape check is per-driver because DO's identifier formats differ:
- App Platform, Database, Cache, Certificate, Droplet, Load Balancer, VPC, Firewall, Reserved IP → UUID
- DNS Domain → domain name (no heal needed; the name IS the ID)
- Spaces → bucket name (same)

A shared `isUUIDLike(s string) bool` helper in `internal/drivers/shared.go` (or equivalent) removes per-driver duplication. Drivers that need heal call it; drivers that don't skip it.

**For v0.7.8 (tonight's scope):** implement state-heal on `AppPlatformDriver` only — that's the unblocker for BMW. Ship.

**For v0.7.9 (follow-up, separate task):** audit + replicate heal across the remaining UUID drivers. Each needs its own `findByName` prerequisite (most have it per task #21 v0.7.3).

### Layer 3 — Integration-test harness

The test infrastructure that would have caught v0.7.7's regression. New file `internal/drivers/app_platform_integration_test.go` plus supporting scaffolding in `internal/drivers/testhelpers_test.go`:

**Harness components:**
- `fakeAppsClient` — embeds `godo.AppsService` interface, tracks calls (which method, with which arguments), returns configurable responses with real UUIDs
- `inMemoryState` — implementing enough of workflow's state-store surface to round-trip `ResourceState` writes and reads
- `apply(driver, spec)` helper that mimics what wfctl's `infra_apply.go` does: call `Create`/`Update`, take returned `ResourceOutput`, persist to in-memory state, return the state snapshot

**Tests:**

| Test | Seed state | Action | Assertion |
|---|---|---|---|
| `TestAppPlatform_Create_PersistsUUIDInState` | empty | Create with mock returning `App{ID:"uuid-abc", Name:"bmw-staging"}` | state has `ProviderID="uuid-abc"` |
| `TestAppPlatform_Update_UsesExistingUUID` | `ProviderID="uuid-abc"` | Update | fakeClient saw `Update(ctx, "uuid-abc", ...)`; no `findByName` call |
| `TestAppPlatform_Update_HealsStaleName` | `ProviderID="bmw-staging"` (stale name) | Update | fakeClient saw `findAppByName` → Update with real UUID; returned `ResourceOutput.ProviderID` is the UUID; warn log captured |
| `TestAppPlatform_Delete_HealsStaleName` | `ProviderID="bmw-staging"` | Delete | Same as Update but for Delete |
| `TestIsUUIDLike_TableDriven` | n/a | pure function | canonical UUIDs pass; names, empty, too-short, missing-hyphens all fail |

**Why these tests pin down the regression:**
- The "happy Create → state has UUID" test fails loudly if anyone ever adds a silent name-fallback in Create. That's the test v0.7.7 didn't have.
- The heal tests cover the defense-in-depth path that matters when state is already corrupt.

### Layer 4 (optional, out of scope) — wfctl-core generic heal hook

Considered and deferred. Reasoning:

**For:** a generic `driver.ValidateProviderID(id) bool` + wfctl calling it before every Update/Delete with Read-by-name fallback means all drivers across all providers benefit uniformly. Less per-driver boilerplate.

**Against:** `ValidateProviderID` shape varies (UUIDs for DO, ARNs for AWS, project/location/resource strings for GCP) — the contract essentially devolves to "let the driver decide." At which point pushing the heal into the driver (where it already has the name-lookup logic) is simpler and more honest. The generic hook buys only marginal deduplication and constrains future providers who may have their own ID conventions.

Verdict: skip the generic hook. Each driver owns its heal.

---

## Data flow (post-v0.7.8, BMW-like recovery)

```
wfctl infra apply --env staging
→ planResourcesForEnv → Plan: 1 action(s): update bmw-staging
→ provider.Apply(plan)
→ AppPlatformDriver.Update(ctx, ref{Name:"bmw-staging", ProviderID:"bmw-staging"}, spec)
→ resolveProviderID(ctx, ref)
→ isUUIDLike("bmw-staging") → false
→ log.Printf("warn: app platform \"bmw-staging\": ProviderID \"bmw-staging\" is not UUID-like; resolving by name (state-heal)")
→ findAppByName(ctx, "bmw-staging") → App{ID:"f8b6200c-...", ...}
→ return "f8b6200c-..."
→ client.Update(ctx, "f8b6200c-...", AppUpdateRequest{Spec:...}) → App{ID:"f8b6200c-...", ...}
→ client.CreateDeployment(ctx, "f8b6200c-...", ...)
→ return ResourceOutput{ProviderID:"f8b6200c-...", ...}
→ wfctl persists ResourceState{ProviderID:"f8b6200c-..."} (healed)
→ Deploy continues: pre_deploy migrations run, app becomes ACTIVE
```

Same flow applies to `Delete` on stale state; user never sees the "invalid uuid" error again.

---

## Rollout

**v0.7.8 (this PR #22):**
1. Root-cause audit write-up in PR description.
2. Shared `isUUIDLike` helper in `internal/drivers/shared.go`.
3. `AppPlatformDriver.resolveProviderID` wired into Update, Delete (already drafted, commit it).
4. Integration-test harness + 5 tests from the table above.
5. CHANGELOG v0.7.8 entry describing the heal, the root-cause finding, and the integration-test harness.
6. Continue to bundle Troubleshoot work from earlier in PR #22.

**v0.7.9 (follow-up task, filed post-merge):**
- Audit + replicate heal across: database, cache, certificate, droplet, load_balancer, vpc, firewall, reserved_ip, api_gateway.
- One commit per driver, same pattern.
- Integration tests for each (parameterize the existing harness).

**BMW consumer bump:**
- Single PR: setup-wfctl v0.18.9 → v0.18.10.1 (after v0.18.10.1 tags) + workflow-plugin-digitalocean v0.7.7 → v0.7.8.
- Retries deploy. State-heal kicks in on the stale `ProviderID="bmw-staging"` during the infra apply step, state gets rewritten with the real UUID, deploy proceeds.

---

## Success criteria

- `internal/drivers/app_platform_integration_test.go` exercises the bug that produced BMW's failure and passes (TDD — write the stale-name test FIRST, watch it fail against main, then apply the heal patch).
- `v0.7.8` merges and tags.
- BMW deploy retry on v0.7.8 against the current (broken-state) staging env completes successfully: `wfctl infra apply` heals state, pre-deploy migration runs, app reaches ACTIVE, /healthz returns 200, auto-promote to prod fires on /healthz green.
- No future regression in the "happy Create produces UUID-state" path — golden test pins it.

---

## Non-goals

- wfctl-core generic `ValidateProviderID` hook (deferred, may never be needed).
- Other DO drivers' heal (v0.7.9).
- AWS / GCP / Azure equivalents (each provider audits + fixes its own drivers in its own repo).
- `wfctl infra heal` command for ad-hoc recovery (nice-to-have, not needed once driver-level heal exists).
- Retroactive state-file repair tool (not needed — first Update on stale state heals it transparently).
Loading
Loading