Skip to content

feat(app-platform): v0.7.8 — Troubleshoot impl + gRPC dispatch (Tasks 11-13)#22

Merged
intel352 merged 13 commits intomainfrom
feat/v0.7.8-troubleshoot
Apr 24, 2026
Merged

feat(app-platform): v0.7.8 — Troubleshoot impl + gRPC dispatch (Tasks 11-13)#22
intel352 merged 13 commits intomainfrom
feat/v0.7.8-troubleshoot

Conversation

@intel352
Copy link
Copy Markdown
Contributor

@intel352 intel352 commented Apr 24, 2026

Summary

  • Refines `AppPlatformDriver.Troubleshoot` to prioritise the app's current deployment slots (InProgress > Pending > Active) before falling back to paginated history, so the most actionable failure context comes first.
  • Adds `pickTroubleshootDeployments`, `buildDiagnosticFor`, `deploymentCauseAndPhase`, and `extractCause` helpers with full unit coverage.
  • Adds `ResourceDriver.Troubleshoot` gRPC dispatch in `InvokeMethod`; returns `codes.Unimplemented` for non-Troubleshooter drivers (now includes `resourceType` in message) so wfctl silently no-ops.
  • 23 new tests across 3 scenarios: slot/history ordering, phase name extraction (SummarySteps → leaf Steps → dep.Cause → terminal phase), and `extractCause` table (10 cases).
  • State-heal for stale name-as-ProviderID: `AppPlatformDriver.Update` and `Delete` now call `resolveProviderID` — if `ref.ProviderID` is not a canonical UUID, a WARN is logged and a name-based lookup heals it transparently. Returns the healed UUID in `ResourceOutput` so wfctl rewrites state on the next Apply.
  • Shared `isUUIDLike` helper in `internal/drivers/shared.go`.
  • Integration-test harness (`fakeAppsClient` + `inMemoryState` + `applySim`) exercises the full Create/Update/Delete loop; 5 integration tests pin the "stale name heals to UUID" behavior.

Root-cause (v0.7.8 addendum — BMW deploy failure 24901939350)

What happened: BMW staging `state.json` contained `ProviderID="bmw-staging"` (the app name) instead of a DO UUID. When wfctl called `Update`, the DO API rejected the `PUT /v2/apps/bmw-staging` with `400 invalid uuid`.

Where the name came from: A pre-v0.7.7 `Create` run hit a code path in `DOProvider.Apply` that substituted `spec.Name` for `ProviderID` when the godo API returned a zero-ID response (documented in v0.7.7 commit `94c9227`). That fallback has been removed.

Why v0.7.7 didn't catch it: v0.7.7's empty-ID guard only protects new Create calls — it errors out early rather than silently storing a name. But it can't retroactively fix existing state entries that already have the bad shape. BMW's state predated the guard.

What v0.7.8 does: `resolveProviderID` in `Update` and `Delete` detects the stale name shape (via `isUUIDLike`) and heals it via `findAppByName` on the fly, before any DO API call. The returned `ResourceOutput.ProviderID` carries the real UUID so wfctl transparently rewrites state. No manual teardown or state editing required.

Verified: reverting `resolveProviderID` call causes `TestUpdate_HealsStaleName` to fail with:

app_platform_stateheal_test.go:157: listCalls = 0, want ≥ 1 (findAppByName must fire during heal)
app_platform_stateheal_test.go:161: Update called with "bmw-staging", want UUID "f8b6200c-3bba-48a7-8bf1-7a3e3a885eb5" (state-heal failed)

Test has teeth.

Test plan

  • `GOWORK=off go test -race -short ./...` — all green (commit 143aae7).
  • `TestAppPlatform_Update_HealsStaleName` — stale name resolves to UUID, Update called with UUID, state rewritten, WARN log contains "state-heal".
  • `TestAppPlatform_Create_PersistsUUIDInState` — Create returns UUID from API, not spec name.
  • `TestIsUUIDLike_TableDriven` — 11 cases in `shared_test.go`.
  • Copilot findings addressed: test quality (listCalls counter, not len(listApps)), comment accuracy, error message diagnosability, CHANGELOG version text.

Copilot AI review requested due to automatic review settings April 24, 2026 19:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds App Platform “troubleshooting” support so wfctl can automatically fetch and display recent DO App Platform deployments when a health-check poll times out.

Changes:

  • Extends AppPlatformClient with ListDeployments and implements AppPlatformDriver.Troubleshoot() to convert recent deployments into []interfaces.Diagnostic.
  • Updates mocks to satisfy the new client interface and adds unit tests for Troubleshoot behaviors.
  • Updates module metadata (go.mod/go.sum) to support the workflow interface change (currently via a local replace).

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/drivers/app_platform.go Adds ListDeployments to the client interface and implements Troubleshoot() to return recent deployment diagnostics.
internal/drivers/app_platform_test.go Extends the App Platform mock client and adds new unit tests for Troubleshoot().
internal/drivers/deploy_test.go Updates deploy mock client to satisfy the expanded AppPlatformClient interface.
go.mod Adds a local replace for workflow to pick up the new Troubleshooter interface during development.
go.sum Removes checksum entries associated with the prior workflow version in conjunction with module changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread go.mod Outdated
Comment on lines +229 to +230

replace github.com/GoCodeAlone/workflow => ../workflow
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The committed replace github.com/GoCodeAlone/workflow => ../workflow will break builds for anyone who doesn’t have that relative path (including CI). Replace directives should not point to local disk paths in a PR; instead update require to the released/pseudo version that contains interfaces.Troubleshooter (or use a versioned replace to a VCS ref) and drop the local replace before merge.

Suggested change
replace github.com/GoCodeAlone/workflow => ../workflow

Copilot uses AI. Check for mistakes.
Comment thread internal/drivers/app_platform.go Outdated
if ref.ProviderID == "" {
return nil, fmt.Errorf("app platform troubleshoot %q: no ProviderID available", ref.Name)
}
deps, _, err := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments})
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the rest of the codebase’s paginated list calls, initialize godo.ListOptions with Page: 1 as well as PerPage. Other drivers (and findAppByName above) always set Page: 1, and relying on the SDK/API default page can lead to surprising results if defaults change.

Suggested change
deps, _, err := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments})
deps, _, err := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{Page: 1, PerPage: troubleshootMaxDeployments})

Copilot uses AI. Check for mistakes.
Comment thread internal/drivers/app_platform.go Outdated
Comment on lines +240 to +248
// Surface the first non-empty error step message when cause is empty.
if cause == "" && dep.Progress != nil {
for _, step := range dep.Progress.Steps {
if step.Status == godo.DeploymentProgressStepStatus_Error && step.Reason != nil && step.Reason.Message != "" {
cause = step.Reason.Message
break
}
}
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic for when dep.Cause is empty (scanning dep.Progress.Steps for an error step Reason.Message) isn’t covered by the new tests. Add a unit test that sets Cause to empty and populates Progress.Steps with an error step to assert the diagnostic’s Cause is derived from the step message.

Copilot generated this review using guidance from organization custom instructions.
@intel352 intel352 changed the title feat(app-platform): v0.7.8 — Troubleshooter: auto-fetch recent deployments on health-check timeout feat(app-platform): v0.7.8 — Troubleshoot impl + gRPC dispatch (Tasks 11-13) Apr 24, 2026
@intel352 intel352 requested a review from Copilot April 24, 2026 19:46
intel352 and others added 2 commits April 24, 2026 15:49
…oyments on health-check failure

Adds AppPlatformDriver.Troubleshoot(ctx, ref, failureMsg) implementing the
new optional interfaces.Troubleshooter from workflow v0.18.10.

When wfctl's health-check poll times out, it now calls Troubleshoot and
prints up to 5 recent deployments (ID, phase, cause, timestamp) so the
user can immediately see root cause without visiting the DO console.

Changes:
- AppPlatformClient interface gains ListDeployments (matches godo.AppsService)
- Troubleshoot() fetches last 5 deployments and maps them to []Diagnostic
  — falls back to error step reason when dep.Cause is empty
- mockAppClient + deployMockClient in tests gain ListDeployments stub
- 3 new unit tests: happy path, missing ProviderID, API error

Also updates go.mod to replace github.com/GoCodeAlone/workflow with
local path during development (will be replaced with require v0.18.10
when that tag is published).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(Tasks 11-13)

- Replace basic ListDeployments-only Troubleshoot with prioritised slot approach:
  InProgressDeployment > PendingDeployment > ActiveDeployment > historical (≤5).
  Empty ProviderID now returns (nil, nil); ListDeployments errors are best-effort.
- Add pickTroubleshootDeployments(app, hist): priority-ordered dedup candidate list.
- Add buildDiagnosticFor(dep): extracts phase/cause from Progress.SummarySteps,
  Progress.Steps, and dep.Cause in priority order.
- Add deploymentCauseAndPhase(dep): pure extraction helper.
- Add extractCause(tail): scans for Error/exit status/panic/fatal patterns with
  last-line fallback.
- Add ResourceDriver.Troubleshoot dispatch in InvokeMethod; returns
  codes.Unimplemented for non-Troubleshooter drivers so wfctl silently no-ops.
- 23 new tests: Troubleshoot scenarios, pickTroubleshootDeployments ordering/dedup,
  buildDiagnosticFor phases, extractCause table (10 cases), deploymentCauseAndPhase.
- Update 3 stale tests in app_platform_test.go to match new empty-ProviderID and
  best-effort ListDeployments semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +255 to +272
// pickTroubleshootDeployments returns up to 3 candidate deployments in priority
// order: InProgress > Pending > Active, followed by unique historical entries.
// The active-deployment slot is included only when the other two are absent,
// since an active-and-healthy deployment seldom needs troubleshooting.
func pickTroubleshootDeployments(app *godo.App, historical []*godo.Deployment) []*godo.Deployment {
seen := map[string]bool{}
var result []*godo.Deployment
add := func(dep *godo.Deployment) {
if dep == nil || dep.ID == "" || seen[dep.ID] {
return
}
seen[dep.ID] = true
result = append(result, dep)
}
add(app.InProgressDeployment)
add(app.PendingDeployment)
add(app.ActiveDeployment)
for _, dep := range historical {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the Active deployment slot is included only when InProgress/Pending are absent, but the implementation always adds app.ActiveDeployment (subject to nil/ID checks). Either adjust the logic to match the comment (to preserve space for historical deployments when an in-progress/pending slot exists), or update the comment so it accurately describes the current behavior.

Copilot uses AI. Check for mistakes.
Comment thread CHANGELOG.md
Comment on lines +20 to +22
### Changed
- Depends on workflow v0.18.10 (was v0.18.9) once tagged.
- `AppPlatformDriver.Troubleshoot`: empty `ProviderID` now returns `(nil, nil)` instead
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry says the plugin depends on workflow v0.18.10, but go.mod in this PR still requires v0.18.6 (and currently uses a local replace). Please update the changelog to match what will actually ship in this release, or bump the module dependency to the stated version once the tag exists.

Copilot uses AI. Check for mistakes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/drivers/app_platform.go Outdated
if app == nil {
return nil, nil
}
hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments})
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListDeployments is called with &godo.ListOptions{PerPage: ...} but no Page is set. Elsewhere in this driver (findAppByName) pagination is explicitly initialized with Page: 1; leaving it as the zero value risks requesting an invalid page depending on godo/API behavior. Consider setting Page: 1 here as well.

Suggested change
hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments})
hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{
Page: 1,
PerPage: troubleshootMaxDeployments,
})

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +262
// pickTroubleshootDeployments returns up to 3 candidate deployments in priority
// order: InProgress > Pending > Active, followed by unique historical entries.
// The active-deployment slot is included only when the other two are absent,
// since an active-and-healthy deployment seldom needs troubleshooting.
func pickTroubleshootDeployments(app *godo.App, historical []*godo.Deployment) []*godo.Deployment {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the Active deployment slot is included only when InProgress/Pending are absent, but the implementation below always adds app.ActiveDeployment and caps results at 3. That can crowd out historical deployments (and if Active is healthy it likely produces no diagnostic). Please either update the logic to match the comment (include Active conditionally) or adjust the comment/selection cap so you don’t lose actionable history.

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +105
case "ResourceDriver.Troubleshoot":
return m.invokeDriverTroubleshoot(args)

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds a new InvokeMethod dispatch path for ResourceDriver.Troubleshoot, but there doesn’t appear to be corresponding coverage in internal/module_instance_test.go (that file already tests the other ResourceDriver dispatch cases). Please add unit tests for (1) a driver implementing interfaces.Troubleshooter returning serialized diagnostics and (2) a non-Troubleshooter driver returning codes.Unimplemented so the gRPC behavior is locked in.

Copilot generated this review using guidance from organization custom instructions.
Comment thread CHANGELOG.md Outdated

### Changed

- Depends on workflow v0.18.10 (was v0.18.9) once tagged.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog claims this release depends on workflow v0.18.10 / was v0.18.9, but go.mod currently requires github.com/GoCodeAlone/workflow v0.18.6 (and this PR adds a local replace). Please align the changelog with the actual dependency state for this PR (or defer the version-bump note to the follow-up commit that updates go.mod).

Suggested change
- Depends on workflow v0.18.10 (was v0.18.9) once tagged.
- Added troubleshooting support compatible with the workflow Troubleshooter interface;
defer the workflow module version-bump note until the follow-up commit that updates
`go.mod`.

Copilot uses AI. Check for mistakes.
- Add Page: 1 to ListDeployments call (consistent with other drivers)
- Fix pickTroubleshootDeployments comment to match code (all 3 slots
  always collected, not just when InProgress/Pending absent)
- Add TestDeploymentCauseAndPhase_LeafStepFallback_NoCause: covers
  path where dep.Cause is empty and cause comes from Progress.Steps
  leaf step Reason.Message

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +258 to +277
// pickTroubleshootDeployments returns up to 3 candidate deployments in priority
// order: InProgress > Pending > Active, followed by unique historical entries.
// All three deployment slots (InProgress, Pending, Active) are collected in
// priority order; historical deployments fill remaining capacity up to 3 total.
func pickTroubleshootDeployments(app *godo.App, historical []*godo.Deployment) []*godo.Deployment {
seen := map[string]bool{}
var result []*godo.Deployment
add := func(dep *godo.Deployment) {
if dep == nil || dep.ID == "" || seen[dep.ID] {
return
}
seen[dep.ID] = true
result = append(result, dep)
}
add(app.InProgressDeployment)
add(app.PendingDeployment)
add(app.ActiveDeployment)
for _, dep := range historical {
if len(result) >= 3 {
break
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pickTroubleshootDeployments hard-caps the candidate list to 3 deployments, which conflicts with troubleshootMaxDeployments = 5 and the PR/CHANGELOG description of including slot deployments plus up to 5 historical ones. Please align the cap and the docs (either increase this limit or reduce the constant/docs accordingly).

Copilot uses AI. Check for mistakes.
Comment thread CHANGELOG.md Outdated
Comment on lines +25 to +27
- Depends on workflow v0.18.10 (was v0.18.9) once tagged.
- `AppPlatformDriver.Troubleshoot`: empty `ProviderID` now returns `(nil, nil)` instead
of an error; `ListDeployments` errors are best-effort (swallowed, slot-based data used).
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog claims a dependency on workflow v0.18.10 (and says it was v0.18.9), but go.mod currently requires v0.18.6 and this PR adds a local replace instead of a version bump. Please update the changelog to match what’s actually in go.mod, or complete the dependency bump once the tag exists.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
package drivers

// White-box tests for Troubleshoot helpers.
// Using package drivers (not drivers_test) so unexported helpers are accessible.

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other tests under internal/drivers use the external drivers_test package (e.g. internal/drivers/app_platform_test.go:1), but this new file uses package drivers to reach unexported helpers. That increases coupling to internals and makes refactors harder; prefer testing via the public Troubleshoot API from drivers_test, or expose the minimal helper behavior needed.

Copilot uses AI. Check for mistakes.
Remove local replace directive; use published v0.18.10 which ships
interfaces.Troubleshooter and interfaces.Diagnostic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +395 to +399
diagList[i] = map[string]any{
"id": d.ID,
"phase": d.Phase,
"cause": d.Cause,
"at": d.At.Format(time.RFC3339),
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invokeDriverTroubleshoot always formats d.At using RFC3339. If a driver returns a Diagnostic with a zero At (e.g., tests/terminal-phase diagnostics without CreatedAt), this will serialize as year 0001, which is misleading for users and may break consumers expecting a valid timestamp. Consider omitting at or returning an empty string when d.At.IsZero().

Suggested change
diagList[i] = map[string]any{
"id": d.ID,
"phase": d.Phase,
"cause": d.Cause,
"at": d.At.Format(time.RFC3339),
at := ""
if !d.At.IsZero() {
at = d.At.Format(time.RFC3339)
}
diagList[i] = map[string]any{
"id": d.ID,
"phase": d.Phase,
"cause": d.Cause,
"at": at,

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +105
case "ResourceDriver.Troubleshoot":
return m.invokeDriverTroubleshoot(args)

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceDriver.Troubleshoot is newly dispatched from InvokeMethod, but internal/module_instance_test.go has no InvokeMethod coverage for this method (success path and codes.Unimplemented for non-Troubleshooter drivers). Add unit tests to lock in the plugin RPC behavior.

Copilot generated this review using guidance from organization custom instructions.
Comment on lines +275 to +280
for _, dep := range historical {
if len(result) >= 3 {
break
}
add(dep)
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pickTroubleshootDeployments limits to 3 candidates (see the len(result) >= 3 break) before buildDiagnosticFor is applied. Non-actionable slot deployments (e.g., healthy ActiveDeployment) can consume capacity and prevent additional failing historical deployments from being returned. Consider applying the cap after filtering for actionable deployments, or skipping slot deployments that would not yield a diagnostic.

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +246
return nil, nil
}
hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{Page: 1, PerPage: troubleshootMaxDeployments})
candidates := pickTroubleshootDeployments(app, hist)
if len(candidates) == 0 {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshoot always calls ListDeployments (history) even when the three app slot deployments already fill the candidate budget (currently capped at 3). That’s an extra API call on every troubleshoot invocation without affecting output in that case. Consider fetching history only when there’s remaining capacity after considering the slot deployments (and align troubleshootMaxDeployments/comments with the actual max returned).

Copilot uses AI. Check for mistakes.
Comment thread go.mod
Comment on lines 5 to 7
require (
github.com/GoCodeAlone/workflow v0.18.6
github.com/GoCodeAlone/workflow v0.18.10
github.com/aws/aws-sdk-go-v2 v1.41.5
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description states the workflow bump to v0.18.10 is “blocked” and will be added later, but this commit already updates go.mod to v0.18.10 (and go.sum too). Please reconcile the description vs. code (either update the PR text or defer the bump).

Copilot uses AI. Check for mistakes.
intel352 and others added 3 commits April 24, 2026 16:39
…(v0.7.8)

Three layers:
1. Root-cause audit of v0.7.7 empty-ID guard — find any silent name-fallback
   that put BMW's state into name-as-ProviderID shape in the first place.
2. Per-driver state-heal in Update/Delete (already drafted in working tree):
   isUUIDLike shape check → findAppByName fallback when stale → return
   ResourceOutput with healed UUID so wfctl rewrites state transparently.
3. Integration-test harness (new): fakeAppsClient + in-memory state store +
   5 tests including the specific "stale name heals to UUID" scenario that
   would have caught tonight's regression.

Scope v0.7.8: AppPlatformDriver only (unblocks BMW). Other UUID drivers
(database, cache, certificate, droplet, lb, vpc, firewall, reserved_ip,
api_gateway) deferred to v0.7.9 with same pattern.

Non-goals: wfctl-core generic ValidateProviderID hook (skipped — each
driver owns its heal since ID format varies per provider).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: BMW state.json contains ProviderID="bmw-staging" (the app name)
from a pre-v0.7.7 broken run. The driver Create path is correct (appOutput
always uses app.ID), but any Update/Delete with a stale non-UUID ProviderID
would send that name as the DO API path parameter, causing 400 invalid uuid.

Changes:
- Add isUUIDLike helper: 36 chars + hyphens at positions 8/13/18/23
- Add resolveProviderID: if ProviderID is not UUID-like, log a WARN and
  resolve by name via findAppByName before proceeding
- Update uses resolveProviderID so PUT /v2/apps/<id> always gets a UUID
- Delete uses resolveProviderID for the same reason
- Update existing tests to use UUID-format ProviderIDs (was "app-123")
- New app_platform_stateheal_test.go: TestIsUUIDLike_TableDriven,
  TestCreate_ProviderIDIsUUIDFromAPI, TestUpdate_HealsStaleName,
  TestUpdate_UsesUUIDWhenProviderIDIsValid, TestUpdate_HealStaleName_LookupFails,
  TestDelete_HealsStaleName, TestDelete_UsesUUIDWhenProviderIDIsValid,
  TestDelete_HealStaleName_LookupFails

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8-task implementation plan for the ProviderID state-heal work:
- Task 0: go.mod bump to v0.18.10.1 (blocks on impl-migrations' hotfix tag)
- Task 1: commit already-drafted resolveProviderID heal in AppPlatformDriver
- Task 2: extract isUUIDLike to shared.go with TableDriven test
- Task 3: root-cause audit of v0.7.7 empty-ID guard + PR-body writeup
- Task 4: integration harness (fakeAppsClient + inMemoryState + applySim)
- Task 5: 5 integration tests covering Create-persists-UUID, Update with
  valid UUID, Update heals stale name, Delete heals stale name, Update
  fails clearly when app not found
- Task 6: CHANGELOG v0.7.8 state-heal section
- Task 7: review cycle + push + Copilot + DM team-lead for merge
- Task 8: tag v0.7.8 (team-lead) + file v0.7.9 follow-up task

Each task has exact file paths, test code, commit messages, and commands.
TDD with frequent commits per team convention.

Ownership: impl-digitalocean-2 owns tasks 1-7, team-lead owns task 8 + tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/drivers/app_platform.go Outdated
}

// troubleshootMaxDeployments is the maximum number of recent historical
// deployments fetched by Troubleshoot beyond the active/in-progress slots.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for troubleshootMaxDeployments says it's the max number of historical deployments fetched "beyond the active/in-progress slots", but pickTroubleshootDeployments caps total candidates at 3 (including slots). Either update the comment/constant name to reflect actual behavior, or adjust the selection logic if the intent is to include slot deployments plus additional history.

Suggested change
// deployments fetched by Troubleshoot beyond the active/in-progress slots.
// deployments fetched by Troubleshoot for candidate selection.

Copilot uses AI. Check for mistakes.
Comment on lines +150 to +153
// findAppByName should NOT have been called (List should not have been invoked).
if len(c.listApps) != 0 && c.updateCalledID != uuid {
t.Error("findAppByName should not be called when ProviderID is already a UUID")
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test intends to assert that findAppByName (and thus List) was not called, but it checks len(c.listApps), which is just the configured return value rather than a call-tracking signal. As written, the assertion will always pass even if List is invoked. Consider adding a listCalled counter/flag to stateHealClient.List and asserting on that instead.

Copilot uses AI. Check for mistakes.
Comment thread internal/module_instance.go Outdated
}
ts, ok := driver.(interfaces.Troubleshooter)
if !ok {
return nil, status.Error(codes.Unimplemented, "driver does not implement Troubleshooter")
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Unimplemented error message is very generic; including the resourceType would make troubleshooting easier if it ever surfaces outside the intended wfctl no-op path.

Suggested change
return nil, status.Error(codes.Unimplemented, "driver does not implement Troubleshooter")
return nil, status.Error(codes.Unimplemented, fmt.Sprintf("resource driver %q does not implement Troubleshooter", resourceType))

Copilot uses AI. Check for mistakes.
Comment thread CHANGELOG.md Outdated

### Changed

- Depends on workflow v0.18.10 (was v0.18.9) once tagged.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog says workflow dependency was bumped from v0.18.9 → v0.18.10, but this PR’s go.mod change is v0.18.6 → v0.18.10. Please correct the prior version in the entry so release notes match the actual diff.

Suggested change
- Depends on workflow v0.18.10 (was v0.18.9) once tagged.
- Depends on workflow v0.18.10 (was v0.18.6) once tagged.

Copilot uses AI. Check for mistakes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
intel352 and others added 2 commits April 24, 2026 16:48
fakeAppsClient + inMemoryState + applySim harness mirrors wfctl's
apply→persist loop so tests exercise the full Create/Update/Delete path
against an in-memory DO API.

5 integration tests:
- TestAppPlatform_Create_PersistsUUIDInState — UUID from API stored, not name
- TestAppPlatform_Update_UsesExistingUUID — no heal fires for valid UUID
- TestAppPlatform_Update_HealsStaleName — stale name resolves to UUID, state
  rewritten, WARN log emitted (regression invariant: test FAILS without heal)
- TestAppPlatform_Delete_HealsStaleName — same heal on Delete path
- TestAppPlatform_Update_HealFails_WhenAppNotFound — clear error, not silent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expand the v0.7.8 Fixed entry with:
- Root-cause narrative (BMW deploy 24901939350, pre-v0.7.7 name substitution)
- isUUIDLike shared helper note
- WARN log / state-heal keyword callout
- Integration-test harness description (fakeAppsClient, inMemoryState, applySim, 5 tests)
- Known follow-up section for v0.7.9 (replicate heal across other UUID drivers)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Finding 1 (comment): Update troubleshootMaxDeployments doc to clarify the
cap applies to the historical listing pass, separate from slot-based candidates.

Finding 2 (CRITICAL test quality): stateHealClient.List now increments a
listCalls counter; TestUpdate_UsesUUIDWhenProviderIDIsValid and
TestUpdate_HealsStaleName both assert on listCalls (not len(listApps) which
was the configured input, not a call-tracking signal). Verified: commenting
out resolveProviderID causes both heal tests to FAIL with:
  - "listCalls = 0, want ≥ 1 (findAppByName must fire during heal)"
  - "Update called with bmw-staging, want UUID f8b6200c-... (state-heal failed)"
Tests have teeth.

Finding 3 (error message): codes.Unimplemented message now includes
resourceType for diagnosability: "resource driver %q does not implement
Troubleshooter".

Finding 4 (CHANGELOG): Corrects version bump text from v0.18.9→v0.18.10
to v0.18.6→v0.18.10.1 (actual go.mod history).

Also: bump go.mod to workflow v0.19.0-alpha.1.0.20260424204713-c02f228f1e08
(the pseudo-version for the v0.18.10.1 hotfix merge commit c02f228 on main;
no stable v0.18.10.1 tag yet in proxy, pseudo-version is the canonical ref).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +370 to +403
// invokeDriverTroubleshoot checks whether the driver implements
// interfaces.Troubleshooter and, if so, calls Troubleshoot and serialises the
// result. Returns codes.Unimplemented when the driver does not implement the
// interface so wfctl's remoteResourceDriver can silently no-op.
func (m *doModuleInstance) invokeDriverTroubleshoot(args map[string]any) (map[string]any, error) {
resourceType, _ := args["resource_type"].(string)
if resourceType == "" {
return nil, fmt.Errorf("ResourceDriver.Troubleshoot: missing resource_type arg")
}
driver, err := m.provider.ResourceDriver(resourceType)
if err != nil {
return nil, fmt.Errorf("ResourceDriver.Troubleshoot: %w", err)
}
ts, ok := driver.(interfaces.Troubleshooter)
if !ok {
return nil, status.Error(codes.Unimplemented, fmt.Sprintf("resource driver %q does not implement Troubleshooter", resourceType))
}
ref := refFromArgs(args)
failureMsg, _ := args["failure_msg"].(string)
diags, err := ts.Troubleshoot(context.Background(), ref, failureMsg)
if err != nil {
return nil, err
}
diagList := make([]any, len(diags))
for i, d := range diags {
diagList[i] = map[string]any{
"id": d.ID,
"phase": d.Phase,
"cause": d.Cause,
"at": d.At.Format(time.RFC3339),
}
}
return map[string]any{"diagnostics": diagList}, nil
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New ResourceDriver.Troubleshoot dispatch is not covered by internal/module_instance_test.go. Add tests that (1) verify dispatch to a driver implementing interfaces.Troubleshooter and correct result marshaling (including RFC3339 timestamp), and (2) verify drivers that don’t implement Troubleshooter return a codes.Unimplemented error so callers can no-op as intended.

Copilot generated this review using guidance from organization custom instructions.
Comment on lines +124 to +127
t.Errorf("Update called with ID %q, want UUID %q", c.updateCalledID, uuid)
}
// findAppByName must NOT fire when ProviderID is already a valid UUID.
if c.listCalls != 0 {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion doesn’t actually verify that findAppByName/List was not called: it checks len(c.listApps) (seed data) and also gates on updateCalledID != uuid, which is expected to be equal and makes the condition false. To make the test meaningful, track List call count in stateHealClient (or make List return an error and assert Update succeeds when ProviderID is already UUID-like).

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +17
## Repo + branch

**Repo:** `/Users/jon/workspace/workflow-plugin-digitalocean`
**Branch:** `feat/v0.7.8-troubleshoot` (already open as PR #22)

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plan doc hard-codes a contributor-specific local path (e.g. /Users/jon/...), which isn’t portable and may leak personal workstation details into the repo. Prefer relative paths or generic placeholders in committed documentation.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +4
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document contains tool-specific meta-instructions (e.g. “For Claude…”) that aren’t part of the implementation plan itself and add noise for human readers. Consider removing these lines or moving them to a separate, non-committed prompt/runbook.

Suggested change
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +272
}
app, _, err := d.client.Get(ctx, ref.ProviderID)
if err != nil {
return nil, fmt.Errorf("troubleshoot: get app %q: %w", ref.Name, WrapGodoError(err))
}
if app == nil {
return nil, nil
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Troubleshoot uses ref.ProviderID directly for Get/ListDeployments. If state still contains a stale name-as-ProviderID (the bug class this PR heals), this will call the DO API with a non-UUID path param and likely fail with "invalid uuid", preventing diagnostics from being returned. Consider reusing resolveProviderID here (and then using the resolved UUID for both Get and ListDeployments) so Troubleshoot remains effective even before an Update/Delete has healed state.

Suggested change
}
app, _, err := d.client.Get(ctx, ref.ProviderID)
if err != nil {
return nil, fmt.Errorf("troubleshoot: get app %q: %w", ref.Name, WrapGodoError(err))
}
if app == nil {
return nil, nil
}
providerID, err := d.resolveProviderID(ctx, ref)
if err != nil {
return nil, fmt.Errorf("troubleshoot: resolve app %q provider id: %w", ref.Name, err)
}
app, _, err := d.client.Get(ctx, providerID)
if err != nil {
return nil, fmt.Errorf("troubleshoot: get app %q: %w", ref.Name, WrapGodoError(err))
}
if app == nil {
return nil, nil
}
hist, _, _ := d.client.ListDeployments(ctx, providerID, &godo.ListOptions{Page: 1, PerPage: troubleshootMaxDeployments})

Copilot uses AI. Check for mistakes.
@intel352 intel352 merged commit e24df9c into main Apr 24, 2026
3 checks passed
@intel352 intel352 deleted the feat/v0.7.8-troubleshoot branch April 24, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants