feat(app-platform): v0.7.8 — Troubleshoot impl + gRPC dispatch (Tasks 11-13)#22
feat(app-platform): v0.7.8 — Troubleshoot impl + gRPC dispatch (Tasks 11-13)#22
Conversation
There was a problem hiding this comment.
Pull request overview
Adds App Platform “troubleshooting” support so wfctl can automatically fetch and display recent DO App Platform deployments when a health-check poll times out.
Changes:
- Extends
AppPlatformClientwithListDeploymentsand implementsAppPlatformDriver.Troubleshoot()to convert recent deployments into[]interfaces.Diagnostic. - Updates mocks to satisfy the new client interface and adds unit tests for Troubleshoot behaviors.
- Updates module metadata (
go.mod/go.sum) to support the workflow interface change (currently via a local replace).
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/drivers/app_platform.go | Adds ListDeployments to the client interface and implements Troubleshoot() to return recent deployment diagnostics. |
| internal/drivers/app_platform_test.go | Extends the App Platform mock client and adds new unit tests for Troubleshoot(). |
| internal/drivers/deploy_test.go | Updates deploy mock client to satisfy the expanded AppPlatformClient interface. |
| go.mod | Adds a local replace for workflow to pick up the new Troubleshooter interface during development. |
| go.sum | Removes checksum entries associated with the prior workflow version in conjunction with module changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| replace github.com/GoCodeAlone/workflow => ../workflow |
There was a problem hiding this comment.
The committed replace github.com/GoCodeAlone/workflow => ../workflow will break builds for anyone who doesn’t have that relative path (including CI). Replace directives should not point to local disk paths in a PR; instead update require to the released/pseudo version that contains interfaces.Troubleshooter (or use a versioned replace to a VCS ref) and drop the local replace before merge.
| replace github.com/GoCodeAlone/workflow => ../workflow |
| if ref.ProviderID == "" { | ||
| return nil, fmt.Errorf("app platform troubleshoot %q: no ProviderID available", ref.Name) | ||
| } | ||
| deps, _, err := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments}) |
There was a problem hiding this comment.
For consistency with the rest of the codebase’s paginated list calls, initialize godo.ListOptions with Page: 1 as well as PerPage. Other drivers (and findAppByName above) always set Page: 1, and relying on the SDK/API default page can lead to surprising results if defaults change.
| deps, _, err := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments}) | |
| deps, _, err := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{Page: 1, PerPage: troubleshootMaxDeployments}) |
| // Surface the first non-empty error step message when cause is empty. | ||
| if cause == "" && dep.Progress != nil { | ||
| for _, step := range dep.Progress.Steps { | ||
| if step.Status == godo.DeploymentProgressStepStatus_Error && step.Reason != nil && step.Reason.Message != "" { | ||
| cause = step.Reason.Message | ||
| break | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The fallback logic for when dep.Cause is empty (scanning dep.Progress.Steps for an error step Reason.Message) isn’t covered by the new tests. Add a unit test that sets Cause to empty and populates Progress.Steps with an error step to assert the diagnostic’s Cause is derived from the step message.
…oyments on health-check failure Adds AppPlatformDriver.Troubleshoot(ctx, ref, failureMsg) implementing the new optional interfaces.Troubleshooter from workflow v0.18.10. When wfctl's health-check poll times out, it now calls Troubleshoot and prints up to 5 recent deployments (ID, phase, cause, timestamp) so the user can immediately see root cause without visiting the DO console. Changes: - AppPlatformClient interface gains ListDeployments (matches godo.AppsService) - Troubleshoot() fetches last 5 deployments and maps them to []Diagnostic — falls back to error step reason when dep.Cause is empty - mockAppClient + deployMockClient in tests gain ListDeployments stub - 3 new unit tests: happy path, missing ProviderID, API error Also updates go.mod to replace github.com/GoCodeAlone/workflow with local path during development (will be replaced with require v0.18.10 when that tag is published). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(Tasks 11-13) - Replace basic ListDeployments-only Troubleshoot with prioritised slot approach: InProgressDeployment > PendingDeployment > ActiveDeployment > historical (≤5). Empty ProviderID now returns (nil, nil); ListDeployments errors are best-effort. - Add pickTroubleshootDeployments(app, hist): priority-ordered dedup candidate list. - Add buildDiagnosticFor(dep): extracts phase/cause from Progress.SummarySteps, Progress.Steps, and dep.Cause in priority order. - Add deploymentCauseAndPhase(dep): pure extraction helper. - Add extractCause(tail): scans for Error/exit status/panic/fatal patterns with last-line fallback. - Add ResourceDriver.Troubleshoot dispatch in InvokeMethod; returns codes.Unimplemented for non-Troubleshooter drivers so wfctl silently no-ops. - 23 new tests: Troubleshoot scenarios, pickTroubleshootDeployments ordering/dedup, buildDiagnosticFor phases, extractCause table (10 cases), deploymentCauseAndPhase. - Update 3 stale tests in app_platform_test.go to match new empty-ProviderID and best-effort ListDeployments semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // pickTroubleshootDeployments returns up to 3 candidate deployments in priority | ||
| // order: InProgress > Pending > Active, followed by unique historical entries. | ||
| // The active-deployment slot is included only when the other two are absent, | ||
| // since an active-and-healthy deployment seldom needs troubleshooting. | ||
| func pickTroubleshootDeployments(app *godo.App, historical []*godo.Deployment) []*godo.Deployment { | ||
| seen := map[string]bool{} | ||
| var result []*godo.Deployment | ||
| add := func(dep *godo.Deployment) { | ||
| if dep == nil || dep.ID == "" || seen[dep.ID] { | ||
| return | ||
| } | ||
| seen[dep.ID] = true | ||
| result = append(result, dep) | ||
| } | ||
| add(app.InProgressDeployment) | ||
| add(app.PendingDeployment) | ||
| add(app.ActiveDeployment) | ||
| for _, dep := range historical { |
There was a problem hiding this comment.
The comment says the Active deployment slot is included only when InProgress/Pending are absent, but the implementation always adds app.ActiveDeployment (subject to nil/ID checks). Either adjust the logic to match the comment (to preserve space for historical deployments when an in-progress/pending slot exists), or update the comment so it accurately describes the current behavior.
| ### Changed | ||
| - Depends on workflow v0.18.10 (was v0.18.9) once tagged. | ||
| - `AppPlatformDriver.Troubleshoot`: empty `ProviderID` now returns `(nil, nil)` instead |
There was a problem hiding this comment.
This entry says the plugin depends on workflow v0.18.10, but go.mod in this PR still requires v0.18.6 (and currently uses a local replace). Please update the changelog to match what will actually ship in this release, or bump the module dependency to the stated version once the tag exists.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
486cb1f to
379abbe
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if app == nil { | ||
| return nil, nil | ||
| } | ||
| hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments}) |
There was a problem hiding this comment.
ListDeployments is called with &godo.ListOptions{PerPage: ...} but no Page is set. Elsewhere in this driver (findAppByName) pagination is explicitly initialized with Page: 1; leaving it as the zero value risks requesting an invalid page depending on godo/API behavior. Consider setting Page: 1 here as well.
| hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{PerPage: troubleshootMaxDeployments}) | |
| hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{ | |
| Page: 1, | |
| PerPage: troubleshootMaxDeployments, | |
| }) |
| // pickTroubleshootDeployments returns up to 3 candidate deployments in priority | ||
| // order: InProgress > Pending > Active, followed by unique historical entries. | ||
| // The active-deployment slot is included only when the other two are absent, | ||
| // since an active-and-healthy deployment seldom needs troubleshooting. | ||
| func pickTroubleshootDeployments(app *godo.App, historical []*godo.Deployment) []*godo.Deployment { |
There was a problem hiding this comment.
The comment says the Active deployment slot is included only when InProgress/Pending are absent, but the implementation below always adds app.ActiveDeployment and caps results at 3. That can crowd out historical deployments (and if Active is healthy it likely produces no diagnostic). Please either update the logic to match the comment (include Active conditionally) or adjust the comment/selection cap so you don’t lose actionable history.
| case "ResourceDriver.Troubleshoot": | ||
| return m.invokeDriverTroubleshoot(args) | ||
|
|
There was a problem hiding this comment.
This PR adds a new InvokeMethod dispatch path for ResourceDriver.Troubleshoot, but there doesn’t appear to be corresponding coverage in internal/module_instance_test.go (that file already tests the other ResourceDriver dispatch cases). Please add unit tests for (1) a driver implementing interfaces.Troubleshooter returning serialized diagnostics and (2) a non-Troubleshooter driver returning codes.Unimplemented so the gRPC behavior is locked in.
|
|
||
| ### Changed | ||
|
|
||
| - Depends on workflow v0.18.10 (was v0.18.9) once tagged. |
There was a problem hiding this comment.
The changelog claims this release depends on workflow v0.18.10 / was v0.18.9, but go.mod currently requires github.com/GoCodeAlone/workflow v0.18.6 (and this PR adds a local replace). Please align the changelog with the actual dependency state for this PR (or defer the version-bump note to the follow-up commit that updates go.mod).
| - Depends on workflow v0.18.10 (was v0.18.9) once tagged. | |
| - Added troubleshooting support compatible with the workflow Troubleshooter interface; | |
| defer the workflow module version-bump note until the follow-up commit that updates | |
| `go.mod`. |
- Add Page: 1 to ListDeployments call (consistent with other drivers) - Fix pickTroubleshootDeployments comment to match code (all 3 slots always collected, not just when InProgress/Pending absent) - Add TestDeploymentCauseAndPhase_LeafStepFallback_NoCause: covers path where dep.Cause is empty and cause comes from Progress.Steps leaf step Reason.Message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // pickTroubleshootDeployments returns up to 3 candidate deployments in priority | ||
| // order: InProgress > Pending > Active, followed by unique historical entries. | ||
| // All three deployment slots (InProgress, Pending, Active) are collected in | ||
| // priority order; historical deployments fill remaining capacity up to 3 total. | ||
| func pickTroubleshootDeployments(app *godo.App, historical []*godo.Deployment) []*godo.Deployment { | ||
| seen := map[string]bool{} | ||
| var result []*godo.Deployment | ||
| add := func(dep *godo.Deployment) { | ||
| if dep == nil || dep.ID == "" || seen[dep.ID] { | ||
| return | ||
| } | ||
| seen[dep.ID] = true | ||
| result = append(result, dep) | ||
| } | ||
| add(app.InProgressDeployment) | ||
| add(app.PendingDeployment) | ||
| add(app.ActiveDeployment) | ||
| for _, dep := range historical { | ||
| if len(result) >= 3 { | ||
| break |
There was a problem hiding this comment.
pickTroubleshootDeployments hard-caps the candidate list to 3 deployments, which conflicts with troubleshootMaxDeployments = 5 and the PR/CHANGELOG description of including slot deployments plus up to 5 historical ones. Please align the cap and the docs (either increase this limit or reduce the constant/docs accordingly).
| - Depends on workflow v0.18.10 (was v0.18.9) once tagged. | ||
| - `AppPlatformDriver.Troubleshoot`: empty `ProviderID` now returns `(nil, nil)` instead | ||
| of an error; `ListDeployments` errors are best-effort (swallowed, slot-based data used). |
There was a problem hiding this comment.
The changelog claims a dependency on workflow v0.18.10 (and says it was v0.18.9), but go.mod currently requires v0.18.6 and this PR adds a local replace instead of a version bump. Please update the changelog to match what’s actually in go.mod, or complete the dependency bump once the tag exists.
| package drivers | ||
|
|
||
| // White-box tests for Troubleshoot helpers. | ||
| // Using package drivers (not drivers_test) so unexported helpers are accessible. | ||
|
|
There was a problem hiding this comment.
All other tests under internal/drivers use the external drivers_test package (e.g. internal/drivers/app_platform_test.go:1), but this new file uses package drivers to reach unexported helpers. That increases coupling to internals and makes refactors harder; prefer testing via the public Troubleshoot API from drivers_test, or expose the minimal helper behavior needed.
Remove local replace directive; use published v0.18.10 which ships interfaces.Troubleshooter and interfaces.Diagnostic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| diagList[i] = map[string]any{ | ||
| "id": d.ID, | ||
| "phase": d.Phase, | ||
| "cause": d.Cause, | ||
| "at": d.At.Format(time.RFC3339), |
There was a problem hiding this comment.
invokeDriverTroubleshoot always formats d.At using RFC3339. If a driver returns a Diagnostic with a zero At (e.g., tests/terminal-phase diagnostics without CreatedAt), this will serialize as year 0001, which is misleading for users and may break consumers expecting a valid timestamp. Consider omitting at or returning an empty string when d.At.IsZero().
| diagList[i] = map[string]any{ | |
| "id": d.ID, | |
| "phase": d.Phase, | |
| "cause": d.Cause, | |
| "at": d.At.Format(time.RFC3339), | |
| at := "" | |
| if !d.At.IsZero() { | |
| at = d.At.Format(time.RFC3339) | |
| } | |
| diagList[i] = map[string]any{ | |
| "id": d.ID, | |
| "phase": d.Phase, | |
| "cause": d.Cause, | |
| "at": at, |
| case "ResourceDriver.Troubleshoot": | ||
| return m.invokeDriverTroubleshoot(args) | ||
|
|
There was a problem hiding this comment.
ResourceDriver.Troubleshoot is newly dispatched from InvokeMethod, but internal/module_instance_test.go has no InvokeMethod coverage for this method (success path and codes.Unimplemented for non-Troubleshooter drivers). Add unit tests to lock in the plugin RPC behavior.
| for _, dep := range historical { | ||
| if len(result) >= 3 { | ||
| break | ||
| } | ||
| add(dep) | ||
| } |
There was a problem hiding this comment.
pickTroubleshootDeployments limits to 3 candidates (see the len(result) >= 3 break) before buildDiagnosticFor is applied. Non-actionable slot deployments (e.g., healthy ActiveDeployment) can consume capacity and prevent additional failing historical deployments from being returned. Consider applying the cap after filtering for actionable deployments, or skipping slot deployments that would not yield a diagnostic.
| return nil, nil | ||
| } | ||
| hist, _, _ := d.client.ListDeployments(ctx, ref.ProviderID, &godo.ListOptions{Page: 1, PerPage: troubleshootMaxDeployments}) | ||
| candidates := pickTroubleshootDeployments(app, hist) | ||
| if len(candidates) == 0 { |
There was a problem hiding this comment.
Troubleshoot always calls ListDeployments (history) even when the three app slot deployments already fill the candidate budget (currently capped at 3). That’s an extra API call on every troubleshoot invocation without affecting output in that case. Consider fetching history only when there’s remaining capacity after considering the slot deployments (and align troubleshootMaxDeployments/comments with the actual max returned).
| require ( | ||
| github.com/GoCodeAlone/workflow v0.18.6 | ||
| github.com/GoCodeAlone/workflow v0.18.10 | ||
| github.com/aws/aws-sdk-go-v2 v1.41.5 |
There was a problem hiding this comment.
PR description states the workflow bump to v0.18.10 is “blocked” and will be added later, but this commit already updates go.mod to v0.18.10 (and go.sum too). Please reconcile the description vs. code (either update the PR text or defer the bump).
…(v0.7.8) Three layers: 1. Root-cause audit of v0.7.7 empty-ID guard — find any silent name-fallback that put BMW's state into name-as-ProviderID shape in the first place. 2. Per-driver state-heal in Update/Delete (already drafted in working tree): isUUIDLike shape check → findAppByName fallback when stale → return ResourceOutput with healed UUID so wfctl rewrites state transparently. 3. Integration-test harness (new): fakeAppsClient + in-memory state store + 5 tests including the specific "stale name heals to UUID" scenario that would have caught tonight's regression. Scope v0.7.8: AppPlatformDriver only (unblocks BMW). Other UUID drivers (database, cache, certificate, droplet, lb, vpc, firewall, reserved_ip, api_gateway) deferred to v0.7.9 with same pattern. Non-goals: wfctl-core generic ValidateProviderID hook (skipped — each driver owns its heal since ID format varies per provider). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: BMW state.json contains ProviderID="bmw-staging" (the app name) from a pre-v0.7.7 broken run. The driver Create path is correct (appOutput always uses app.ID), but any Update/Delete with a stale non-UUID ProviderID would send that name as the DO API path parameter, causing 400 invalid uuid. Changes: - Add isUUIDLike helper: 36 chars + hyphens at positions 8/13/18/23 - Add resolveProviderID: if ProviderID is not UUID-like, log a WARN and resolve by name via findAppByName before proceeding - Update uses resolveProviderID so PUT /v2/apps/<id> always gets a UUID - Delete uses resolveProviderID for the same reason - Update existing tests to use UUID-format ProviderIDs (was "app-123") - New app_platform_stateheal_test.go: TestIsUUIDLike_TableDriven, TestCreate_ProviderIDIsUUIDFromAPI, TestUpdate_HealsStaleName, TestUpdate_UsesUUIDWhenProviderIDIsValid, TestUpdate_HealStaleName_LookupFails, TestDelete_HealsStaleName, TestDelete_UsesUUIDWhenProviderIDIsValid, TestDelete_HealStaleName_LookupFails Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8-task implementation plan for the ProviderID state-heal work: - Task 0: go.mod bump to v0.18.10.1 (blocks on impl-migrations' hotfix tag) - Task 1: commit already-drafted resolveProviderID heal in AppPlatformDriver - Task 2: extract isUUIDLike to shared.go with TableDriven test - Task 3: root-cause audit of v0.7.7 empty-ID guard + PR-body writeup - Task 4: integration harness (fakeAppsClient + inMemoryState + applySim) - Task 5: 5 integration tests covering Create-persists-UUID, Update with valid UUID, Update heals stale name, Delete heals stale name, Update fails clearly when app not found - Task 6: CHANGELOG v0.7.8 state-heal section - Task 7: review cycle + push + Copilot + DM team-lead for merge - Task 8: tag v0.7.8 (team-lead) + file v0.7.9 follow-up task Each task has exact file paths, test code, commit messages, and commands. TDD with frequent commits per team convention. Ownership: impl-digitalocean-2 owns tasks 1-7, team-lead owns task 8 + tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| // troubleshootMaxDeployments is the maximum number of recent historical | ||
| // deployments fetched by Troubleshoot beyond the active/in-progress slots. |
There was a problem hiding this comment.
The comment for troubleshootMaxDeployments says it's the max number of historical deployments fetched "beyond the active/in-progress slots", but pickTroubleshootDeployments caps total candidates at 3 (including slots). Either update the comment/constant name to reflect actual behavior, or adjust the selection logic if the intent is to include slot deployments plus additional history.
| // deployments fetched by Troubleshoot beyond the active/in-progress slots. | |
| // deployments fetched by Troubleshoot for candidate selection. |
| // findAppByName should NOT have been called (List should not have been invoked). | ||
| if len(c.listApps) != 0 && c.updateCalledID != uuid { | ||
| t.Error("findAppByName should not be called when ProviderID is already a UUID") | ||
| } |
There was a problem hiding this comment.
This test intends to assert that findAppByName (and thus List) was not called, but it checks len(c.listApps), which is just the configured return value rather than a call-tracking signal. As written, the assertion will always pass even if List is invoked. Consider adding a listCalled counter/flag to stateHealClient.List and asserting on that instead.
| } | ||
| ts, ok := driver.(interfaces.Troubleshooter) | ||
| if !ok { | ||
| return nil, status.Error(codes.Unimplemented, "driver does not implement Troubleshooter") |
There was a problem hiding this comment.
The Unimplemented error message is very generic; including the resourceType would make troubleshooting easier if it ever surfaces outside the intended wfctl no-op path.
| return nil, status.Error(codes.Unimplemented, "driver does not implement Troubleshooter") | |
| return nil, status.Error(codes.Unimplemented, fmt.Sprintf("resource driver %q does not implement Troubleshooter", resourceType)) |
|
|
||
| ### Changed | ||
|
|
||
| - Depends on workflow v0.18.10 (was v0.18.9) once tagged. |
There was a problem hiding this comment.
Changelog says workflow dependency was bumped from v0.18.9 → v0.18.10, but this PR’s go.mod change is v0.18.6 → v0.18.10. Please correct the prior version in the entry so release notes match the actual diff.
| - Depends on workflow v0.18.10 (was v0.18.9) once tagged. | |
| - Depends on workflow v0.18.10 (was v0.18.6) once tagged. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fakeAppsClient + inMemoryState + applySim harness mirrors wfctl's apply→persist loop so tests exercise the full Create/Update/Delete path against an in-memory DO API. 5 integration tests: - TestAppPlatform_Create_PersistsUUIDInState — UUID from API stored, not name - TestAppPlatform_Update_UsesExistingUUID — no heal fires for valid UUID - TestAppPlatform_Update_HealsStaleName — stale name resolves to UUID, state rewritten, WARN log emitted (regression invariant: test FAILS without heal) - TestAppPlatform_Delete_HealsStaleName — same heal on Delete path - TestAppPlatform_Update_HealFails_WhenAppNotFound — clear error, not silent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expand the v0.7.8 Fixed entry with: - Root-cause narrative (BMW deploy 24901939350, pre-v0.7.7 name substitution) - isUUIDLike shared helper note - WARN log / state-heal keyword callout - Integration-test harness description (fakeAppsClient, inMemoryState, applySim, 5 tests) - Known follow-up section for v0.7.9 (replicate heal across other UUID drivers) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Finding 1 (comment): Update troubleshootMaxDeployments doc to clarify the cap applies to the historical listing pass, separate from slot-based candidates. Finding 2 (CRITICAL test quality): stateHealClient.List now increments a listCalls counter; TestUpdate_UsesUUIDWhenProviderIDIsValid and TestUpdate_HealsStaleName both assert on listCalls (not len(listApps) which was the configured input, not a call-tracking signal). Verified: commenting out resolveProviderID causes both heal tests to FAIL with: - "listCalls = 0, want ≥ 1 (findAppByName must fire during heal)" - "Update called with bmw-staging, want UUID f8b6200c-... (state-heal failed)" Tests have teeth. Finding 3 (error message): codes.Unimplemented message now includes resourceType for diagnosability: "resource driver %q does not implement Troubleshooter". Finding 4 (CHANGELOG): Corrects version bump text from v0.18.9→v0.18.10 to v0.18.6→v0.18.10.1 (actual go.mod history). Also: bump go.mod to workflow v0.19.0-alpha.1.0.20260424204713-c02f228f1e08 (the pseudo-version for the v0.18.10.1 hotfix merge commit c02f228 on main; no stable v0.18.10.1 tag yet in proxy, pseudo-version is the canonical ref). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // invokeDriverTroubleshoot checks whether the driver implements | ||
| // interfaces.Troubleshooter and, if so, calls Troubleshoot and serialises the | ||
| // result. Returns codes.Unimplemented when the driver does not implement the | ||
| // interface so wfctl's remoteResourceDriver can silently no-op. | ||
| func (m *doModuleInstance) invokeDriverTroubleshoot(args map[string]any) (map[string]any, error) { | ||
| resourceType, _ := args["resource_type"].(string) | ||
| if resourceType == "" { | ||
| return nil, fmt.Errorf("ResourceDriver.Troubleshoot: missing resource_type arg") | ||
| } | ||
| driver, err := m.provider.ResourceDriver(resourceType) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("ResourceDriver.Troubleshoot: %w", err) | ||
| } | ||
| ts, ok := driver.(interfaces.Troubleshooter) | ||
| if !ok { | ||
| return nil, status.Error(codes.Unimplemented, fmt.Sprintf("resource driver %q does not implement Troubleshooter", resourceType)) | ||
| } | ||
| ref := refFromArgs(args) | ||
| failureMsg, _ := args["failure_msg"].(string) | ||
| diags, err := ts.Troubleshoot(context.Background(), ref, failureMsg) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| diagList := make([]any, len(diags)) | ||
| for i, d := range diags { | ||
| diagList[i] = map[string]any{ | ||
| "id": d.ID, | ||
| "phase": d.Phase, | ||
| "cause": d.Cause, | ||
| "at": d.At.Format(time.RFC3339), | ||
| } | ||
| } | ||
| return map[string]any{"diagnostics": diagList}, nil | ||
| } |
There was a problem hiding this comment.
New ResourceDriver.Troubleshoot dispatch is not covered by internal/module_instance_test.go. Add tests that (1) verify dispatch to a driver implementing interfaces.Troubleshooter and correct result marshaling (including RFC3339 timestamp), and (2) verify drivers that don’t implement Troubleshooter return a codes.Unimplemented error so callers can no-op as intended.
| t.Errorf("Update called with ID %q, want UUID %q", c.updateCalledID, uuid) | ||
| } | ||
| // findAppByName must NOT fire when ProviderID is already a valid UUID. | ||
| if c.listCalls != 0 { |
There was a problem hiding this comment.
This assertion doesn’t actually verify that findAppByName/List was not called: it checks len(c.listApps) (seed data) and also gates on updateCalledID != uuid, which is expected to be equal and makes the condition false. To make the test meaningful, track List call count in stateHealClient (or make List return an error and assert Update succeeds when ProviderID is already UUID-like).
| ## Repo + branch | ||
|
|
||
| **Repo:** `/Users/jon/workspace/workflow-plugin-digitalocean` | ||
| **Branch:** `feat/v0.7.8-troubleshoot` (already open as PR #22) | ||
|
|
There was a problem hiding this comment.
This plan doc hard-codes a contributor-specific local path (e.g. /Users/jon/...), which isn’t portable and may leak personal workstation details into the repo. Prefer relative paths or generic placeholders in committed documentation.
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. | ||
|
|
There was a problem hiding this comment.
The document contains tool-specific meta-instructions (e.g. “For Claude…”) that aren’t part of the implementation plan itself and add noise for human readers. Consider removing these lines or moving them to a separate, non-committed prompt/runbook.
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. |
| } | ||
| app, _, err := d.client.Get(ctx, ref.ProviderID) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("troubleshoot: get app %q: %w", ref.Name, WrapGodoError(err)) | ||
| } | ||
| if app == nil { | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
Troubleshoot uses ref.ProviderID directly for Get/ListDeployments. If state still contains a stale name-as-ProviderID (the bug class this PR heals), this will call the DO API with a non-UUID path param and likely fail with "invalid uuid", preventing diagnostics from being returned. Consider reusing resolveProviderID here (and then using the resolved UUID for both Get and ListDeployments) so Troubleshoot remains effective even before an Update/Delete has healed state.
| } | |
| app, _, err := d.client.Get(ctx, ref.ProviderID) | |
| if err != nil { | |
| return nil, fmt.Errorf("troubleshoot: get app %q: %w", ref.Name, WrapGodoError(err)) | |
| } | |
| if app == nil { | |
| return nil, nil | |
| } | |
| providerID, err := d.resolveProviderID(ctx, ref) | |
| if err != nil { | |
| return nil, fmt.Errorf("troubleshoot: resolve app %q provider id: %w", ref.Name, err) | |
| } | |
| app, _, err := d.client.Get(ctx, providerID) | |
| if err != nil { | |
| return nil, fmt.Errorf("troubleshoot: get app %q: %w", ref.Name, WrapGodoError(err)) | |
| } | |
| if app == nil { | |
| return nil, nil | |
| } | |
| hist, _, _ := d.client.ListDeployments(ctx, providerID, &godo.ListOptions{Page: 1, PerPage: troubleshootMaxDeployments}) |
Summary
Root-cause (v0.7.8 addendum — BMW deploy failure 24901939350)
What happened: BMW staging `state.json` contained `ProviderID="bmw-staging"` (the app name) instead of a DO UUID. When wfctl called `Update`, the DO API rejected the `PUT /v2/apps/bmw-staging` with `400 invalid uuid`.
Where the name came from: A pre-v0.7.7 `Create` run hit a code path in `DOProvider.Apply` that substituted `spec.Name` for `ProviderID` when the godo API returned a zero-ID response (documented in v0.7.7 commit `94c9227`). That fallback has been removed.
Why v0.7.7 didn't catch it: v0.7.7's empty-ID guard only protects new Create calls — it errors out early rather than silently storing a name. But it can't retroactively fix existing state entries that already have the bad shape. BMW's state predated the guard.
What v0.7.8 does: `resolveProviderID` in `Update` and `Delete` detects the stale name shape (via `isUUIDLike`) and heals it via `findAppByName` on the fly, before any DO API call. The returned `ResourceOutput.ProviderID` carries the real UUID so wfctl transparently rewrites state. No manual teardown or state editing required.
Verified: reverting `resolveProviderID` call causes `TestUpdate_HealsStaleName` to fail with:
Test has teeth.
Test plan