Skip to content

feat: v0.18.10 — wfctl deploy-log observability (Troubleshooter interface)#477

Merged
intel352 merged 15 commits intomainfrom
feat/v0.18.10-observability
Apr 24, 2026
Merged

feat: v0.18.10 — wfctl deploy-log observability (Troubleshooter interface)#477
intel352 merged 15 commits intomainfrom
feat/v0.18.10-observability

Conversation

@intel352
Copy link
Copy Markdown
Contributor

@intel352 intel352 commented Apr 24, 2026

Summary

  • Adds optional Troubleshooter interface on ResourceDriver (with Detail field on Diagnostic) so drivers can explain deploy failures without requiring the operator to open the provider console.
  • Wires wfctl to invoke Troubleshoot automatically after deploy health-check timeout or infra apply failure; renders diagnostics via CI-provider-agnostic group blocks (GHA ::group::, GitLab section_start, Jenkins/CircleCI dashed separators) and writes a Markdown summary to $GITHUB_STEP_SUMMARY.
  • Backwards compatible: drivers that don't implement Troubleshooter silently no-op via codes.Unimplemented; existing exit codes and error messages are preserved.

Commits

  • dc9d918 interfaces: add Detail field + compile-time Troubleshooter check
  • b0dce2f wfctl: Troubleshoot gRPC dispatch with Unimplemented fallback
  • 4ee8b40 wfctl: CIGroupEmitter with provider detection (GHA/GitLab/Jenkins/CircleCI)
  • b033869 wfctl: step-summary Markdown writer with golden tests
  • c44ea53 wfctl: call Troubleshoot after deploy health-check failure
  • b250e8e wfctl: call Troubleshoot after infra apply failure
  • 4658ece wfctl: e2e tests for Troubleshoot wiring
  • d708b6d docs: CHANGELOG v0.18.10

Test plan

  • go test -race -short ./... passes (all packages)
  • New unit tests: remote driver Troubleshoot dispatch (3), CI provider detection (6), step-summary golden (4), deploy troubleshoot helpers (4), infra apply wiring (2), e2e (2)
  • go vet ./... clean
  • After merge: DO plugin v0.7.8 implements AppPlatformDriver.Troubleshoot; BMW retries deploy and sees pre-deploy migration error in GHA output within 30s of timeout.

Design: docs/plans/2026-04-24-wfctl-deploy-log-observability-design.md
Follow-up: workflow-plugin-digitalocean v0.7.8 (separate PR, blocked by this tag)

🤖 Generated with Claude Code

intel352 and others added 2 commits April 24, 2026 15:27
Introduces optional Troubleshooter interface on ResourceDriver.  Drivers that
can explain their own failures (DO App Platform in v0.7.8, other clouds
later) implement Troubleshoot(ctx, ref, failureMsg) returning structured
[]Diagnostic. wfctl invokes it automatically after a health-check timeout or
deploy error and renders diagnostics in CI-provider-agnostic group blocks
plus a Markdown summary to $GITHUB_STEP_SUMMARY (and equivalents).

Scope:
- workflow v0.18.10 — interface, wfctl wiring, gRPC default-UNIMPLEMENTED,
  CI group emitter, tests.
- workflow-plugin-digitalocean v0.7.8 — AppPlatformDriver.Troubleshoot via
  godo deployments + per-phase logs.
- BMW bump — single PR after both upstream tags.

Non-goals v0.18.10: generic StreamLogs API; AWS/GCP/Azure implementations;
real-time streaming. All deferred to v0.19.0 alongside tasks #42 and #63.

Includes initial uncommitted draft of the Troubleshooter/Diagnostic types in
interfaces/iac_resource_driver.go (from impl-migrations' pre-pipeline draft;
adopted as design anchor).

Closes task #64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t on health-check timeout

Add live progress output to the plugin health-check poll loop so users
can see what's happening during long deploys instead of silence followed
by "timed out":

- Print start message: "→ health poll: waiting for <name> (timeout: 10m)"
- Every poll: emit timestamped status when message changes
- Heartbeat every 30 s when no new status arrives (silent-wait bug fix)
- On success: "✓ healthy (elapsed)"

On timeout, emit a structured failure block:
  ❌ Deploy health check timed out for "bmw-staging" after 10m0s
     Last observed status: no deployment found
     Recent deployments (via provider API):
       • dep-abc  phase=ERROR          14:45:03  — image pull failed: ...

The failure block is populated by the new optional interfaces.Troubleshooter
interface (added in b7d0ac1). Drivers that implement Troubleshoot() return
[]interfaces.Diagnostic; wfctl calls it automatically with a 15 s deadline.
Drivers that don't implement it produce no extra output (graceful degradation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
….7.8 + BMW bump)

19-task implementation plan covering:
- Phase 1 (tasks 1-10): workflow v0.18.10 — Troubleshooter interface, gRPC
  dispatch, CI emitters (GHA/GitLab/Jenkins/CircleCI), step-summary writer,
  failure-path wiring in deploy_providers + infra_apply, full test coverage,
  PR review, merge, tag.
- Phase 2 (tasks 11-17): workflow-plugin-digitalocean v0.7.8 —
  AppPlatformDriver.Troubleshoot via godo deployments + per-phase logs,
  cause extraction, gRPC dispatch, tests, PR, merge, tag.
- Phase 3 (tasks 18-19): BMW bump setup-wfctl + DO plugin pins.

Ownership: impl-migrations (workflow core), impl-digitalocean-2 (DO plugin),
impl-bmw-2 (BMW bump), team-lead (tags + BMW merge), spec+code reviewers.

TDD with frequent commits; each task has failing-test-first steps + exact
code samples. Dependency: Task 14 blocked by Task 10; Task 18 blocked by
Task 17.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional IaC driver troubleshooting interface and enhances wfctl deploy health-check polling output to improve observability during long-running deployments and timeouts.

Changes:

  • Introduce interfaces.Troubleshooter and interfaces.Diagnostic for provider-side failure diagnostics.
  • Add live progress/heartbeat logging during health-check polling and a structured timeout block.
  • Add a design document describing the intended end-to-end observability/troubleshooting UX.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
interfaces/iac_resource_driver.go Adds Diagnostic and the optional Troubleshooter interface for IaC drivers.
cmd/wfctl/deploy_providers.go Adds progress logging during health polling and attempts auto-troubleshoot on timeout.
docs/plans/2026-04-24-wfctl-deploy-log-observability-design.md Design doc for deploy-log observability and troubleshooting behavior.

var lastMsg string
var lastProgress time.Time

fmt.Printf(" → health poll: waiting for %q to become healthy (timeout: %s)\n", name, healthPollDefaultTimeout)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastProgress starts as the zero time, so time.Since(lastProgress) will be huge and the heartbeat will emit immediately (and in tests with zero intervals can spam output). Initialize lastProgress to start/time.Now() after the initial "waiting" line so the first heartbeat occurs after healthPollProgressInterval.

Suggested change
fmt.Printf(" → health poll: waiting for %q to become healthy (timeout: %s)\n", name, healthPollDefaultTimeout)
fmt.Printf(" → health poll: waiting for %q to become healthy (timeout: %s)\n", name, healthPollDefaultTimeout)
lastProgress = start

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 01f8b03: lastProgress initialised to start so the first heartbeat fires after healthPollProgressInterval, not immediately.

Comment on lines 1109 to 1112
select {
case <-pollCtx.Done():
if lastMsg != "" {
return fmt.Errorf("plugin health check %q: timed out waiting for healthy; last status: %s", name, lastMsg)
}
return fmt.Errorf("plugin health check %q: timed out waiting for healthy", name)
return healthPollTimeout(ctx, driver, ref, name, lastMsg, start)
case <-time.After(interval):
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pollCtx.Done() can be triggered by parent context cancellation as well as deadline expiry, but this path always reports a timeout and prints the timeout failure block. Consider checking pollCtx.Err() (canceled vs deadline exceeded) and return a cancellation error without the timeout messaging when the user/parent cancels.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 01f8b03: both pollCtx.Done() select case and the post-sleep pollCtx.Err() check now test errors.Is(pollCtx.Err(), context.Canceled) and return a plain "plugin health check: cancelled" error; only DeadlineExceeded falls through to the timeout-and-troubleshoot path.

Comment thread cmd/wfctl/deploy_providers.go Outdated
Comment on lines +1132 to +1135
baseErr := fmt.Sprintf("plugin health check %q: timed out waiting for healthy after %s", name, elapsed)
if lastMsg != "" {
baseErr = fmt.Sprintf("%s; last status: %s", baseErr, lastMsg)
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned error text now includes an elapsed duration ("... timed out ... after %s") whereas the previous error string did not. This conflicts with the PR description claim that existing error text for grep-based parsing remains unchanged; consider keeping the legacy error string and emitting the extra timing/context only in the printed failure block.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 01f8b03: the returned error string is restored to the legacy format (plugin health check "%s": timed out waiting for healthy; last status: ...) — no elapsed duration. Duration is now only in the human-readable os.Stderr print block.

Comment thread cmd/wfctl/deploy_providers.go Outdated
Comment on lines +1143 to +1145
// Auto-troubleshoot if the driver supports it.
if ts, ok := driver.(interfaces.Troubleshooter); ok {
tCtx, tCancel := context.WithTimeout(context.Background(), 15*time.Second)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

driver is typically a remoteResourceDriver for external IaC plugins, and that type currently only implements interfaces.ResourceDriver. As a result, this type assertion will always be false for plugin-backed providers and auto-troubleshoot will never run. To make this feature work, implement Troubleshoot on the remote driver and route it through InvokeService (and handle unimplemented gracefully).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b0dce2fremoteResourceDriver.Troubleshoot method added at line 691 of deploy_providers.go. It calls d.invoker.InvokeService("ResourceDriver.Troubleshoot", ...) and returns (nil, nil) silently on codes.Unimplemented, so the type assertion now succeeds for all plugin-backed drivers.

Comment thread cmd/wfctl/deploy_providers.go Outdated
Comment on lines +1145 to +1147
tCtx, tCancel := context.WithTimeout(context.Background(), 15*time.Second)
defer tCancel()
diags, tErr := ts.Troubleshoot(tCtx, ref, lastMsg)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The troubleshoot call ignores the caller context by using context.Background(), so it will keep running even if the parent ctx is canceled (e.g., Ctrl-C). Also, failureMsg is passed as lastMsg (status text) rather than the actual failure/timeout error string; pass the constructed failure message (e.g., baseErr) and derive the timeout context from ctx so cancellation is respected.

Suggested change
tCtx, tCancel := context.WithTimeout(context.Background(), 15*time.Second)
defer tCancel()
diags, tErr := ts.Troubleshoot(tCtx, ref, lastMsg)
tCtx, tCancel := context.WithTimeout(ctx, 15*time.Second)
defer tCancel()
diags, tErr := ts.Troubleshoot(tCtx, ref, baseErr)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c44ea53healthPollTimeout now calls troubleshootAfterFailure(ctx, ...) (parent context, not context.Background()) with a 30s bounded timeout, and passes errors.New(baseErr) (the full timeout error string) as the failure message — not lastMsg.

Comment on lines +33 to +41
```go
// Diagnostic is a single troubleshooting finding returned by a Troubleshooter.
type Diagnostic struct {
ID string `json:"id"` // provider-side id (deployment id, task arn)
Phase string `json:"phase"` // terminal or current phase
Cause string `json:"cause"` // human-readable root cause
At time.Time `json:"at"` // when the event occurred
Detail string `json:"detail,omitempty"` // optional verbose tail (log excerpt, stack)
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design doc's Diagnostic example includes a Detail field, but the actual interfaces.Diagnostic added in this PR does not. Either update the doc snippet to match the shipped interface or add Detail to the interface (and ensure it’s plumbed through the wfctl rendering).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in dc9d918Detail string \json:"detail,omitempty"`added tointerfaces.Diagnostic`. Design doc snippet now matches the shipped interface.

Comment on lines +79 to +82
After `pluginDeployProvider.Deploy()` returns error OR `HealthCheck` polling times out, check whether the driver implements Troubleshooter (via gRPC probe). If yes, call `Troubleshoot(ctx, ref, originalErr.Error())` with a 30-second timeout, render diagnostics, then propagate the original error.

**Hook 2 — `wfctl infra apply` (cmd/wfctl/infra_apply.go)**
When `driver.Apply()` or `HealthCheck()` fails, same logic.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc specifies calling Troubleshoot with a 30-second timeout, but wfctl currently uses a 15-second timeout. Please align the doc with the implementation (or vice versa) to avoid confusion for driver/plugin authors.

Suggested change
After `pluginDeployProvider.Deploy()` returns error OR `HealthCheck` polling times out, check whether the driver implements Troubleshooter (via gRPC probe). If yes, call `Troubleshoot(ctx, ref, originalErr.Error())` with a 30-second timeout, render diagnostics, then propagate the original error.
**Hook 2 — `wfctl infra apply` (cmd/wfctl/infra_apply.go)**
When `driver.Apply()` or `HealthCheck()` fails, same logic.
After `pluginDeployProvider.Deploy()` returns error OR `HealthCheck` polling times out, check whether the driver implements Troubleshooter (via gRPC probe). If yes, call `Troubleshoot(ctx, ref, originalErr.Error())` with a 15-second timeout, render diagnostics, then propagate the original error.
**Hook 2 — `wfctl infra apply` (cmd/wfctl/infra_apply.go)**
When `driver.Apply()` or `HealthCheck()` fails, check whether the driver implements Troubleshooter (via gRPC probe). If yes, call `Troubleshoot(ctx, ref, originalErr.Error())` with a 15-second timeout, render diagnostics, then propagate the original error.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed — code (c44ea53) and design doc both specify 30s. The 15s was only in the original inline code in bb69de8, which was superseded in c44ea53.

Comment thread cmd/wfctl/deploy_providers.go Outdated
Comment on lines +1127 to +1130
// healthPollTimeout builds the timeout error and auto-troubleshoots via the
// driver's Troubleshooter implementation (if any) before returning.
func healthPollTimeout(ctx context.Context, driver interfaces.ResourceDriver, ref interfaces.ResourceRef, name, lastMsg string, start time.Time) error {
elapsed := time.Since(start).Round(time.Second)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behaviors in pollUntilHealthy/healthPollTimeout (heartbeat progress output and the Troubleshooter path) are not covered by the existing health-check tests in this package. Adding unit tests that assert (a) Troubleshoot is invoked on timeout and (b) it’s skipped when not supported would help prevent regressions.

Copilot generated this review using guidance from organization custom instructions.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests added across Tasks 5-7: deploy_providers_troubleshoot_test.go (TestEmitDiagnostics_WritesGroupBlock, TestEmitDiagnostics_EmptyIsNoop, TestTroubleshootAfterFailure_Timeout, TestTroubleshootAfterFailure_NonTroubleshooterSkips), infra_apply_troubleshoot_test.go (TestInfraApply_EmitsDiagnosticsOnFailure, TestInfraApply_NonTroubleshooterNocrash), and e2e_deploy_troubleshoot_test.go (TestE2EDeployFailure_EmitsFullTroubleshootBlock, TestE2ENoTroubleshooter_NoCrash).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

⏱ Benchmark Results

No significant performance regressions detected.

benchstat comparison (baseline → PR)
## benchstat: baseline → PR
baseline-bench.txt:245: parsing iteration count: invalid syntax
baseline-bench.txt:432525: parsing iteration count: invalid syntax
baseline-bench.txt:792958: parsing iteration count: invalid syntax
baseline-bench.txt:1204415: parsing iteration count: invalid syntax
baseline-bench.txt:1605123: parsing iteration count: invalid syntax
baseline-bench.txt:1965250: parsing iteration count: invalid syntax
benchmark-results.txt:245: parsing iteration count: invalid syntax
benchmark-results.txt:358302: parsing iteration count: invalid syntax
benchmark-results.txt:653470: parsing iteration count: invalid syntax
benchmark-results.txt:975305: parsing iteration count: invalid syntax
benchmark-results.txt:1281390: parsing iteration count: invalid syntax
benchmark-results.txt:1594774: parsing iteration count: invalid syntax
goos: linux
goarch: amd64
pkg: github.com/GoCodeAlone/workflow/dynamic
cpu: AMD EPYC 9V74 80-Core Processor                
                            │ baseline-bench.txt │
                            │       sec/op       │
InterpreterCreation-4               3.592m ± 78%
ComponentLoad-4                     2.697m ±  0%
ComponentExecute-4                  1.400µ ±  1%
PoolContention/workers-1-4          784.9n ±  2%
PoolContention/workers-2-4          785.2n ±  1%
PoolContention/workers-4-4          785.4n ±  1%
PoolContention/workers-8-4          784.3n ±  1%
PoolContention/workers-16-4         785.6n ±  1%
ComponentLifecycle-4                2.700m ±  2%
SourceValidation-4                  1.613µ ±  1%
RegistryConcurrent-4                586.8n ±  4%
LoaderLoadFromString-4              2.726m ±  0%
geomean                             13.21µ

                            │ baseline-bench.txt │
                            │        B/op        │
InterpreterCreation-4               2.027Mi ± 0%
ComponentLoad-4                     2.180Mi ± 0%
ComponentExecute-4                  1.203Ki ± 0%
PoolContention/workers-1-4          1.203Ki ± 0%
PoolContention/workers-2-4          1.203Ki ± 0%
PoolContention/workers-4-4          1.203Ki ± 0%
PoolContention/workers-8-4          1.203Ki ± 0%
PoolContention/workers-16-4         1.203Ki ± 0%
ComponentLifecycle-4                2.183Mi ± 0%
SourceValidation-4                  1.984Ki ± 0%
RegistryConcurrent-4                1.133Ki ± 0%
LoaderLoadFromString-4              2.182Mi ± 0%
geomean                             15.25Ki

                            │ baseline-bench.txt │
                            │     allocs/op      │
InterpreterCreation-4                15.68k ± 0%
ComponentLoad-4                      18.02k ± 0%
ComponentExecute-4                    25.00 ± 0%
PoolContention/workers-1-4            25.00 ± 0%
PoolContention/workers-2-4            25.00 ± 0%
PoolContention/workers-4-4            25.00 ± 0%
PoolContention/workers-8-4            25.00 ± 0%
PoolContention/workers-16-4           25.00 ± 0%
ComponentLifecycle-4                 18.07k ± 0%
SourceValidation-4                    32.00 ± 0%
RegistryConcurrent-4                  2.000 ± 0%
LoaderLoadFromString-4               18.06k ± 0%
geomean                               183.3

cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                            │ benchmark-results.txt │
                            │        sec/op         │
InterpreterCreation-4                 3.662m ± 166%
ComponentLoad-4                       3.541m ±   1%
ComponentExecute-4                    1.910µ ±   1%
PoolContention/workers-1-4            1.204µ ±   3%
PoolContention/workers-2-4            1.202µ ±   1%
PoolContention/workers-4-4            1.205µ ±   1%
PoolContention/workers-8-4            1.213µ ±   1%
PoolContention/workers-16-4           1.212µ ±   0%
ComponentLifecycle-4                  3.667m ±   3%
SourceValidation-4                    2.287µ ±   2%
RegistryConcurrent-4                  961.9n ±   5%
LoaderLoadFromString-4                3.749m ±   1%
geomean                               18.77µ

                            │ benchmark-results.txt │
                            │         B/op          │
InterpreterCreation-4                  2.027Mi ± 0%
ComponentLoad-4                        2.180Mi ± 0%
ComponentExecute-4                     1.203Ki ± 0%
PoolContention/workers-1-4             1.203Ki ± 0%
PoolContention/workers-2-4             1.203Ki ± 0%
PoolContention/workers-4-4             1.203Ki ± 0%
PoolContention/workers-8-4             1.203Ki ± 0%
PoolContention/workers-16-4            1.203Ki ± 0%
ComponentLifecycle-4                   2.183Mi ± 0%
SourceValidation-4                     1.984Ki ± 0%
RegistryConcurrent-4                   1.133Ki ± 0%
LoaderLoadFromString-4                 2.182Mi ± 0%
geomean                                15.25Ki

                            │ benchmark-results.txt │
                            │       allocs/op       │
InterpreterCreation-4                   15.68k ± 0%
ComponentLoad-4                         18.02k ± 0%
ComponentExecute-4                       25.00 ± 0%
PoolContention/workers-1-4               25.00 ± 0%
PoolContention/workers-2-4               25.00 ± 0%
PoolContention/workers-4-4               25.00 ± 0%
PoolContention/workers-8-4               25.00 ± 0%
PoolContention/workers-16-4              25.00 ± 0%
ComponentLifecycle-4                    18.07k ± 0%
SourceValidation-4                       32.00 ± 0%
RegistryConcurrent-4                     2.000 ± 0%
LoaderLoadFromString-4                  18.06k ± 0%
geomean                                  183.3

pkg: github.com/GoCodeAlone/workflow/middleware
cpu: AMD EPYC 9V74 80-Core Processor                
                                  │ baseline-bench.txt │
                                  │       sec/op       │
CircuitBreakerDetection-4                  233.3n ± 5%
CircuitBreakerExecution_Success-4          17.59n ± 0%
CircuitBreakerExecution_Failure-4          55.15n ± 0%
geomean                                    60.94n

                                  │ baseline-bench.txt │
                                  │        B/op        │
CircuitBreakerDetection-4                 144.0 ± 0%
CircuitBreakerExecution_Success-4         0.000 ± 0%
CircuitBreakerExecution_Failure-4         0.000 ± 0%
geomean                                              ¹
¹ summaries must be >0 to compute geomean

                                  │ baseline-bench.txt │
                                  │     allocs/op      │
CircuitBreakerDetection-4                 1.000 ± 0%
CircuitBreakerExecution_Success-4         0.000 ± 0%
CircuitBreakerExecution_Failure-4         0.000 ± 0%
geomean                                              ¹
¹ summaries must be >0 to compute geomean

cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                  │ benchmark-results.txt │
                                  │        sec/op         │
CircuitBreakerDetection-4                     455.5n ± 5%
CircuitBreakerExecution_Success-4             59.73n ± 0%
CircuitBreakerExecution_Failure-4             64.93n ± 0%
geomean                                       120.9n

                                  │ benchmark-results.txt │
                                  │         B/op          │
CircuitBreakerDetection-4                    144.0 ± 0%
CircuitBreakerExecution_Success-4            0.000 ± 0%
CircuitBreakerExecution_Failure-4            0.000 ± 0%
geomean                                                 ¹
¹ summaries must be >0 to compute geomean

                                  │ benchmark-results.txt │
                                  │       allocs/op       │
CircuitBreakerDetection-4                    1.000 ± 0%
CircuitBreakerExecution_Success-4            0.000 ± 0%
CircuitBreakerExecution_Failure-4            0.000 ± 0%
geomean                                                 ¹
¹ summaries must be >0 to compute geomean

pkg: github.com/GoCodeAlone/workflow/module
cpu: AMD EPYC 9V74 80-Core Processor                
                                 │ baseline-bench.txt │
                                 │       sec/op       │
JQTransform_Simple-4                     624.4n ± 29%
JQTransform_ObjectConstruction-4         1.060µ ±  0%
JQTransform_ArraySelect-4                2.559µ ±  0%
JQTransform_Complex-4                    32.00µ ±  1%
JQTransform_Throughput-4                 1.284µ ±  0%
SSEPublishDelivery-4                     50.13n ±  1%
geomean                                  1.231µ

                                 │ baseline-bench.txt │
                                 │        B/op        │
JQTransform_Simple-4                   1.273Ki ± 0%
JQTransform_ObjectConstruction-4       1.773Ki ± 0%
JQTransform_ArraySelect-4              2.625Ki ± 0%
JQTransform_Complex-4                  16.22Ki ± 0%
JQTransform_Throughput-4               1.984Ki ± 0%
SSEPublishDelivery-4                     0.000 ± 0%
geomean                                             ¹
¹ summaries must be >0 to compute geomean

                                 │ baseline-bench.txt │
                                 │     allocs/op      │
JQTransform_Simple-4                     10.00 ± 0%
JQTransform_ObjectConstruction-4         15.00 ± 0%
JQTransform_ArraySelect-4                30.00 ± 0%
JQTransform_Complex-4                    324.0 ± 0%
JQTransform_Throughput-4                 17.00 ± 0%
SSEPublishDelivery-4                     0.000 ± 0%
geomean                                             ¹
¹ summaries must be >0 to compute geomean

cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                 │ benchmark-results.txt │
                                 │        sec/op         │
JQTransform_Simple-4                        989.0n ± 22%
JQTransform_ObjectConstruction-4            1.498µ ±  1%
JQTransform_ArraySelect-4                   3.244µ ±  1%
JQTransform_Complex-4                       35.50µ ±  1%
JQTransform_Throughput-4                    1.819µ ±  2%
SSEPublishDelivery-4                        76.78n ±  3%
geomean                                     1.696µ

                                 │ benchmark-results.txt │
                                 │         B/op          │
JQTransform_Simple-4                      1.273Ki ± 0%
JQTransform_ObjectConstruction-4          1.773Ki ± 0%
JQTransform_ArraySelect-4                 2.625Ki ± 0%
JQTransform_Complex-4                     16.22Ki ± 0%
JQTransform_Throughput-4                  1.984Ki ± 0%
SSEPublishDelivery-4                        0.000 ± 0%
geomean                                                ¹
¹ summaries must be >0 to compute geomean

                                 │ benchmark-results.txt │
                                 │       allocs/op       │
JQTransform_Simple-4                        10.00 ± 0%
JQTransform_ObjectConstruction-4            15.00 ± 0%
JQTransform_ArraySelect-4                   30.00 ± 0%
JQTransform_Complex-4                       324.0 ± 0%
JQTransform_Throughput-4                    17.00 ± 0%
SSEPublishDelivery-4                        0.000 ± 0%
geomean                                                ¹
¹ summaries must be >0 to compute geomean

pkg: github.com/GoCodeAlone/workflow/schema
cpu: AMD EPYC 9V74 80-Core Processor                
                                    │ baseline-bench.txt │
                                    │       sec/op       │
SchemaValidation_Simple-4                    826.8n ± 2%
SchemaValidation_AllFields-4                 1.254µ ± 1%
SchemaValidation_FormatValidation-4          1.211µ ± 2%
SchemaValidation_ManySchemas-4               1.211µ ± 2%
geomean                                      1.110µ

                                    │ baseline-bench.txt │
                                    │        B/op        │
SchemaValidation_Simple-4                   0.000 ± 0%
SchemaValidation_AllFields-4                0.000 ± 0%
SchemaValidation_FormatValidation-4         0.000 ± 0%
SchemaValidation_ManySchemas-4              0.000 ± 0%
geomean                                                ¹
¹ summaries must be >0 to compute geomean

                                    │ baseline-bench.txt │
                                    │     allocs/op      │
SchemaValidation_Simple-4                   0.000 ± 0%
SchemaValidation_AllFields-4                0.000 ± 0%
SchemaValidation_FormatValidation-4         0.000 ± 0%
SchemaValidation_ManySchemas-4              0.000 ± 0%
geomean                                                ¹
¹ summaries must be >0 to compute geomean

cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                    │ benchmark-results.txt │
                                    │        sec/op         │
SchemaValidation_Simple-4                      1.020µ ± 18%
SchemaValidation_AllFields-4                   1.502µ ±  7%
SchemaValidation_FormatValidation-4            1.471µ ±  1%
SchemaValidation_ManySchemas-4                 1.477µ ±  6%
geomean                                        1.350µ

                                    │ benchmark-results.txt │
                                    │         B/op          │
SchemaValidation_Simple-4                      0.000 ± 0%
SchemaValidation_AllFields-4                   0.000 ± 0%
SchemaValidation_FormatValidation-4            0.000 ± 0%
SchemaValidation_ManySchemas-4                 0.000 ± 0%
geomean                                                   ¹
¹ summaries must be >0 to compute geomean

                                    │ benchmark-results.txt │
                                    │       allocs/op       │
SchemaValidation_Simple-4                      0.000 ± 0%
SchemaValidation_AllFields-4                   0.000 ± 0%
SchemaValidation_FormatValidation-4            0.000 ± 0%
SchemaValidation_ManySchemas-4                 0.000 ± 0%
geomean                                                   ¹
¹ summaries must be >0 to compute geomean

pkg: github.com/GoCodeAlone/workflow/store
cpu: AMD EPYC 9V74 80-Core Processor                
                                   │ baseline-bench.txt │
                                   │       sec/op       │
EventStoreAppend_InMemory-4                762.0n ±  4%
EventStoreAppend_SQLite-4                  8.402m ± 73%
GetTimeline_InMemory/events-10-4           7.556µ ± 33%
GetTimeline_InMemory/events-50-4           41.93µ ±  1%
GetTimeline_InMemory/events-100-4          83.60µ ±  1%
GetTimeline_InMemory/events-500-4          425.6µ ±  1%
GetTimeline_InMemory/events-1000-4         869.2µ ±  4%
GetTimeline_SQLite/events-10-4             65.24µ ±  3%
GetTimeline_SQLite/events-50-4             171.3µ ±  3%
GetTimeline_SQLite/events-100-4            296.7µ ±  1%
GetTimeline_SQLite/events-500-4            1.299m ±  0%
GetTimeline_SQLite/events-1000-4           2.535m ±  0%
geomean                                    172.3µ

                                   │ baseline-bench.txt │
                                   │        B/op        │
EventStoreAppend_InMemory-4                 801.5 ±  8%
EventStoreAppend_SQLite-4                 1.990Ki ± 27%
GetTimeline_InMemory/events-10-4          7.953Ki ±  0%
GetTimeline_InMemory/events-50-4          46.62Ki ±  0%
GetTimeline_InMemory/events-100-4         94.48Ki ±  0%
GetTimeline_InMemory/events-500-4         472.8Ki ±  0%
GetTimeline_InMemory/events-1000-4        944.3Ki ±  0%
GetTimeline_SQLite/events-10-4            16.74Ki ±  0%
GetTimeline_SQLite/events-50-4            87.14Ki ±  0%
GetTimeline_SQLite/events-100-4           175.4Ki ±  0%
GetTimeline_SQLite/events-500-4           846.1Ki ±  0%
GetTimeline_SQLite/events-1000-4          1.639Mi ±  0%
geomean                                   67.44Ki

                                   │ baseline-bench.txt │
                                   │     allocs/op      │
EventStoreAppend_InMemory-4                  7.000 ± 0%
EventStoreAppend_SQLite-4                    52.00 ± 2%
GetTimeline_InMemory/events-10-4             125.0 ± 0%
GetTimeline_InMemory/events-50-4             653.0 ± 0%
GetTimeline_InMemory/events-100-4           1.306k ± 0%
GetTimeline_InMemory/events-500-4           6.514k ± 0%
GetTimeline_InMemory/events-1000-4          13.02k ± 0%
GetTimeline_SQLite/events-10-4               382.0 ± 0%
GetTimeline_SQLite/events-50-4              1.852k ± 0%
GetTimeline_SQLite/events-100-4             3.681k ± 0%
GetTimeline_SQLite/events-500-4             18.54k ± 0%
GetTimeline_SQLite/events-1000-4            37.29k ± 0%
geomean                                     1.160k

cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                   │ benchmark-results.txt │
                                   │        sec/op         │
EventStoreAppend_InMemory-4                   1.154µ ±  7%
EventStoreAppend_SQLite-4                     1.018m ±  5%
GetTimeline_InMemory/events-10-4              13.66µ ±  4%
GetTimeline_InMemory/events-50-4              66.68µ ± 16%
GetTimeline_InMemory/events-100-4             118.5µ ±  1%
GetTimeline_InMemory/events-500-4             604.5µ ±  1%
GetTimeline_InMemory/events-1000-4            1.229m ±  1%
GetTimeline_SQLite/events-10-4                83.25µ ±  1%
GetTimeline_SQLite/events-50-4                239.7µ ±  1%
GetTimeline_SQLite/events-100-4               436.6µ ±  2%
GetTimeline_SQLite/events-500-4               1.925m ±  2%
GetTimeline_SQLite/events-1000-4              3.791m ±  2%
geomean                                       206.4µ

                                   │ benchmark-results.txt │
                                   │         B/op          │
EventStoreAppend_InMemory-4                     821.0 ± 8%
EventStoreAppend_SQLite-4                     1.985Ki ± 2%
GetTimeline_InMemory/events-10-4              7.953Ki ± 0%
GetTimeline_InMemory/events-50-4              46.62Ki ± 0%
GetTimeline_InMemory/events-100-4             94.48Ki ± 0%
GetTimeline_InMemory/events-500-4             472.8Ki ± 0%
GetTimeline_InMemory/events-1000-4            944.3Ki ± 0%
GetTimeline_SQLite/events-10-4                16.74Ki ± 0%
GetTimeline_SQLite/events-50-4                87.14Ki ± 0%
GetTimeline_SQLite/events-100-4               175.4Ki ± 0%
GetTimeline_SQLite/events-500-4               846.1Ki ± 0%
GetTimeline_SQLite/events-1000-4              1.639Mi ± 0%
geomean                                       67.56Ki

                                   │ benchmark-results.txt │
                                   │       allocs/op       │
EventStoreAppend_InMemory-4                     7.000 ± 0%
EventStoreAppend_SQLite-4                       53.00 ± 0%
GetTimeline_InMemory/events-10-4                125.0 ± 0%
GetTimeline_InMemory/events-50-4                653.0 ± 0%
GetTimeline_InMemory/events-100-4              1.306k ± 0%
GetTimeline_InMemory/events-500-4              6.514k ± 0%
GetTimeline_InMemory/events-1000-4             13.02k ± 0%
GetTimeline_SQLite/events-10-4                  382.0 ± 0%
GetTimeline_SQLite/events-50-4                 1.852k ± 0%
GetTimeline_SQLite/events-100-4                3.681k ± 0%
GetTimeline_SQLite/events-500-4                18.54k ± 0%
GetTimeline_SQLite/events-1000-4               37.29k ± 0%
geomean                                        1.162k

Benchmarks run with go test -bench=. -benchmem -count=6.
Regressions ≥ 20% are flagged. Results compared via benchstat.

@intel352 intel352 changed the title feat(wfctl): v0.18.10 — observability pass (progress logging + auto-troubleshoot) feat: v0.18.10 — wfctl deploy-log observability (Troubleshooter interface) Apr 24, 2026
@intel352 intel352 requested a review from Copilot April 24, 2026 19:50
- lastProgress initialized to start (not zero) so heartbeat fires after
  healthPollProgressInterval, not immediately on first loop iteration
- pollCtx.Done() now checks Err() to distinguish context.Canceled
  (parent Ctrl-C / pipeline abort) from context.DeadlineExceeded
  (our own timeout), returning a plain cancelled error in the former case
- healthPollTimeout restores legacy returned error text format
  (no 'after N' suffix) for grep-based parser compatibility; elapsed
  duration is now only in the human-readable stderr print block
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 10 comments.

Comment on lines +15 to +18
t.Setenv("GITHUB_ACTIONS", "true")
summary := t.TempDir() + "/summary.md"
t.Setenv("GITHUB_STEP_SUMMARY", summary)

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test sets GITHUB_STEP_SUMMARY but never asserts that a summary file was written/updated. Given the PR goal includes writing a Markdown step summary, it would be good to assert the file contains the expected header/diagnostic content (or, if summaries are intentionally not written here, drop the env var setup).

Copilot generated this review using guidance from organization custom instructions.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env var is set for CI provider detection (so detectCIProvider() returns githubEmitter), not because this test writes a summary. WriteStepSummary is not wired into the failure path yet — that's v0.18.11 scope (task #71). Once wired, this test will get a summary assertion.

Comment on lines +55 to +67
func TestGitlabEmitter_GroupMarkers(t *testing.T) {
var buf bytes.Buffer
e := gitlabEmitter{}
e.GroupStart(&buf, "my-section")
e.GroupEnd(&buf)
out := buf.String()
if !strings.Contains(out, "section_start") {
t.Errorf("missing section_start: %q", out)
}
if !strings.Contains(out, "section_end") {
t.Errorf("missing section_end: %q", out)
}
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestGitlabEmitter_GroupMarkers only checks that output contains section_start/section_end, but it won’t catch mismatched section IDs (which prevents GitLab from closing the fold). Consider asserting that the same section ID appears in both markers (e.g., via regex capture) to lock in correct behavior.

Copilot generated this review using guidance from organization custom instructions.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f6ae4b. TestGitlabEmitter_GroupMarkers now uses regexp to capture the section IDs from both markers and asserts they are equal.

Comment thread cmd/wfctl/ci_output.go Outdated
Comment on lines +50 to +57
func (g gitlabEmitter) GroupStart(w io.Writer, name string) {
id := fmt.Sprintf("wfctl_%d", time.Now().UnixNano())
fmt.Fprintf(w, "\x1b[0Ksection_start:%d:%s\r\x1b[0K%s\n", time.Now().Unix(), id, name)
}
func (g gitlabEmitter) GroupEnd(w io.Writer) {
id := fmt.Sprintf("wfctl_%d", time.Now().UnixNano())
fmt.Fprintf(w, "\x1b[0Ksection_end:%d:%s\r\x1b[0K\n", time.Now().Unix(), id)
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitLab section_start/section_end markers must use the same section ID; this implementation generates a new random ID in GroupEnd, so sections won’t close properly in GitLab UI. Store the ID created in GroupStart (e.g., in the emitter struct with pointer receiver) and reuse it in GroupEnd.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f6ae4b. gitlabEmitter now uses pointer receivers and stores sectionID in GroupStart, reusing it in GroupEnd. detectCIProvider() returns &gitlabEmitter{}. Test updated to verify matching IDs via regex capture.

Comment on lines +1176 to +1179
}

// healthPollTimeout builds the timeout error, emits a structured failure block,
// and auto-troubleshoots via the driver's Troubleshooter (if any) before returning.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the timeout error text from the previous "timed out waiting for healthy" form to include an elapsed suffix ("... after %s"). The PR description says existing error messages are preserved; if callers parse this message, this is a breaking change. Consider keeping the returned error string exactly as before and printing elapsed time only in the additional stderr diagnostics/progress output.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed in 01f8b03. healthPollTimeout now returns the legacy error text plugin health check %q: timed out waiting for healthy (line 1185 in the current diff). The elapsed duration after %s only appears in the human-readable os.Stderr print block, not in the returned error.

Comment on lines +1188 to +1191
}

// Print structured failure block (elapsed only in the human-readable output).
fmt.Fprintf(os.Stderr, "\n❌ Deploy health check timed out for %q after %s\n", name, elapsed)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteStepSummary / the step-summary golden tests are added, but the deploy failure path here only calls troubleshootAfterFailure and never writes a summary file. This doesn’t match the PR description/design goal (Markdown summary to $GITHUB_STEP_SUMMARY). Consider wiring WriteStepSummary(detectCIProvider(), SummaryInput{...}) into the same failure/success paths where diagnostics are emitted.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally deferred to v0.18.11 per team decision. WriteStepSummary infrastructure is landed here; wiring it into deploy/infra failure paths is tracked in task #71 (wfctl v0.18.11: expanded observability).

Comment on lines +15 to +38
// applyFailProvider is an IaCProvider that fails Apply and optionally
// implements Troubleshooter.
type applyFailProvider struct {
applyCapture
applyErr error
// troubleshoot fields
isTroubleshooter bool
diags []interfaces.Diagnostic
tsErr error
tsCalls int
}

func (p *applyFailProvider) Apply(_ context.Context, plan *interfaces.IaCPlan) (*interfaces.ApplyResult, error) {
p.mu.Lock()
defer p.mu.Unlock()
p.applyCalled = true
p.appliedPlan = plan
return nil, p.applyErr
}

func (p *applyFailProvider) Troubleshoot(_ context.Context, _ interfaces.ResourceRef, _ string) ([]interfaces.Diagnostic, error) {
p.tsCalls++
return p.diags, p.tsErr
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyFailProvider always defines a Troubleshoot method, so it always satisfies interfaces.Troubleshooter regardless of isTroubleshooter. This means TestInfraApply_NonTroubleshooterNocrash doesn’t actually exercise the non-troubleshooter path. Use a separate provider type without a Troubleshoot method for that test, and remove the unused isTroubleshooter flag if it’s no longer needed.

Copilot generated this review using guidance from organization custom instructions.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f6ae4b. Added plainFailProvider (no Troubleshoot method) as a separate type and use it for TestInfraApply_NonTroubleshooterNocrash. Removed isTroubleshooter field from applyFailProvider. The non-troubleshooter test now also asserts diagBuf is empty.

Comment on lines +40 to +69
func TestInfraApply_EmitsDiagnosticsOnFailure(t *testing.T) {
t.Setenv("GITHUB_ACTIONS", "true")

diags := []interfaces.Diagnostic{
{ID: "dep-abc", Phase: "pre_deploy", Cause: "migration failed", At: mustTime("2026-04-24T00:00:00Z")},
}
provider := &applyFailProvider{
applyErr: errors.New("API error"),
isTroubleshooter: true,
diags: diags,
}

var errBuf bytes.Buffer
infraApplyTroubleshootTimeout = 5 * time.Second
defer func() { infraApplyTroubleshootTimeout = 30 * time.Second }()

specs := []interfaces.ResourceSpec{{Name: "bmw-staging", Type: "app_platform"}}
err := applyWithProviderAndStore(context.Background(), provider, "digitalocean", specs, nil, nil)
if err == nil {
t.Fatal("expected error from failing apply")
}
if !strings.Contains(err.Error(), "API error") {
t.Errorf("original error not preserved: %v", err)
}
if provider.tsCalls != 1 {
t.Errorf("Troubleshoot not called: tsCalls=%d", provider.tsCalls)
}
_ = errBuf
_ = fmt.Sprintf // keep import
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestInfraApply_EmitsDiagnosticsOnFailure doesn’t assert on any emitted diagnostics/group markers (and errBuf is unused), so it won’t catch regressions in stderr output. Capture os.Stderr (or refactor applyWithProviderAndStore to accept an io.Writer) and assert the troubleshoot block contains the expected diagnostic text; also remove the _ = fmt.Sprintf import workaround once assertions are added.

Copilot generated this review using guidance from organization custom instructions.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f6ae4b. applyWithProviderAndStore now accepts io.Writer for diagnostic output. TestInfraApply_EmitsDiagnosticsOnFailure passes &diagBuf and asserts the output contains the GHA group marker and diagnostic cause. Removed the dead errBuf and _ = fmt.Sprintf hack.

Comment thread cmd/wfctl/infra_apply.go
Comment on lines 252 to 255
if err != nil {
em := detectCIProvider()
troubleshootAfterFailure(ctx, os.Stderr, provider, interfaces.ResourceRef{}, err, infraApplyTroubleshootTimeout, em)
return fmt.Errorf("apply: %w", err)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra-apply failure path invokes Troubleshoot, but (like deploy) it doesn’t write a step-summary via WriteStepSummary, even when running in GitHub Actions with $GITHUB_STEP_SUMMARY set. If summaries are part of the v0.18.10 contract, consider wiring summary writing here too (or clarifying that summaries are deploy-only).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — WriteStepSummary wiring for the infra apply path is deferred to v0.18.11 (task #71). The infrastructure (WriteStepSummary, SummaryInput, golden tests) is all present; caller wiring is the next step.

Comment on lines +18 to +26
func TestDiagnostic_JSONRoundtrip(t *testing.T) {
d := Diagnostic{
ID: "dep-abc", Phase: "pre_deploy", Cause: "exit 1",
At: time.Now().UTC().Truncate(time.Second), Detail: "line1\nline2",
}
// simple JSON marshal/unmarshal sanity
// (will fail initially if fields aren't exported with json tags)
_ = d
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestDiagnostic_JSONRoundtrip is currently a no-op (it never marshals/unmarshals or asserts anything), so it won’t catch missing JSON tags/exports as the comment suggests. Consider actually JSON-marshaling Diagnostic, unmarshaling back, and asserting key fields (incl. At and Detail) round-trip correctly.

Copilot generated this review using guidance from organization custom instructions.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f6ae4b. TestDiagnostic_JSONRoundtrip now marshals Diagnostic to JSON, unmarshals it back, and asserts all fields (ID, Phase, Cause, Detail, At) round-trip correctly.

Comment thread cmd/wfctl/infra_apply.go Outdated
Comment on lines +253 to +254
em := detectCIProvider()
troubleshootAfterFailure(ctx, os.Stderr, provider, interfaces.ResourceRef{}, err, infraApplyTroubleshootTimeout, em)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Apply failure this calls Troubleshoot with an empty interfaces.ResourceRef{}. For providers like DO App Platform, Troubleshoot will likely need at least ref.Name and/or ref.ProviderID to find the right deployment, so this wiring may effectively no-op in real usage. Consider deriving a meaningful ref (e.g., from plan.Actions, specs[0] when single-resource, or current state entries to populate ProviderID) before calling troubleshootAfterFailure.

Suggested change
em := detectCIProvider()
troubleshootAfterFailure(ctx, os.Stderr, provider, interfaces.ResourceRef{}, err, infraApplyTroubleshootTimeout, em)
ref := interfaces.ResourceRef{}
if len(plan.Actions) == 1 {
ref.Name = plan.Actions[0].Resource.Name
ref.Type = plan.Actions[0].Resource.Type
} else if len(specs) == 1 {
ref.Name = specs[0].Name
ref.Type = specs[0].Type
}
em := detectCIProvider()
troubleshootAfterFailure(ctx, os.Stderr, provider, ref, err, infraApplyTroubleshootTimeout, em)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1f6ae4b. Applied the suggested approach: single-action plans populate ref from plan.Actions[0], single-spec configs fall back to specs[0], multi-resource plans leave ref empty (best-effort). The io.Writer refactor also lets tests capture and assert the troubleshoot output.

- ci_output: gitlabEmitter now stores sectionID in GroupStart and reuses
  it in GroupEnd so GitLab section folds close correctly (was generating
  a new random ID on each call)
- ci_output_test: verify start/end section IDs match via regex capture
- infra_apply: applyWithProviderAndStore accepts io.Writer for diagnostic
  output (replaces hardcoded os.Stderr) enabling test capture
- infra_apply: derive meaningful ResourceRef (name+type from plan.Actions
  or specs) before calling troubleshootAfterFailure instead of empty ref
- infra_apply_troubleshoot_test: add plainFailProvider (no Troubleshooter)
  to actually exercise the non-troubleshooter no-op path; assert stderr
  output contains group markers and diagnostic cause
- interfaces/troubleshooter_test: TestDiagnostic_JSONRoundtrip now
  marshals/unmarshals and asserts all fields including At and Detail

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hook 2 (applyWithProviderAndStore failure path) passes an IaCProvider to
troubleshootAfterFailure whose type assertion against Troubleshooter always
returns false — the call is currently a no-op. Add a TODO(v0.18.11) comment
documenting the gap so it's explicit rather than silent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Comment on lines +693 to +694
"ref": ref,
"failure_msg": failureMsg,
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remoteResourceDriver.Troubleshoot passes ref as a struct inside the InvokeService args map. The underlying transport converts args via structpb.NewStruct, which does not accept arbitrary structs; on conversion error it falls back to an empty struct, so the plugin will receive missing/empty args and Troubleshoot will not work. Encode ref as JSON-compatible primitives (e.g., pass ref_name/ref_type/ref_provider_id like other methods, or JSON-roundtrip ref into a map[string]any) before calling InvokeService.

Suggested change
"ref": ref,
"failure_msg": failureMsg,
"ref_name": ref.Name,
"ref_type": ref.Type,
"ref_provider_id": ref.ProviderID,
"failure_msg": failureMsg,

Copilot uses AI. Check for mistakes.
Comment thread cmd/wfctl/infra_apply.go
Comment on lines +267 to +272
// TODO(v0.18.11): provider is interfaces.IaCProvider, which does not implement
// interfaces.Troubleshooter. troubleshootAfterFailure's type assertion always
// returns false here — this hook is currently a no-op for all IaC providers.
// Full diagnostics require per-ResourceDriver access to be threaded through
// applyInfraModules (see task #69).
troubleshootAfterFailure(ctx, w, provider, ref, err, infraApplyTroubleshootTimeout, em)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the provider.Apply error path, troubleshooting is invoked against the interfaces.IaCProvider (troubleshootAfterFailure(..., provider, ...)). For remote/plugin providers, the IaCProvider wrapper does not implement interfaces.Troubleshooter; the Troubleshoot method is on the ResourceDriver returned by provider.ResourceDriver(resourceType). As written, infra apply failures will silently skip troubleshooting for plugin-backed providers. Consider looking up the appropriate ResourceDriver for ref.Type (when available) and passing that into troubleshootAfterFailure instead of the provider itself.

Suggested change
// TODO(v0.18.11): provider is interfaces.IaCProvider, which does not implement
// interfaces.Troubleshooter. troubleshootAfterFailure's type assertion always
// returns false here — this hook is currently a no-op for all IaC providers.
// Full diagnostics require per-ResourceDriver access to be threaded through
// applyInfraModules (see task #69).
troubleshootAfterFailure(ctx, w, provider, ref, err, infraApplyTroubleshootTimeout, em)
troubleshootTarget := any(provider)
if ref.Type != "" {
if driver := provider.ResourceDriver(ref.Type); driver != nil {
troubleshootTarget = driver
}
}
troubleshootAfterFailure(ctx, w, troubleshootTarget, ref, err, infraApplyTroubleshootTimeout, em)

Copilot uses AI. Check for mistakes.
Comment on lines +1197 to +1200
em := detectCIProvider()
troubleshootAfterFailure(ctx, os.Stderr, driver, ref, errors.New(baseErr), 30*time.Second, em)

return errors.New(baseErr)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/CHANGELOG say wfctl writes a Markdown summary to $GITHUB_STEP_SUMMARY, but the new WriteStepSummary helper is never called from the deploy timeout path (or elsewhere in cmd/wfctl). As a result, even in GitHub Actions the summary file will remain untouched. Either wire WriteStepSummary into the deploy/apply success+failure paths (building a SummaryInput from available data) or adjust the docs/changelog to avoid claiming this behavior.

Copilot uses AI. Check for mistakes.
golangci-lint gosec rule G302 requires file permissions ≤ 0600.
The summary file is written to a CI-runner-managed path from
GITHUB_STEP_SUMMARY; 0644 had world-read which triggered the lint.
Change to 0600 and add nolint comment explaining the path is env-controlled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
if err != nil {
return fmt.Errorf("open summary: %w", err)
}
defer f.Close()
@intel352 intel352 merged commit 11a43c9 into main Apr 24, 2026
18 checks passed
@intel352 intel352 deleted the feat/v0.18.10-observability branch April 24, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants