feat: capture aks-node-controller errors into Guest Agent Events#7773
feat: capture aks-node-controller errors into Guest Agent Events#7773
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds Guest Agent event logging to the aks-node-controller wrapper script to capture both successful completions and errors for monitoring and telemetry purposes.
Changes:
- Added
createGuestAgentEventfunction to generate Guest Agent events in JSON format - Events are now created for both successful completion and error cases
- Added test coverage for error scenario using ShellSpec
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh | Implements Guest Agent event creation with a new createGuestAgentEvent function and calls it for both success and failure cases |
| spec/parts/linux/cloud-init/artifacts/aks_node_controller_wrapper_spec.sh | Adds test case to verify Guest Agent event is created on non-zero exit with correct TaskName, EventLevel, and Message |
parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Outdated
Show resolved
Hide resolved
parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Outdated
Show resolved
Hide resolved
spec/parts/linux/cloud-init/artifacts/aks_node_controller_wrapper_spec.sh
Outdated
Show resolved
Hide resolved
parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Outdated
Show resolved
Hide resolved
parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh
Outdated
Show resolved
Hide resolved
- Make GuestAgentEvent and ReadEvents public for test assertions - Add CreateEventFunc type with NewCreateEventFunc factory for DI - Inject events directory via App struct instead of hardcoded path - Add TestApp helper struct for cleaner test setup - Simplify tests using ReadEvents helper
aks-node-controller/app.go
Outdated
There was a problem hiding this comment.
fmt.Sprintf("... %s", err) uses the %s verb with an error value, which formats as %!s(<type=value>) rather than the error message. Use %v or err.Error() so the Guest Agent Event contains the real error text.
| message := fmt.Sprintf("aks-node-controller exited with error %s", err) | |
| message := fmt.Sprintf("aks-node-controller exited with error %v", err) |
| startTime := time.Now() | ||
| a.createEvent(cmd.taskName, "Starting", helpers.EventLevelInformational, startTime, startTime) | ||
|
|
There was a problem hiding this comment.
a.createEvent(...) is called unconditionally. If an App instance is constructed without createEvent set (e.g., in future tests/utility code), this will panic at runtime. Consider defaulting createEvent to a no-op in Run/run (or in an App constructor) when it is nil.
- Add EventLogger struct with LogEvent and Events methods - LogEvent writes events, Events reads them (for testing) - Cleaner API than function type with separate ReadEvents
| // Use nanosecond timestamp as filename, based on current time to ensure uniqueness | ||
| // This provides better collision avoidance than milliseconds | ||
| eventsFileName := fmt.Sprintf("%d.json", time.Now().UnixNano()) | ||
| eventFilePath := filepath.Join(l.Dir, eventsFileName) |
There was a problem hiding this comment.
Using time.Now().UnixNano() as the filename is not guaranteed to be unique (clock resolution can be coarser than 1ns, and values can repeat), so back-to-back events can collide and overwrite each other, losing telemetry. Consider using an atomic counter/loop with O_EXCL, os.CreateTemp, or adding an additional uniqueness component (pid/random) to the filename.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| exitCode := tt.App.Run(context.Background(), []string{"aks-node-controller", "provision", "--provision-config=parser/testdata/test_aksnodeconfig.json", "--dry-run"}) | ||
| assert.Equal(t, 0, exitCode) | ||
| if reflect.ValueOf(tt.App.cmdRun).Pointer() != reflect.ValueOf(cmdRunnerDryRun).Pointer() { | ||
| t.Fatal("app.cmdRunner is expected to be cmdRunnerDryRun") |
There was a problem hiding this comment.
The failure message here still refers to app.cmdRunner, but the field was renamed to cmdRun. Updating the message will make test failures clearer and avoid confusion during future refactors.
| t.Fatal("app.cmdRunner is expected to be cmdRunnerDryRun") | |
| t.Fatal("app.cmdRun is expected to be cmdRunnerDryRun") |
| // Use nanosecond timestamp as filename, based on current time to ensure uniqueness | ||
| // This provides better collision avoidance than milliseconds | ||
| eventsFileName := fmt.Sprintf("%d.json", time.Now().UnixNano()) | ||
| eventFilePath := filepath.Join(l.Dir, eventsFileName) |
There was a problem hiding this comment.
LogEvent uses time.Now().UnixNano() alone as the event filename. This is not guaranteed to be unique on all platforms/VMs (clock resolution can be coarser than 1ns), and concurrent/rapid successive calls can overwrite an earlier event file, losing telemetry and causing test flakiness. Consider generating filenames via os.CreateTemp (or add a monotonic counter/O_EXCL retry loop) to guarantee uniqueness.
| startTime := time.Now() | ||
| a.eventLogger.LogEvent(cmd.taskName, "Starting", helpers.EventLevelInformational, startTime, startTime) | ||
|
|
||
| err := cmd.handler(a, ctx, args) | ||
| endTime := time.Now() | ||
| if err != nil { | ||
| message := fmt.Sprintf("aks-node-controller exited with error %s", err.Error()) | ||
| a.eventLogger.LogEvent(cmd.taskName, message, helpers.EventLevelError, startTime, endTime) | ||
| } else { | ||
| a.eventLogger.LogEvent(cmd.taskName, "Completed", helpers.EventLevelInformational, startTime, endTime) | ||
| } |
There was a problem hiding this comment.
run() emits a "Starting" guest agent event and then a second event for "Completed"/error. Existing guest-agent event emitters in parts/linux/cloud-init/artifacts/ generally emit a single event per operation (Timestamp=startTime, OperationId=endTime). Emitting two events per command increases event volume and also makes it hard to correlate start/end because OperationId will differ between the two events. If the goal is to capture errors, consider only emitting an event on failure (or use a single event emitted at the end with start/end timing in the Message).
| app := App{cmdRunner: cmdRunner} | ||
| app := App{ | ||
| cmdRun: cmdRunner, | ||
| eventLogger: helpers.NewEventLogger("/var/log/azure/Microsoft.Azure.Extensions.CustomScript/events"), |
There was a problem hiding this comment.
could we make this not hardcoded and instead come from the service config ? this file won't exists on non Azure VM
There was a problem hiding this comment.
We create the file. And ignore any errors in the process.
There are plenty of hardcoded values in our scripts. Making everything configurable can be harder to track and read. Is there a place where we want to change it?
I thought the path is outside of our control.
What this PR does / why we need it:
capture aks-node-controller errors into Guest Agent Events
Which issue(s) this PR fixes:
Fixes #