Skip to content

feat: capture aks-node-controller errors into Guest Agent Events#7773

Merged
Devinwong merged 38 commits intomainfrom
devinwon/controller-guest-agent-Event
Feb 9, 2026
Merged

feat: capture aks-node-controller errors into Guest Agent Events#7773
Devinwong merged 38 commits intomainfrom
devinwon/controller-guest-agent-Event

Conversation

@Devinwong
Copy link
Collaborator

@Devinwong Devinwong commented Feb 3, 2026

What this PR does / why we need it:
capture aks-node-controller errors into Guest Agent Events

Which issue(s) this PR fixes:

Fixes #

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Guest Agent event logging to the aks-node-controller wrapper script to capture both successful completions and errors for monitoring and telemetry purposes.

Changes:

  • Added createGuestAgentEvent function to generate Guest Agent events in JSON format
  • Events are now created for both successful completion and error cases
  • Added test coverage for error scenario using ShellSpec

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
parts/linux/cloud-init/artifacts/aks-node-controller-wrapper.sh Implements Guest Agent event creation with a new createGuestAgentEvent function and calls it for both success and failure cases
spec/parts/linux/cloud-init/artifacts/aks_node_controller_wrapper_spec.sh Adds test case to verify Guest Agent event is created on non-zero exit with correct TaskName, EventLevel, and Message

Copilot AI review requested due to automatic review settings February 3, 2026 01:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

- Make GuestAgentEvent and ReadEvents public for test assertions
- Add CreateEventFunc type with NewCreateEventFunc factory for DI
- Inject events directory via App struct instead of hardcoded path
- Add TestApp helper struct for cleaner test setup
- Simplify tests using ReadEvents helper
Copilot AI review requested due to automatic review settings February 7, 2026 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Sprintf("... %s", err) uses the %s verb with an error value, which formats as %!s(<type=value>) rather than the error message. Use %v or err.Error() so the Guest Agent Event contains the real error text.

Suggested change
message := fmt.Sprintf("aks-node-controller exited with error %s", err)
message := fmt.Sprintf("aks-node-controller exited with error %v", err)

Copilot uses AI. Check for mistakes.
Comment on lines 119 to 121
startTime := time.Now()
a.createEvent(cmd.taskName, "Starting", helpers.EventLevelInformational, startTime, startTime)

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a.createEvent(...) is called unconditionally. If an App instance is constructed without createEvent set (e.g., in future tests/utility code), this will panic at runtime. Consider defaulting createEvent to a no-op in Run/run (or in an App constructor) when it is nil.

Copilot uses AI. Check for mistakes.
- Add EventLogger struct with LogEvent and Events methods
- LogEvent writes events, Events reads them (for testing)
- Cleaner API than function type with separate ReadEvents
@Devinwong Devinwong enabled auto-merge (squash) February 7, 2026 05:18
Copilot AI review requested due to automatic review settings February 7, 2026 07:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment on lines +57 to +60
// Use nanosecond timestamp as filename, based on current time to ensure uniqueness
// This provides better collision avoidance than milliseconds
eventsFileName := fmt.Sprintf("%d.json", time.Now().UnixNano())
eventFilePath := filepath.Join(l.Dir, eventsFileName)
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using time.Now().UnixNano() as the filename is not guaranteed to be unique (clock resolution can be coarser than 1ns, and values can repeat), so back-to-back events can collide and overwrite each other, losing telemetry. Consider using an atomic counter/loop with O_EXCL, os.CreateTemp, or adding an additional uniqueness component (pid/random) to the filename.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 9, 2026 05:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

exitCode := tt.App.Run(context.Background(), []string{"aks-node-controller", "provision", "--provision-config=parser/testdata/test_aksnodeconfig.json", "--dry-run"})
assert.Equal(t, 0, exitCode)
if reflect.ValueOf(tt.App.cmdRun).Pointer() != reflect.ValueOf(cmdRunnerDryRun).Pointer() {
t.Fatal("app.cmdRunner is expected to be cmdRunnerDryRun")
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure message here still refers to app.cmdRunner, but the field was renamed to cmdRun. Updating the message will make test failures clearer and avoid confusion during future refactors.

Suggested change
t.Fatal("app.cmdRunner is expected to be cmdRunnerDryRun")
t.Fatal("app.cmdRun is expected to be cmdRunnerDryRun")

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +63
// Use nanosecond timestamp as filename, based on current time to ensure uniqueness
// This provides better collision avoidance than milliseconds
eventsFileName := fmt.Sprintf("%d.json", time.Now().UnixNano())
eventFilePath := filepath.Join(l.Dir, eventsFileName)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogEvent uses time.Now().UnixNano() alone as the event filename. This is not guaranteed to be unique on all platforms/VMs (clock resolution can be coarser than 1ns), and concurrent/rapid successive calls can overwrite an earlier event file, losing telemetry and causing test flakiness. Consider generating filenames via os.CreateTemp (or add a monotonic counter/O_EXCL retry loop) to guarantee uniqueness.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to 129
startTime := time.Now()
a.eventLogger.LogEvent(cmd.taskName, "Starting", helpers.EventLevelInformational, startTime, startTime)

err := cmd.handler(a, ctx, args)
endTime := time.Now()
if err != nil {
message := fmt.Sprintf("aks-node-controller exited with error %s", err.Error())
a.eventLogger.LogEvent(cmd.taskName, message, helpers.EventLevelError, startTime, endTime)
} else {
a.eventLogger.LogEvent(cmd.taskName, "Completed", helpers.EventLevelInformational, startTime, endTime)
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run() emits a "Starting" guest agent event and then a second event for "Completed"/error. Existing guest-agent event emitters in parts/linux/cloud-init/artifacts/ generally emit a single event per operation (Timestamp=startTime, OperationId=endTime). Emitting two events per command increases event volume and also makes it hard to correlate start/end because OperationId will differ between the two events. If the goal is to capture errors, consider only emitting an event on failure (or use a single event emitted at the end with start/end timing in the Message).

Copilot uses AI. Check for mistakes.
@Devinwong Devinwong merged commit 392281e into main Feb 9, 2026
34 of 35 checks passed
@Devinwong Devinwong deleted the devinwon/controller-guest-agent-Event branch February 9, 2026 18:47
app := App{cmdRunner: cmdRunner}
app := App{
cmdRun: cmdRunner,
eventLogger: helpers.NewEventLogger("/var/log/azure/Microsoft.Azure.Extensions.CustomScript/events"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make this not hardcoded and instead come from the service config ? this file won't exists on non Azure VM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create the file. And ignore any errors in the process.

There are plenty of hardcoded values in our scripts. Making everything configurable can be harder to track and read. Is there a place where we want to change it?

I thought the path is outside of our control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments