Skip to content

fix: context deadline exceeded shouldn't throw an error#203

Merged
ambermingxin merged 4 commits into
mainfrom
fix/offline-mode-context-exceed
May 22, 2026
Merged

fix: context deadline exceeded shouldn't throw an error#203
ambermingxin merged 4 commits into
mainfrom
fix/offline-mode-context-exceed

Conversation

@ambermingxin
Copy link
Copy Markdown
Collaborator

@ambermingxin ambermingxin commented May 22, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Offline mode now performs an immediate export at startup so short-lived runs produce output before the first scheduled interval.
  • Improvements

    • Server shutdown now coordinates and waits for background loops to exit, with bounded wait and clearer timeout handling.
  • Documentation

    • Helm chart README, values, daemonset template, and install guide updated with optional enrollment metadata flags and usage notes.
  • Tests

    • Added tests for offline export behavior and for background-loop coordination and shutdown.

Review Change Stack

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

Adds an immediate one-time export when the exporter runs in OfflineMode; introduces Server loop lifecycle state (cancelable loop context + waitgroup) and a wait helper with tests; and adds optional Helm enrollment metadata flags, template wiring, and docs.

Changes

Offline exporter initial export

Layer / File(s) Summary
Offline mode initial export
internal/exporter/exporter.go, internal/exporter/exporter_test.go
Start() now performs an immediate export() when OfflineMode is enabled, logging failures and updating lastExport on success; tests assert output files are created and include collected data.

Server loop lifecycle and shutdown

Layer / File(s) Summary
waitForWaitGroup and tests
internal/server/server.go, internal/server/server_test.go
Adds waitForWaitGroup(wg, timeout) and test coverage verifying completion and timeout behavior; includes necessary test import changes.
Server loop fields and New() wiring
internal/server/server.go
Server adds loopWG, loopCtx, and loopCancel; New() creates a dedicated cancelable loopCtx stored on the server and used to start background loops.
Loop wiring, Add/Done, and Stop() shutdown
internal/server/server.go
Inventory and attestation loops use loopCtx and increment/decrement loopWG; Stop() cancels loopCtx and waits up to 10s for loops to exit, logging a warning on timeout.

Helm enrollment flags and docs

Layer / File(s) Summary
Values, template args, and docs
deployments/helm/fleet-intelligence-agent/values.yaml, deployments/helm/fleet-intelligence-agent/templates/daemonset.yaml, deployments/helm/fleet-intelligence-agent/README.md, docs/install-helm.md
Adds enroll.nodeGroup and enroll.computeZone values, conditionally passes --node-group/--compute-zone to the fleetint enroll initContainer when set, and documents usage and upgrade examples including null vs empty-string semantics.

Sequence Diagram(s)

sequenceDiagram
  participant ComponentA
  participant ComponentB
  ComponentA->>ComponentB: observable interaction
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • mukilsh

Poem

🐇 I hopped to run the exporter fast,
I nudged a tick to export first and last,
Flags tucked in Helm like carrots in beds,
Loops said goodnight and rested their heads,
A tiny file appeared — mission fed.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title claims to fix a 'context deadline exceeded' error handling issue, but the actual changes add offline mode exports, background loop synchronization, and Helm enrollment parameters—none of which directly address context deadline error handling. Revise the title to accurately reflect the main changes, such as 'feat: add offline mode export and improve server shutdown coordination' or similar.
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/offline-mode-context-exceed

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread internal/exporter/exporter.go
Comment thread internal/server/server.go Outdated
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
internal/server/server.go (1)

223-224: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't suppress joined loop failures just because the context ended.

Once ctx.Err() is set, this drops any error that matches context.Canceled or context.DeadlineExceeded, including joined/wrapped errors that also contain a real loop failure. That still masks the scenario called out earlier instead of only silencing the expected shutdown path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/server/server.go` around lines 223 - 224, The current condition
suppresses any error that wraps context.Canceled/DeadlineExceeded; change the
check to only suppress when the error is exactly the cancellation error (not
wrapped): in the block referencing ctx and err, replace errors.Is(err,
context.Canceled)/errors.Is(err, context.DeadlineExceeded) with direct equality
checks (err == context.Canceled || err == context.DeadlineExceeded) while still
requiring ctx != nil && ctx.Err() != nil, so wrapped/joined loop failures are
not accidentally dropped.
🧹 Nitpick comments (1)
internal/exporter/exporter_test.go (1)

343-346: ⚡ Quick win

Consider adding a brief sleep after Start() for test reliability.

The existing test at line 287-305 includes a 100ms sleep after calling Start() before reading the directory. For consistency and to guard against file system buffering delays, consider adding the same safety margin here.

Suggested addition
 err = exporter.Start()
 require.NoError(t, err)
+
+// Brief wait to ensure file system operations complete
+time.Sleep(100 * time.Millisecond)

 entries, err := os.ReadDir(tmpDir)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/exporter/exporter_test.go` around lines 343 - 346, Add a short sleep
after calling exporter.Start() to match the other test's safety margin and avoid
flaky reads: after invoking exporter.Start() (the call to exporter.Start() that
precedes reading tmpDir with os.ReadDir), insert a brief pause (e.g., 100ms)
before calling os.ReadDir(tmpDir) so the exporter has time to write files and
the filesystem buffers settle; keep the delay minimal and document it with a
one-line comment near the exporter.Start() call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deployments/helm/fleet-intelligence-agent/values.yaml`:
- Around line 64-67: The values.yaml defaults for enrollment metadata use empty
strings for nodeGroup and computeZone which causes the template checks (see
daemonset.yaml checks using "ne nil" at the node-group/compute-zone flag
generation) to always pass and emit --node-group "" / --compute-zone ""; change
the defaults for the nodeGroup and computeZone keys in values.yaml from "" to
null so the template's nil checks behave correctly and preserve existing values
(also ensure README.md table remains consistent with the null defaults).

---

Duplicate comments:
In `@internal/server/server.go`:
- Around line 223-224: The current condition suppresses any error that wraps
context.Canceled/DeadlineExceeded; change the check to only suppress when the
error is exactly the cancellation error (not wrapped): in the block referencing
ctx and err, replace errors.Is(err, context.Canceled)/errors.Is(err,
context.DeadlineExceeded) with direct equality checks (err == context.Canceled
|| err == context.DeadlineExceeded) while still requiring ctx != nil &&
ctx.Err() != nil, so wrapped/joined loop failures are not accidentally dropped.

---

Nitpick comments:
In `@internal/exporter/exporter_test.go`:
- Around line 343-346: Add a short sleep after calling exporter.Start() to match
the other test's safety margin and avoid flaky reads: after invoking
exporter.Start() (the call to exporter.Start() that precedes reading tmpDir with
os.ReadDir), insert a brief pause (e.g., 100ms) before calling
os.ReadDir(tmpDir) so the exporter has time to write files and the filesystem
buffers settle; keep the delay minimal and document it with a one-line comment
near the exporter.Start() call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3ce52339-623f-4ed1-bcb8-b89bf6475581

📥 Commits

Reviewing files that changed from the base of the PR and between 58f4a2f and 6b39c2b.

📒 Files selected for processing (7)
  • deployments/helm/fleet-intelligence-agent/README.md
  • deployments/helm/fleet-intelligence-agent/templates/daemonset.yaml
  • deployments/helm/fleet-intelligence-agent/values.yaml
  • docs/install-helm.md
  • internal/exporter/exporter_test.go
  • internal/server/server.go
  • internal/server/server_test.go
✅ Files skipped from review due to trivial changes (1)
  • deployments/helm/fleet-intelligence-agent/README.md

Comment thread deployments/helm/fleet-intelligence-agent/values.yaml Outdated
Comment thread internal/server/server.go Outdated
Comment thread deployments/helm/fleet-intelligence-agent/templates/daemonset.yaml Outdated
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/server/server.go (1)

495-496: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Differentiate expected context timeout exits from real loop failures.

Line 495 and Line 532 now suppress only context.Canceled. In timeout-driven shutdown, these loops can legitimately exit with context.DeadlineExceeded, which will now be logged as an error and conflicts with the PR objective. Suppress DeadlineExceeded only when it comes from the loop context (ctx.Err()), and still log real internal deadline failures.

Proposed patch
@@
 	s.loopWG.Add(1)
 	go func() {
 		defer s.loopWG.Done()
-		if err := manager.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {
-			log.Logger.Errorw("inventory loop manager exited", "error", err)
-		}
+		if err := manager.Run(ctx); err != nil {
+			expectedCtxExit := errors.Is(err, context.Canceled) ||
+				(errors.Is(err, context.DeadlineExceeded) && errors.Is(ctx.Err(), context.DeadlineExceeded))
+			if !expectedCtxExit {
+				log.Logger.Errorw("inventory loop manager exited", "error", err)
+			}
+		}
 	}()
 }
@@
 	s.loopWG.Add(1)
 	go func() {
 		defer s.loopWG.Done()
-		if err := manager.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {
-			log.Logger.Errorw("attestation loop exited", "error", err)
-		}
+		if err := manager.Run(ctx); err != nil {
+			expectedCtxExit := errors.Is(err, context.Canceled) ||
+				(errors.Is(err, context.DeadlineExceeded) && errors.Is(ctx.Err(), context.DeadlineExceeded))
+			if !expectedCtxExit {
+				log.Logger.Errorw("attestation loop exited", "error", err)
+			}
+		}
 	}()
 }

Also applies to: 532-533

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/server/server.go` around lines 495 - 496, The current error
suppression only filters context.Canceled causing context.DeadlineExceeded from
internal failures to be swallowed; update the error-checking around
manager.Run(ctx) (and the other loop Run call at the later occurrence) to
suppress DeadlineExceeded only when it originated from the loop context by
checking ctx.Err() — i.e., keep the existing suppression for context.Canceled,
and add logic so that errors.Is(err, context.DeadlineExceeded) is ignored only
if ctx.Err() == context.DeadlineExceeded; otherwise treat DeadlineExceeded as a
real error and log it with log.Logger.Errorw.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/server/server.go`:
- Around line 495-496: The current error suppression only filters
context.Canceled causing context.DeadlineExceeded from internal failures to be
swallowed; update the error-checking around manager.Run(ctx) (and the other loop
Run call at the later occurrence) to suppress DeadlineExceeded only when it
originated from the loop context by checking ctx.Err() — i.e., keep the existing
suppression for context.Canceled, and add logic so that errors.Is(err,
context.DeadlineExceeded) is ignored only if ctx.Err() ==
context.DeadlineExceeded; otherwise treat DeadlineExceeded as a real error and
log it with log.Logger.Errorw.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ed80e414-67a0-4341-8abb-f96029360639

📥 Commits

Reviewing files that changed from the base of the PR and between 5707e83 and 74c6009.

📒 Files selected for processing (2)
  • internal/server/server.go
  • internal/server/server_test.go
💤 Files with no reviewable changes (1)
  • internal/server/server_test.go

@ambermingxin ambermingxin merged commit 022201f into main May 22, 2026
9 checks passed
@ambermingxin ambermingxin deleted the fix/offline-mode-context-exceed branch May 22, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants