fix: harden startup and export under slow DCGM and machine info by jingxiang-z · Pull Request #157 · NVIDIA/fleet-intelligence-agent

jingxiang-z · 2026-04-06T22:26:37Z

Summary

prewarm machine info and wait up to 5 seconds on the first export before falling back to cached or empty machine info
move machine info refresh behind a single-flight cache so later exports never block on slow NVML/DCGM-backed machine info collection
separate collection and HTTP export timeout contexts so slow collection does not consume the send budget
bound DCGM initialization during server startup to 1 minute and fall back to a no-op DCGM instance when startup enumeration is too slow

Behavior Notes

first export now makes a best effort to include machine info without waiting indefinitely
later exports reuse cached machine info and refresh it in the background only when stale
the initial machine-info wait is a one-time gate and now honors collection context cancellation
one export cycle can now take up to roughly collection timeout plus export timeout because those budgets are separated intentionally
if DCGM startup takes more than 1 minute, the server continues startup with a no-op DCGM instance and skips DCGM-backed checks instead of hanging

Testing

go test ./...
go test -race ./internal/exporter/collector
go test github.com/NVIDIA/fleet-intelligence-sdk/pkg/nvidia-query/dcgm ./internal/server ./internal/exporter

Related ticket: GPUHEALTH-1735

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai · 2026-04-06T22:26:45Z

📝 Walkthrough

Walkthrough

Machine info collection was refactored to use an asynchronous cached provider; the collector best-effort reads cached info (with an optional initial wait) and schedules background refreshes. The exporter now creates separate timeout-derived contexts for collection and for HTTP transmission.

Changes

Cohort / File(s)	Summary
Machine Info Provider `internal/exporter/collector/machine_info_provider.go`	Adds `machineInfoProvider` interface and `cachedMachineInfoProvider` that caches machine info, deduplicates concurrent refreshes, supports async refresh and a one-time initial-wait mechanism, and handles staleness.
Collector `internal/exporter/collector/collector.go`	Instantiates provider when enabled, uses provider.Get() to populate `data.MachineInfo` (best-effort), waits briefly for initial refresh when needed, and triggers `RefreshAsync` for background updates.
Collector Tests & Mocks `internal/exporter/collector/collector_test.go`	Adds tests and `mockMachineInfoProvider` covering cached reads, initial prewarm/wait behavior, non-blocking collection when refresh stalls, deduplication of concurrent refreshes, and resilience to refresh failures.
Exporter Contexts `internal/exporter/exporter.go`	Uses a dedicated timeout-derived `collectionCtx` for metadata refresh and collection, and a separate `exportCtx` for HTTP transmission (offline path unchanged).
Exporter Tests & Mock HTTP Writer `internal/exporter/exporter_test.go`	Adds `MockHTTPWriter` and `TestExportUsesSeparateContextsForCollectionAndHTTP` to verify HTTP export receives a fresh, non-expired context when collection overruns.

Sequence Diagram(s)

sequenceDiagram
    participant Collector
    participant MachineInfoProvider
    participant NVML
    participant Cache

    Collector->>MachineInfoProvider: New(includeMachineInfo=true, options...)
    MachineInfoProvider->>MachineInfoProvider: start initial async refresh()
    MachineInfoProvider->>NVML: fetch machine info (background)
    NVML-->>MachineInfoProvider: machine info / error
    MachineInfoProvider->>Cache: update cached value (on success)
    MachineInfoProvider-->>Collector: initialRefreshDone (closes)

    Collector->>MachineInfoProvider: Collect(): WaitForInitialRefresh(maxWait)
    MachineInfoProvider-->>Collector: Get() => cached machine info (best-effort)
    Collector->>MachineInfoProvider: RefreshAsync() (schedule next refresh if stale)

sequenceDiagram
    participant Exporter
    participant Collector
    participant HTTPWriter

    Exporter->>Exporter: create collectionCtx (timeout T1)
    Exporter->>Collector: Collect(collectionCtx)
    Note over Collector: collection may use cached data / wait briefly
    Collector-->>Exporter: collection result

    Exporter->>Exporter: create exportCtx (timeout T2, independent)
    Exporter->>HTTPWriter: Send(exportCtx)
    HTTPWriter-->>Exporter: send result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I nibbled cache where NVML slept tight,
Async refreshes tend the data at night,
Two timeouts now hop—collection and send,
No more expired contexts to upend,
A joyful hop for stable insight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective: hardening export behavior when machine info collection is slow, which aligns with the key change of separating collection timeouts, implementing cached machine info, and ensuring exports don't hang indefinitely.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/export-machine-info-resilience

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1546195fc1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 973-979: The mockMachineInfoProvider accesses the cached field
from multiple goroutines (methods Get and setCached) without synchronization,
causing race failures; add a mutex (e.g., sync.Mutex or sync.RWMutex) as a field
on mockMachineInfoProvider and use it to protect all reads/writes of cached
(wrap cached reads in RLock/RUnlock in Get and Lock/Unlock in setCached), and
update any other helper methods that touch cached (and initialRefreshDone if it
reads/writes cached) to use the same mutex so the mock mirrors the production
provider's locking.

In `@internal/exporter/collector/machine_info_provider.go`:
- Around line 35-39: The WaitForInitialRefresh method currently ignores caller
cancellation; change its signature on machineInfoProvider from
WaitForInitialRefresh(maxWait time.Duration) to WaitForInitialRefresh(ctx
context.Context, maxWait time.Duration), update all callers (notably the Collect
path in collector.go) to pass the collection context (collectionCtx) into
WaitForInitialRefresh, and update every implementation of machineInfoProvider
(including RefreshAsync/Wait loops) so the initial-wait logic selects on
ctx.Done() as well as the timeout to ensure the wait is canceled promptly when
the context is canceled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e9dcc4d-ee24-4ca4-8c92-1bad9c96278f

📥 Commits

Reviewing files that changed from the base of the PR and between f86cfff and 1546195.

📒 Files selected for processing (5)

internal/exporter/collector/collector.go
internal/exporter/collector/collector_test.go
internal/exporter/collector/machine_info_provider.go
internal/exporter/exporter.go
internal/exporter/exporter_test.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

internal/exporter/collector/machine_info_provider.go (1)

35-39: ⚠️ Potential issue | 🟠 Major

Honor caller cancellation during the initial wait.

WaitForInitialRefresh only watches the done channel and the timer, so the first collect can still sit here for the full 5s after its collection context has already been canceled. That undermines the collection/export timeout split this PR is introducing. Please thread context.Context through this method and update the collector.go call site to pass the collection context.

⏱️ Suggested fix

 type machineInfoProvider interface {
 	Get() (*machineinfo.MachineInfo, bool)
 	RefreshAsync(parent context.Context)
-	WaitForInitialRefresh(maxWait time.Duration) bool
+	WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool
 }
@@
-func (p *cachedMachineInfoProvider) WaitForInitialRefresh(maxWait time.Duration) bool {
+func (p *cachedMachineInfoProvider) WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool {
 	if p == nil || maxWait <= 0 {
 		return false
 	}
@@
 	select {
 	case <-p.initialRefreshDone:
 		return true
+	case <-ctx.Done():
+		return false
 	case <-timer.C:
 		return false
 	}
 }

Also applies to: 119-140

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/collector/machine_info_provider.go` around lines 35 - 39,
The WaitForInitialRefresh method on interface machineInfoProvider currently
blocks on only a timer and its done channel and therefore ignores caller
cancellation; change the signature to WaitForInitialRefresh(ctx context.Context,
maxWait time.Duration) bool, update every concrete implementation of
machineInfoProvider to accept the context and to select on ctx.Done() alongside
the existing timer/done channels, and update the caller in collector.go to pass
the collection context when calling WaitForInitialRefresh; ensure RefreshAsync,
Get and any struct receiver methods still satisfy the interface after the
signature change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 494-497: The test currently calls
provider.markInitialRefreshDone() inside provider.refreshFn before blocking on
<-blocker, which signals completion too early; change provider.refreshFn so it
first blocks on <-blocker and only calls provider.markInitialRefreshDone() after
the blocker unblocks, ensuring the test exercises the 5s timeout path when
Collect() is invoked.

---

Duplicate comments:
In `@internal/exporter/collector/machine_info_provider.go`:
- Around line 35-39: The WaitForInitialRefresh method on interface
machineInfoProvider currently blocks on only a timer and its done channel and
therefore ignores caller cancellation; change the signature to
WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool, update
every concrete implementation of machineInfoProvider to accept the context and
to select on ctx.Done() alongside the existing timer/done channels, and update
the caller in collector.go to pass the collection context when calling
WaitForInitialRefresh; ensure RefreshAsync, Get and any struct receiver methods
still satisfy the interface after the signature change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9efe8286-b237-40d9-adc7-bb330ad6d704

📥 Commits

Reviewing files that changed from the base of the PR and between 1546195 and 465d156.

📒 Files selected for processing (2)

internal/exporter/collector/collector_test.go
internal/exporter/collector/machine_info_provider.go

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

jingxiang-z · 2026-04-06T23:33:34Z

@coderabbitai review

mukilsh

LGTM

fix: harden export under slow machine info collection

1546195

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

jingxiang-z marked this pull request as ready for review April 6, 2026 22:27

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread internal/exporter/collector/collector.go Outdated

fix: avoid repeated initial machine info wait

465d156

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread internal/exporter/collector/collector_test.go

Comment thread internal/exporter/collector/machine_info_provider.go

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

Comment thread internal/exporter/collector/collector_test.go

jingxiang-z added 2 commits April 6, 2026 15:56

fix: cancel initial machine info wait with context

7d29806

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

fix: bound dcgm startup during server init

9c1c3a0

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>

jingxiang-z changed the title ~~fix: harden export under slow machine info collection~~ fix: harden startup and export under slow DCGM and machine info Apr 6, 2026

jingxiang-z requested a review from mukilsh April 7, 2026 16:34

jingxiang-z self-assigned this Apr 7, 2026

mukilsh approved these changes Apr 7, 2026

View reviewed changes

jingxiang-z merged commit 810acfb into main Apr 7, 2026
9 checks passed

jingxiang-z deleted the fix/export-machine-info-resilience branch April 7, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden startup and export under slow DCGM and machine info#157

fix: harden startup and export under slow DCGM and machine info#157
jingxiang-z merged 4 commits intomainfrom
fix/export-machine-info-resilience

jingxiang-z commented Apr 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

jingxiang-z commented Apr 6, 2026

Uh oh!

mukilsh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jingxiang-z commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior Notes

Testing

Uh oh!

coderabbitai Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jingxiang-z commented Apr 6, 2026

Uh oh!

mukilsh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jingxiang-z commented Apr 6, 2026 •

edited

Loading

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading