Skip to content

fix: harden startup and export under slow DCGM and machine info#157

Merged
jingxiang-z merged 4 commits intomainfrom
fix/export-machine-info-resilience
Apr 7, 2026
Merged

fix: harden startup and export under slow DCGM and machine info#157
jingxiang-z merged 4 commits intomainfrom
fix/export-machine-info-resilience

Conversation

@jingxiang-z
Copy link
Copy Markdown
Collaborator

@jingxiang-z jingxiang-z commented Apr 6, 2026

Summary

  • prewarm machine info and wait up to 5 seconds on the first export before falling back to cached or empty machine info
  • move machine info refresh behind a single-flight cache so later exports never block on slow NVML/DCGM-backed machine info collection
  • separate collection and HTTP export timeout contexts so slow collection does not consume the send budget
  • bound DCGM initialization during server startup to 1 minute and fall back to a no-op DCGM instance when startup enumeration is too slow

Behavior Notes

  • first export now makes a best effort to include machine info without waiting indefinitely
  • later exports reuse cached machine info and refresh it in the background only when stale
  • the initial machine-info wait is a one-time gate and now honors collection context cancellation
  • one export cycle can now take up to roughly collection timeout plus export timeout because those budgets are separated intentionally
  • if DCGM startup takes more than 1 minute, the server continues startup with a no-op DCGM instance and skips DCGM-backed checks instead of hanging

Testing

  • go test ./...
  • go test -race ./internal/exporter/collector
  • go test github.com/NVIDIA/fleet-intelligence-sdk/pkg/nvidia-query/dcgm ./internal/server ./internal/exporter

Related ticket: GPUHEALTH-1735

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

Machine info collection was refactored to use an asynchronous cached provider; the collector best-effort reads cached info (with an optional initial wait) and schedules background refreshes. The exporter now creates separate timeout-derived contexts for collection and for HTTP transmission.

Changes

Cohort / File(s) Summary
Machine Info Provider
internal/exporter/collector/machine_info_provider.go
Adds machineInfoProvider interface and cachedMachineInfoProvider that caches machine info, deduplicates concurrent refreshes, supports async refresh and a one-time initial-wait mechanism, and handles staleness.
Collector
internal/exporter/collector/collector.go
Instantiates provider when enabled, uses provider.Get() to populate data.MachineInfo (best-effort), waits briefly for initial refresh when needed, and triggers RefreshAsync for background updates.
Collector Tests & Mocks
internal/exporter/collector/collector_test.go
Adds tests and mockMachineInfoProvider covering cached reads, initial prewarm/wait behavior, non-blocking collection when refresh stalls, deduplication of concurrent refreshes, and resilience to refresh failures.
Exporter Contexts
internal/exporter/exporter.go
Uses a dedicated timeout-derived collectionCtx for metadata refresh and collection, and a separate exportCtx for HTTP transmission (offline path unchanged).
Exporter Tests & Mock HTTP Writer
internal/exporter/exporter_test.go
Adds MockHTTPWriter and TestExportUsesSeparateContextsForCollectionAndHTTP to verify HTTP export receives a fresh, non-expired context when collection overruns.

Sequence Diagram(s)

sequenceDiagram
    participant Collector
    participant MachineInfoProvider
    participant NVML
    participant Cache

    Collector->>MachineInfoProvider: New(includeMachineInfo=true, options...)
    MachineInfoProvider->>MachineInfoProvider: start initial async refresh()
    MachineInfoProvider->>NVML: fetch machine info (background)
    NVML-->>MachineInfoProvider: machine info / error
    MachineInfoProvider->>Cache: update cached value (on success)
    MachineInfoProvider-->>Collector: initialRefreshDone (closes)

    Collector->>MachineInfoProvider: Collect(): WaitForInitialRefresh(maxWait)
    MachineInfoProvider-->>Collector: Get() => cached machine info (best-effort)
    Collector->>MachineInfoProvider: RefreshAsync() (schedule next refresh if stale)
Loading
sequenceDiagram
    participant Exporter
    participant Collector
    participant HTTPWriter

    Exporter->>Exporter: create collectionCtx (timeout T1)
    Exporter->>Collector: Collect(collectionCtx)
    Note over Collector: collection may use cached data / wait briefly
    Collector-->>Exporter: collection result

    Exporter->>Exporter: create exportCtx (timeout T2, independent)
    Exporter->>HTTPWriter: Send(exportCtx)
    HTTPWriter-->>Exporter: send result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I nibbled cache where NVML slept tight,
Async refreshes tend the data at night,
Two timeouts now hop—collection and send,
No more expired contexts to upend,
A joyful hop for stable insight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective: hardening export behavior when machine info collection is slow, which aligns with the key change of separating collection timeouts, implementing cached machine info, and ensuring exports don't hang indefinitely.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/export-machine-info-resilience

Comment @coderabbitai help to get the list of available commands and usage tips.

@jingxiang-z jingxiang-z marked this pull request as ready for review April 6, 2026 22:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1546195fc1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/exporter/collector/collector.go Outdated
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 973-979: The mockMachineInfoProvider accesses the cached field
from multiple goroutines (methods Get and setCached) without synchronization,
causing race failures; add a mutex (e.g., sync.Mutex or sync.RWMutex) as a field
on mockMachineInfoProvider and use it to protect all reads/writes of cached
(wrap cached reads in RLock/RUnlock in Get and Lock/Unlock in setCached), and
update any other helper methods that touch cached (and initialRefreshDone if it
reads/writes cached) to use the same mutex so the mock mirrors the production
provider's locking.

In `@internal/exporter/collector/machine_info_provider.go`:
- Around line 35-39: The WaitForInitialRefresh method currently ignores caller
cancellation; change its signature on machineInfoProvider from
WaitForInitialRefresh(maxWait time.Duration) to WaitForInitialRefresh(ctx
context.Context, maxWait time.Duration), update all callers (notably the Collect
path in collector.go) to pass the collection context (collectionCtx) into
WaitForInitialRefresh, and update every implementation of machineInfoProvider
(including RefreshAsync/Wait loops) so the initial-wait logic selects on
ctx.Done() as well as the timeout to ensure the wait is canceled promptly when
the context is canceled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3e9dcc4d-ee24-4ca4-8c92-1bad9c96278f

📥 Commits

Reviewing files that changed from the base of the PR and between f86cfff and 1546195.

📒 Files selected for processing (5)
  • internal/exporter/collector/collector.go
  • internal/exporter/collector/collector_test.go
  • internal/exporter/collector/machine_info_provider.go
  • internal/exporter/exporter.go
  • internal/exporter/exporter_test.go

Comment thread internal/exporter/collector/collector_test.go
Comment thread internal/exporter/collector/machine_info_provider.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
internal/exporter/collector/machine_info_provider.go (1)

35-39: ⚠️ Potential issue | 🟠 Major

Honor caller cancellation during the initial wait.

WaitForInitialRefresh only watches the done channel and the timer, so the first collect can still sit here for the full 5s after its collection context has already been canceled. That undermines the collection/export timeout split this PR is introducing. Please thread context.Context through this method and update the collector.go call site to pass the collection context.

⏱️ Suggested fix
 type machineInfoProvider interface {
 	Get() (*machineinfo.MachineInfo, bool)
 	RefreshAsync(parent context.Context)
-	WaitForInitialRefresh(maxWait time.Duration) bool
+	WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool
 }
@@
-func (p *cachedMachineInfoProvider) WaitForInitialRefresh(maxWait time.Duration) bool {
+func (p *cachedMachineInfoProvider) WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool {
 	if p == nil || maxWait <= 0 {
 		return false
 	}
@@
 	select {
 	case <-p.initialRefreshDone:
 		return true
+	case <-ctx.Done():
+		return false
 	case <-timer.C:
 		return false
 	}
 }

Also applies to: 119-140

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/exporter/collector/machine_info_provider.go` around lines 35 - 39,
The WaitForInitialRefresh method on interface machineInfoProvider currently
blocks on only a timer and its done channel and therefore ignores caller
cancellation; change the signature to WaitForInitialRefresh(ctx context.Context,
maxWait time.Duration) bool, update every concrete implementation of
machineInfoProvider to accept the context and to select on ctx.Done() alongside
the existing timer/done channels, and update the caller in collector.go to pass
the collection context when calling WaitForInitialRefresh; ensure RefreshAsync,
Get and any struct receiver methods still satisfy the interface after the
signature change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 494-497: The test currently calls
provider.markInitialRefreshDone() inside provider.refreshFn before blocking on
<-blocker, which signals completion too early; change provider.refreshFn so it
first blocks on <-blocker and only calls provider.markInitialRefreshDone() after
the blocker unblocks, ensuring the test exercises the 5s timeout path when
Collect() is invoked.

---

Duplicate comments:
In `@internal/exporter/collector/machine_info_provider.go`:
- Around line 35-39: The WaitForInitialRefresh method on interface
machineInfoProvider currently blocks on only a timer and its done channel and
therefore ignores caller cancellation; change the signature to
WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool, update
every concrete implementation of machineInfoProvider to accept the context and
to select on ctx.Done() alongside the existing timer/done channels, and update
the caller in collector.go to pass the collection context when calling
WaitForInitialRefresh; ensure RefreshAsync, Get and any struct receiver methods
still satisfy the interface after the signature change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9efe8286-b237-40d9-adc7-bb330ad6d704

📥 Commits

Reviewing files that changed from the base of the PR and between 1546195 and 465d156.

📒 Files selected for processing (2)
  • internal/exporter/collector/collector_test.go
  • internal/exporter/collector/machine_info_provider.go

Comment thread internal/exporter/collector/collector_test.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@jingxiang-z jingxiang-z changed the title fix: harden export under slow machine info collection fix: harden startup and export under slow DCGM and machine info Apr 6, 2026
@jingxiang-z
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@jingxiang-z jingxiang-z requested a review from mukilsh April 7, 2026 16:34
@jingxiang-z jingxiang-z self-assigned this Apr 7, 2026
Copy link
Copy Markdown
Contributor

@mukilsh mukilsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jingxiang-z jingxiang-z merged commit 810acfb into main Apr 7, 2026
9 checks passed
@jingxiang-z jingxiang-z deleted the fix/export-machine-info-resilience branch April 7, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants