fix: harden startup and export under slow DCGM and machine info#157
fix: harden startup and export under slow DCGM and machine info#157jingxiang-z merged 4 commits intomainfrom
Conversation
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
📝 WalkthroughWalkthroughMachine info collection was refactored to use an asynchronous cached provider; the collector best-effort reads cached info (with an optional initial wait) and schedules background refreshes. The exporter now creates separate timeout-derived contexts for collection and for HTTP transmission. Changes
Sequence Diagram(s)sequenceDiagram
participant Collector
participant MachineInfoProvider
participant NVML
participant Cache
Collector->>MachineInfoProvider: New(includeMachineInfo=true, options...)
MachineInfoProvider->>MachineInfoProvider: start initial async refresh()
MachineInfoProvider->>NVML: fetch machine info (background)
NVML-->>MachineInfoProvider: machine info / error
MachineInfoProvider->>Cache: update cached value (on success)
MachineInfoProvider-->>Collector: initialRefreshDone (closes)
Collector->>MachineInfoProvider: Collect(): WaitForInitialRefresh(maxWait)
MachineInfoProvider-->>Collector: Get() => cached machine info (best-effort)
Collector->>MachineInfoProvider: RefreshAsync() (schedule next refresh if stale)
sequenceDiagram
participant Exporter
participant Collector
participant HTTPWriter
Exporter->>Exporter: create collectionCtx (timeout T1)
Exporter->>Collector: Collect(collectionCtx)
Note over Collector: collection may use cached data / wait briefly
Collector-->>Exporter: collection result
Exporter->>Exporter: create exportCtx (timeout T2, independent)
Exporter->>HTTPWriter: Send(exportCtx)
HTTPWriter-->>Exporter: send result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1546195fc1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 973-979: The mockMachineInfoProvider accesses the cached field
from multiple goroutines (methods Get and setCached) without synchronization,
causing race failures; add a mutex (e.g., sync.Mutex or sync.RWMutex) as a field
on mockMachineInfoProvider and use it to protect all reads/writes of cached
(wrap cached reads in RLock/RUnlock in Get and Lock/Unlock in setCached), and
update any other helper methods that touch cached (and initialRefreshDone if it
reads/writes cached) to use the same mutex so the mock mirrors the production
provider's locking.
In `@internal/exporter/collector/machine_info_provider.go`:
- Around line 35-39: The WaitForInitialRefresh method currently ignores caller
cancellation; change its signature on machineInfoProvider from
WaitForInitialRefresh(maxWait time.Duration) to WaitForInitialRefresh(ctx
context.Context, maxWait time.Duration), update all callers (notably the Collect
path in collector.go) to pass the collection context (collectionCtx) into
WaitForInitialRefresh, and update every implementation of machineInfoProvider
(including RefreshAsync/Wait loops) so the initial-wait logic selects on
ctx.Done() as well as the timeout to ensure the wait is canceled promptly when
the context is canceled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3e9dcc4d-ee24-4ca4-8c92-1bad9c96278f
📒 Files selected for processing (5)
internal/exporter/collector/collector.gointernal/exporter/collector/collector_test.gointernal/exporter/collector/machine_info_provider.gointernal/exporter/exporter.gointernal/exporter/exporter_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/exporter/collector/machine_info_provider.go (1)
35-39:⚠️ Potential issue | 🟠 MajorHonor caller cancellation during the initial wait.
WaitForInitialRefreshonly watches the done channel and the timer, so the first collect can still sit here for the full 5s after its collection context has already been canceled. That undermines the collection/export timeout split this PR is introducing. Please threadcontext.Contextthrough this method and update thecollector.gocall site to pass the collection context.⏱️ Suggested fix
type machineInfoProvider interface { Get() (*machineinfo.MachineInfo, bool) RefreshAsync(parent context.Context) - WaitForInitialRefresh(maxWait time.Duration) bool + WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool } @@ -func (p *cachedMachineInfoProvider) WaitForInitialRefresh(maxWait time.Duration) bool { +func (p *cachedMachineInfoProvider) WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool { if p == nil || maxWait <= 0 { return false } @@ select { case <-p.initialRefreshDone: return true + case <-ctx.Done(): + return false case <-timer.C: return false } }Also applies to: 119-140
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/exporter/collector/machine_info_provider.go` around lines 35 - 39, The WaitForInitialRefresh method on interface machineInfoProvider currently blocks on only a timer and its done channel and therefore ignores caller cancellation; change the signature to WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool, update every concrete implementation of machineInfoProvider to accept the context and to select on ctx.Done() alongside the existing timer/done channels, and update the caller in collector.go to pass the collection context when calling WaitForInitialRefresh; ensure RefreshAsync, Get and any struct receiver methods still satisfy the interface after the signature change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/exporter/collector/collector_test.go`:
- Around line 494-497: The test currently calls
provider.markInitialRefreshDone() inside provider.refreshFn before blocking on
<-blocker, which signals completion too early; change provider.refreshFn so it
first blocks on <-blocker and only calls provider.markInitialRefreshDone() after
the blocker unblocks, ensuring the test exercises the 5s timeout path when
Collect() is invoked.
---
Duplicate comments:
In `@internal/exporter/collector/machine_info_provider.go`:
- Around line 35-39: The WaitForInitialRefresh method on interface
machineInfoProvider currently blocks on only a timer and its done channel and
therefore ignores caller cancellation; change the signature to
WaitForInitialRefresh(ctx context.Context, maxWait time.Duration) bool, update
every concrete implementation of machineInfoProvider to accept the context and
to select on ctx.Done() alongside the existing timer/done channels, and update
the caller in collector.go to pass the collection context when calling
WaitForInitialRefresh; ensure RefreshAsync, Get and any struct receiver methods
still satisfy the interface after the signature change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9efe8286-b237-40d9-adc7-bb330ad6d704
📒 Files selected for processing (2)
internal/exporter/collector/collector_test.gointernal/exporter/collector/machine_info_provider.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
|
@coderabbitai review |
Summary
Behavior Notes
Testing
go test ./...go test -race ./internal/exporter/collectorgo test github.com/NVIDIA/fleet-intelligence-sdk/pkg/nvidia-query/dcgm ./internal/server ./internal/exporterRelated ticket: GPUHEALTH-1735