Skip to content

feat: add loop and retry for dcgm detection#180

Merged
ambermingxin merged 4 commits into
mainfrom
feat/detect-dcgm
Apr 28, 2026
Merged

feat: add loop and retry for dcgm detection#180
ambermingxin merged 4 commits into
mainfrom
feat/detect-dcgm

Conversation

@ambermingxin
Copy link
Copy Markdown
Collaborator

@ambermingxin ambermingxin commented Apr 23, 2026

Description

  • Introduce a reconnecting DCGM instance wrapper in third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go.
  • Track desired registrations in wrapper state (health watches, watched fields, added entities) so calls made while DCGM is unavailable are replayed when reconnect succeeds.
  • Run a background reconcile loop every 30s (or configurable, default 30s) that:
  1. attempts dcgm.Init when unavailable,
  2. atomically swaps in the real instance on success,
  3. replays registrations,
  4. logs state transition once (down->up / up->down).
  • Update FieldValueCache in field_cache.go to support late availability:
  1. don’t permanently skip start when DCGM is initially unavailable,
  2. lazily (re)run field-group setup when instance becomes available and fields exist.
  • Keep HealthCache polling but reduce noisy warnings when DCGM is simply not loaded yet; reserve warning/error for real API failures.
  • Handle direct-setup components (dcgm/cpu, dcgm/prof) with lazy one-time setup in Check() (or explicit EnsureInitialized) so they can recover after late DCGM availability too.

Note

  • Current deferred registration: components declare watches/fields immediately; DCGM wrapper buffers intent and replays when DCGM is up.
  • Alternative approach (register only after up): some manager must remember all intended watches/fields and execute registration later.
    The current implementation allows minimal invasive change to existing component flow, and it works for startup races and late DCGM readiness. It also aturally supports replays when DCGM reconnects (not just first startup).

Summary by CodeRabbit

  • Improvements

    • GPU monitoring now registers watches earlier and tolerates delayed DCGM availability; registrations are replayed when DCGM reconnects.
    • New background reconnecting behavior remembers and replays health/field watches and entity registrations.
    • Polling and health checks treat transient DCGM unavailability as recoverable; NVSwitch/entity registration is deduplicated and tracked.
  • Tests

    • Added tests for reconnect behavior, deferred registrations, uninitialized-group handling, and state replay.

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Constructors now register DCGM watches/fields when a DCGM instance object exists even if DCGM is not yet connected. A reconnectingInstance was added to remember and replay deferred registrations, run periodic reconnect attempts, and allow polling/health caches to tolerate DCGM being temporarily unavailable.

Changes

Cohort / File(s) Summary
DCGM Component Guard Relaxation
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/clock/component.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/inforom/component.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/mem/component.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvlink/component.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/pcie/component.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/thermal/component.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/utilization/component.go
Removed direct DCGMExists() guard in New() constructors; health-watch/field registration calls are invoked for any non-nil dcgmInstance. Runtime IsSupported()/Check() still validate DCGM availability.
NVSwitch Synchronized Registration
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvswitch/component.go
Adds registeredSwitches + switchMu and ensureSwitchEntitiesRegistered() to register only unseen NVSwitch IDs, track counts, and report entity-add failures.
Field Watching Infrastructure
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go
Makes field-group name mutable, adds idempotent ensureFieldWatchingSetup(), ensures on-demand setup in Poll(), and changes Start()/Poll() to always start polling and to tolerate missing DCGM or zero watched fields without returning errors.
Health Cache Graceful Degradation
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go
Treats missing DCGM or uninitialized group handles as transient in Poll() (log + return nil). Introduces errTransientGroupNotReady; healthCheckDirect() obtains group handle via Instance and returns transient sentinel when handle==0.
Instance Reconnection and Initialization
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
Adds unexported reconnectingInstance that records deferred registrations (groups/entities/fields/health watches), runs a background reconnect loop, builds connected instances, replays recorded state on connect, swaps active instance, invokes reconnect callbacks, and supports graceful Shutdown. NewWithContext now returns the reconnecting wrapper in more startup scenarios.
Field/Health Tests and Mocks
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go, third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go, third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache_test.go
Adds reconnect-focused tests, transient-group-handle tests, and test-only mocks (tracking, blocking replay, and health-cache mock) to validate replay of deferred registrations and Poll/health behavior when DCGM is not yet ready.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant RI as ReconnectingInstance
    participant CI as ConnectedInstance
    participant DCGM as DCGM Library

    App->>RI: AddHealthWatch / AddFieldsToWatch / AddEntityGroup
    RI->>RI: Record deferred registrations (watches, fields, entities)
    Note over RI: Background reconnect loop (periodic)
    RI->>DCGM: Attempt Init / Connect
    alt DCGM available
        RI->>CI: Create connected instance
        RI->>CI: Replay deferred watches, fields, entities
        CI->>DCGM: Register health watches and fields/groups
        RI->>RI: Swap active instance, run reconnect callbacks
    else DCGM unavailable
        RI->>RI: Keep deferred registrations, retry later
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 I stash the watches while DCGM sleeps,
I hop in loops and wait through quiet beats,
then on reconnect I replay every call,
watches wake, fields answer, none fall,
a small rabbit's queue keeps the system neat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: add loop and retry for dcgm detection' accurately describes the main change: introducing a background reconnection loop for DCGM detection and initialization. The title is concise and clearly summarizes the primary functionality added.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/detect-dcgm

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go (1)

82-83: Update the guard comment to match current behavior.

Line 82 says “Only initialize if DCGM is available”, but the condition at Line 83 now checks only that the instance is non-nil. Consider rewording to avoid confusion during maintenance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go`
around lines 82 - 83, The comment above the if check is outdated; update the
guard comment to accurately describe the current condition that is being checked
(c.dcgmInstance != nil) so it doesn't claim full DCGM availability when only a
non-nil instance is required. Edit the comment immediately preceding the if that
references c.dcgmInstance to something like "Initialize only when dcgmInstance
is present (non-nil)" or "Proceed if c.dcgmInstance is non-nil" so it directly
matches the behavior in component.go and the if (c.dcgmInstance != nil) guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go`:
- Around line 259-266: The code currently treats a zero group handle
(inst.GetGroupHandle() -> groupHandle.GetHandle() == 0) as a permanent error
which gets stored in lastError; instead treat this as a transient
bootstrap/reconnect condition so it doesn't surface as persistent unhealthy.
Change the zero-handle branch so it returns a transient error (e.g., a specific
sentinel like errTransientGroupNotReady or wrap the error so
IsTransientError(...) returns true) rather than a normal non-transient error;
update IsTransientError to recognize that sentinel/message if needed. Ensure the
change is applied where groupHandle.GetHandle() is checked (near
inst.GetGroupHandle()) so dcgm.HealthCheck is only called when handle != 0 and
transient errors are not persisted to lastError.

---

Nitpick comments:
In
`@third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go`:
- Around line 82-83: The comment above the if check is outdated; update the
guard comment to accurately describe the current condition that is being checked
(c.dcgmInstance != nil) so it doesn't claim full DCGM availability when only a
non-nil instance is required. Edit the comment immediately preceding the if that
references c.dcgmInstance to something like "Initialize only when dcgmInstance
is present (non-nil)" or "Proceed if c.dcgmInstance is non-nil" so it directly
matches the behavior in component.go and the if (c.dcgmInstance != nil) guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b0aef71-ca4d-41a8-849c-d990a872825f

📥 Commits

Reviewing files that changed from the base of the PR and between 24e2b5c and 00843a2.

📒 Files selected for processing (13)
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/clock/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/inforom/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/mem/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvlink/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvswitch/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/pcie/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/thermal/component.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/utilization/component.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00843a2478

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go Outdated
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go (1)

46-49: Optional: strengthen the post-Poll() assertion.

Right now this test only checks err == nil. You can also assert default PASS + no incidents to lock expected fallback semantics.

Proposed assertion extension
-	_, _, err := hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE)
+	health, incidents, err := hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE)
 	if err != nil {
 		t.Fatalf("GetResult() should not expose transient group-not-ready as lastError, got: %v", err)
 	}
+	if health != dcgm.DCGM_HEALTH_RESULT_PASS {
+		t.Fatalf("expected default PASS result, got: %v", health)
+	}
+	if len(incidents) != 0 {
+		t.Fatalf("expected no incidents for default result, got: %d", len(incidents))
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go`
around lines 46 - 49, After calling hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE)
capture the returned result and strengthen the assertion: assert result.Status
== dcgm.DCGM_HEALTH_RESULT_PASS and assert there are no incidents (e.g.,
len(result.Incidents) == 0) so the test verifies the default PASS fallback
semantics; update the t.Fatalf messages on failure to include the actual result
for easier debugging.
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go (1)

141-145: Consider lowering transient retry log level to avoid periodic noise.

This path can run every poll interval during warmup/reconnect. Using debug (or throttling) would keep routine transient retries quieter.

Proposed tweak
-			log.Logger.Infow("DCGM transient error, will retry",
+			log.Logger.Debugw("DCGM transient error, will retry",
 				"component", "health_cache",
 				"error", err)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go`
around lines 141 - 145, The transient-error branch in health_cache (the block
checking errors.Is(err, errTransientGroupNotReady) || IsTransientError(err)) is
using log.Logger.Infow and should be made quieter to avoid poll-interval noise;
change that log call to a debug-level call (e.g., log.Logger.Debugw) or wrap it
with a rate/throttle so transient retries during warmup/reconnect are not
repeatedly logged at info level, keeping the same structured fields
("component","health_cache","error", err) and leaving the return nil behavior
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go`:
- Around line 46-49: After calling hc.GetResult(dcgm.DCGM_HEALTH_WATCH_PCIE)
capture the returned result and strengthen the assertion: assert result.Status
== dcgm.DCGM_HEALTH_RESULT_PASS and assert there are no incidents (e.g.,
len(result.Incidents) == 0) so the test verifies the default PASS fallback
semantics; update the t.Fatalf messages on failure to include the actual result
for easier debugging.

In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go`:
- Around line 141-145: The transient-error branch in health_cache (the block
checking errors.Is(err, errTransientGroupNotReady) || IsTransientError(err)) is
using log.Logger.Infow and should be made quieter to avoid poll-interval noise;
change that log call to a debug-level call (e.g., log.Logger.Debugw) or wrap it
with a rate/throttle so transient retries during warmup/reconnect are not
repeatedly logged at info level, keeping the same structured fields
("component","health_cache","error", err) and leaving the return nil behavior
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 380d9e6d-db9d-41f8-a95b-b442c2c5e013

📥 Commits

Reviewing files that changed from the base of the PR and between 00843a2 and 640805a.

📒 Files selected for processing (2)
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/health_cache_test.go

@jingxiang-z
Copy link
Copy Markdown
Collaborator

Need to test this before merge. Need to verify after the dcgm available, is agent recreate the DCGM group and fieldgroup.

@jingxiang-z
Copy link
Copy Markdown
Collaborator

jingxiang-z commented Apr 23, 2026

Code Review — Medium Issues

### [Critical] New() is not wrapped with reconnectingInstance
Retracted — server.go on main already calls nvidiadcgm.NewWithContext(...), so the reconnect wrapper is correctly wired up. My local copy was stale.


[Medium] fieldGroupID is never reset on DCGM reconnect — field watching is not re-established

When reconnectNow() succeeds it calls previousInst.Shutdown(), which destroys the DCGM group and invalidates the field group handle. But FieldValueCache.fieldGroupID is never cleared, so ensureFieldWatchingSetup() always short-circuits on the guard:

if fc.fieldGroupID.GetHandle() != 0 {
    return nil  // stale handle; WatchFieldsWithGroupEx never re-called on the new connection
}

DCGM will not push field data for the new connection, so GetLatestValuesForFields silently returns empty/stale values after every reconnect.

Suggested fix: expose a ResetFieldGroup() method on FieldValueCache that zeroes fieldGroupID (under registrationMu), and have reconnectingInstance accept a slice of reset callbacks that are invoked after a successful reconnect.


[Medium] current is swapped before previousInst.Shutdown() — teardown races with readers

inst.mu.Lock()
previousInst := inst.current
inst.current = connectedInst  // new instance visible to all readers here
inst.mu.Unlock()

_ = previousInst.Shutdown()  // destroys group + cleanup *after* the swap

Between the unlock and Shutdown(), a concurrent FieldValueCache.Poll() mid-flight may still hold a reference to previousInst's group handle; Shutdown() concurrently destroys that group, which can produce a spurious DCGM error on the inflight call. Low-risk today but a latent hazard as more callers are added.

Suggested fix: call previousInst.Shutdown() before the swap, or keep the old instance alive until all inflight calls have drained.

@ambermingxin
Copy link
Copy Markdown
Collaborator Author

Code Review — Critical & Medium Issues

[Critical] New() is not wrapped with reconnectingInstance — reconnect is unreachable from the server

Only NewWithContext gets the reconnectingInstance wrapper, but internal/server/server.go calls nvidiadcgm.New(), which is unchanged in this PR. New() still returns a plain *instance or bare NewNoOp() on failure with no reconnect loop.

As-is, newReconnectingInstance is dead code for the main server execution path. (scan.go intentionally uses New() since it's one-shot — that's fine.)

Options:

  • Modify New() to also wrap with newReconnectingInstance, or
  • Add a NewWithAutoReconnect(ctx context.Context) function and update server.go to call it instead of New()

in current code server.go already uses nvidiadcgm.NewWithContext(...) in current main. Are you referring to any other occurrence?

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go (1)

90-133: ⚠️ Potential issue | 🟠 Major

Don't retain a destroyed field-group handle after setup failure.

Line 118 stores fc.fieldGroupID before WatchFieldsWithGroupEx(...) succeeds. If that watch call fails, Line 127 destroys the handle, but the cached handle stays non-zero, so future calls return early at Line 90 and never retry setup.

Proposed fix
 	fieldGroupID, err := dcgm.FieldGroupCreate(fieldGroupName, watchedFields)
 	if err != nil {
 		setupErr := fmt.Errorf("failed to create DCGM field group: %w", err)
 		// Store error so GetResult() returns it to components
 		fc.mu.Lock()
 		fc.lastError = setupErr
 		fc.mu.Unlock()
 		return setupErr
 	}
-	fc.fieldGroupID = fieldGroupID

 	updateFreqMicroseconds := int64(fc.pollInterval / time.Microsecond)
 	maxKeepAge := fc.pollInterval.Seconds() * 2
 	maxKeepSamples := int32(3)

 	err = dcgm.WatchFieldsWithGroupEx(fieldGroupID, fc.instance.GetGroupHandle(),
 		updateFreqMicroseconds, maxKeepAge, maxKeepSamples)
 	if err != nil {
 		dcgm.FieldGroupDestroy(fieldGroupID)
 		setupErr := fmt.Errorf("failed to set up DCGM field watching: %w", err)
 		// Store error so GetResult() returns it to components
 		fc.mu.Lock()
 		fc.lastError = setupErr
 		fc.mu.Unlock()
 		return setupErr
 	}
+	fc.fieldGroupID = fieldGroupID
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go`
around lines 90 - 133, The code stores fc.fieldGroupID before
WatchFieldsWithGroupEx succeeds, so if WatchFieldsWithGroupEx fails the handle
is destroyed but fc.fieldGroupID remains non-zero and prevents retries; fix by
only assigning fc.fieldGroupID after WatchFieldsWithGroupEx returns nil, or if
you keep the current order, ensure you clear fc.fieldGroupID (set to zero) while
holding fc.mu immediately after dcgm.FieldGroupDestroy(fieldGroupID) and before
returning the setup error, and also ensure lastError is set under the same lock
so future GetResult/retries behave correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache_test.go`:
- Around line 97-105: The test currently uses a non-fatal assert.NotNil for
mockInstance.callback then unconditionally calls mockInstance.callback(), which
can cause a panic; make the check fatal so the test fails cleanly if the
callback isn't registered (e.g. replace assert.NotNil(t, mockInstance.callback,
...) with a fatal assertion such as require.NotNil(t, mockInstance.callback,
"reconnect callback should be registered") or an explicit if
mockInstance.callback == nil { t.Fatalf("reconnect callback should be
registered") }) before the subsequent fc.mu.Lock()/mockInstance.callback()
invocation; keep the rest of the setup (fc.lastError, fc.lastUpdate, fc.values)
unchanged.

In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go`:
- Around line 512-535: The loop never re-enters reconnect mode because
DCGMExists() returns a static dcgmExists flag set at creation; change the logic
so reconnectLoop observes the actual runtime DCGM availability instead of that
static flag—either make DCGMExists() query the underlying live instance state
(or remove the cached dcgmExists and update it on disconnects) and then have
reconnectLoop call that dynamic check each tick, using the existing symbols
reconnectLoop, DCGMExists, reconnectNow, reconnectInterval, stopCh and the
dcgmExists field to locate and update the code paths.
- Around line 548-557: The reconnect swap is not atomic with old-instance
teardown and reconnects can race with Shutdown/reconnectNow; modify the logic
around inst.current, inst.currentMu, runReconnectCallbacks(), reconnectNow() and
Shutdown() so the swap + callback + previousInst.Shutdown() happen as a single
serialized operation and in-flight reconnects are aborted once Shutdown begins:
add a small synchronization token (e.g. a generation counter or a
reconnectInProgress flag guarded by currentMu or a new reconnectMu) that
reconnectNow() reads and compares before writing inst.current, increment that
token when starting the serialized swap, perform inst.current = connectedInst,
call inst.runReconnectCallbacks(), then call previousInst.Shutdown() while still
owning the token so no other reconnect can install a new current mid-teardown,
and have Shutdown() set a "shuttingDown" marker (or increment the token) so any
concurrent reconnectNow() detects this and aborts early; update reconnectNow(),
Shutdown(), and the swap site (where previousInst is captured) to use this
token/flag to ensure atomic swap+teardown and to abort in-flight reconnects.

---

Outside diff comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go`:
- Around line 90-133: The code stores fc.fieldGroupID before
WatchFieldsWithGroupEx succeeds, so if WatchFieldsWithGroupEx fails the handle
is destroyed but fc.fieldGroupID remains non-zero and prevents retries; fix by
only assigning fc.fieldGroupID after WatchFieldsWithGroupEx returns nil, or if
you keep the current order, ensure you clear fc.fieldGroupID (set to zero) while
holding fc.mu immediately after dcgm.FieldGroupDestroy(fieldGroupID) and before
returning the setup error, and also ensure lastError is set under the same lock
so future GetResult/retries behave correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fbb47656-c90b-4548-923c-ea417dd9bb06

📥 Commits

Reviewing files that changed from the base of the PR and between 640805a and fd9778f.

📒 Files selected for processing (4)
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/field_cache_test.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go (1)

280-287: Minor: Direct struct initialization bypasses constructor validation.

The test directly instantiates reconnectingInstance{} to avoid starting the reconnect loop, which is reasonable for this test. However, this pattern could become fragile if the struct adds required fields in the future.

Consider adding a comment explaining why this is done, or creating a test-only constructor that doesn't start the goroutine.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go`
around lines 280 - 287, The test constructs reconnectingInstance directly which
bypasses constructor validation; either add a brief comment above this
instantiation explaining that the direct struct init is intentional to avoid
starting the reconnect goroutine (and must be kept in sync if fields are added),
or provide and use a test-only constructor like NewTestReconnectingInstance that
returns a pre-initialized reconnectingInstance without launching the reconnect
loop; reference the reconnectingInstance type, the NewNoOp() call used for
current, and the Shutdown() method to ensure any new constructor or comment
clarifies identical field initialization (watchedFields, groupEntities,
reconnectInterval, stopCh) and the reason for not starting the background
goroutine.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go`:
- Around line 280-287: The test constructs reconnectingInstance directly which
bypasses constructor validation; either add a brief comment above this
instantiation explaining that the direct struct init is intentional to avoid
starting the reconnect goroutine (and must be kept in sync if fields are added),
or provide and use a test-only constructor like NewTestReconnectingInstance that
returns a pre-initialized reconnectingInstance without launching the reconnect
loop; reference the reconnectingInstance type, the NewNoOp() call used for
current, and the Shutdown() method to ensure any new constructor or comment
clarifies identical field initialization (watchedFields, groupEntities,
reconnectInterval, stopCh) and the reason for not starting the background
goroutine.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5939284a-37dd-40cd-b4bd-c5aba2c87cf0

📥 Commits

Reviewing files that changed from the base of the PR and between fd9778f and f66d7fb.

📒 Files selected for processing (2)
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go

Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go (1)

696-704: Minor style note: defer placement.

The pattern of placing defer inst.currentMu.RUnlock() after capturing the value (line 699) is unconventional but functionally correct. The more idiomatic Go pattern places defer immediately after lock acquisition:

inst.currentMu.RLock()
defer inst.currentMu.RUnlock()
currentInst := inst.current

This is consistent across multiple methods in this file, so it may be intentional. Not a correctness issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go` around
lines 696 - 704, The defer for unlocking should be placed immediately after
acquiring the read lock for idiomatic clarity: in
reconnectingInstance.DCGMExists(), call inst.currentMu.RLock() then defer
inst.currentMu.RUnlock() and only then read currentInst := inst.current; update
other similar methods in this file to follow the same pattern (look for methods
on reconnectingInstance that use currentMu.RLock/RUnlock and current) to keep
style consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go`:
- Around line 696-704: The defer for unlocking should be placed immediately
after acquiring the read lock for idiomatic clarity: in
reconnectingInstance.DCGMExists(), call inst.currentMu.RLock() then defer
inst.currentMu.RUnlock() and only then read currentInst := inst.current; update
other similar methods in this file to follow the same pattern (look for methods
on reconnectingInstance that use currentMu.RLock/RUnlock and current) to keep
style consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bcc28f22-54a2-444f-b739-81d44c4f9473

📥 Commits

Reviewing files that changed from the base of the PR and between f66d7fb and 66db289.

📒 Files selected for processing (2)
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance.go
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go
✅ Files skipped from review due to trivial changes (1)
  • third_party/fleet-intelligence-sdk/pkg/nvidia-query/dcgm/instance_test.go

@ambermingxin ambermingxin merged commit a601987 into main Apr 28, 2026
9 checks passed
@ambermingxin ambermingxin deleted the feat/detect-dcgm branch April 28, 2026 20:02
jingxiang-z pushed a commit that referenced this pull request Apr 28, 2026
Signed-off-by: Amber Xue <ambermingxin@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants