fix: drive storage cleanup through reset before discovery (#5555701) by williampnvidia · Pull Request #1748 · NVIDIA/infra-controller

williampnvidia · 2026-05-17T21:21:55Z

Summary

Stop Scout's discovery/no-API path from performing hidden NVMe/HDD/SAS cleanup;
discovery now leaves destructive storage cleanup to the API-directed reset path
Add cleanup context to WaitingForCleanup so the state machine can distinguish
initial discovery cleanup from deprovision cleanup and route recovery correctly
Drive initial discovery through WaitingForCleanup { InitialDiscovery } before
returning Scout Action::Discovery
Preserve HostInit cleanup context across repeated NVMECleanFailed retries so a
successful retry returns to HostInit/WaitingForDiscovery, not the deprovision flow
Add regressions covering firmware-upgrade Scout boots, assigned discovery-image
boots, and repeated HostInit cleanup failures

Recovery context note

NVMECleanFailed recovery currently uses FailureSource::StateMachineArea(HostInit)
to remember that the failure happened during initial discovery cleanup. That keeps
this fix scoped and avoids changing serialized FailureCause shape in this PR.
A follow-up should consider storing cleanup context directly on the storage-cleanup
failure cause instead of inferring it from FailureSource.

Dev environment validation

Deployed the local Carbide build into local-dev control-plane
environment and verified the updated pods rolled out
Normal delete path: provisioned a host, deleted it, observed API cleanup
state, Scout RESET, NVMe cleanup, successful cleanup_machine_completed,
and no destructive cleanup during the later DISCOVERY
Force-delete/re-ingest path: force-deleted the host, observed the predicted
host move through WaitingForCleanup { InitialDiscovery }, the actual host
receive RESET, cleanup complete successfully, then DISCOVERY log the
no-API cleanup skip message
Injected an NVMe cleanup failure in Scout for the delete path and verified
the machine moved to NVMECleanFailed
Restored the non-failing Scout artifact, rebooted the host, and verified the
machine recovered from NVMECleanFailed
Repeated the injected NVMe cleanup failure and recovery validation for the
force-delete/re-ingest path
Verified API state transitions and Scout cleanup behavior in Loki logs

Test plan

cargo +1.90.0 fmt --all
git diff --check

Summary by CodeRabbit

New Features
- Introduced context-aware cleanup flow that distinguishes between initial discovery and deprovision cleanup paths for improved failure recovery.
Bug Fixes
- Enhanced failure handling to preserve initial-discovery cleanup path during recovery, preventing incorrect state transitions.
- Clarified storage cleanup responsibility, ensuring NVMe and HDD cleanup is properly delegated to the reset workflow.

Signed-off-by: Josh P <williamp@nvidia.com>

ianderson-nvidia · 2026-05-19T16:10:12Z

@coderabbitai review

coderabbitai · 2026-05-19T16:10:18Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-19T16:16:26Z

Walkthrough

This pull request refactors machine cleanup state handling to distinguish between initial-discovery cleanup and deprovision cleanup flows. A new CleanupContext enum is added to the model, cleanup state transitions are refactored to preserve context throughout the machine lifecycle, scout handlers are updated to enforce cleanup prerequisites before discovery, and tests are expanded to validate context-aware behavior.

Changes

Initial Discovery Cleanup Context Refactoring

Layer / File(s)	Summary
Model schema and display updates `crates/api-model/src/machine/mod.rs`	`CleanupContext` enum added with `Deprovision` (default) and `InitialDiscovery` variants; `ManagedHostState::WaitingForCleanup` extended with `cleanup_context: CleanupContext` field; `Display` and `dpu_state_string` rendering updated to match new shape.
Scout handler cleanup and discovery logic `crates/api/src/handlers/machine_scout.rs`	`cleanup_machine_completed` derives `FailureSource` dynamically from machine state to preserve `HostInit` context on retries. `forge_agent_control` gates initial discovery on `last_cleanup_time`, returning `Retry` if cleanup not yet performed.
State controller transition refactoring `crates/api/src/state_controller/machine/handler.rs`	Major refactoring: `WaitingForCleanup` handling destructures and preserves `cleanup_context`; all substate transitions (`SecureEraseBoss`, `CreateBossVolume`) use new `waiting_for_cleanup_state(...)` helper; failure recovery paths select context based on source (initial-discovery vs. deprovision); BOSS job handling computes and threads context through error and success paths; new helpers centralize context extraction and post-cleanup state derivation.
Metrics and observability `crates/api/src/state_controller/machine/io.rs`	`metric_state_names` match arm updated to destructure `cleanup_state` while ignoring new field; metric output preserved.
Test fixture initial discovery cleanup helper `crates/api/src/tests/common/api_fixtures/site_explorer.rs`	New `complete_initial_discovery_cleanup_if_needed` helper detects whether initial cleanup is required, drives state controller to correct cleanup state, issues forge-agent reset, completes cleanup, and waits for return to `WaitingForDiscovery`. Helper injected into both host state-controller iteration chains to support both old and new lifecycle behaviors.
Test assertions and new validations `crates/api/src/tests/common/api_fixtures/instance.rs`, `crates/api/src/tests/host_bmc_firmware_test.rs`, `crates/api/src/tests/instance.rs`, `crates/api/src/tests/ipxe.rs`, `crates/api/src/tests/machine_states.rs`	All test imports updated to include `CleanupContext`. Existing state assertions now specify `cleanup_context: CleanupContext::Deprovision`. New tests validate that repeated initial-discovery cleanup failures preserve `HostInit` source, and that `forge_agent_control` returns `Noop` (not `Reset`) when `last_cleanup_time` is absent for firmware-upgrade and discovery-boot states. Existing measurement/host-init test modified to expect initial-discovery cleanup context. Metric counts adjusted in `test_dpu_and_host_till_ready`.
Scout deprovision and boundary clarification `crates/api/src/tests/resource_pool.rs`, `crates/scout/src/deprovision/scrabbing.rs`, `crates/scout/src/main.rs`	`run_no_api` deprovision path now clears TPM and skips NVMe/HDD cleanup with explanatory log. Test database setup refactored to use SQLx `PgConnectOptions` and `PgPoolOptions`. Comment clarified to document that discovery prep must not scrub storage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A cleanup context hops through the state,
Distinguishing when it's deprovision's gate.
Initial discovery waits its turn,
While helpers ensure the helpers don't burn.
Scout and controller in harmony dance,
Giving each cleanup its proper stance!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: routing destructive storage cleanup through the reset action before discovery, matching the core objective of preventing Scout's discovery path from performing storage cleanup.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

crates/api/src/tests/host_bmc_firmware_test.rs (1)
2759-2763: ⚡ Quick win

Use a checked SQLx query or DB helper for cleanup time reset.

This raw sqlx::query string bypasses compile-time validation and is inconsistent with the coding guideline requiring "compile-time checked queries for database operations". Either switch to a sqlx::query_as! checked macro or create a helper similar to the existing db::machine::update_cleanup_time function (which currently handles NOW() updates). The same pattern appears in machine_states.rs and should be addressed there as well.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api/src/tests/host_bmc_firmware_test.rs` around lines 2759 - 2763, The
raw sqlx::query call that sets last_cleanup_time to NULL bypasses compile-time
checked queries; replace it with a compile-time-checked alternative or a DB
helper. Either (A) use sqlx's checked macro (e.g., sqlx::query! with the SQL and
mh.host().id as a typed parameter) in place of the raw sqlx::query invocation,
or (B) add/ reuse a helper in db::machine (e.g., extend
db::machine::update_cleanup_time to accept Option<DateTime> or add
db::machine::set_cleanup_time_null) and call that helper from the test/wherever
the raw query appears (also update the similar occurrence in machine_states.rs).
Ensure the new call accepts the mutable transaction (txn.as_mut()) and
returns/propagates the Result instead of unwrapping.
crates/api/src/tests/machine_states.rs (1)
787-792: ⚡ Quick win

Strengthen the second-failure assertion to prove the retry path executed.

At Line 787, this currently re-checks only failure_details.source, which can still pass if the second cleanup_machine_completed call is a no-op. Add an assertion on failed-state retry progression (or another monotonic field) to make this regression test deterministic.
Proposed assertion tightening
     let mut txn = env.db_txn().await;
     let host = mh.host().db_machine(&mut txn).await;
+    assert!(matches!(
+        host.current_state(),
+        ManagedHostState::Failed { retry_count: 1, .. }
+    ));
     assert!(matches!(
         host.failure_details.source,
         FailureSource::StateMachineArea(StateMachineArea::HostInit)
     ));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api/src/tests/machine_states.rs` around lines 787 - 792, The test
currently only re-checks host.failure_details.source after the second
cleanup_machine_completed call, which can pass even if the retry path didn't
run; fetch and store the host's failure-related monotonic field (e.g.,
host.failure_details.retry_count or a timestamp like
host.failure_details.last_failed_at) before invoking cleanup_machine_completed
the second time, call cleanup_machine_completed (the function under test) again,
then re-fetch host via mh.host().db_machine(&mut txn).await and assert that that
monotonic field has increased/advanced to prove the retry path executed;
reference the symbols txn, env.db_txn(), mh.host(), host.failure_details,
FailureSource::StateMachineArea(StateMachineArea::HostInit) and
cleanup_machine_completed when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api/src/tests/machine_states.rs`:
- Around line 805-811: The test uses raw sqlx::query(...) SQL strings (e.g., the
UPDATE setting last_cleanup_time) which bypass compile-time checking; replace
these with SQLx compile-time checked queries (e.g., sqlx::query! or
sqlx::query_as! macros) or use your typed DB helper so the query and its
parameters (mh.id) are validated at build time — locate the raw calls in
machine_states.rs (the sqlx::query(...) that updates machines.last_cleanup_time
and the similar raw calls around the other noted ranges) and convert them to the
corresponding sqlx::query! macro invocation with the proper parameter
placeholder and typed return mapping.

---

Nitpick comments:
In `@crates/api/src/tests/host_bmc_firmware_test.rs`:
- Around line 2759-2763: The raw sqlx::query call that sets last_cleanup_time to
NULL bypasses compile-time checked queries; replace it with a
compile-time-checked alternative or a DB helper. Either (A) use sqlx's checked
macro (e.g., sqlx::query! with the SQL and mh.host().id as a typed parameter) in
place of the raw sqlx::query invocation, or (B) add/ reuse a helper in
db::machine (e.g., extend db::machine::update_cleanup_time to accept
Option<DateTime> or add db::machine::set_cleanup_time_null) and call that helper
from the test/wherever the raw query appears (also update the similar occurrence
in machine_states.rs). Ensure the new call accepts the mutable transaction
(txn.as_mut()) and returns/propagates the Result instead of unwrapping.

In `@crates/api/src/tests/machine_states.rs`:
- Around line 787-792: The test currently only re-checks
host.failure_details.source after the second cleanup_machine_completed call,
which can pass even if the retry path didn't run; fetch and store the host's
failure-related monotonic field (e.g., host.failure_details.retry_count or a
timestamp like host.failure_details.last_failed_at) before invoking
cleanup_machine_completed the second time, call cleanup_machine_completed (the
function under test) again, then re-fetch host via mh.host().db_machine(&mut
txn).await and assert that that monotonic field has increased/advanced to prove
the retry path executed; reference the symbols txn, env.db_txn(), mh.host(),
host.failure_details,
FailureSource::StateMachineArea(StateMachineArea::HostInit) and
cleanup_machine_completed when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3d6d91fc-7551-4a0c-b1d9-1e7cf2f0c4c8

📥 Commits

Reviewing files that changed from the base of the PR and between 755d116 and 6b6bb64.

📒 Files selected for processing (13)

crates/api-model/src/machine/mod.rs
crates/api/src/handlers/machine_scout.rs
crates/api/src/state_controller/machine/handler.rs
crates/api/src/state_controller/machine/io.rs
crates/api/src/tests/common/api_fixtures/instance.rs
crates/api/src/tests/common/api_fixtures/site_explorer.rs
crates/api/src/tests/host_bmc_firmware_test.rs
crates/api/src/tests/instance.rs
crates/api/src/tests/ipxe.rs
crates/api/src/tests/machine_states.rs
crates/api/src/tests/resource_pool.rs
crates/scout/src/deprovision/scrabbing.rs
crates/scout/src/main.rs

williampnvidia requested a review from a team as a code owner May 17, 2026 21:21

williampnvidia requested review from ajf, krish-nvidia, martinraumann and stoo-davies May 17, 2026 21:29

williampnvidia added 4 commits May 18, 2026 14:16

fix: keep storage cleanup out of discovery

d6f94a0

Signed-off-by: Josh P <williamp@nvidia.com>

fix: drive initial cleanup through reset

5060943

Signed-off-by: Josh P <williamp@nvidia.com>

Add storage cleanup recovery regressions

53af9ae

test: drive initial cleanup in host fixture

f96cbec

williampnvidia force-pushed the fix/scout-discovery-no-storage-cleanup branch from 9f63f6f to f96cbec Compare May 18, 2026 21:19

williampnvidia added 2 commits May 18, 2026 15:08

test: update host readiness metrics for cleanup

cd6d0a4

test: stabilize cleanup fixture tests

6b6bb64

krish-nvidia reviewed May 19, 2026

View reviewed changes

Comment thread crates/api/src/state_controller/machine/handler.rs Outdated

krish-nvidia reviewed May 19, 2026

View reviewed changes

Comment thread crates/api/src/handlers/machine_scout.rs Outdated

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread crates/api/src/tests/machine_states.rs Outdated

test: address cleanup review feedback

d03b23f

krish-nvidia self-requested a review May 20, 2026 13:48

krish-nvidia approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: drive storage cleanup through reset before discovery (#5555701)#1748

fix: drive storage cleanup through reset before discovery (#5555701)#1748
williampnvidia wants to merge 7 commits into
NVIDIA:mainfrom
williampnvidia:fix/scout-discovery-no-storage-cleanup

williampnvidia commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

Uh oh!

ianderson-nvidia commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

williampnvidia commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Recovery context note

Dev environment validation

Test plan

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

ianderson-nvidia commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

williampnvidia commented May 17, 2026 •

edited by coderabbitai Bot

Loading