Skip to content

fix: drive storage cleanup through reset before discovery (#5555701)#1748

Open
williampnvidia wants to merge 7 commits into
NVIDIA:mainfrom
williampnvidia:fix/scout-discovery-no-storage-cleanup
Open

fix: drive storage cleanup through reset before discovery (#5555701)#1748
williampnvidia wants to merge 7 commits into
NVIDIA:mainfrom
williampnvidia:fix/scout-discovery-no-storage-cleanup

Conversation

@williampnvidia
Copy link
Copy Markdown
Contributor

@williampnvidia williampnvidia commented May 17, 2026

Summary

  • Stop Scout's discovery/no-API path from performing hidden NVMe/HDD/SAS cleanup;
    discovery now leaves destructive storage cleanup to the API-directed reset path
  • Add cleanup context to WaitingForCleanup so the state machine can distinguish
    initial discovery cleanup from deprovision cleanup and route recovery correctly
  • Drive initial discovery through WaitingForCleanup { InitialDiscovery } before
    returning Scout Action::Discovery
  • Preserve HostInit cleanup context across repeated NVMECleanFailed retries so a
    successful retry returns to HostInit/WaitingForDiscovery, not the deprovision flow
  • Add regressions covering firmware-upgrade Scout boots, assigned discovery-image
    boots, and repeated HostInit cleanup failures

Recovery context note

NVMECleanFailed recovery currently uses FailureSource::StateMachineArea(HostInit)
to remember that the failure happened during initial discovery cleanup. That keeps
this fix scoped and avoids changing serialized FailureCause shape in this PR.
A follow-up should consider storing cleanup context directly on the storage-cleanup
failure cause instead of inferring it from FailureSource.

Dev environment validation

  • Deployed the local Carbide build into local-dev control-plane
    environment and verified the updated pods rolled out
  • Normal delete path: provisioned a host, deleted it, observed API cleanup
    state, Scout RESET, NVMe cleanup, successful cleanup_machine_completed,
    and no destructive cleanup during the later DISCOVERY
  • Force-delete/re-ingest path: force-deleted the host, observed the predicted
    host move through WaitingForCleanup { InitialDiscovery }, the actual host
    receive RESET, cleanup complete successfully, then DISCOVERY log the
    no-API cleanup skip message
  • Injected an NVMe cleanup failure in Scout for the delete path and verified
    the machine moved to NVMECleanFailed
  • Restored the non-failing Scout artifact, rebooted the host, and verified the
    machine recovered from NVMECleanFailed
  • Repeated the injected NVMe cleanup failure and recovery validation for the
    force-delete/re-ingest path
  • Verified API state transitions and Scout cleanup behavior in Loki logs

Test plan

  • cargo +1.90.0 fmt --all
  • git diff --check

Summary by CodeRabbit

  • New Features

    • Introduced context-aware cleanup flow that distinguishes between initial discovery and deprovision cleanup paths for improved failure recovery.
  • Bug Fixes

    • Enhanced failure handling to preserve initial-discovery cleanup path during recovery, preventing incorrect state transitions.
    • Clarified storage cleanup responsibility, ensuring NVMe and HDD cleanup is properly delegated to the reset workflow.

Review Change Stack

@williampnvidia williampnvidia force-pushed the fix/scout-discovery-no-storage-cleanup branch from 9f63f6f to f96cbec Compare May 18, 2026 21:19
Comment thread crates/api/src/state_controller/machine/handler.rs Outdated
Comment thread crates/api/src/handlers/machine_scout.rs Outdated
@ianderson-nvidia
Copy link
Copy Markdown
Contributor

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Walkthrough

This pull request refactors machine cleanup state handling to distinguish between initial-discovery cleanup and deprovision cleanup flows. A new CleanupContext enum is added to the model, cleanup state transitions are refactored to preserve context throughout the machine lifecycle, scout handlers are updated to enforce cleanup prerequisites before discovery, and tests are expanded to validate context-aware behavior.

Changes

Initial Discovery Cleanup Context Refactoring

Layer / File(s) Summary
Model schema and display updates
crates/api-model/src/machine/mod.rs
CleanupContext enum added with Deprovision (default) and InitialDiscovery variants; ManagedHostState::WaitingForCleanup extended with cleanup_context: CleanupContext field; Display and dpu_state_string rendering updated to match new shape.
Scout handler cleanup and discovery logic
crates/api/src/handlers/machine_scout.rs
cleanup_machine_completed derives FailureSource dynamically from machine state to preserve HostInit context on retries. forge_agent_control gates initial discovery on last_cleanup_time, returning Retry if cleanup not yet performed.
State controller transition refactoring
crates/api/src/state_controller/machine/handler.rs
Major refactoring: WaitingForCleanup handling destructures and preserves cleanup_context; all substate transitions (SecureEraseBoss, CreateBossVolume) use new waiting_for_cleanup_state(...) helper; failure recovery paths select context based on source (initial-discovery vs. deprovision); BOSS job handling computes and threads context through error and success paths; new helpers centralize context extraction and post-cleanup state derivation.
Metrics and observability
crates/api/src/state_controller/machine/io.rs
metric_state_names match arm updated to destructure cleanup_state while ignoring new field; metric output preserved.
Test fixture initial discovery cleanup helper
crates/api/src/tests/common/api_fixtures/site_explorer.rs
New complete_initial_discovery_cleanup_if_needed helper detects whether initial cleanup is required, drives state controller to correct cleanup state, issues forge-agent reset, completes cleanup, and waits for return to WaitingForDiscovery. Helper injected into both host state-controller iteration chains to support both old and new lifecycle behaviors.
Test assertions and new validations
crates/api/src/tests/common/api_fixtures/instance.rs, crates/api/src/tests/host_bmc_firmware_test.rs, crates/api/src/tests/instance.rs, crates/api/src/tests/ipxe.rs, crates/api/src/tests/machine_states.rs
All test imports updated to include CleanupContext. Existing state assertions now specify cleanup_context: CleanupContext::Deprovision. New tests validate that repeated initial-discovery cleanup failures preserve HostInit source, and that forge_agent_control returns Noop (not Reset) when last_cleanup_time is absent for firmware-upgrade and discovery-boot states. Existing measurement/host-init test modified to expect initial-discovery cleanup context. Metric counts adjusted in test_dpu_and_host_till_ready.
Scout deprovision and boundary clarification
crates/api/src/tests/resource_pool.rs, crates/scout/src/deprovision/scrabbing.rs, crates/scout/src/main.rs
run_no_api deprovision path now clears TPM and skips NVMe/HDD cleanup with explanatory log. Test database setup refactored to use SQLx PgConnectOptions and PgPoolOptions. Comment clarified to document that discovery prep must not scrub storage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A cleanup context hops through the state,
Distinguishing when it's deprovision's gate.
Initial discovery waits its turn,
While helpers ensure the helpers don't burn.
Scout and controller in harmony dance,
Giving each cleanup its proper stance!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: routing destructive storage cleanup through the reset action before discovery, matching the core objective of preventing Scout's discovery path from performing storage cleanup.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/api/src/tests/host_bmc_firmware_test.rs (1)

2759-2763: ⚡ Quick win

Use a checked SQLx query or DB helper for cleanup time reset.

This raw sqlx::query string bypasses compile-time validation and is inconsistent with the coding guideline requiring "compile-time checked queries for database operations". Either switch to a sqlx::query_as! checked macro or create a helper similar to the existing db::machine::update_cleanup_time function (which currently handles NOW() updates). The same pattern appears in machine_states.rs and should be addressed there as well.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api/src/tests/host_bmc_firmware_test.rs` around lines 2759 - 2763, The
raw sqlx::query call that sets last_cleanup_time to NULL bypasses compile-time
checked queries; replace it with a compile-time-checked alternative or a DB
helper. Either (A) use sqlx's checked macro (e.g., sqlx::query! with the SQL and
mh.host().id as a typed parameter) in place of the raw sqlx::query invocation,
or (B) add/ reuse a helper in db::machine (e.g., extend
db::machine::update_cleanup_time to accept Option<DateTime> or add
db::machine::set_cleanup_time_null) and call that helper from the test/wherever
the raw query appears (also update the similar occurrence in machine_states.rs).
Ensure the new call accepts the mutable transaction (txn.as_mut()) and
returns/propagates the Result instead of unwrapping.
crates/api/src/tests/machine_states.rs (1)

787-792: ⚡ Quick win

Strengthen the second-failure assertion to prove the retry path executed.

At Line 787, this currently re-checks only failure_details.source, which can still pass if the second cleanup_machine_completed call is a no-op. Add an assertion on failed-state retry progression (or another monotonic field) to make this regression test deterministic.

Proposed assertion tightening
     let mut txn = env.db_txn().await;
     let host = mh.host().db_machine(&mut txn).await;
+    assert!(matches!(
+        host.current_state(),
+        ManagedHostState::Failed { retry_count: 1, .. }
+    ));
     assert!(matches!(
         host.failure_details.source,
         FailureSource::StateMachineArea(StateMachineArea::HostInit)
     ));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api/src/tests/machine_states.rs` around lines 787 - 792, The test
currently only re-checks host.failure_details.source after the second
cleanup_machine_completed call, which can pass even if the retry path didn't
run; fetch and store the host's failure-related monotonic field (e.g.,
host.failure_details.retry_count or a timestamp like
host.failure_details.last_failed_at) before invoking cleanup_machine_completed
the second time, call cleanup_machine_completed (the function under test) again,
then re-fetch host via mh.host().db_machine(&mut txn).await and assert that that
monotonic field has increased/advanced to prove the retry path executed;
reference the symbols txn, env.db_txn(), mh.host(), host.failure_details,
FailureSource::StateMachineArea(StateMachineArea::HostInit) and
cleanup_machine_completed when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api/src/tests/machine_states.rs`:
- Around line 805-811: The test uses raw sqlx::query(...) SQL strings (e.g., the
UPDATE setting last_cleanup_time) which bypass compile-time checking; replace
these with SQLx compile-time checked queries (e.g., sqlx::query! or
sqlx::query_as! macros) or use your typed DB helper so the query and its
parameters (mh.id) are validated at build time — locate the raw calls in
machine_states.rs (the sqlx::query(...) that updates machines.last_cleanup_time
and the similar raw calls around the other noted ranges) and convert them to the
corresponding sqlx::query! macro invocation with the proper parameter
placeholder and typed return mapping.

---

Nitpick comments:
In `@crates/api/src/tests/host_bmc_firmware_test.rs`:
- Around line 2759-2763: The raw sqlx::query call that sets last_cleanup_time to
NULL bypasses compile-time checked queries; replace it with a
compile-time-checked alternative or a DB helper. Either (A) use sqlx's checked
macro (e.g., sqlx::query! with the SQL and mh.host().id as a typed parameter) in
place of the raw sqlx::query invocation, or (B) add/ reuse a helper in
db::machine (e.g., extend db::machine::update_cleanup_time to accept
Option<DateTime> or add db::machine::set_cleanup_time_null) and call that helper
from the test/wherever the raw query appears (also update the similar occurrence
in machine_states.rs). Ensure the new call accepts the mutable transaction
(txn.as_mut()) and returns/propagates the Result instead of unwrapping.

In `@crates/api/src/tests/machine_states.rs`:
- Around line 787-792: The test currently only re-checks
host.failure_details.source after the second cleanup_machine_completed call,
which can pass even if the retry path didn't run; fetch and store the host's
failure-related monotonic field (e.g., host.failure_details.retry_count or a
timestamp like host.failure_details.last_failed_at) before invoking
cleanup_machine_completed the second time, call cleanup_machine_completed (the
function under test) again, then re-fetch host via mh.host().db_machine(&mut
txn).await and assert that that monotonic field has increased/advanced to prove
the retry path executed; reference the symbols txn, env.db_txn(), mh.host(),
host.failure_details,
FailureSource::StateMachineArea(StateMachineArea::HostInit) and
cleanup_machine_completed when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3d6d91fc-7551-4a0c-b1d9-1e7cf2f0c4c8

📥 Commits

Reviewing files that changed from the base of the PR and between 755d116 and 6b6bb64.

📒 Files selected for processing (13)
  • crates/api-model/src/machine/mod.rs
  • crates/api/src/handlers/machine_scout.rs
  • crates/api/src/state_controller/machine/handler.rs
  • crates/api/src/state_controller/machine/io.rs
  • crates/api/src/tests/common/api_fixtures/instance.rs
  • crates/api/src/tests/common/api_fixtures/site_explorer.rs
  • crates/api/src/tests/host_bmc_firmware_test.rs
  • crates/api/src/tests/instance.rs
  • crates/api/src/tests/ipxe.rs
  • crates/api/src/tests/machine_states.rs
  • crates/api/src/tests/resource_pool.rs
  • crates/scout/src/deprovision/scrabbing.rs
  • crates/scout/src/main.rs

Comment thread crates/api/src/tests/machine_states.rs Outdated
@krish-nvidia krish-nvidia self-requested a review May 20, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants