fix: drive storage cleanup through reset before discovery (#5555701)#1748
fix: drive storage cleanup through reset before discovery (#5555701)#1748williampnvidia wants to merge 7 commits into
Conversation
Signed-off-by: Josh P <williamp@nvidia.com>
Signed-off-by: Josh P <williamp@nvidia.com>
9f63f6f to
f96cbec
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
WalkthroughThis pull request refactors machine cleanup state handling to distinguish between initial-discovery cleanup and deprovision cleanup flows. A new ChangesInitial Discovery Cleanup Context Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crates/api/src/tests/host_bmc_firmware_test.rs (1)
2759-2763: ⚡ Quick winUse a checked SQLx query or DB helper for cleanup time reset.
This raw
sqlx::querystring bypasses compile-time validation and is inconsistent with the coding guideline requiring "compile-time checked queries for database operations". Either switch to asqlx::query_as!checked macro or create a helper similar to the existingdb::machine::update_cleanup_timefunction (which currently handlesNOW()updates). The same pattern appears inmachine_states.rsand should be addressed there as well.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api/src/tests/host_bmc_firmware_test.rs` around lines 2759 - 2763, The raw sqlx::query call that sets last_cleanup_time to NULL bypasses compile-time checked queries; replace it with a compile-time-checked alternative or a DB helper. Either (A) use sqlx's checked macro (e.g., sqlx::query! with the SQL and mh.host().id as a typed parameter) in place of the raw sqlx::query invocation, or (B) add/ reuse a helper in db::machine (e.g., extend db::machine::update_cleanup_time to accept Option<DateTime> or add db::machine::set_cleanup_time_null) and call that helper from the test/wherever the raw query appears (also update the similar occurrence in machine_states.rs). Ensure the new call accepts the mutable transaction (txn.as_mut()) and returns/propagates the Result instead of unwrapping.crates/api/src/tests/machine_states.rs (1)
787-792: ⚡ Quick winStrengthen the second-failure assertion to prove the retry path executed.
At Line 787, this currently re-checks only
failure_details.source, which can still pass if the secondcleanup_machine_completedcall is a no-op. Add an assertion on failed-state retry progression (or another monotonic field) to make this regression test deterministic.Proposed assertion tightening
let mut txn = env.db_txn().await; let host = mh.host().db_machine(&mut txn).await; + assert!(matches!( + host.current_state(), + ManagedHostState::Failed { retry_count: 1, .. } + )); assert!(matches!( host.failure_details.source, FailureSource::StateMachineArea(StateMachineArea::HostInit) ));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api/src/tests/machine_states.rs` around lines 787 - 792, The test currently only re-checks host.failure_details.source after the second cleanup_machine_completed call, which can pass even if the retry path didn't run; fetch and store the host's failure-related monotonic field (e.g., host.failure_details.retry_count or a timestamp like host.failure_details.last_failed_at) before invoking cleanup_machine_completed the second time, call cleanup_machine_completed (the function under test) again, then re-fetch host via mh.host().db_machine(&mut txn).await and assert that that monotonic field has increased/advanced to prove the retry path executed; reference the symbols txn, env.db_txn(), mh.host(), host.failure_details, FailureSource::StateMachineArea(StateMachineArea::HostInit) and cleanup_machine_completed when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api/src/tests/machine_states.rs`:
- Around line 805-811: The test uses raw sqlx::query(...) SQL strings (e.g., the
UPDATE setting last_cleanup_time) which bypass compile-time checking; replace
these with SQLx compile-time checked queries (e.g., sqlx::query! or
sqlx::query_as! macros) or use your typed DB helper so the query and its
parameters (mh.id) are validated at build time — locate the raw calls in
machine_states.rs (the sqlx::query(...) that updates machines.last_cleanup_time
and the similar raw calls around the other noted ranges) and convert them to the
corresponding sqlx::query! macro invocation with the proper parameter
placeholder and typed return mapping.
---
Nitpick comments:
In `@crates/api/src/tests/host_bmc_firmware_test.rs`:
- Around line 2759-2763: The raw sqlx::query call that sets last_cleanup_time to
NULL bypasses compile-time checked queries; replace it with a
compile-time-checked alternative or a DB helper. Either (A) use sqlx's checked
macro (e.g., sqlx::query! with the SQL and mh.host().id as a typed parameter) in
place of the raw sqlx::query invocation, or (B) add/ reuse a helper in
db::machine (e.g., extend db::machine::update_cleanup_time to accept
Option<DateTime> or add db::machine::set_cleanup_time_null) and call that helper
from the test/wherever the raw query appears (also update the similar occurrence
in machine_states.rs). Ensure the new call accepts the mutable transaction
(txn.as_mut()) and returns/propagates the Result instead of unwrapping.
In `@crates/api/src/tests/machine_states.rs`:
- Around line 787-792: The test currently only re-checks
host.failure_details.source after the second cleanup_machine_completed call,
which can pass even if the retry path didn't run; fetch and store the host's
failure-related monotonic field (e.g., host.failure_details.retry_count or a
timestamp like host.failure_details.last_failed_at) before invoking
cleanup_machine_completed the second time, call cleanup_machine_completed (the
function under test) again, then re-fetch host via mh.host().db_machine(&mut
txn).await and assert that that monotonic field has increased/advanced to prove
the retry path executed; reference the symbols txn, env.db_txn(), mh.host(),
host.failure_details,
FailureSource::StateMachineArea(StateMachineArea::HostInit) and
cleanup_machine_completed when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 3d6d91fc-7551-4a0c-b1d9-1e7cf2f0c4c8
📒 Files selected for processing (13)
crates/api-model/src/machine/mod.rscrates/api/src/handlers/machine_scout.rscrates/api/src/state_controller/machine/handler.rscrates/api/src/state_controller/machine/io.rscrates/api/src/tests/common/api_fixtures/instance.rscrates/api/src/tests/common/api_fixtures/site_explorer.rscrates/api/src/tests/host_bmc_firmware_test.rscrates/api/src/tests/instance.rscrates/api/src/tests/ipxe.rscrates/api/src/tests/machine_states.rscrates/api/src/tests/resource_pool.rscrates/scout/src/deprovision/scrabbing.rscrates/scout/src/main.rs
Summary
discovery now leaves destructive storage cleanup to the API-directed reset path
WaitingForCleanupso the state machine can distinguishinitial discovery cleanup from deprovision cleanup and route recovery correctly
WaitingForCleanup { InitialDiscovery }beforereturning Scout
Action::DiscoveryNVMECleanFailedretries so asuccessful retry returns to
HostInit/WaitingForDiscovery, not the deprovision flowboots, and repeated HostInit cleanup failures
Recovery context note
NVMECleanFailedrecovery currently usesFailureSource::StateMachineArea(HostInit)to remember that the failure happened during initial discovery cleanup. That keeps
this fix scoped and avoids changing serialized
FailureCauseshape in this PR.A follow-up should consider storing cleanup context directly on the storage-cleanup
failure cause instead of inferring it from
FailureSource.Dev environment validation
environment and verified the updated pods rolled out
state, Scout
RESET, NVMe cleanup, successfulcleanup_machine_completed,and no destructive cleanup during the later
DISCOVERYhost move through
WaitingForCleanup { InitialDiscovery }, the actual hostreceive
RESET, cleanup complete successfully, thenDISCOVERYlog theno-API cleanup skip message
the machine moved to
NVMECleanFailedmachine recovered from
NVMECleanFailedforce-delete/re-ingest path
Test plan
cargo +1.90.0 fmt --allgit diff --checkSummary by CodeRabbit
New Features
Bug Fixes