Skip to content

fix(api): abort switch/machine reprovisioning when parent rack is in …#1884

Open
vinodchitraliNVIDIA wants to merge 3 commits into
NVIDIA:mainfrom
vinodchitraliNVIDIA:vc/rack_ready
Open

fix(api): abort switch/machine reprovisioning when parent rack is in …#1884
vinodchitraliNVIDIA wants to merge 3 commits into
NVIDIA:mainfrom
vinodchitraliNVIDIA:vc/rack_ready

Conversation

@vinodchitraliNVIDIA
Copy link
Copy Markdown
Contributor

…Error

When the rack maintenance flow bails out into RackState::Error, switches in SwitchControllerState::ReProvisioning and machines in HostReprovision were staying stuck waiting for sub-states (firmware upgrade, NVOS update, NMX-C configure) the rack would never advance.

Add an early-exit at the start of each reprovisioning handler: look up the parent rack and, if it is in Error, clear the rack-driven reprovisioning request bit (switch_reprovisioning_requested / host_reprovisioning_requested) and transition the object back to its pre-reprovisioning Ready state.

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@vinodchitraliNVIDIA vinodchitraliNVIDIA requested a review from a team as a code owner May 22, 2026 07:07
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vinodchitraliNVIDIA vinodchitraliNVIDIA force-pushed the vc/rack_ready branch 2 times, most recently from 306cf5b to 3b90e01 Compare May 22, 2026 08:32
Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool connection change is an easy one to make, that's what the "request changes" is about... the question of rack-triggered reprovisioning is just to discuss.

Comment thread crates/api/src/state_controller/machine/handler.rs
Comment thread crates/api/src/state_controller/switch/reprovisioning.rs Outdated
kensimon added a commit to kensimon/infra-controller-core that referenced this pull request May 22, 2026
A PoolConnection may or may not correspond to an active connection to the
database, but even if it doesn't, it consumes sqlx's pool limits.

A change in NVIDIA#1884 should probably have fired this lint, so add support
to detect it.
@Matthias247
Copy link
Copy Markdown
Contributor

I'm not sure I understand the sequencing here:

  • Wouldn't the Rack only enter error state after all maintenance activities are done, including the switch reprovisioning? In this case we should never enter the state.
  • Or is this for e.g. a multi-step rack maintenance activity, where as first step we perform FW updates, it fails and moves rack to error, and then the switch would never apply?
  • If its about the latter, we could also avoid the problem by only setting the switch_reprovisioning_requested on the switches after the first step is done?

@vinodchitraliNVIDIA
Copy link
Copy Markdown
Contributor Author

I'm not sure I understand the sequencing here:

  • Wouldn't the Rack only enter error state after all maintenance activities are done, including the switch reprovisioning? In this case we should never enter the state.
  • Or is this for e.g. a multi-step rack maintenance activity, where as first step we perform FW updates, it fails and moves rack to error, and then the switch would never apply?
  • If its about the latter, we could also avoid the problem by only setting the switch_reprovisioning_requested on the switches after the first step is done?

its multi-step activity .. If 1->2->3->4

If 2 is failed bcz of machines Rack moves to Failed. But if Switches finished their job and moved to 3 . And Switches wait for update from rack state controller, and which is in Failed sate. Results in Stuck state for switch

@vinodchitraliNVIDIA
Copy link
Copy Markdown
Contributor Author

I'm not sure I understand the sequencing here:

  • Wouldn't the Rack only enter error state after all maintenance activities are done, including the switch reprovisioning? In this case we should never enter the state.
  • Or is this for e.g. a multi-step rack maintenance activity, where as first step we perform FW updates, it fails and moves rack to error, and then the switch would never apply?
  • If its about the latter, we could also avoid the problem by only setting the switch_reprovisioning_requested on the switches after the first step is done?

its multi-step activity .. If 1->2->3->4

If 2 is failed bcz of machines Rack moves to Failed. But if Switches finished their job and moved to 3 . And Switches wait for update from rack state controller, and which is in Failed sate. Results in Stuck state for switch

The Change will make sure the switch aborts its rest of activity that is 3 and 4 and moves to Ready

@amit-pabalkar
Copy link
Copy Markdown

Approving

vinodchitraliNVIDIA and others added 3 commits May 23, 2026 13:05
…Error

When the rack maintenance flow bails out into RackState::Error, switches in
SwitchControllerState::ReProvisioning and machines in HostReprovision were
staying stuck waiting for sub-states (firmware upgrade, NVOS update, NMX-C
configure) the rack would never advance.

Add an early-exit at the start of each reprovisioning handler: look up the
parent rack and, if it is in Error, clear the rack-driven reprovisioning
request bit (switch_reprovisioning_requested / host_reprovisioning_requested)
and transition the object back to its pre-reprovisioning Ready state.

Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
Co-authored-by: Ken Simon <ken@kensimon.io>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Ken Simon <ken@kensimon.io>
Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants