fix(api): abort switch/machine reprovisioning when parent rack is in … by vinodchitraliNVIDIA · Pull Request #1884 · NVIDIA/infra-controller

vinodchitraliNVIDIA · 2026-05-22T07:07:48Z

…Error

When the rack maintenance flow bails out into RackState::Error, switches in SwitchControllerState::ReProvisioning and machines in HostReprovision were staying stuck waiting for sub-states (firmware upgrade, NVOS update, NMX-C configure) the rack would never advance.

Add an early-exit at the start of each reprovisioning handler: look up the parent rack and, if it is in Error, clear the rack-driven reprovisioning request bit (switch_reprovisioning_requested / host_reprovisioning_requested) and transition the object back to its pre-reprovisioning Ready state.

Description

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

copy-pr-bot · 2026-05-22T07:07:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kensimon

The pool connection change is an easy one to make, that's what the "request changes" is about... the question of rack-triggered reprovisioning is just to discuss.

A PoolConnection may or may not correspond to an active connection to the database, but even if it doesn't, it consumes sqlx's pool limits. A change in NVIDIA#1884 should probably have fired this lint, so add support to detect it.

Matthias247 · 2026-05-22T18:13:36Z

I'm not sure I understand the sequencing here:

Wouldn't the Rack only enter error state after all maintenance activities are done, including the switch reprovisioning? In this case we should never enter the state.
Or is this for e.g. a multi-step rack maintenance activity, where as first step we perform FW updates, it fails and moves rack to error, and then the switch would never apply?
If its about the latter, we could also avoid the problem by only setting the switch_reprovisioning_requested on the switches after the first step is done?

vinodchitraliNVIDIA · 2026-05-22T20:02:13Z

I'm not sure I understand the sequencing here:

Wouldn't the Rack only enter error state after all maintenance activities are done, including the switch reprovisioning? In this case we should never enter the state.

Or is this for e.g. a multi-step rack maintenance activity, where as first step we perform FW updates, it fails and moves rack to error, and then the switch would never apply?

If its about the latter, we could also avoid the problem by only setting the switch_reprovisioning_requested on the switches after the first step is done?

its multi-step activity .. If 1->2->3->4

If 2 is failed bcz of machines Rack moves to Failed. But if Switches finished their job and moved to 3 . And Switches wait for update from rack state controller, and which is in Failed sate. Results in Stuck state for switch

vinodchitraliNVIDIA · 2026-05-22T20:03:38Z

I'm not sure I understand the sequencing here:

Wouldn't the Rack only enter error state after all maintenance activities are done, including the switch reprovisioning? In this case we should never enter the state.

Or is this for e.g. a multi-step rack maintenance activity, where as first step we perform FW updates, it fails and moves rack to error, and then the switch would never apply?

If its about the latter, we could also avoid the problem by only setting the switch_reprovisioning_requested on the switches after the first step is done?

its multi-step activity .. If 1->2->3->4

If 2 is failed bcz of machines Rack moves to Failed. But if Switches finished their job and moved to 3 . And Switches wait for update from rack state controller, and which is in Failed sate. Results in Stuck state for switch

The Change will make sure the switch aborts its rest of activity that is 3 and 4 and moves to Ready

amit-pabalkar · 2026-05-22T20:20:29Z

Approving

…Error When the rack maintenance flow bails out into RackState::Error, switches in SwitchControllerState::ReProvisioning and machines in HostReprovision were staying stuck waiting for sub-states (firmware upgrade, NVOS update, NMX-C configure) the rack would never advance. Add an early-exit at the start of each reprovisioning handler: look up the parent rack and, if it is in Error, clear the rack-driven reprovisioning request bit (switch_reprovisioning_requested / host_reprovisioning_requested) and transition the object back to its pre-reprovisioning Ready state. Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>

Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>

Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com> Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>

vinodchitraliNVIDIA requested a review from a team as a code owner May 22, 2026 07:07

vinodchitraliNVIDIA force-pushed the vc/rack_ready branch 2 times, most recently from 306cf5b to 3b90e01 Compare May 22, 2026 08:32

kensimon requested changes May 22, 2026

View reviewed changes

Comment thread crates/api/src/state_controller/machine/handler.rs

Comment thread crates/api/src/state_controller/switch/reprovisioning.rs Outdated

kensimon mentioned this pull request May 22, 2026

txn_held_across_await lint: Fire on holding a PoolConnection #1893

Open

10 tasks

vinodchitraliNVIDIA force-pushed the vc/rack_ready branch from 2c5a81b to 83a06ba Compare May 22, 2026 20:12

vinodchitraliNVIDIA enabled auto-merge (squash) May 22, 2026 20:14

vinodchitraliNVIDIA requested a review from kensimon May 22, 2026 20:16

kensimon approved these changes May 22, 2026

View reviewed changes

vinodchitraliNVIDIA force-pushed the vc/rack_ready branch from 83a06ba to fd14b6a Compare May 23, 2026 06:18

anunna0 approved these changes May 23, 2026

View reviewed changes

vinodchitraliNVIDIA and others added 3 commits May 23, 2026 13:05

Apply suggestion from @kensimon

7a215ae

Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>

Apply suggestion from @kensimon

76897d4

Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com> Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>

vinodchitraliNVIDIA force-pushed the vc/rack_ready branch from fd14b6a to 76897d4 Compare May 23, 2026 07:36

vinodchitraliNVIDIA disabled auto-merge May 23, 2026 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): abort switch/machine reprovisioning when parent rack is in …#1884

fix(api): abort switch/machine reprovisioning when parent rack is in …#1884
vinodchitraliNVIDIA wants to merge 3 commits into
NVIDIA:mainfrom
vinodchitraliNVIDIA:vc/rack_ready

vinodchitraliNVIDIA commented May 22, 2026

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

kensimon left a comment

Uh oh!

Uh oh!

Uh oh!

Matthias247 commented May 22, 2026

Uh oh!

vinodchitraliNVIDIA commented May 22, 2026

Uh oh!

vinodchitraliNVIDIA commented May 22, 2026

Uh oh!

amit-pabalkar commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

vinodchitraliNVIDIA commented May 22, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot Bot commented May 22, 2026

Uh oh!

kensimon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Matthias247 commented May 22, 2026

Uh oh!

vinodchitraliNVIDIA commented May 22, 2026

Uh oh!

vinodchitraliNVIDIA commented May 22, 2026

Uh oh!

amit-pabalkar commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants