fix(api): abort switch/machine reprovisioning when parent rack is in …#1884
fix(api): abort switch/machine reprovisioning when parent rack is in …#1884vinodchitraliNVIDIA wants to merge 3 commits into
Conversation
306cf5b to
3b90e01
Compare
kensimon
left a comment
There was a problem hiding this comment.
The pool connection change is an easy one to make, that's what the "request changes" is about... the question of rack-triggered reprovisioning is just to discuss.
A PoolConnection may or may not correspond to an active connection to the database, but even if it doesn't, it consumes sqlx's pool limits. A change in NVIDIA#1884 should probably have fired this lint, so add support to detect it.
|
I'm not sure I understand the sequencing here:
|
its multi-step activity .. If 1->2->3->4 If 2 is failed bcz of machines Rack moves to Failed. But if Switches finished their job and moved to 3 . And Switches wait for update from rack state controller, and which is in Failed sate. Results in Stuck state for switch |
The Change will make sure the switch aborts its rest of activity that is 3 and 4 and moves to |
2c5a81b to
83a06ba
Compare
|
Approving |
83a06ba to
fd14b6a
Compare
…Error When the rack maintenance flow bails out into RackState::Error, switches in SwitchControllerState::ReProvisioning and machines in HostReprovision were staying stuck waiting for sub-states (firmware upgrade, NVOS update, NMX-C configure) the rack would never advance. Add an early-exit at the start of each reprovisioning handler: look up the parent rack and, if it is in Error, clear the rack-driven reprovisioning request bit (switch_reprovisioning_requested / host_reprovisioning_requested) and transition the object back to its pre-reprovisioning Ready state. Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com>
Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Vinod Chitrali <51107486+vinodchitraliNVIDIA@users.noreply.github.com> Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
fd14b6a to
76897d4
Compare
…Error
When the rack maintenance flow bails out into RackState::Error, switches in SwitchControllerState::ReProvisioning and machines in HostReprovision were staying stuck waiting for sub-states (firmware upgrade, NVOS update, NMX-C configure) the rack would never advance.
Add an early-exit at the start of each reprovisioning handler: look up the parent rack and, if it is in Error, clear the rack-driven reprovisioning request bit (switch_reprovisioning_requested / host_reprovisioning_requested) and transition the object back to its pre-reprovisioning Ready state.
Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes