Skip to content

Conversation

@riceriley59
Copy link
Collaborator

@riceriley59 riceriley59 commented Aug 15, 2025

Implement New Interrupt Package Ordering

Summary

Changes interrupt package ordering to cordon, wait, and drain nodes before applying packages, rather than after. This critical change aligns with user expectations and enables safe execution of operations requiring exclusive system access.

Problem Statement

Previously, the interrupt package order was: apply → config → cordon → wait → drain → interrupt

This ordering was problematic because:

  • It didn't match user expectations that "if interrupting, then you drain/cordon before doing anything"
  • Critical operations requiring exclusive system access (like kernel module unloading) couldn't be performed safely while workloads were still running
  • Use cases like GPU mode switching from graphics to display mode were not properly supported

Solution

Code Changes: Modified the interrupt flow in the Skyhook controller to implement the new ordering
Documentation Changes: Updated all documentation to reflect the new interrupt package order

New Order: cordon → wait → drain → apply → config → interrupt

This ensures:

  1. Workloads are safely evacuated before any package operations begin
  2. Package operations have exclusive access to system resources when needed
  3. Critical use cases work correctly (kernel module unloading, GPU mode switching, driver updates)

Use Case Example

GPU Mode Switching Scenario:

  1. Cordon/Wait/Drain: Safely remove all workloads from the node
  2. Apply: Unload the graphics kernel module (now safe with no competing workloads)
  3. Config: Configure the GPU for display mode
  4. Interrupt: Reboot to complete the mode change
  5. Post-Interrupt: Verify the new driver is loaded correctly

Changes Made

Code Changes

  • operator/internal/controller/skyhook_controller.go: Modified interrupt flow logic to cordon and drain nodes before package application
  • Updated ProcessInterrupt() and EnsureNodeIsReadyForInterrupt() functions to implement new ordering
  • Maintained safety checks and error handling throughout the new flow

Documentation Updates

  • docs/operator-status-definitions.md: Updated stage flow diagrams to show correct interrupt ordering
  • README.md: Clarified stage ordering differences between interrupt and non-interrupt packages
  • docs/interrupt_flow.md:
    • Created comprehensive documentation explaining the interrupt flow and rationale
    • Added detailed sequence breakdown including uninstall scenarios
    • Documented why this ordering is critical for safety and system stability
    • Added best practices including Grafana dashboard monitoring
  • docs/README.md: Added reference to new interrupt flow documentation

Key Ordering Changes

  • Old: apply → config → cordon → wait → drain → interrupt
  • New: uninstall (if needed) → cordon → wait → drain → apply/upgrade → config → interrupt

@riceriley59 riceriley59 force-pushed the change-interrupts-order branch from 708b1fa to 32cf141 Compare August 15, 2025 20:26
@riceriley59 riceriley59 merged commit fb0126b into main Aug 15, 2025
14 of 16 checks passed
@riceriley59 riceriley59 deleted the change-interrupts-order branch August 15, 2025 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants