Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"mode": "debug",
"program": "${workspaceRoot}/operator/cmd/main.go",
"cwd": "${workspaceRoot}/operator",
"buildFlags": "--ldflags '-X github.com/NVIDIA/skyhook/internal/version.GIT_SHA=foobars -X github.com/NVIDIA/skyhook/internal/version.VERSION=v0.5.0'",
"buildFlags": "--ldflags '-X github.com/NVIDIA/skyhook/operator/internal/version.GIT_SHA=foobars -X github.com/NVIDIA/skyhook/operator/internal/version.VERSION=v0.5.0'",
"env": {
"ENABLE_WEBHOOKS": "false",
"LOG_ENCODER": "console",
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,19 @@ The operator will apply steps in a package throughout different lifecycle stages

The stages are applied in this order:

**Without Interrupts:**
- Uninstall -> Apply -> Config (No Upgrade)
- Upgrade -> Config (With Upgrade)

**With Interrupts:**
For packages that require interrupts, the node is first cordoned and drained to ensure workloads are safely evacuated before package operations begin:
- Uninstall -> Apply -> Config -> Interrupt -> Post-Interrupt (No Upgrade)
- Upgrade -> Config -> Interrupt -> Post-Interrupt (With Upgrade)

This ensures that when operations like kernel module unloading or system reboots are required, they happen after workloads have been safely removed and any necessary pre-interrupt package operations have completed.

**NOTE**: If a package is removed from the SCR, then the uninstall stage for that package will solely be run.

**Semantic versioning is strictly enforced in the operator** in order to support upgrade and uninstall. Semantic versioning allows the
operator to know which way the package is going while also enforcing best versioning practices.

Expand Down
3 changes: 3 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ This directory contains user and operator documentation for Skyhook. Here you'll
- [Runtime Required](runtime_required.md):
How to use the runtime required taint and feature in Skyhook.

- [Interrupt Flow and Ordering](interrupt_flow.md):
Detailed explanation of how Skyhook handles packages with interrupts, including the interrupt sequence.

- [Strict Ordering](ordering_of_skyhooks.md): How and why the operator applies each Skyhook Custom Resource in a deterministic sequential order.

- **Resources**
Expand Down
87 changes: 87 additions & 0 deletions docs/interrupt_flow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Interrupt Flow and Ordering

This document explains how Skyhook handles packages that require interrupts and the specific ordering of operations to ensure safe and reliable execution.

## Overview

When a package requires an interrupt (such as a reboot or service restart), Skyhook follows a specific sequence to ensure that workloads are safely evacuated from the node before any potentially disruptive operations occur.

## Interrupt Flow Sequence

### For packages WITH interrupts:

1. **Uninstall** (if downgrading) - Package uninstallation operations are executed.
2. **Cordon** - Node is marked as unschedulable to prevent new workloads from being scheduled
3. **Wait** - System waits for any conflicting workloads to naturally complete or be rescheduled
4. **Drain** - Remaining workloads are gracefully evicted from the node
5. **Apply** / **Upgrade** (if upgrading) - Package installation/upgrade operations are executed
6. **Config** - Configuration and setup operations are performed
7. **Interrupt** - The actual interrupt operation (reboot, service restart, etc.) is executed
8. **Post-Interrupt** - Any cleanup or verification operations after the interrupt

### For packages WITHOUT interrupts:

1. **Uninstall** (if downgrading) - Package uninstallation operations are executed.
2. **Apply** / **Upgrade** (if upgrading) - Package installation/upgrade operations are executed
3. **Config** - Configuration and setup operations are performed

## Why This Order Matters

The **uninstall → cordon → wait → drain → apply/upgrade → config → interrupt** sequence is critical for several reasons:

### Safety First
- Workloads are safely removed before any potentially disruptive operations
- Prevents data loss or service interruption for running applications
- Ensures the node is in a clean state before package operations begin

### Use Cases
This ordering is particularly important for scenarios such as:

- **Kernel module changes**: Unloading kernel modules while workloads are present could cause system instability
- **GPU mode switching**: Changing GPU from graphics to compute mode requires exclusive access
- **Driver updates**: Hardware driver changes need exclusive access to the hardware
- **System reboots**: Obviously require all workloads to be evacuated first

### Example Scenario

Consider a package that needs to unload a kernel module, perform some operations, and then reboot:

```yaml
apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
name: gpu-mode-switch
spec:
packages:
gpu-driver:
version: "1.0.0"
image: "example/gpu-driver"
interrupt:
type: "reboot"
```

**Flow:**
1. **Cordon**: Node becomes unschedulable
2. **Wait**: Any non-interrupt workloads are given time to complete
3. **Drain**: Remaining workloads are evicted
4. **Apply**: GPU driver package operations run (unload old module, install new)
5. **Config**: Configuration files are updated
6. **Interrupt**: System reboots to complete the driver change
7. **Post-Interrupt**: Verification that the new driver is loaded correctly

## Technical Implementation

The interrupt flow is managed by the `ProcessInterrupt` and `EnsureNodeIsReadyForInterrupt` functions in the Skyhook controller, which:

- Check for conflicting workloads using label selectors
- Coordinate the cordon and drain operations
- Ensure the node is ready before proceeding with package operations
- Handle the timing and sequencing of all stages

## Best Practices

- Always test interrupt-enabled packages in non-production environments first
- Use appropriate `podNonInterruptLabels` selectors to identify important workloads that should block interrupts
- Consider the impact of node cordoning on cluster capacity
- Monitor package logs during interrupt operations for troubleshooting
- Use Grafana dashboards to monitor interrupt operations and track package state transitions across your cluster (see [docs/metrics/](metrics/) for dashboard setup and configuration)
9 changes: 6 additions & 3 deletions docs/operator-status-definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,10 @@ upgrade → config
```

### With Interrupts:
When a package requires an interrupt, the node is first cordoned and drained before package operations begin:
```
uninstall → apply → config → interrupt → post-interrupt
upgrade → config → interrupt → post-interrupt
```
uninstall (if downgrading) → cordon → wait → drain → apply → config → interrupt → post-interrupt
cordon → wait → drain → upgrade (if upgrading) → config → interrupt → post-interrupt
```

**Note**: The cordon, wait, and drain phases ensure that workloads are safely removed from the node before any package operations that require interrupts (such as reboots or kernel module changes) are executed.
34 changes: 19 additions & 15 deletions k8s-tests/chainsaw/skyhook/config-skyhook/assert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ metadata:
skyhook.nvidia.com/test-node: skyhooke2e
skyhook.nvidia.com/status_config-skyhook: in_progress
annotations:
("skyhook.nvidia.com/nodeState_config-skyhook" && parse_json("skyhook.nvidia.com/nodeState_config-skyhook")):
("skyhook.nvidia.com/nodeState_config-skyhook" && parse_json("skyhook.nvidia.com/nodeState_config-skyhook")):
{
"baxter|3.2.1": {
"name": "baxter",
Expand All @@ -30,22 +30,26 @@ metadata:
"stage": "apply",
"state": "in_progress"
},
"dexter|1.2.3": {
"name": "dexter",
"version": "1.2.3",
"spencer|3.2.3": {
"name": "spencer",
"version": "3.2.3",
"image": "ghcr.io/nvidia/skyhook/agentless",
"stage": "apply",
"state": "in_progress"
},
"spencer|3.2.3": {
"name": "spencer",
"version": "3.2.3",
"dexter|1.2.3": {
"name": "dexter",
"version": "1.2.3",
"image": "ghcr.io/nvidia/skyhook/agentless",
"stage": "apply",
"state": "in_progress"
}
},
}
skyhook.nvidia.com/status_config-skyhook: in_progress
spec:
taints:
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
status:
(conditions[?type == 'skyhook.nvidia.com/config-skyhook/NotReady']):
- reason: "Incomplete"
Expand All @@ -62,13 +66,7 @@ status:
status: in_progress
nodeState:
(values(@)):
- dexter|1.2.3:
name: dexter
state: in_progress
version: '1.2.3'
stage: apply
image: ghcr.io/nvidia/skyhook/agentless
baxter|3.2.1:
- baxter|3.2.1:
name: baxter
state: in_progress
version: '3.2.1'
Expand All @@ -80,6 +78,12 @@ status:
version: '3.2.3'
stage: apply
image: ghcr.io/nvidia/skyhook/agentless
dexter|1.2.3:
name: dexter
state: in_progress
version: '1.2.3'
stage: apply
image: ghcr.io/nvidia/skyhook/agentless
nodeStatus:
# grab values should be one and is complete
(values(@)):
Expand Down
22 changes: 21 additions & 1 deletion k8s-tests/chainsaw/skyhook/interrupt-grouping/assert.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,27 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
kind: Node
apiVersion: v1
metadata:
labels:
skyhook.nvidia.com/test-node: skyhooke2e
skyhook.nvidia.com/status_interrupt-grouping: in_progress
annotations:
skyhook.nvidia.com/status_interrupt-grouping: in_progress
spec:
taints:
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
status:
(conditions[?type == 'skyhook.nvidia.com/interrupt-grouping/NotReady']):
- reason: "Incomplete"
status: "True"
(conditions[?type == 'skyhook.nvidia.com/interrupt-grouping/Erroring']):
- reason: "Not Erroring"
status: "False"
---
kind: Pod
apiVersion: v1
metadata:
Expand Down
84 changes: 84 additions & 0 deletions k8s-tests/chainsaw/skyhook/interrupt/assert-a.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
kind: Pod
apiVersion: v1
metadata:
namespace: skyhook
labels:
skyhook.nvidia.com/name: interrupt
skyhook.nvidia.com/package: jason-1.3.2
annotations:
("skyhook.nvidia.com/package" && parse_json("skyhook.nvidia.com/package")):
{
"name": "jason",
"version": "1.3.2",
"skyhook": "interrupt",
"stage": "apply",
"image": "ghcr.io/nvidia/skyhook/agentless"
}
ownerReferences:
- apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
name: interrupt
spec:
initContainers:
- name: jason-init
image: ghcr.io/nvidia/skyhook/agentless:1.3.2
- name: jason-apply
image: ghcr.io/nvidia/skyhook/agentless:3.2.3
args:
([0]): apply
([1]): /root
(length(@)): 3
- name: jason-applycheck
image: ghcr.io/nvidia/skyhook/agentless:3.2.3
args:
([0]): apply-check
([1]): /root
(length(@)): 3
---
apiVersion: v1
kind: Node
metadata:
labels:
skyhook.nvidia.com/test-node: skyhooke2e
skyhook.nvidia.com/status_interrupt: in_progress
annotations:
("skyhook.nvidia.com/nodeState_interrupt" && parse_json("skyhook.nvidia.com/nodeState_interrupt")):
{
"jason|1.3.2": {
"name": "jason",
"version": "1.3.2",
"image": "ghcr.io/nvidia/skyhook/agentless",
"stage": "config",
"state": "complete"
}
}
skyhook.nvidia.com/status_interrupt: in_progress
spec:
taints:
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
status:
(conditions[?type == 'skyhook.nvidia.com/interrupt/NotReady']):
- reason: "Incomplete"
status: "True"
(conditions[?type == 'skyhook.nvidia.com/interrupt/Erroring']):
- reason: "Not Erroring"
status: "False"
---
Loading