Skyhook is a Kubernetes-aware package manager for cluster administrators to safely modify and maintain underlying host declaratively at scale.
Managing and updating Kubernetes clusters is challenging. While Kubernetes advocates treating compute as disposable, but certain scenarios make this difficult:
- Updating hosts without re-imaging:
- Limited excess hardware/capacity for rolling replacements
- Long node replacement times (example can be hours in some cloud providers)
- OS image management:
- Maintain a common base image with workload-specific overlays instead of multiple OS images
- Workload sensitivity:
- Some workloads can't be moved, are difficult to move, or take a long time to migrate
Skyhook functions like a package manager but for your entire Kubernetes cluster, with three main components:
- Skyhook Operator - Manages installing, updating, and removing packages
- Skyhook Custom Resource (SCR) - Declarative definitions of changes to apply
- Packages - The actual modifications you want to implement
Skyhook works in any Kubernetes environment (self-managed, on-prem, cloud) and shines when you need:
- Kubernetes-aware scheduling that protects important workloads
- Rolling or simultaneous updates across your cluster
- Declarative configuration management for host-level changes
- Native Kubernetes integration - Packages are standard Kubernetes resources compatible with GitOps tools like ArgoCD, Helm, and Flux
- Autoscaling support - Ensure newly created nodes are properly configured before schedulable
- First-class upgrades - Deploys changes with minimal disruption, waiting for running workloads to complete when needed
- Interruption Budget: percent of nodes or count
- Node Selectors: selectors for which nodes to apply too (node labels)
- Pod Non Interrupt Labels: labels for pods to never interrupt
- Package Interrupt: service (containerd, cron, any thing systemd), or reboot
- Additional Tolerations: are tolerations added to the packages
- Runtime Required: requires node to come into the cluster with a taint, and will do work prior to removing custom taint.
There are a few pre-built generalist packages available at NVIDIA/skyhook-packages
- Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.15.2/cert-manager.yaml
- Create a secret for the operator to pull images
kubectl create secret generic node-init-secret --from-file=.dockerconfigjson=${HOME}/.config/containers/auth.json --type=kubernetes.io/dockerconfigjson -n skyhook
- Install the operator
helm install skyhook ./chart --namespace skyhook
Example package using shellscript, put this in a file called demo.yaml
and apply it with kubectl apply -f demo.yaml
apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
labels:
app.kubernetes.io/part-of: skyhook-operator
app.kubernetes.io/created-by: skyhook-operator
name: demo
spec:
nodeSelectors:
matchLabels:
skyhook.nvidia.com/test-node: demo
packages:
tuning:
version: 1.1.0
image: ghcr.io/nvidia/skyhook-packages/shellscript
configMap:
apply.sh: |-
#!/bin/bash
echo "hello world" > /skyhook-hello-world
sleep 5
apply_check.sh: |-
#!/bin/bash
cat /skyhook-hello-world
sleep 5
config.sh: |-
#!/bin/bash
echo "a config is run" >> /skyhook-hello-world
sleep 5
config_check.sh: |-
#!/bin/bash
grep "config" /skyhook-hello-world
sleep 5
kubectl get pods -w -n skyhook
There will a pod for each lifecycle stage (apply, config) in this case.
kubectl describe skyhooks.skyhook.nvidia.com/demo
The Status will show the overall package status as well as the status of each node
kubectl get nodes -o jsonpath='{range .items[?(@.metadata.labels.skyhook\.nvidia\.com/test-node=="demo")]}{.metadata.annotations.skyhook\.nvidia\.com/nodeState_demo}{"\n"}{end}'
The operator will apply steps in a package throughout different lifecycle stages. This ensures that the right steps are applied in the right situations and in the correct order.
- Upgrade: This stage will be ran whenever a package's version is upgraded in the SCR.
- Uninstall: This stage will be ran whenever a package's version is downgraded or it's removed from the SCR.
- Apply: This stage will always be ran at least once.
- Config: This stage will run when a configmap is changed and on the first SCR application.
- Interrupt: This stage will run when a package has an interrupt defined or a key's value in a packages configmap changes which has a config interrupt defined.
- Post-Interrupt: This stage will run when a package's interrupt has finished.
The stages are applied in this order:
- Uninstall -> Apply -> Config -> Interrupt -> Post-Interrupt (No Upgrade)
- Upgrade -> Config -> Interrupt -> Post-Interrupt (With Upgrade)
Semantic versioning is strictly enforced in the operator in order to support upgrade and uninstall. Semantic versioning allows the operator to know which way the package is going while also enforcing best versioning practices.
Part of how the operator works is the skyhook-agent. Packages have to be created in way so the operator knows how to use them. This is where the agent comes into play, more on that later. A package is a container that meets these requirements:
- Container shall have
bash
, so needs to be at least something like busybox/alpine - Config that is valid, jsonschema is used to valid this config. The agent has a tool build in to valid the config. This tool should be used to test packages before publishing.
- The file system structure needs to adhere to:
/skyhook-package
├── skyhook_dir/{steps}
├── root_dir/{static files}
└── config.json
This repository includes an example Kyverno policy that demonstrates how to restrict the images that can be used in Skyhook packages. While this is not a complete policy, it serves as a template that end users can modify to fit their security needs.
The policy prevents the creation of Skyhook resources that contain packages with restricted image patterns. Specifically, it blocks:
- Images containing 'shellscript:' anywhere in the image name
- Images from Docker Hub (matching 'docker.io/*')
If you are going to use kyverno make sure to turn on the creation of the skyhook-viewer-role in the values file for the operator. (rbac.createSkyhookViewerRole: true) and then bind kyverno to that role. Example policy:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kyverno-skyhook-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: skyhook-viewer-role
subjects:
- kind: ServiceAccount
name: kyverno-reports-controller
namespace: kyverno
The operator is a kbuernetes operator that monitors cluster events and coordinates the installation and lifecycle of Skyhook packages.
The agent is what does the operators work and is a separate container from the package. The agent knowns how to read a package (/skyhook_package/config.json) is what implements the lifecycle packages go though.