Rename reimage kubectl subcommand to repave#36
Conversation
Agent-Logs-Url: https://github.com/Azure/unbounded-kube/sessions/be3e52e7-8845-433a-8188-fda746389200 Co-authored-by: jveski <7576912+jveski@users.noreply.github.com>
|
Thoughts on reset vs repave? |
I prefer repave personally since it seems more specific than reset, but I don't have strong opinions here honestly |
|
I guess the question I have is whether a repave implies an update? In my mind repave does. You're putting a fresh something down but it's fundamentally new vs resetting to the original state (which is generally what reimage implies). |
|
Yeah it implies an update i.e. the latest machine spec is used during |
|
@copilot Fix the failing linter |
Agent-Logs-Url: https://github.com/Azure/unbounded-kube/sessions/36fd6f5e-eaeb-48e8-96b2-94a60e93f824 Co-authored-by: jveski <7576912+jveski@users.noreply.github.com>
Fixed in 80f40a8. Two issues:
|
|
@copilot fix the smoke tests |
…readiness check Agent-Logs-Url: https://github.com/Azure/unbounded-kube/sessions/b00f698f-0cd8-4677-a867-6f03ead0a2df Co-authored-by: jveski <7576912+jveski@users.noreply.github.com>
Fixed in e2913e9. The smoke test was failing because kindnet enters CrashLoopBackOff on the smoke-node (a pre-existing flaky issue also affecting |
|
@copilot rebase |
…e-to-repave # Conflicts: # api/v1alpha3/machine_types.go Co-authored-by: jveski <7576912+jveski@users.noreply.github.com>
Follow the API rename from ReimageCounter to RepaveCounter that landed in main via #36. Update Go field references, JSON patch fields in e2e tests, Python variables, CI comments, and documentation.
* Replace gRPC daemon with Machine CR watch-based daemon Replace the gRPC task-pull daemon with a watch-based daemon that monitors the Machine CR on the control plane and reconciles the local node to match the desired state. - Watch Machine CR for spec drift (version, image) and operation counter drift (reimageCounter, rebootCounter) - Perform alternating nspawn machine updates (kube1/kube2) on drift - Update Machine CR status: phase, conditions (NodeUpdated), and acknowledge operation counters - Authenticate with bootstrap token from applied config (not kubelet kubeconfig which has nspawn-internal paths) - Add RBAC (ClusterRole + ClusterRoleBinding) for system:bootstrappers - Add e2e tests: daemon validation, version upgrade via reimageCounter patch, applied config verification - Fix double-reconciliation bug: re-GET Machine CR before drift detection to avoid stale watch events * Refactor: extract shared phase helpers and consolidate machineRun Extract rootfs.Provision, nodestart.StartNode, and nodestop.StopNode as shared composite tasks used by both the initial agent start and the node update flow. This removes duplicated phase orchestration from start.go and nodeupdate.Execute. Move the duplicated machineRun helper into utilexec.MachineRun so all packages (nodestart, nodestop, nodeupdate) share a single implementation. Replace the empty nodestop stubs (StopContainerd, StopKubelet) with a real StopNode task that gracefully stops services before nspawn teardown. * Refactor agent daemon: consolidate into daemon package and extract phase helpers - Move nodeupdate package to cmd/agent/internal/daemon, rename Execute to updateNode (unexported). The daemon package now owns the full watch loop, reconciliation, kube client, scheme, drift detection, and node update logic. - Slim cmd/daemon.go to just cobra wiring that calls daemon.Run(ctx, log). Delete cmd/scheme.go (moved to daemon package). - Remove Config field from NodeStart goal state. PersistAppliedConfig and StartNode now take *provision.AgentConfig as an explicit parameter instead of smuggling it through the goal state struct. - Extract WaitForKubelet into phases/nodestart/wait_kubelet.go. - Extract PersistAppliedConfig into phases/nodestart/persist_config.go. - Add reset.RemoveAppliedConfig task and reset.CleanupMachine composite that combines RemoveNSpawnConfig + RemoveMachine + RemoveAppliedConfig. Replace inline os.Remove in node update with the composite task. * Simplify e2e test flow to single linear sequence Remove the two-case structure (pre-existing CR vs no CR with VM recreation) and replace with a single linear flow: join, validate self-registered CR, upgrade, reset, rejoin. This removes the need for VM recreation between test cases. * Move Machine CR registration from start command to daemon The daemon now registers the Machine CR at startup before entering the watch loop, instead of the start command doing it as a separate phase. This ensures registration happens even on rejoin after reset, and keeps all Machine CR interaction in the daemon package. Remove the now-unused nodestart.RegisterMachine phase and its tests. Add tests for the daemon's registerMachine and buildMachineCR functions. * Move EnableDaemon into task list, add machines.target dep, unexport findActiveMachine - Move EnableDaemon from a separate call into the Serial task list so all phases are in one place. - Add machines.target dependency to the daemon systemd unit so it waits for the nspawn machine to be running before starting. - Unexport FindActiveMachine since it is only used within the daemon package. - Add polling to validate_machine_cr_created in e2e tests since the daemon now registers the Machine CR asynchronously after startup. * Fix implicit string concatenation flagged by CodeQL Use explicit + concatenation for the multi-line shell command string in validate_upgrade to avoid the implicit-concatenation-in-list warning. * Fix lint: lowercase error string per Go conventions * Update RBAC and daemon doc to match current architecture Add create verb to machines resource in agent RBAC (needed for daemon self-registration). Update daemon.md to accurately describe bootstrap token auth, system:bootstrappers group, Machine CR registration at startup, operation counter drift as sole reconciliation trigger, and machines.target systemd dependency. * Address PR #37 review nits and drop post-rejoin CR check - Annotate bare error returns in goalstates/resolve.go (plombardi89) - Add MachineConditionNodeUpdated const to api/v1alpha3 (jveski) - Use apimeta.SetStatusCondition instead of manual condition loop (jveski) - Set NodeUpdated condition alongside Provisioning phase update (jveski) - Drop validate-machine-cr-created step after rejoin in e2e workflow * Decouple watch loop from reconciliation with async worker Move reconciliation to a worker goroutine signalled via a buffered channel (capacity 1). The watch loop now performs a non-blocking send on each MODIFIED/ADDED event and immediately returns to draining the watch stream. This prevents backpressure on the API server's HTTP/2 connection when reconciliation takes time (rootfs provisioning ~15s). The worker calls handleMachineEvent which re-GETs the Machine CR from the API server, so coalesced signals naturally pick up the latest state. Multiple events arriving during an in-flight reconciliation are merged into a single follow-up reconciliation. * Use client-go workqueue for async reconciliation Replace hand-rolled channel-based worker with client-go's TypedRateLimitingInterface workqueue. This is the standard Kubernetes controller building block and provides deduplication, rate limiting with exponential backoff on failures, and proper shutdown semantics. The watch loop calls queue.Add(machineName) on events; the workqueue deduplicates if the key is already queued or being processed. On reconciliation failure runWorker calls AddRateLimited for backoff retry; on success it calls Forget to reset the rate limiter. * Add TODO for bootstrap token credential strategy in buildKubeClient * Replace global NewKubeClient var with parameter injection * Consolidate agent daemon RBAC into bootstrapper RBAC * Fix flaky e2e: add site label to bootstrap token secret * Stop and remove agent daemon systemd unit during reset * Unify config dirs under /etc/unbounded/{kube,agent} * Add SHA-256 sidecar checksum for applied config integrity Write a .sha256 companion file alongside the applied config JSON to detect on-disk corruption (e.g. bitflips). Each file is written atomically via renameio; a missing sidecar is treated as a warning (older agent or crash between writes), while a present sidecar with wrong digest returns ErrChecksumMismatch. - Write path: PersistAppliedConfig writes checksum after config - Read path: findActiveMachine verifies checksum before trusting data - Reset path: RemoveAppliedConfig cleans up the sidecar file - Tests: ComputeChecksum, VerifyChecksum (match/mismatch/missing/error) * Update daemon doc: consolidate sections, add SVG diagram Combine Drift Detection and Applied Config Integrity into a single Applied Config and Drift Detection section that describes AgentConfig fields, operation counter triggers, and persistence with integrity guard. Replace ASCII diagram with SVG following existing style. Simplify systemd and RBAC sections to prose descriptions. Call out bootstrap token auth as temporary. * Rename reimage to repave in agent daemon, e2e, and docs Follow the API rename from ReimageCounter to RepaveCounter that landed in main via #36. Update Go field references, JSON patch fields in e2e tests, Python variables, CI comments, and documentation. * Consolidate daemonUnit const and remove reboot counter drift check * Implement in-place update of nspawn machine Add alternating (blue/green) nspawn machine update logic and a long-running agent daemon that registers the Machine CR at startup. The daemon discovers the active nspawn machine, builds a kube client from the applied config, ensures a Machine CR exists, then blocks until shutdown. The update logic (updateNode, hasDrift, findActiveMachine) is in place but the trigger mechanism is TBD. Removes the Machine CR watch loop, workqueue, status updates, and operations counter logic from the prior implementation. Also removes daemon-specific e2e steps and design doc. * Add license header and StartLimitIntervalSec=0 to daemon unit file
Summary
Renames the
reimagekubectl subcommand torepaveand updates all references across the codebase.Changes
kubectl subcommand
cmd/kubectl-unbounded/app/machine_reimage.gotomachine_repave.gomachineReimageCommand()->machineRepaveCommand()runReimage()->runRepave()reimage NAME->repave NAMEAPI types (
api/v1alpha3/machine_types.go)MachineConditionReimaged->MachineConditionRepaved(condition string:"Repaved")ReimageCounter->RepaveCounter(in bothOperationsSpecandOperationsStatus)reimageCounter->repaveCounterCRD YAML
deploy/machina/crd/unbounded-kube.io_machines.yamlto match new field namesInternal packages
internal/metalman/lifecycle/reconciler.go-reimageTimeout->repaveTimeout, variable renamesinternal/metalman/redfish/reconciler.go-condReimaged->condRepaved, variable renamesinternal/metalman/netboot/http.go- variable and log message renamesDocumentation
docs/content/concepts/bare-metal.mddocs/content/guides/pxe.mddocs/content/reference/machina-crd.mddocs/content/reference/architecture.mdcmd/metalman/README.mdOther
images/host-ubuntu2404/assets/grub.cfg.tmpl- GRUB templatehack/smoke-metalman.py- smoke test scriptVerification
go build ./cmd/kubectl-unbounded/...)go test ./cmd/kubectl-unbounded/... ./internal/metalman/...)