Rollout automation #17

squaremo · 2022-02-02T11:56:45Z

This adds a "work in progress" RFC on changing the layering to make rollouts possible. It's an RFC, rather than a design, because it proposes changes to the design -- if I can get it into a good shape (including, possibly, a reference implementation), then I'll update the design docs as a follow-up.

get happy with the design as described in the RFC
reference implementation
make changes to design docs

This adds a "work in progress" RFC on changing the layering to make rollouts possible. It's an RFC, rather than a design, because it proposes changes to the design -- if I can get it into a good shape (including, possibly, a reference implementation), then I'll update the design docs as a follow-up. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

The partition type was the locus for the module->cluster assignment, with the idea being that a rollout controller can change which clusters belong to which partitions to effect a rollout. However, it's not necessary to have an intermediate object for this relation -- clusters can simply be owned by modules directly, without losing generality. Since partitions don't pay their way, I've removed them from the design.

Using owner references will interact with Kubernetes' garbage collection. Even if it's possible to work around that with e.g., finalisers, it seems like a bad idea to subvert GC. It also doesn't help with atomicity, for which other workarounds are needed. Instead, use the object that already represents the assignment of modules to a cluster -- the RemoteAssemblage -- and change its meaning to suit. More shuffling may be necessary to make BootstrapModules fit too. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

After realising that assemblages represent the assignment of modules to clusters -- thus no need for another object -- a further realisation is that building a Rollout layer on top of modules makes modules essentially passive objects. Another way to look at it is that Modules may as well be the locus for rollout logic, since otherwise they do not have a job to do. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

This is just to indicate where the user is expected to interact with the system. It doesn't affect the technical bits of the design, but might help orient readers. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

kubebuilder create api --kind ProxyAsssemblage ... One change needed for rollouts is to have an "assemblage" object to be the place all the assigned bootstrap modules (syncs which get applied remotely from the control plane) get recorded for each cluster. A good name for that would be `RemoteAssemblage` (e.g., assemblage which is applied remotely) -- but that name is already used for the one that proxies to a remote cluster, which would be better-named `ProxyAssemblage`. To create myself some bother, I'm adding the new type and the rename at the same time. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

As part of the rename of RemoteAssemblage to ProxyAssemblage, this commit copies the implementation of the remote assemblage controller to the proxy assemblage controller, ports the tests for the former to the latter, and likewise swaps the type used in module_controller.go. The implementation for (repurposed) remote assemblages will be new, and need new tests. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

At present, bootstrap modules get expanded directly into Flux primitives (one GitRepository, and a Kustomization per cluster). Incremental rollouts could be implemented within this scheme; however, it would not help third party integrations much since they'd have to implement all of it themselves. Instead, I'm repurposing the RemoteAssemblage type -- the behaviour of which was moved to the better-named ProxyAssemblage -- to represent a set of syncs to be applied remotely to a cluster. In this commit: - change the definition of the RemoteAssemblage type so it contains syncs to apply remotely, rather than syncs to proxy (those were moved to ProxyAssemblage) - move the "expand to Flux primitives" code from the bootstrap module controller to the remote assemblage controller - implement the construction of remote assemblages in the bootstrap module controller - fix up type names (remote v proxy) in RBAC anontations and comments - adapt test code to the above changes The aim is to eventually put the commonly useful bits -- expansion to Flux primitives, and binding evaluation -- in the assemblage code, and the rollout logic in the module code. An integration or extension can then replace the module part by building on the assemblage part. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

When you create a cluster with Cluster API, it will typically take some minutes between the Cluster object (and its companions) appearing and the provider creating a secret containing a kubeconfig for the newly-provisioned cluster. If you have a bootstrap module already, the Flux primitives targeting the new cluster will fail in that period, possibly leading to a long back-off. There's at least a couple of places this can be mitigated: - only create a RemoteAssemblage when a cluster is marked ready; - only create Flux primitives when the required secret exists; Here, I've implemented the latter, since it is at a lower level and therefore will apply more broadly. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

The assemblage controller has a func for calculating the ready state of the Kustomization objects it created. This is useful for the remote assemblage controller too, which is in a different layer (module rather than assemblage) -- so, recreate it in pkg/sync to be available to both layers. I changed the argument type; it seems like a calculation specific to the Flux Kustomize type. But if I need it for other things, perhaps it can be generalised, or just replaced with kstatus. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

- adapt the code for tracking assemblages and recording sync states and module summaries, from module_controller to bootstrapmodule_controller - use KustomizeReadyState from pkg/api/, consistently - try to get the RBAC annotations more accurate (again) Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

It's expected (and useful) behaviour to recreate a controller-created object that is deleted out of band (e.g., by a pesky human). To make it work this way, it's usually only necessary for the controller to watch the objects in question, since a deletion will then trigger a reconciliation of the owner object. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

Previously, the remote assemblage controller was made to skip (and come back to) a cluster when its kubeconfig secret is missing. This accounts broadly for races in provisioning clusters, since ultimately its the secret that matters. However, with Cluster API providers its possible for a kubeconfig secret to exist before the cluster is ready to be used. This situation _also_ leads to long backoffs and apparent lack of responsiveness. To mitigate that, this commit adds a similar mechanism to the above, at another layer: clusters that are not marked as ready (in `.status.controlPlaneReady`) are skipped. This is at the module (and bootstrapmodule) level, because it's modules that react to clusters -- assemblages just refer to kubeconfig secrets. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

The ProxyAssemblage controller does not have the opportunity to be notified of changes in a downstream cluster (or at least, this would be tricky to arrange, especially at scale), so it won't in general see when its Assemblage counterpart changes status. A dumb but effective way to make sure updates are seen is to simply requeue each ProxyAssemblage to be re-examined after a small interval. This will be OK at moderate scale (I would guess O(100) clusters), since more worker threads can be configured to cope with polling. Larger scales may need a less brute force approach to be figured out. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

- fix a dependency loop in the demo with CRDs and their use - fix a typo in the enrol script - give some clues about manually cleaning up Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

squaremo force-pushed the design-rollouts branch from 4447c20 to 5e405aa Compare February 21, 2022 17:41

squaremo mentioned this pull request Feb 23, 2022

Rollout strategy #4

Closed

squaremo force-pushed the design-rollouts branch 3 times, most recently from 5401c1f to c8f78e6 Compare March 9, 2022 13:41

squaremo force-pushed the design-rollouts branch from ce47826 to fe32846 Compare April 12, 2022 13:04

squaremo added 18 commits April 12, 2022 17:55

Record questions and suggestions

3602288

Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

Explain atomicity requirements

1353834

Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

Summarise proposed changes in RFC001

e300b4c

Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

Show user in RFC diagrams

b8e5236

This is just to indicate where the user is expected to interact with the system. It doesn't affect the technical bits of the design, but might help orient readers. Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

Tidy up demo scripts and instructions

7755045

- fix a dependency loop in the demo with CRDs and their use - fix a typo in the enrol script - give some clues about manually cleaning up Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>

squaremo force-pushed the design-rollouts branch from fe32846 to 7755045 Compare April 13, 2022 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollout automation #17

Rollout automation #17

squaremo commented Feb 2, 2022

Rollout automation #17

Are you sure you want to change the base?

Rollout automation #17

Conversation

squaremo commented Feb 2, 2022