nixos/kubernetes: stabilize cluster deployment/startup across machines #56789

calbrecht · 2019-03-03T19:13:25Z

Motivation for this change

I had a tough time getting to know the kubernetes with all the errors in the logs caused by services dying because of missing required certificates (because they were not generated yet) or simply because services depend on other services which were not started yet.

This PR makes all the services related to kubernetes start in order, even across machines and ensures the requirements for the services are met in advance of starting. The kubernetes tests had 28K lines of log before ran about 4m 20s and are now down to 13K and 3m 40s. All service start clean on the first boot.

Please see the commit messages for further information about the changes.

Things done

to protect services from crashing and clobbering the logs when certificates are not in place yet and make sure services are activated when certificates are ready. To prevent errors similar to "kube-controller-manager.path: Failed to enter waiting state: Too many open files" fs.inotify.max_user_instances has to be increased.

by adding targets and curl wait loops to services to ensure services are not started before their depended services are reachable. Extra targets cfssl-online.target and kube-apiserver-online.target syncronize starts across machines and node-online.target ensures docker is restarted and ready to deploy containers on after flannel has discussed the network cidr with apiserver. Since flannel needs to be started before addon-manager to configure the docker interface, it has to have its own rbac bootstrap service. The curl wait loops within the other services exists to ensure that when starting the service it is able to do its work immediately without clobbering the log about failing conditions. By ensuring kubernetes.target is only reached after starting the cluster it can be used in the tests as a wait condition. In kube-certmgr-bootstrap mkdir is needed for it to not fail to start. The following is the relevant part of systemctl list-dependencies default.target ● ├─certmgr.service ● ├─cfssl.service ● ├─docker.service ● ├─etcd.service ● ├─flannel.service ● ├─kubernetes.target ● │ ├─kube-addon-manager.service ● │ ├─kube-proxy.service ● │ ├─kube-apiserver-online.target ● │ │ ├─flannel-rbac-bootstrap.service ● │ │ ├─kube-apiserver-online.service ● │ │ ├─kube-apiserver.service ● │ │ ├─kube-controller-manager.service ● │ │ └─kube-scheduler.service ● │ └─node-online.target ● │ ├─node-online.service ● │ ├─flannel.target ● │ │ ├─flannel.service ● │ │ └─mk-docker-opts.service ● │ └─kubelet.target ● │ └─kubelet.service ● ├─network-online.target ● │ └─cfssl-online.target ● │ ├─certmgr.service ● │ ├─cfssl-online.service ● │ └─kube-certmgr-bootstrap.service

to speed up startup time because it can be parallelized.

to prevent errors in log about missing permissions when addon manager starts the dashboard.

within the node join script, since certmgr is taking care of restarting services.

calbrecht · 2019-03-03T19:16:45Z

cc @johanot @globin @fpletz

johanot · 2019-03-04T12:20:34Z

@calbrecht Thank you for this! Looks good at first glance, and tests pass.
Will do a comprehensive review later today.

arianvp · 2019-03-04T16:58:38Z

Would be nice if kubernetes components would support socket-based activation ... then we would get correct ordering of startup for free id we'd use systemd.socket. but I'm not sure if they do..

calbrecht · 2019-03-05T07:46:51Z

@arianvp there is an open issue in the kubernetes repo about socket-based activation, but it does not apply to starting core components, i figure.

johanot

@calbrecht This was what I had for now :-) Quite a bit actually. Feel free to catch me on IRC if you want to discuss some of it in a more informal fashion.

johanot · 2019-03-05T09:04:36Z

nixos/modules/services/cluster/kubernetes/pki.nix

-          echo "Restarting flannel..." >&1
-          systemctl restart flannel
-        ''}
+        if [ y = $do_restart ]; then


Bash nitpick:
Couldn't this be written as just if [ -s "${certmgrAPITokenPath}" ]; then, and then line 378 can be removed.

my intention was like if the file is present and not empty before running the node-join script then restart certmgr, else the systemd path unit will take care of starting certmgr.

Fine by me then.

johanot · 2019-03-05T09:08:10Z

nixos/modules/services/cluster/kubernetes/pki.nix

        fi
      '')
      ];
      serviceConfig = {
-        RestartSec = "10s";
+        TimeoutSec = "500";
+        RestartSec = "1s";


Do we really want this to try and restart every second without no StartLimit or anything?

Removed restart all together in 5684034 since the service has a preStart wait loop anyway.

johanot · 2019-03-05T09:17:13Z

nixos/modules/services/cluster/kubernetes/kubelet.nix

+        # because the network of docker0 has been changed by flannel.
+        script = let
+          docker-env = "/run/flannel/docker";
+          flannel-date = "stat --print=%Y ${docker-env}";


Urgh. This is not very nice, but neither was my restart-dance, and for now I have no prettier suggestion. :-) It is certainly complicated the way flannel negotiates the pod CIDR. We should get rid of flannel as default CNI in the future, I think.

Ok, i am not experienced in networking, whatever you say, i am in for it :)

Like I said, I have no better suggestion :-) It just seems a little flaky to rely on timestamps for getting the restart order right. Perhaps the only right solution is to replace Flannel with something else, but that's a toy project for another time.

johanot · 2019-03-05T09:24:47Z

nixos/modules/services/cluster/kubernetes/flannel.nix

-          apiGroup = "rbac.authorization.k8s.io";
-          kind = "ClusterRole";
-          name = "flannel";
+    systemd.services.flannel-rbac-bootstrap = mkIf (top.apiserver.enable && (elem "RBAC" top.apiserver.authorizationMode)) {


May I ask why you changed this? Flannel doesn't need RBAC permissions if it doesn't use kubernetes as storage backend, which is why the mkIf previously has a guard for the flannel storage backend option. Futhermore; the addonManager bootstrap addons were used to assign privileges, but you created a separate service - why?

First, the storageBackend always was "kubernetes" like declared in line 10 of this file. I created this because in my view the flannel service would need to start before any addons might be deployed to nodes. And nodes couldn't be found before the flannel service configured the network.

In 6e9037f now there is a new service kube-addon-manager-bootstrap.service which has the purpose to have higher permissions to invoke the kubectl apply and not to conflict with the mkWaitCurl of the kube-addon-manager.service which does not need to have clusterAdmin certs but the addonManager certs.

I think it is my mistake that I forgot to mkDefault the storageBackend in line 40, but that doesn't change the fact that we should always be able to handle/support the situation where someone decides to do storageBackend = mkForce "etcd". This is why the mkIf on storageBackend is needed. For the same reason that we cannot assume that easyCerts is always true.

Anyway, the new version of the diff looks better. I see why a new service was needed now, because of the order of things starting. :-)

johanot · 2019-03-05T09:34:47Z

nixos/modules/services/cluster/kubernetes/flannel.nix

+          inherit cert key;
+        })}
+
+        kubectl -s ${top.apiserverAddress} --certificate-authority=${top.caFile} --client-certificate=${top.pki.certs.clusterAdmin.cert} --client-key=${top.pki.certs.clusterAdmin.key} apply -f ${concatStringsSep " \\\n -f " files}


This is why assigning RBAC rights is nice to have isolated in the "bootstrap-addons" loop. We don't want too many individual units to have access to the clusterAdmin certificates, or run kubectl commands against the cluster. In either case, this unit poses a problem for clusters where easyCerts = false. We cannot reference top.pki.certs without guarding for mkIf top.easyCerts. Preferably, everything easyCerts related should just be declared in pki.nix.

Ok, i get the point about the access to clusterAdmin certs and the easyCerts.

I just found by delaying the startup of kube-addon-manager until it can do its work, that flannel needs its roles earlier.

I also created a different oneshot service for the bootstrapAddons set to activate this as soon as the apiserver was online but threw it away because i was confused if there would be any "pod creation requests" involved.

The bootstrapAddons confuses me when thinking about the startup order, i'd rather think of it as a bootstrapAccessControl or similar to just do this only on the master right after the kube-apiserver is working, which is the intention of bootstrapAddons, right?

Preferably, everything easyCerts related should just be declared in pki.nix.

Agree, do you think of the mkCert calls here as well?

Removed the bootstrapping in 6e9037f

mkCert I think can stay here, because it doesn't require easyCerts to be true, and it fits well in the same file as the component needing the certs.

johanot · 2019-03-05T09:40:54Z

nixos/modules/services/cluster/kubernetes/apiserver.nix


+      systemd.services.kube-apiserver-online = mkIf top.flannel.enable {


Why the mkIf on flannel here? What happens if I disable flannel and go with another CNI? Will dependencies on kube-apiserver-online still work?

Hmm, i see. The mkIf currently is there because the mkWaitCurl is using the certifificates of the flannel client to check the apiserver /healthz path.
I decided to use the flannel certs because it seemed to me that it will always be present on the master and nodes.

So maybe on the master the mkWaitCurl could be called with the cluster-admin cert, and maybe this will work with the kubelet-client cert on the nodes (not sure if this is working because the node somehow is not present at this point, but will try)

Resolved in 7323b77#diff-8d468cb36655335308a8eb4c3999b59aR372 by either using the clusterAdmin cert or the kubelet cert.

See my comment here: 7323b77#r32635029

johanot · 2019-03-05T09:48:33Z

nixos/modules/services/cluster/kubernetes/apiserver.nix

+      systemd.services.kube-apiserver-online = mkIf top.flannel.enable {
+        description = "apiserver control plane is online";
+        wantedBy = [ "kube-apiserver-online.target" ];
+        after = [ "kube-scheduler.service" "kube-controller-manager.service" ];


I guess, what's confusing me here is, that there is a kube-apiserver-online.service and a kube-apiserver-online.target. I think (maybe, if I understood the intention correctly), the latter should be renamed to kube-control-plane-online.target, and that target should contain all the master components: kube-apiserver, kube-controller-manager, kube-scheduler, kube-addon-manager.

Ok, will rename it, but i'd prefer to not include the addon-manager and start that one after node-online.target for the reasons i gave above.

Renamed in ff91d58

johanot · 2019-03-05T09:50:14Z

nixos/modules/services/cluster/kubernetes/addon-manager.nix

@@ -72,7 +72,7 @@ in
    systemd.services.kube-addon-manager = {
      description = "Kubernetes addon manager";
      wantedBy = [ "kubernetes.target" ];
-      after = [ "kube-apiserver.service" ];
+      after = [ "kube-apiserver-online.target" "node-online.target" ];


The addon manager doesn't require node-online does it? Isn't it just about submitting something to the apiserver. Whatever is submitted will be scheduled later, when one or more nodes come online.

The reason i made the addon-manager depend on node-online.target is that i wanted to make sure fannel already configured and restarted docker at the time of requesting new containers to be started.
I thought if they were requested earlier they had to be destroyed and restarted to actually be connected to the newly created subnet. And if this is a non-issue then it is just a matter of my preference to not start services that cannot function correctly and therefore just spam the logs with warnings about not being able to find the node.

I mean, the addon-manager cannot do anything if there are no nodes online. Or do i lack the deeper understanding here?

It activates now within kube-control-plane-online.target ff91d58#diff-e6ede1705ae680d8146d5db307a2a392R74

All the addon manager does is submit stuff to the apiserver (etcd ultimately). If some of that "stuff" requires sheduling (i.e. pods), the kube-scheduler and kube-controller-manager will handle it from there. If there are no nodes online yet, the pods will simply end up in status Pending until at least one node shows up, but this is not something the addon manager should wait for. :-) All it requires is the apiserver. I see you new version of the diff implements that. Fine.

johanot · 2019-03-05T09:55:00Z

nixos/modules/services/cluster/kubernetes/kubelet.nix

@@ -354,5 +405,16 @@ in
      };
    })

+    {
+      systemd.targets.kubelet = {
+        wantedBy = [ "node-online.target" ];


Could it be renamed to kube-node-online.target ? Just node-online.target can be confusing on multi-purpose hosts. Also; I think that kube-proxy should be part of this target.

Sure i will rename it. And for the kube-proxy i will check.

Resolved in e148cb0 and the proxy in ff91d58

Looks good!

johanot · 2019-03-05T10:10:23Z

nixos/modules/services/cluster/kubernetes/apiserver.nix

@@ -272,11 +272,30 @@ in
  ###### implementation
  config = mkMerge [

-    (mkIf cfg.enable {
+    (mkIf cfg.enable (let
+      apiserverPaths = [


I fear there might be an issue regarding compatibility with easyCerts = false here? It should be allowed to run a less secure cluster with AlwaysAllow as authorizer. I this we should consider putting some of these path-lists in pki.nix, to make it possible to opt-out of some of these path dependencies.

Yes, i understand. Will move all occurrences into pki.nix

Moved the remaining path units and usage in ff382c1

Looks good. Will have to test without easyCerts to see whether everything is still compatible.

arianvp · 2019-03-05T10:13:51Z

There's also this: https://www.freedesktop.org/software/systemd/man/systemd-socket-proxyd.html which retrofits socket activation on any non-socket-activated service

…de-online.target

…service

…ap service

calbrecht · 2019-03-07T12:11:32Z

nixos/modules/services/cluster/kubernetes/default.nix

@@ -73,6 +73,18 @@ let
    };
  };

+  mkWaitCurl = { address ? cfg.apiserverAddress, sleep ? 2, path ? "", args ? "-o /dev/null",


Just noticed by accident, when deploying a different deployment with nixops, nix complains about cfg.apiserverAddress being used but not defined. Probably because it has no default value in the options. Have to workaround that somehow.

@calbrecht apiserverAddress is mkDefault'et to the value of masterAddress here: https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/cluster/kubernetes/default.nix#L286 .. The reason is the assumption that you have only one master node, but it should still be possible to override this. In our production cluster for example; we have 3 master nodes and load-balanced address that divides requests between them all. In that case, we want a setup where masterAddress != apiserverAddress.

@johanot Ok, so you have the cfssl service running on the masterAddress, right?

However, the issue here was that i defined kube-control-plane-online service and target within apiserver.nix but outside of the mkIf cfg.enabled set, which led to evaluation of this function declaration in every other nix-build, probably. Fixed this in 154356d

outside kubernetes module.

globin · 2019-04-17T16:38:08Z

Looks good!

johanot · 2019-04-22T17:15:19Z

@calbrecht Sorry I went kind of offline on this one and didn't help seeing it through. Anyway, now I'm "awake" again :-)

It seems, unfortunately that the DNS and RBAC test cases stopped working on master after merging this. Looks like kube-control-plane-online-pre-start is hanging. Have you observed this?

johanot · 2019-04-22T17:34:27Z

kubectl get --raw=/healthz prompts for username, because the kubeconfig passed to kube-control-plane-online.service has errors.

calbrecht · 2019-04-24T17:53:57Z

@johanot, what errors do you mean? When reverting the kubernetes version upgrades back to 1.13.4, the tests run successful. Has this to do with restoring --user and --password flags in the changes from 1.13.4 -> 1.13.5?

johanot · 2019-04-24T19:04:45Z

@calbrecht Looks like your assumption is right. I does indeed break with 1.13.4 -> 1.13.5. It is sometimes baffling to me, what is apparently allowed to do in a minor kubernetes release. I mean.. It clearly breaks the API, but that's somehow ok. :) Just confirms to me that we need to be careful even with a revision bump of kubernetes.

The "error" I was talking about, is that it looks like client cert is null. For some reason, this was ok in 1.13.4, but not in 1.13.5. I understand the prompt for username, since the kubeconfig doesn't contain proper user identification. Will you look into this, or shall I?

calbrecht · 2019-04-25T07:52:49Z

@johanot would be great if you look into it because i am unsure about how to proceed. Whether to generate a new cert or take an existing one, and then if its ok to rely on the pki within default.nix, etc.

…factor" This reverts commit 7dc6e77, reversing changes made to bce47ea.

…factor" This reverts commit 7dc6e77, reversing changes made to bce47ea. Motivation for the revert in NixOS#67563

nixos-discourse · 2019-11-01T15:26:21Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/kubernetes-network-malfunction-after-upgrading-to-19-09/4620/1

…factor" This reverts commit 7dc6e77, reversing changes made to bce47ea. Motivation for the revert in NixOS#67563 (cherry picked from commit 00975b5)

calbrecht added 7 commits March 3, 2019 19:34

nixos/kubernetes: flannel needs iptables in service path

51aeaaf

nixos/kubernetes: Add longer timeouts for waiting services

cf8389c

nixos/kubernetes: Seed docker images before kubelet service start

fd28c0a

to speed up startup time because it can be parallelized.

nixos/kubernetes: Put dashboard service account into bootstrapAddons

7df88bd

to prevent errors in log about missing permissions when addon manager starts the dashboard.

nixos/kubernetes: No need to restart services besides certmgr

74962bf

within the node join script, since certmgr is taking care of restarting services.

calbrecht requested a review from infinisil as a code owner March 3, 2019 19:13

johanot suggested changes Mar 5, 2019

View reviewed changes

calbrecht added 7 commits March 6, 2019 16:54

nixos/kubernetes: Address review: Rename targets and move proxy to no…

ff91d58

…de-online.target

nixos/kubernetes: Address review: Move bootstrapping addons into own …

6e9037f

…service

nixos/kubernetes: Address review: Move controller manager paths into pki

52fe1d2

nixos/kubernetes: Address review: Separate preStart from certificates

7323b77

nixos/kubernetes: Address review: Remove restart from certmgr bootstr…

5684034

…ap service

nixos/kubernetes: Address review: rename node-online target

e148cb0

nixos/kubernetes: Address review: Move remaining paths to pki

ff382c1

GrahamcOfBorg added 8.has: clean-up 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux and removed 10.rebuild-linux: 1-10 labels Mar 6, 2019

calbrecht commented Mar 7, 2019

View reviewed changes

GrahamcOfBorg added the 10.rebuild-linux: 1-10 label Mar 7, 2019

GrahamcOfBorg removed the 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux label Mar 7, 2019

nixos/kubernetes: Fix kube-control-plane-online must not be present

154356d

outside kubernetes module.

calbrecht force-pushed the upstream-k8s-refactor branch from f5cacee to 154356d Compare March 8, 2019 08:37

calbrecht added 9 commits March 11, 2019 12:16

Cleanup pki: addon-manager

ee9dd43

Cleanup pki: apiserver and etcd

8ab50cb

Cleanup pki: controller-manager

ce83dc2

Cleanup pki: flannel

ea6985f

Cleanup pki: kubelet

73657b7

Cleanup pki: proxy

46653f8

Cleanup pki: scheduler

50c5f48

Cleanup pki: control-plane-online

45e683f

Cleanup pki: remove mkWaitCurl

e3a80eb

globin merged commit 7dc6e77 into NixOS:master Apr 17, 2019

arianvp mentioned this pull request Apr 18, 2019

etcd not init etcd.pem with services.kubernetes.roles master #59364

Open

johanot mentioned this pull request Apr 29, 2019

nixos/kubernetes: fix control-plane-online prestart dependency #60415

Merged

10 tasks

johanot pushed a commit to johanot/nixpkgs that referenced this pull request Aug 24, 2019

Revert "Merge pull request NixOS#56789 from mayflower/upstream-k8s-re…

bb97a11

…factor" This reverts commit 7dc6e77, reversing changes made to bce47ea.

johanot mentioned this pull request Aug 27, 2019

kubernetes: 1.14.3 -> 1.15.3 (with revert of module systemd dependencies) #67563

Merged

10 tasks

srhb pushed a commit to johanot/nixpkgs that referenced this pull request Sep 4, 2019

Revert "Merge pull request NixOS#56789 from mayflower/upstream-k8s-re…

00975b5

…factor" This reverts commit 7dc6e77, reversing changes made to bce47ea. Motivation for the revert in NixOS#67563

calbrecht deleted the upstream-k8s-refactor branch January 4, 2024 23:12


		systemd.services.kube-apiserver-online = mkIf top.flannel.enable {

nixos/kubernetes: stabilize cluster deployment/startup across machines #56789

nixos/kubernetes: stabilize cluster deployment/startup across machines #56789

Conversation

calbrecht commented Mar 3, 2019 • edited Loading

Motivation for this change

Things done

calbrecht commented Mar 3, 2019 • edited Loading

johanot commented Mar 4, 2019

arianvp commented Mar 4, 2019

calbrecht commented Mar 5, 2019

johanot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calbrecht Mar 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arianvp commented Mar 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calbrecht Mar 7, 2019 • edited Loading

Choose a reason for hiding this comment

globin commented Apr 17, 2019

johanot commented Apr 22, 2019

johanot commented Apr 22, 2019

calbrecht commented Apr 24, 2019 • edited Loading

johanot commented Apr 24, 2019

calbrecht commented Apr 25, 2019

nixos-discourse commented Nov 1, 2019

calbrecht commented Mar 3, 2019 •

edited

Loading

calbrecht commented Mar 3, 2019 •

edited

Loading

calbrecht Mar 5, 2019 •

edited

Loading

calbrecht Mar 7, 2019 •

edited

Loading

calbrecht commented Apr 24, 2019 •

edited

Loading