-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nixos/kubernetes: stabilize cluster deployment/startup across machines #56789
Conversation
to protect services from crashing and clobbering the logs when certificates are not in place yet and make sure services are activated when certificates are ready. To prevent errors similar to "kube-controller-manager.path: Failed to enter waiting state: Too many open files" fs.inotify.max_user_instances has to be increased.
by adding targets and curl wait loops to services to ensure services are not started before their depended services are reachable. Extra targets cfssl-online.target and kube-apiserver-online.target syncronize starts across machines and node-online.target ensures docker is restarted and ready to deploy containers on after flannel has discussed the network cidr with apiserver. Since flannel needs to be started before addon-manager to configure the docker interface, it has to have its own rbac bootstrap service. The curl wait loops within the other services exists to ensure that when starting the service it is able to do its work immediately without clobbering the log about failing conditions. By ensuring kubernetes.target is only reached after starting the cluster it can be used in the tests as a wait condition. In kube-certmgr-bootstrap mkdir is needed for it to not fail to start. The following is the relevant part of systemctl list-dependencies default.target ● ├─certmgr.service ● ├─cfssl.service ● ├─docker.service ● ├─etcd.service ● ├─flannel.service ● ├─kubernetes.target ● │ ├─kube-addon-manager.service ● │ ├─kube-proxy.service ● │ ├─kube-apiserver-online.target ● │ │ ├─flannel-rbac-bootstrap.service ● │ │ ├─kube-apiserver-online.service ● │ │ ├─kube-apiserver.service ● │ │ ├─kube-controller-manager.service ● │ │ └─kube-scheduler.service ● │ └─node-online.target ● │ ├─node-online.service ● │ ├─flannel.target ● │ │ ├─flannel.service ● │ │ └─mk-docker-opts.service ● │ └─kubelet.target ● │ └─kubelet.service ● ├─network-online.target ● │ └─cfssl-online.target ● │ ├─certmgr.service ● │ ├─cfssl-online.service ● │ └─kube-certmgr-bootstrap.service
to speed up startup time because it can be parallelized.
to prevent errors in log about missing permissions when addon manager starts the dashboard.
within the node join script, since certmgr is taking care of restarting services.
@calbrecht Thank you for this! Looks good at first glance, and tests pass. |
Would be nice if kubernetes components would support socket-based activation ... then we would get correct ordering of startup for free id we'd use |
@arianvp there is an open issue in the kubernetes repo about socket-based activation, but it does not apply to starting core components, i figure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calbrecht This was what I had for now :-) Quite a bit actually. Feel free to catch me on IRC if you want to discuss some of it in a more informal fashion.
echo "Restarting flannel..." >&1 | ||
systemctl restart flannel | ||
''} | ||
if [ y = $do_restart ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bash nitpick:
Couldn't this be written as just if [ -s "${certmgrAPITokenPath}" ]; then
, and then line 378 can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my intention was like if the file is present and not empty before running the node-join script then restart certmgr, else the systemd path unit will take care of starting certmgr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me then.
fi | ||
'') | ||
]; | ||
serviceConfig = { | ||
RestartSec = "10s"; | ||
TimeoutSec = "500"; | ||
RestartSec = "1s"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want this to try and restart every second without no StartLimit
or anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed restart all together in 5684034 since the service has a preStart wait loop anyway.
# because the network of docker0 has been changed by flannel. | ||
script = let | ||
docker-env = "/run/flannel/docker"; | ||
flannel-date = "stat --print=%Y ${docker-env}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Urgh. This is not very nice, but neither was my restart-dance, and for now I have no prettier suggestion. :-) It is certainly complicated the way flannel negotiates the pod CIDR. We should get rid of flannel as default CNI in the future, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i am not experienced in networking, whatever you say, i am in for it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said, I have no better suggestion :-) It just seems a little flaky to rely on timestamps for getting the restart order right. Perhaps the only right solution is to replace Flannel with something else, but that's a toy project for another time.
apiGroup = "rbac.authorization.k8s.io"; | ||
kind = "ClusterRole"; | ||
name = "flannel"; | ||
systemd.services.flannel-rbac-bootstrap = mkIf (top.apiserver.enable && (elem "RBAC" top.apiserver.authorizationMode)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why you changed this? Flannel doesn't need RBAC permissions if it doesn't use kubernetes as storage backend, which is why the mkIf previously has a guard for the flannel storage backend option. Futhermore; the addonManager bootstrap addons were used to assign privileges, but you created a separate service - why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, the storageBackend always was "kubernetes" like declared in line 10 of this file. I created this because in my view the flannel service would need to start before any addons might be deployed to nodes. And nodes couldn't be found before the flannel service configured the network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 6e9037f now there is a new service kube-addon-manager-bootstrap.service
which has the purpose to have higher permissions to invoke the kubectl apply
and not to conflict with the mkWaitCurl of the kube-addon-manager.service
which does not need to have clusterAdmin certs but the addonManager certs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is my mistake that I forgot to mkDefault
the storageBackend in line 40, but that doesn't change the fact that we should always be able to handle/support the situation where someone decides to do storageBackend = mkForce "etcd"
. This is why the mkIf on storageBackend is needed. For the same reason that we cannot assume that easyCerts
is always true.
Anyway, the new version of the diff looks better. I see why a new service was needed now, because of the order of things starting. :-)
inherit cert key; | ||
})} | ||
|
||
kubectl -s ${top.apiserverAddress} --certificate-authority=${top.caFile} --client-certificate=${top.pki.certs.clusterAdmin.cert} --client-key=${top.pki.certs.clusterAdmin.key} apply -f ${concatStringsSep " \\\n -f " files} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why assigning RBAC rights is nice to have isolated in the "bootstrap-addons" loop. We don't want too many individual units to have access to the clusterAdmin certificates, or run kubectl commands against the cluster. In either case, this unit poses a problem for clusters where easyCerts = false
. We cannot reference top.pki.certs
without guarding for mkIf top.easyCerts
. Preferably, everything easyCerts related should just be declared in pki.nix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i get the point about the access to clusterAdmin certs and the easyCerts.
I just found by delaying the startup of kube-addon-manager
until it can do its work, that flannel
needs its roles earlier.
I also created a different oneshot service for the bootstrapAddons
set to activate this as soon as the apiserver was online but threw it away because i was confused if there would be any "pod creation requests" involved.
The bootstrapAddons
confuses me when thinking about the startup order, i'd rather think of it as a bootstrapAccessControl
or similar to just do this only on the master right after the kube-apiserver
is working, which is the intention of bootstrapAddons, right?
Preferably, everything easyCerts related should just be declared in pki.nix.
Agree, do you think of the mkCert calls here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the bootstrapping in 6e9037f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mkCert
I think can stay here, because it doesn't require easyCerts to be true, and it fits well in the same file as the component needing the certs.
|
||
systemd.services.kube-apiserver-online = mkIf top.flannel.enable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the mkIf on flannel here? What happens if I disable flannel and go with another CNI? Will dependencies on kube-apiserver-online still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, i see. The mkIf currently is there because the mkWaitCurl is using the certifificates of the flannel client to check the apiserver /healthz path.
I decided to use the flannel certs because it seemed to me that it will always be present on the master and nodes.
So maybe on the master the mkWaitCurl could be called with the cluster-admin cert, and maybe this will work with the kubelet-client cert on the nodes (not sure if this is working because the node somehow is not present at this point, but will try)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in 7323b77#diff-8d468cb36655335308a8eb4c3999b59aR372 by either using the clusterAdmin
cert or the kubelet
cert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here: 7323b77#r32635029
systemd.services.kube-apiserver-online = mkIf top.flannel.enable { | ||
description = "apiserver control plane is online"; | ||
wantedBy = [ "kube-apiserver-online.target" ]; | ||
after = [ "kube-scheduler.service" "kube-controller-manager.service" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, what's confusing me here is, that there is a kube-apiserver-online.service
and a kube-apiserver-online.target
. I think (maybe, if I understood the intention correctly), the latter should be renamed to kube-control-plane-online.target
, and that target should contain all the master components: kube-apiserver
, kube-controller-manager
, kube-scheduler
, kube-addon-manager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will rename it, but i'd prefer to not include the addon-manager and start that one after node-online.target for the reasons i gave above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in ff91d58
@@ -72,7 +72,7 @@ in | |||
systemd.services.kube-addon-manager = { | |||
description = "Kubernetes addon manager"; | |||
wantedBy = [ "kubernetes.target" ]; | |||
after = [ "kube-apiserver.service" ]; | |||
after = [ "kube-apiserver-online.target" "node-online.target" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addon manager doesn't require node-online
does it? Isn't it just about submitting something to the apiserver. Whatever is submitted will be scheduled later, when one or more nodes come online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason i made the addon-manager depend on node-online.target is that i wanted to make sure fannel already configured and restarted docker at the time of requesting new containers to be started.
I thought if they were requested earlier they had to be destroyed and restarted to actually be connected to the newly created subnet. And if this is a non-issue then it is just a matter of my preference to not start services that cannot function correctly and therefore just spam the logs with warnings about not being able to find the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, the addon-manager cannot do anything if there are no nodes online. Or do i lack the deeper understanding here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It activates now within kube-control-plane-online.target
ff91d58#diff-e6ede1705ae680d8146d5db307a2a392R74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the addon manager does is submit stuff to the apiserver (etcd ultimately). If some of that "stuff" requires sheduling (i.e. pods), the kube-scheduler and kube-controller-manager will handle it from there. If there are no nodes online yet, the pods will simply end up in status Pending
until at least one node shows up, but this is not something the addon manager should wait for. :-) All it requires is the apiserver. I see you new version of the diff implements that. Fine.
@@ -354,5 +405,16 @@ in | |||
}; | |||
}) | |||
|
|||
{ | |||
systemd.targets.kubelet = { | |||
wantedBy = [ "node-online.target" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be renamed to kube-node-online.target
? Just node-online.target
can be confusing on multi-purpose hosts. Also; I think that kube-proxy
should be part of this target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure i will rename it. And for the kube-proxy
i will check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@@ -272,11 +272,30 @@ in | |||
###### implementation | |||
config = mkMerge [ | |||
|
|||
(mkIf cfg.enable { | |||
(mkIf cfg.enable (let | |||
apiserverPaths = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear there might be an issue regarding compatibility with easyCerts = false
here? It should be allowed to run a less secure cluster with AlwaysAllow
as authorizer. I this we should consider putting some of these path-lists in pki.nix, to make it possible to opt-out of some of these path dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i understand. Will move all occurrences into pki.nix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the remaining path units and usage in ff382c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Will have to test without easyCerts to see whether everything is still compatible.
There's also this: https://www.freedesktop.org/software/systemd/man/systemd-socket-proxyd.html which retrofits socket activation on any non-socket-activated service |
@@ -73,6 +73,18 @@ let | |||
}; | |||
}; | |||
|
|||
mkWaitCurl = { address ? cfg.apiserverAddress, sleep ? 2, path ? "", args ? "-o /dev/null", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed by accident, when deploying a different deployment with nixops, nix complains about cfg.apiserverAddress
being used but not defined. Probably because it has no default value in the options. Have to workaround that somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@calbrecht apiserverAddress
is mkDefault'et to the value of masterAddress
here: https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/cluster/kubernetes/default.nix#L286 .. The reason is the assumption that you have only one master node, but it should still be possible to override this. In our production cluster for example; we have 3 master nodes and load-balanced address that divides requests between them all. In that case, we want a setup where masterAddress != apiserverAddress
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johanot Ok, so you have the cfssl
service running on the masterAddress
, right?
However, the issue here was that i defined kube-control-plane-online
service and target within apiserver.nix
but outside of the mkIf cfg.enabled
set, which led to evaluation of this function declaration in every other nix-build, probably. Fixed this in 154356d
outside kubernetes module.
f5cacee
to
154356d
Compare
Looks good! |
@calbrecht Sorry I went kind of offline on this one and didn't help seeing it through. Anyway, now I'm "awake" again :-) It seems, unfortunately that the DNS and RBAC test cases stopped working on master after merging this. Looks like |
|
@johanot, what errors do you mean? When reverting the kubernetes version upgrades back to 1.13.4, the tests run successful. Has this to do with restoring --user and --password flags in the changes from 1.13.4 -> 1.13.5? |
@calbrecht Looks like your assumption is right. I does indeed break with 1.13.4 -> 1.13.5. It is sometimes baffling to me, what is apparently allowed to do in a minor kubernetes release. I mean.. It clearly breaks the API, but that's somehow ok. :) Just confirms to me that we need to be careful even with a revision bump of kubernetes. The "error" I was talking about, is that it looks like client cert is |
@johanot would be great if you look into it because i am unsure about how to proceed. Whether to generate a new cert or take an existing one, and then if its ok to rely on the pki within default.nix, etc. |
…factor" This reverts commit 7dc6e77, reversing changes made to bce47ea. Motivation for the revert in NixOS#67563
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/kubernetes-network-malfunction-after-upgrading-to-19-09/4620/1 |
…factor" This reverts commit 7dc6e77, reversing changes made to bce47ea. Motivation for the revert in NixOS#67563 (cherry picked from commit 00975b5)
Motivation for this change
I had a tough time getting to know the kubernetes with all the errors in the logs caused by services dying because of missing required certificates (because they were not generated yet) or simply because services depend on other services which were not started yet.
This PR makes all the services related to kubernetes start in order, even across machines and ensures the requirements for the services are met in advance of starting. The kubernetes tests had 28K lines of log before ran about 4m 20s and are now down to 13K and 3m 40s. All service start clean on the first boot.
Please see the commit messages for further information about the changes.
Things done
sandbox
innix.conf
on non-NixOS)nix-shell -p nox --run "nox-review wip"
./result/bin/
)nix path-info -S
before and after)