Scenario:
- Create a Kubernetes cluster with default settings with kubeadm on GCP
- Install a CNI plugin with default settings in the cluster
- Check the network connectivities in the cluster
Compared CNI plugins (with number of GitHub stars in March 2020):
CNI plugin | Works in cloud (1) | Default Pod network CIDR | Default Pod network CIDR defined in YAML (2) | Works if podCIDR not allocated (3) |
Can override Pod network CIDR with kubeadm (4) |
---|---|---|---|---|---|
Calico | ❌ No | 192.168.0.0/16 | ✅ Yes | ✅ Yes | ❌ No |
Flannel | ✅ Yes | 10.244.0.0/16 | ✅ Yes | ❌ No | ❌ No |
Weave Net | ✅ Yes | 10.32.0.0/12 | ❌ No | ✅ Yes | ❌ No |
Cilium | ✅ Yes | 10.217.0.0/16 | ❌ No | ✅ Yes | ✅ Yes |
Contiv | ❌ No | 10.1.0.0/16 | ✅ Yes | ✅ Yes | ❌ No |
kube-router | ❌ No | - | ❌ No | ❌ No | ✅ Yes |
Footnotes:
- If no, it usually means that the inter-node Pod communication doesn't work. That means, no messages can be sent to a Pod on a different node (to a Pod on the same node works). This is because the CNI plugin probably assumes direct Layer 2 connectivity between nodes, which is not the case in the cloud.
- If yes, the default Pod network CIDR can be customised by editing the deployment manifest of the CNI plugin.
- If no, the CNI plugin Pods fail to start up if the
node.spec.podCIDR
field is not set, that is, if the controller manager didn't automatically allocate Pod subnet CIDRs to the nodes. - If yes, it's enough to define the Pod network CIDR by specifying it to kubeadm (which will pass it to the
--cluster-cidr
flag of kube-controller-manager; the CNI plugin will reuse the custom Pod subnet CIDRs assigned to each node. If no, this doesn't work, and the CNI plugin keeps using its default Pod network CIDR, even if a different CIDR was specified to kubeadm. More steps have to be taken to customise the Pod network CIDR, such as editing the CNI plugin deployment YAML file or binary.
Links to sections:
Unencapsulated, supports NetworkPolicies.
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
See documentation.
- Pod to itself: ✅
- Pod to Pod on same node: ✅
- Pod to own node: ✅
- Pod to Pod on different node: ❌
- Pod to different node: ❌
- Pod to Service IP address: ✅ if backend Pod is on same node as client Pod)
- Pod to Service DNS name: ❌
- DNS lookup: ❌
- Does not require pre-existing Pod subnet CIDR allocation to nodes
- If
node.spec.podCIDR
is set, Calico does not use it and uses a different Pod subnet CIDR for each node
- If
- Creates a
calico-node
Pod on each node (host network) - Additionally, creates a
calico-kube-controllers
Pod on one of the nodes (Pod network) - Uses Pod network CIDR 192.168.0.0/16, by default
- Assigns each node a /24 subnet CIDR
- CoreDNS Pods are runnig on master node, but Pods can't do DNS lookups because Pod-to-Pod communication across nodes doesn't work
Encapsulated, does not support NetworkPolicies.
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
See documentation.
EDIT 2020-04-29: confirm this behaviour on a three-node cluster on AWS created with terraform-kubeadm-aws (Kubernetes 1.18).
Nodes become Ready
. However, the kube-flannel-ds
Pods on each node are in a run-crash loop. The coredns
Pods are stuck in ContainerCreating
.
The kube-flannel-ds
Pods report the following error message in their log output:
Error registering network: failed to acquire lease: node "<NODE>" pod cidr not assigned
This is because there's no Pod subnet CIDR assigned to the nodes.
Automatic assignments of Pod subnet CIDRs to the nodes happens only when you set the
networking.podSubnet
field in thekubeadm init
configuration file or set the--pod-network-cidr
flag ofkubeadm init
.
You might try to fix this, you can manually assign a Pod subnet CIDR to each node:
kubectl edit node <NODE>
And change the spec
field to contain the following:
spec:
podCIDR: 200.200.0.0/24
Do this with a different CIDR for each node, for example, 200.200.0.0/24, 200.200.1.0/24, and 200.200.2.0/24.
Then, delete the individual kube-flannel-ds
Pods to restart them (or wait unti they finish their CrashLoopBackOff
and restart by themselves).
After that, the kube-flannel-ds
Pods should be running correctly.
However, this doesn't make it work. The logs of the kube-flannel-ds
Pods now show:
Current network or subnet (10.244.0.0/16, 200.200.0.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules
And the Pods delete and recreate some iptables rules.
However, this doesn't make it work, as shown in the connectivities:
- Pod to itself: ✅
- Pod to Pod on same node: ❌
- Pod to own node: ✅
- Pod to Pod on different node: ❌
- Pod to different node: ❌
- Pod to Service IP address: ❌
- Pod to Service DNS name: ❌
- DNS lookup: ❌
- Pod on host network to own node: ✅
- Pod on host network to different node: ✅
- Pod on host network to Pod on same node: ✅
- Pod on host network to Pod on diferent node: ❌
EDIT 2020-04-29: confirm this behaviour on a three-node cluster on AWS created with terraform-kubeadm-aws (Kubernetes 1.18).
When using an arbitrary Pod network CIDR (e.g. 200.200.0.0/16) at cluster creation time, the following happens.
The Pod network CIDR can be specified in the
networking.podSubnet
field in thekubeadm init
config file or the--pod-network-cidr
flag ofkubeadm init
. Kubernetes will then automatically assign a Pod subnet CIDR to each node.
All nodes get Ready
, all Flannel Pods run, the coredns
Pods get IP addreses from the configured Pod network range and run, but are not ready (0/1
).
When creating new workload Pods, they also get IP addresses from the configured Pod network range and run, however, the connectivities are exactly as above when no Pod network CIDR is specified at all.
If you choose precisely 10.244.0.0/16 as the Pod network CIDR, the following happens.
All Flannel and coredns
Pods get running and ready. All connectivities work:
- Pod to itself: ✅
- Pod to Pod on same node: ✅
- Pod to own node: ✅
- Pod to Pod on different node: ✅
- Pod to different node: ✅
- Pod to Service IP address: ✅
- Pod to Service DNS name: ✅
- DNS lookup: ✅
- Pod on host network to own node: ✅
- Pod on host network to different node: ✅
- Pod on host network to Pod on same node: ✅
- Pod on host network to Pod on diferent node: ✅
- With Flannel, you must set the Pod network CIDR when you create the cluster and it must be 10.244.0.0/16
- With all other configurations, Flannel will not work
- Creates a
kube-flannel
Pod on each node
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
See documentation.
- Pod to itself: ✅
- Pod to Pod on same node: ✅
- Pod to own node: ✅
- Pod to Pod on different node: ✅
- Pod to different node: ✅
- Pod to Service IP address: ✅
- Pod to Service DNS name: ✅
- DNS lookup: ✅
- Does not require pre-existing Pod subnet CIDR allocation to nodes
- If you have pre-existing Pod subnet CIDR allocations, Weave Net does not use them but chooses its own Pod subnet CIDR (from a different Pod network CIDR than you specified) for each node
- Assigns a /24 Pod subnet CIDR to each node
- Runs a
weave-net
Pod on each node
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/1.7.0/install/kubernetes/quick-install.yaml
See documentation.
- Pod to itself: ✅
- Pod to Pod on same node: ✅
- Pod to own node: ✅
- Pod to Pod on different node: ✅
- Pod to different node: ✅
- Pod to Service IP address: ✅
- Pod to Service DNS name: ✅
- DNS lookup: ✅
- Pod on host network to own node: ✅
- Pod on host network to different node: ✅
- Pod on host network to Pod on same node: ✅
- Pod on host network to Pod on diferent node: ✅
- Does not require pre-existing Pod subnet CIDR allocation to nodes
- If you don't specify a Pod network CIDR, Cilium uses 10.217.0.0/16 by default
- If you have pre-existing Pod subnet CIDR allocations, Cilium uses them as expected
- Runs a
cilium
Pod on each node - Runs a
cilium-operator
Pod on one of the nodes (may be a worker node)
kubectl apply -f https://raw.githubusercontent.com/contiv/vpp/master/k8s/contiv-vpp.yaml
See documentation.
After deployment, the nodes remain NotReady
and the contiv-vswitch
Pods that Contiv deploys to each node, as well as the coredns
Pods remain in Pending
.
The reason is that Contiv requires hugepages to run on the nodes.
To fix this issue, go to each node, and do the following:
sysctl -w vm.nr_hugepages=512
echo "vm.nr_hugepages=512" >> /etc/sysctl.conf
service kubelet restart
If you use Ansible, execute these commands on all hosts with:
ansible all -b -m shell -a 'sysctl -w vm.nr_hugepages=512; echo "vm.nr_hugepages=512" >> /etc/sysctl.conf; service kubelet restart'
After that, the nodes should become Ready
and all the Pods should be scheduled and run.
However, the connectivities are like this:
- Pod to itself: ✅
- Pod to Pod on same node: ✅
- Pod to own node: ✅
- Pod to Pod on different node: ❌
- Pod to different node: ❌
- Pod to Service IP address: ✅ (if backend Pod is on same node as client Pod)
- Pod to Service DNS name: ❌
- DNS lookup: ❌
- Pod on host network to own node: ✅
- Pod on host network to different node: ✅
- Pod on host network to Pod on same node: ✅
- Pod on host network to Pod on diferent node: ❌
When using an aribrary Pod network CIDR (e.g. 200.200.0.0/16), the connectivities are like above
According to the documentation, 10.1.0.0/16 must be used for the Pod network CIDR with kubeadm.
However, if using this Pod network CIDR, the connectivities are still like above.
- Does not require pre-existing Pod subnet CIDR allocation to nodes
- If you specify a Pod network CIDR, Contiv does not use it. It uses the Pod network CIDR 10.1.0.0/16 instead.
- This is hardcoded in the
contiv-vpp.yaml
file
- This is hardcoded in the
- Deploys a
contiv-vswitch
Pod to each node as well as acontiv-ksr
,contiv-etcd
, andcontiv-crd
Pod to the master node - Creates a single route for the whole Pod network on each node to a
vpp1
network interface- This network interface seem to be a FD.io/VPP (Vector Packet Processor) vSwitch, which is a fast, scalable layer 2-4 multi-platform network stack.
- See Contiv/VPP overview
- The way the VPP switch connects nodes seems to be similar to Calico, that is, it doesn't work by default in a cloud environment
kubectl apply -f https://raw.githubusercontent.com/cloudnativelabs/kube-router/master/daemonset/kubeadm-kuberouter.yaml
See documentation.
Upon installation, the nodes become Ready
, the coredns
Pods are stuck in ContainerCreating
, and the kube-router
Pods keep crashing with a log output of:
Failed to get pod CIDR from node spec. kube-router relies on kube-controller-manager to allocate pod CIDR for the node or an annotation `kube-router.io/pod-cidr`
If you try to fix it by adding a node.spec.podCIDR
field to all nodes (e.g. with kubectl edit
), all Pods start running and new Pods also start up correctly and get IP addresses from the assigned Pod subnet CIDRs. However, the connectivities don't work:
- Pod to itself: ✅
- Pod to Pod on same node: ✅
- Pod to own node: ✅
- Pod to Pod on different node: ❌
- Pod to different node: ❌
- Pod to Service IP address: ✅ (if backend Pod is on same node as client Pod)
- Pod to Service DNS name: ❌
- DNS lookup: ❌
- Pod on host network to own node: ✅
- Pod on host network to different node: ✅
- Pod on host network to Pod on same node: ✅
- Pod on host network to Pod on diferent node: ❌
All Pods start up and run correctly (however, coredns
Pods are not ready). Pods in the Pod network get IP addresses from the allocated Pod subnet CIDRs.
However, the connectivities are exactly like above.
- Creates a
kube-router
Pod on each node - Requires pre-allocated Pod subnet CIDRs for each node (otherwise, fails to start up)
- If there are pre-allocated Pod subnet CIDRs, it uses them
- Creates routes for each Pod subnet on each node. The routes for a Pod subnet on a different node go via a
tun
virtual network interface.
There are two different types of CNI plugins, encapsulated and unencapsulated:
- Encapsulated: wrap the Pod network packets into native host network packets and send them between hosts like normal host network packets. All the Pod networking logic is hidden in the CNI plugin agent on each node and the hosts are unaware of any Pod networking.
- Unencapsulated: the hosts and network infrastructure are directly configured to handle and route Pod network packets.
Unencapsulated approaches need to integrate with the underlying infrastructure, i.e. they must be targeted at a specific infrastructure. Most are targeted traditional networks with direct Layer 2 connectivity between nodes. These CNI plugins don't work in cloud environments like GCP. However, there are some unencapsulated approaches that are targeted at specific cloud environments, such as the GCE backend for Flannel.
Encapsulated approaches work on all infrastructure, including cloud environments, because they masquerade cross-node Pod communication as normal host-to-host communication.
So that's the reason that the default modes of Flannel, Weave Net, and Cilium work on GCP, but Calico, Contiv, and kube-router don't.
So, basically, if your cluster is in the cloud, you need to choose an encapsulated CNI plugin, or an unencapsulated that is targeted at this specific cloud.
The advantage of encapsulated approaches is that it works on all infrastructure and doesn't need to modify it. The advantage of unencapsulated approaches is that it has less overhead and integrates better with existing debugging tools as there's no additional layer of abstraction.