Fail to add HA masters in kubespher v3.0.0 #4066

whaifang · 2021-07-16T03:26:07Z

Hi Team.
Our original cluster based on kubesphere-all-v2.1.1 is like below:

master0
node1
node2

We upgrade to v3.0.0 according to https://v3-0.docs.kubesphere.io/docs/upgrade/upgrade-with-kubekey/

Then we add master1 and master2 on v3.0.0 to setup HA according to https://v3-0.docs.kubesphere.io/docs/installing-on-linux/cluster-operation/add-new-nodes/

master0
master1
master2
node1
node2

config.yaml
spec:
hosts:

{name: master0, address: 9.30.222.112, internalAddress: 9.30.222.112, user: root, password: ******}
{name: master1, address: 9.30.181.110, internalAddress: 9.30.181.110, user: root, password: ******}}
{name: master2, address: 9.112.254.38, internalAddress: 9.112.254.38, user: root, password: ******}}
{name: node1, address: 9.30.181.117, internalAddress: 9.30.181.117, user: root, password: ******}}
{name: node2, address: 9.30.223.102, internalAddress: 9.30.223.102, user: root, password: ******}}
roleGroups:
etcd:
- master0
- master1
- master2
  master:
- master0
- master1
- master2
  worker:
- node1
- node2
  controlPlaneEndpoint:
  domain: lb.kubesphere.local
  address: 9.112.254.207
  port: 6443

But we fail, after we execute: ./kk add nodes -f config.yaml
[root@master0 conf]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master0 Ready master 4d1h v1.17.9 9.30.222.112 CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://18.9.7
master1 NotReady master 3d21h v1.17.9 9.30.181.110 CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://18.9.7
master2 NotReady master 3d21h v1.17.9 9.112.254.38 CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
node1 Ready worker 4d1h v1.17.9 9.30.181.117 CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://18.9.7
node2 Ready worker 4d1h v1.17.9 9.30.223.102 CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://18.9.7

Master1 and Master2 are not ready for us.
[root@master1 ~]# systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2021-07-14 23:38:32 PDT; 20h ago
Docs: http://kubernetes.io/docs/
Main PID: 19529 (kubelet)
CGroup: /system.slice/kubelet.service
└─19529 /usr/local/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=kubesphere/pause:3.1 --node-ip=9.30.181.110 --hostname-override=master1

Jul 15 20:21:24 master1 kubelet[19529]: E0715 20:21:24.511205 19529 kubelet.go:2264] node "master1" not found
Jul 15 20:21:24 master1 kubelet[19529]: E0715 20:21:24.599211 19529 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://9.30.222.112:6443/api/v1/nodes?fieldSelector=metadata.name%3Dmaster1&limit=500&resourceVersion=0: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
Jul 15 20:21:24 master1 kubelet[19529]: E0715 20:21:24.611363 19529 kubelet.go:2264] node "master1" not found
Jul 15 20:21:24 master1 kubelet[19529]: E0715 20:21:24.711571 19529 kubelet.go:2264] node "master1" not found
Jul 15 20:21:24 master1 kubelet[19529]: E0715 20:21:24.799528 19529 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:449: Failed to list *v1.Service: Get https://9.30.222.112:6443/api/v1/services?limit=500&resourceVersion=0: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
Jul 15 20:21:24 master1 kubelet[19529]: E0715 20:21:24.811756 19529 kubelet.go:2264] node "master1" not found
Jul 15 20:21:24 master1 kubelet[19529]: E0715 20:21:24.911945 19529 kubelet.go:2264] node "master1" not found
Jul 15 20:21:25 master1 kubelet[19529]: E0715 20:21:25.012102 19529 kubelet.go:2264] node "master1" not found
Jul 15 20:21:25 master1 kubelet[19529]: E0715 20:21:25.112244 19529 kubelet.go:2264] node "master1" not found
Jul 15 20:21:25 master1 kubelet[19529]: E0715 20:21:25.199911 19529 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://9.30.222.112:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
Jul 15 20:21:25 master1 kubelet[19529]: E0715 20:21:25.212422 19529 kubelet.go:2264] node "master1" not found

[root@master0 ~]# systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Thu 2021-07-15 01:15:42 PDT; 19h ago
Docs: http://kubernetes.io/docs/
Main PID: 5625 (kubelet)
Tasks: 27
Memory: 82.8M
CGroup: /system.slice/kubelet.service
└─5625 /usr/local/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=kubesphere/pause:3.1 --node-ip=9.30.222.112 --hostname-override=master0

Jul 15 20:21:28 master0 kubelet[5625]: E0715 20:21:28.027224 5625 pod_workers.go:191] Error syncing pod f6b1e75f-8491-47f0-8b9e-463893f25770 ("redis-6fd6c6d6f9-44ktp_kubesphere-system(f6b1e75f-8491-47f0-8b9e-463893f25770)"), skipping: unmounted volumes=[redis-pvc], unattached volumes=[redis-pvc default-token-l7tkq]: timed out waiting for the condition
Jul 15 20:21:33 master0 kubelet[5625]: E0715 20:21:33.027318 5625 kubelet.go:1681] Unable to attach or mount volumes for pod "openldap-0_kubesphere-system(72e73a3f-8188-4005-a4d9-a5184e9c99f8)": unmounted volumes=[openldap-pvc], unattached volumes=[openldap-pvc default-token-l7tkq]: timed out waiting for the condition; skipping pod
Jul 15 20:21:33 master0 kubelet[5625]: E0715 20:21:33.027354 5625 pod_workers.go:191] Error syncing pod 72e73a3f-8188-4005-a4d9-a5184e9c99f8 ("openldap-0_kubesphere-system(72e73a3f-8188-4005-a4d9-a5184e9c99f8)"), skipping: unmounted volumes=[openldap-pvc], unattached volumes=[openldap-pvc default-token-l7tkq]: timed out waiting for the condition
Jul 15 20:21:39 master0 kubelet[5625]: E0715 20:21:39.027303 5625 pod_workers.go:191] Error syncing pod 3c9b52a0-edc9-470c-9c68-ab744a33a2c6 ("ks-controller-manager-5b7f8cbd6c-h75zk_kubesphere-system(3c9b52a0-edc9-470c-9c68-ab744a33a2c6)"), skipping: failed to "StartContainer" for "ks-controller-manager" with CrashLoopBackOff: "back-off 5m0s restarting failed container=ks-controller-manager pod=ks-controller-manager-5b7f8cbd6c-h75zk_kubesphere-system(3c9b52a0-edc9-470c-9c68-ab744a33a2c6)"
Jul 15 20:21:40 master0 kubelet[5625]: E0715 20:21:40.027521 5625 pod_workers.go:191] Error syncing pod b5cf6111-08d1-4747-8f65-38600059bd00 ("ks-apiserver-869f56b578-98fdg_kubesphere-system(b5cf6111-08d1-4747-8f65-38600059bd00)"), skipping: failed to "StartContainer" for "ks-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=ks-apiserver pod=ks-apiserver-869f56b578-98fdg_kubesphere-system(b5cf6111-08d1-4747-8f65-38600059bd00)"
Jul 15 20:21:52 master0 kubelet[5625]: E0715 20:21:52.027367 5625 pod_workers.go:191] Error syncing pod b5cf6111-08d1-4747-8f65-38600059bd00 ("ks-apiserver-869f56b578-98fdg_kubesphere-system(b5cf6111-08d1-4747-8f65-38600059bd00)"), skipping: failed to "StartContainer" for "ks-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=ks-apiserver pod=ks-apiserver-869f56b578-98fdg_kubesphere-system(b5cf6111-08d1-4747-8f65-38600059bd00)"
Jul 15 20:21:52 master0 kubelet[5625]: E0715 20:21:52.028017 5625 pod_workers.go:191] Error syncing pod 3c9b52a0-edc9-470c-9c68-ab744a33a2c6 ("ks-controller-manager-5b7f8cbd6c-h75zk_kubesphere-system(3c9b52a0-edc9-470c-9c68-ab744a33a2c6)"), skipping: failed to "StartContainer" for "ks-controller-manager" with CrashLoopBackOff: "back-off 5m0s restarting failed container=ks-controller-manager pod=ks-controller-manager-5b7f8cbd6c-h75zk_kubesphere-system(3c9b52a0-edc9-470c-9c68-ab744a33a2c6)"
Jul 15 20:21:54 master0 kubelet[5625]: W0715 20:21:54.219451 5625 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/659c68b1-d99a-48e3-9bf6-d8a04d082577/volumes/kubernetes.io~secret/dns-autoscaler-token-pr4p5 and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see kubernetes/kubernetes#69699
Jul 15 20:22:03 master0 kubelet[5625]: E0715 20:22:03.027313 5625 pod_workers.go:191] Error syncing pod 3c9b52a0-edc9-470c-9c68-ab744a33a2c6 ("ks-controller-manager-5b7f8cbd6c-h75zk_kubesphere-system(3c9b52a0-edc9-470c-9c68-ab744a33a2c6)"), skipping: failed to "StartContainer" for "ks-controller-manager" with CrashLoopBackOff: "back-off 5m0s restarting failed container=ks-controller-manager pod=ks-controller-manager-5b7f8cbd6c-h75zk_kubesphere-system(3c9b52a0-edc9-470c-9c68-ab744a33a2c6)"
Jul 15 20:22:05 master0 kubelet[5625]: E0715 20:22:05.027315 5625 pod_workers.go:191] Error syncing pod b5cf6111-08d1-4747-8f65-38600059bd00 ("ks-apiserver-869f56b578-98fdg_kubesphere-system(b5cf6111-08d1-4747-8f65-38600059bd00)"), skipping: failed to "StartContainer" for "ks-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=ks-apiserver pod=ks-apiserver-869f56b578-98fdg_kubesphere-system(b5cf6111-08d1-4747-8f65-38600059bd00)"

These pods are CrashLoopBackOff
kube-system kube-apiserver-master1 0/1 CrashLoopBackOff 729 3d22h
kube-system openebs-localpv-provisioner-77fbd6858d-xk5w6 0/1 CrashLoopBackOff 203 4d1h
kube-system openebs-ndm-5tp84 0/1 CrashLoopBackOff 203 4d1h
kube-system openebs-ndm-operator-59c75c96fc-jpnt9 0/1 CrashLoopBackOff 204 4d1h
kube-system openebs-ndm-sndrv 0/1 CrashLoopBackOff 204 4d1h
kubesphere-monitoring-system kube-state-metrics-5c466fc7b6-lwbf8 2/3 CrashLoopBackOff 219 3d22h

kubesphere-monitoring-system notification-manager-deployment-7ff95b7544-r7n87 0/1 CrashLoopBackOff 219 3d22h
kubesphere-monitoring-system notification-manager-deployment-7ff95b7544-vq78c 0/1 CrashLoopBackOff 219 3d22h
kubesphere-monitoring-system notification-manager-operator-5cbb58b756-xbk8l 1/2 Error 219 3d22h
kubesphere-monitoring-system prometheus-operator-78c5cdbc8f-sbdqb 1/2 CrashLoopBackOff 219 3d22h
kubesphere-system ks-apiserver-869f56b578-98fdg 0/1 CrashLoopBackOff 215 3d22h
kubesphere-system ks-controller-manager-5b7f8cbd6c-h75zk 0/1 CrashLoopBackOff 215 3d22h
kubesphere-system ks-installer-85854b8c8-6l8gh 0/1 CrashLoopBackOff 202 3d22h
kubesphere-system ks-upgrade-wlcqq 0/1 Completed 0 3d22h
kubesphere-system openldap-0 0/1 ContainerCreating 2 4d1h
kubesphere-system redis-6fd6c6d6f9-44ktp 0/1 ContainerCreating 2 4d1h

**Why we encountered so many errors on setup HA masters using kk command in v3.0.0.
Have you tested successfully like this case?
**

whaifang · 2021-07-19T01:21:48Z

Hi Team.

Any one help on this?
I think the setup HA process is not complete, the folder /etc/cni/net.d is not been set up on master1 and master2, we see below error and the folder /etc/cni/net.d is empty in master1 and master2.

[root@master2 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2021-07-14 23:38:58 PDT; 3 days ago
Docs: http://kubernetes.io/docs/
Main PID: 11569 (kubelet)
Tasks: 18
Memory: 69.5M
CGroup: /system.slice/kubelet.service
└─11569 /usr/local/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs...

Jul 18 18:07:39 master2 kubelet[11569]: W0718 18:07:39.561495 11569 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
Jul 18 18:07:39 master2 kubelet[11569]: E0718 18:07:39.599012 11569 kubelet.go:2264] node "master2" not found
Jul 18 18:07:39 master2 kubelet[11569]: E0718 18:07:39.699189 11569 kubelet.go:2264] node "master2" not found
Jul 18 18:07:39 master2 kubelet[11569]: E0718 18:07:39.799517 11569 kubelet.go:2264] node "master2" not found
Jul 18 18:07:39 master2 kubelet[11569]: E0718 18:07:39.899748 11569 kubelet.go:2264] node "master2" not found
Jul 18 18:07:39 master2 kubelet[11569]: E0718 18:07:39.900884 11569 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://9.30.222.112:6443/api/v1/pods?...
Jul 18 18:07:40 master2 kubelet[11569]: E0718 18:07:39.999981 11569 kubelet.go:2264] node "master2" not found
Jul 18 18:07:40 master2 kubelet[11569]: I0718 18:07:40.024850 11569 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
Jul 18 18:07:40 master2 kubelet[11569]: I0718 18:07:40.089418 11569 kubelet_node_status.go:70] Attempting to register node master2
Jul 18 18:07:40 master2 kubelet[11569]: E0718 18:07:40.100169 11569 kubelet.go:2264] node "master2" not found
Hint: Some lines were ellipsized, use -l to show in full.

But the etcd cluster is work.
[root@master0 net.d]# etcdctl --endpoints=https://9.30.222.112:2379 --ca-file=/etc/ssl/etcd/ssl/ca.pem --cert-file=/etc/ssl/etcd/ssl/node-master0.pem --key-file=/etc/ssl/etcd/ssl/node-master0-key.pem member list
b10d53fc47158ae: name=etcd3 peerURLs=https://9.112.254.38:2380 clientURLs=https://9.112.254.38:2379 isLeader=false
712a76cc7438cd1d: name=etcd2 peerURLs=https://9.30.181.110:2380 clientURLs=https://9.30.181.110:2379 isLeader=true
8d9c6672b14eb247: name=etcd1 peerURLs=https://9.30.222.112:2380 clientURLs=https://9.30.222.112:2379 isLeader=false

RolandMa1986 · 2021-07-19T03:27:02Z

@whaifang Can you parse the error log when adding new nodes?

whaifang · 2021-07-19T07:23:13Z

@RolandMa1986 We added master nodes not work nodes, and there is not error log when adding new master nodes, it tells me the process is complete and successful.
The issue is happened when we check our cluster after add master nodes.
So do you successfully when add HA masters on origin cluster in v3.0.0?

It seems the LB configure is not updated in ~/.kube/configure, the server still refer to master0 not LB server.
server: https://9.30.222.112:6443

So I think the kk command is not completed all process for add HA masters, please help to confirm this.

RolandMa1986 · 2021-07-19T07:33:01Z

What kind of LB are you using for the Kube-apiserver? The config shows your LB address is 9.112.254.207. But the kubelet show it was trying to connect to 9.30.222.112 on master1. Can you check the /etc/hosts and /etc/kubernetes/kubelet.conf config, The API server address should be matched.

whaifang · 2021-07-19T07:52:45Z

@RolandMa1986 Our LB server is 9.112.254.207 and using nginx to setup this, we think the LB configure should be auto updated when kk command adding HA maters. But it still keep to master0:9.30.222.112
Please check below /etc/hosts and /etc/kubernetes/kubelet.conf config for master0

And please note: Our origin cluster is only one master (master0), so we were not configured LB server at that time.
The LB server configure is add when we add master1 and master2 to setup HA, so this LB configure should be auto updated by KK command, but it's not.

Is anyone successfully setup HA masters same to my process?
If you are free, I hope you can follow my steps to do again, then you will see the error.

/etc/hosts
##kubekey hosts BEGIN
9.30.222.112 master0.cluster.local master0
9.30.181.110 master1.cluster.local master1
9.112.254.38 master2.cluster.local master2
9.30.181.117 node1.cluster.local node1
9.30.223.102 node2.cluster.local node2
9.112.254.207 lb.kubesphere.local
##kubekey hosts END

/etc/kubernetes/kubelet.conf:
server: https://9.30.222.112:6443

nginx.conf :
error_log stderr notice;

worker_processes auto;
events {
multi_accept on;
use epoll;
worker_connections 1024;
}

stream {
upstream kube_apiserver {
least_conn;
server 9.30.222.112:6443;
server 9.30.181.110:6443;
server 9.112.254.38:6443;
}

server {
    listen        0.0.0.0:6443;
    proxy_pass    kube_apiserver;
    proxy_timeout 10m;
    proxy_connect_timeout 1s;
}

}

RolandMa1986 · 2021-07-19T08:08:44Z

You can edit the kubelet.conf manually and use lb.kubesphere.local as the Kube-apiserver address.

whaifang · 2021-07-19T09:33:09Z

@RolandMa1986 Hi , I yet updated kubelet.conf manually and use lb.kubesphere.local as the Kube-apiserver address, nodes are ready now

but another error will throws out to me, and many system pods are still crash now,

Sometimes we got below error:
[root@master0 ~]# kubectl get pod --all-namespaces
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

These two pods are still ContainerCreating
kubesphere-system openldap-0 0/1 ContainerCreating 2 7d7h
kubesphere-system redis-6fd6c6d6f9-44ktp 0/1 ContainerCreating 2 7d7h
Sometimes throw out below error:

Some volumes folder are not exists.

So what i means is that: I think the kk command is not completed all process for add HA masters, could you please check this?
Because we try for many times, and encountered different errors. so I am sure if we fix these errors, another error will throw to us.

Is anyone successfully setup HA masters same to my process?
If you are free, I hope you can follow my steps to do again, then you will see the error.
I still think the KK command miss some steps, and we can't success.

whaifang · 2021-07-20T07:55:51Z

@RolandMa1986 Any update on this? Hope good idea from you~~)

RolandMa1986 · 2021-07-20T08:17:24Z

@RolandMa1986 Any update on this? Hope good idea from you~~)

We don't have a failure upgrade case similar to yours before. So you may have to troubleshoot by yourself. We can just give you some advice or suggestions.
Form the screenshot attached from the last replies. it seems you have lost your Openebs local PVC for some reason. you can try to rebuild the folder according to the error log.

stale · 2021-10-18T17:21:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2021-11-18T00:01:14Z

This issue is being automatically closed due to inactivity.

stale bot added the stale No recent activity in a long period label Oct 18, 2021

stale bot closed this as completed Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to add HA masters in kubespher v3.0.0 #4066

Fail to add HA masters in kubespher v3.0.0 #4066

whaifang commented Jul 16, 2021 •

edited

whaifang commented Jul 19, 2021 •

edited

RolandMa1986 commented Jul 19, 2021 •

edited

whaifang commented Jul 19, 2021 •

edited

RolandMa1986 commented Jul 19, 2021

whaifang commented Jul 19, 2021 •

edited

RolandMa1986 commented Jul 19, 2021

whaifang commented Jul 19, 2021

whaifang commented Jul 20, 2021

RolandMa1986 commented Jul 20, 2021

stale bot commented Oct 18, 2021

stale bot commented Nov 18, 2021

Fail to add HA masters in kubespher v3.0.0 #4066

Fail to add HA masters in kubespher v3.0.0 #4066

Comments

whaifang commented Jul 16, 2021 • edited

whaifang commented Jul 19, 2021 • edited

RolandMa1986 commented Jul 19, 2021 • edited

whaifang commented Jul 19, 2021 • edited

RolandMa1986 commented Jul 19, 2021

whaifang commented Jul 19, 2021 • edited

RolandMa1986 commented Jul 19, 2021

whaifang commented Jul 19, 2021

whaifang commented Jul 20, 2021

RolandMa1986 commented Jul 20, 2021

stale bot commented Oct 18, 2021

stale bot commented Nov 18, 2021

whaifang commented Jul 16, 2021 •

edited

whaifang commented Jul 19, 2021 •

edited

RolandMa1986 commented Jul 19, 2021 •

edited

whaifang commented Jul 19, 2021 •

edited

whaifang commented Jul 19, 2021 •

edited