Upgrade process for kubernets is not working (acs-engine not ready for production workload) #2567

jalberto · 2018-03-30T16:03:56Z

Is this a request for help?:
YES & a BUG REPORT

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.14.5

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
kubernetes 1.8.4 to 1.8.10

What happened:
run `upgrade command following #2062 have lots of troubles:

the process failed 16 times, it finally was successful on 17
every problem described in upgrade k8s process is broke #2022 continue happening (months after being reported)
my exisint gcluter use a custom vnet with peering, it need to be delete, then upgrade then recreated
etcd install script still try to install 2.5.2, I modify it manually to isntall 2.3.7 (latest 2.x release)
after etcd finally installed, and masters rebooted still not connectivity
each upgraded with acs-engine is an adventure and not suitable for production workflows
now etcd is running but yet not connectivity to cluster using kubectl

What you expected to happen:
to work

How to reproduce it (as minimally and precisely as possible):
just try to upgrade an existing cluster

Anything else we need to know:
this is really critical, as my prod cluster is down right now

jalberto · 2018-03-30T16:16:17Z

maybe related:

jalberto · 2018-03-30T16:18:28Z

Is it possible to run manually cloud-init to recreate what is missing?

jalberto · 2018-03-30T16:33:17Z

etcd logs:

Mar 30 16:32:06 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 received vote from 134af6e5d3e35861 at term 4430                                          
Mar 30 16:32:06 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 2a0965170762fd09 at term 4430      
Mar 30 16:32:06 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 91d25258d70ad868 at term 4430      
Mar 30 16:32:07 k8s-master-11577755-0 etcd[48953]: publish error: etcdserver: request timed out                                                               
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 is starting a new election at term 4430                                                   
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 became candidate at term 4431                                                             
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 received vote from 134af6e5d3e35861 at term 4431                                          
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 2a0965170762fd09 at term 4431      
Mar 30 16:32:08 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 91d25258d70ad868 at term 4431      
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 is starting a new election at term 4431                                                   
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 became candidate at term 4432                                                             
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 received vote from 134af6e5d3e35861 at term 4432                                          
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 2a0965170762fd09 at term 4432      
Mar 30 16:32:09 k8s-master-11577755-0 etcd[48953]: 134af6e5d3e35861 [logterm: 1857, index: 265391380] sent vote request to 91d25258d70ad868 at term 4432

kubelet logs:

Mar 30 17:08:49 k8s-master-11577755-0 systemd[1]: Started Kubelet.
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Flag --non-masquerade-cidr has been deprecated, will be removed in a future version
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Flag --keep-terminated-pod-volumes has been deprecated, will be removed in a future version
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Flag --non-masquerade-cidr has been deprecated, will be removed in a future version
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.064943    5595 feature_gate.go:162] feature gates: map[]
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.065489    5595 mount_linux.go:196] Detected OS without systemd
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.065523    5595 client.go:75] Connecting to docker on unix:///var/run/docker.sock
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: I0330 17:08:50.065572    5595 client.go:95] Start docker client with request timeout=2m0s
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: W0330 17:08:50.066686    5595 cni.go:196] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 30 17:08:50 k8s-master-11577755-0 docker[5567]: Error: failed to run Kubelet: could not init cloud  not init cloud provider "azure": No credentials provided for AAD applicationcredentials provided for AAD appliad units

jalberto · 2018-03-30T18:19:38Z

so I change any reference to "2.5.2" to "2.3.7" in my _output dir, and try to upgrade to k8s 1.9, and surprise, etcd 2.5.2 still trying to be installed in master

jackfrancis · 2018-03-30T18:20:35Z

Hi @jalberto thanks for your courage. What's the first thing we can look at?

jalberto · 2018-03-30T18:27:49Z

@jackfrancis thanks for your time

IMHO:

don't mark a VM as "upgraded" is any component is not 100% working (so we can repeat the command until working)
find why acs-engine insist in installing 2.5.2 (that version doesn't exist, latest in 2.x series is 2.3.7)
figure out how a "reset" can be forced in a node even if it's already upgraded (to mitigate 1)

jackfrancis · 2018-03-30T18:31:31Z

#1 and #3 sound like longer-term improvements to the upgrade implementation, so thanks for your patience while those are not in place (and the burden remains on the user to reconcile cluster state).

Let's figure out why the 2.5.2 etcd bug is still present, that was fixed a while ago. What does "etcd 2.5.2 still trying to be installed in master" mean exactly?

jalberto · 2018-03-30T18:36:49Z

when running the upgrade command in a working cluster with etcd 2.3.7 new acs-engine crates the file /opt/azure/containers/setup-etcd.sh with this content:

#!/bin/bash
set -x
source /opt/azure/containers/provision_source.sh
ETCD_VER=v2.5.2
DOWNLOAD_URL=https://acs-mirror.azureedge.net/github-coreos
retrycmd_if_failure 5 5 curl --retry 5 --retry-delay 10 --retry-max-time 30 -L ${DOWNLOAD_URL}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /usr/bin/ --strip-components=1
systemctl daemon-reload
systemctl enable etcd.service
sudo sed -i "1iETCDCTL_ENDPOINTS=https://127.0.0.1:2379" /etc/environment
sudo sed -i "1iETCDCTL_CA_FILE=/etc/kubernetes/certs/ca.crt" /etc/environment
sudo sed -i "1iETCDCTL_KEY_FILE=/etc/kubernetes/certs/etcdclient.key" /etc/environment
sudo sed -i "1iETCDCTL_CERT_FILE=/etc/kubernetes/certs/etcdclient.crt" /etc/environment

clearly the curl command faills to fetch that version, but the script continues running to end so acs-engine it was successful.

@jackfrancis is there a way to run a commadn in master to recreate the initial provisioning commands? (previous remove of /opt/azure/containers/*.complete files)

Why? becasue if manually I go into each amster, and change that value, and re-run setup_etcd.sh successfully, the global state of master is inconsistent (as etcd wassn't ready to continue some key steps in the process)

jalberto · 2018-03-30T18:44:14Z

BTW point 1 can be achived just exiting of any script if there is any error, at least that will stop the process

jackfrancis · 2018-03-30T18:47:09Z

There is no way to recreate an original provisioning, no. The only semi-notion of cluster "state" (quotations intentional) lives in the api model on the client side, which as you've discovered is only a fractional representation of the actual cluster, especially w/ respect to the original api model representation vs a newer version of acs-engine.

The ETCD_VER=v2.5.2 derives from the value of etcdVersion, in the kubernetesConfig at the time of template generation. E.g., in your api model:

<etc>
    "kubernetesConfig" {
        "etcdVersion": "3.2.16"
    }
<etc>

jalberto · 2018-03-30T19:01:56Z

@jackfrancis this can be related: kubernetes/kubernetes#54918
I am using calico and no fiel is being created in etc/cni how can I install manually?

jalberto · 2018-03-30T19:03:45Z

@jackfrancis I understand it should come from there, but before I run upgrade, I edited my _output/foo/apimodel.json to fix etcd version.

an suggestion? my cluster is down for so many hours now

jalberto · 2018-03-30T19:17:56Z

@jackfrancis more context:

root@k8s-master-11577755-0:~# docker ps --format '{{.Image}} - {{.Names}}'
gcrio.azureedge.net/google_containers/hyperkube-amd64:v1.9.5 - wizardly_bartik
gcrio.azureedge.net/google_containers/hyperkube-amd64@sha256:a31961a719a1d0ade89149a6a8db5181cbef461baa6ef049681c31c0e48d9f1e - k8s_kube-controller-manager_kube-controller-manager-k8s-master-11577755-0_kube-system_beaaf22644028e3842cf0847ccb58d15_1
k8s-gcrio.azureedge.net/kube-addon-manager-amd64@sha256:3519273916ba45cfc9b318448d4629819cb5fbccbb0822cce054dd8c1f68cb60 - k8s_kube-addon-manager_kube-addon-manager-k8s-master-11577755-0_kube-system_61d4fa32deceb6175822bf42fb7410f2_1
gcrio.azureedge.net/google_containers/hyperkube-amd64@sha256:a31961a719a1d0ade89149a6a8db5181cbef461baa6ef049681c31c0e48d9f1e - k8s_kube-scheduler_kube-scheduler-k8s-master-11577755-0_kube-system_27fb8458832c33c3b8754aca44f00158_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-controller-manager-k8s-master-11577755-0_kube-system_beaaf22644028e3842cf0847ccb58d15_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-apiserver-k8s-master-11577755-0_kube-system_a0596cf3432e0574f040528197bc3441_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-addon-manager-k8s-master-11577755-0_kube-system_61d4fa32deceb6175822bf42fb7410f2_1
k8s-gcrio.azureedge.net/pause-amd64:3.1 - k8s_POD_kube-scheduler-k8s-master-11577755-0_kube-system_27fb8458832c33c3b8754aca44f00158_1

I0330 19:08:21.516738   35291 kubelet.go:316] Watching apiserver
E0330 19:08:21.517265   35291 file.go:149] Can't process manifest file "/etc/kubernetes/manifests/audit-policy.yaml": /etc/kubernetes/manifests/audit-policy.yaml: couldn't parse as pod(no kind "Policy" is registered for version "audit.k8s.io/v1beta1"), please check manifest file.
E0330 19:08:21.545031   35291 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.240.255.5:443/api/v1/pods?fieldSelector=spec.nodeName%3Dk8s-master-11577755-0&limit=500&resourceVersion=0: dial tcp 10.240.255.5:443: getsockopt: connection refused
E0330 19:08:21.545778   35291 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:480: Failed to list *v1.Node: Get https://10.240.255.5:443/api/v1/nodes?fieldSelector=metadata.name%3Dk8s-master-11577755-0&limit=500&resourceVersion=0: dial tcp 10.240.255.5:443: getsockopt: connection refused
E0330 19:08:21.546344   35291 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:471: Failed to list *v1.Service: Get https://10.240.255.5:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.240.255.5:443: getsockopt: connection refused
W0330 19:08:21.569904   35291 kubelet_network.go:139] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "hairpin-veth"
I0330 19:08:21.569952   35291 kubelet.go:577] Hairpin mode set to "hairpin-veth"
W0330 19:08:21.570207   35291 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
I0330 19:08:21.570229   35291 plugins.go:190] Loaded network plugin "cni"
I0330 19:08:21.570291   35291 client.go:80] Connecting to docker on unix:///var/run/docker.sock
I0330 19:08:21.570305   35291 client.go:109] Start docker client with request timeout=2m0s
W0330 19:08:21.572091   35291 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
W0330 19:08:21.575640   35291 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
I0330 19:08:21.575711   35291 plugins.go:190] Loaded network plugin "cni"
I0330 19:08:21.575838   35291 docker_service.go:232] Docker cri networking managed by cni

jackfrancis · 2018-03-30T19:46:31Z

All of this suggests an api model that is not able to be easily reconcilable with current versions of acs-engine. (Of course this is not idea, just a reflection of the current limitations of what acs-engine does reliably.)

Are you able to build a new cluster and install your workloads on it?

jalberto · 2018-03-30T19:47:45Z

@jackfrancis only if I am able to move the data from the PVs to the new cluster

jalberto · 2018-03-30T19:58:24Z

@jackfrancis this is my apimodel.json

{
  "apiVersion": "vlabs",
  "location": "westeurope",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.8",
      "orchestratorVersion": "1.8.10",
      "kubernetesConfig": {
        "kubernetesImageBase": "gcrio.azureedge.net/google_containers/",
        "clusterSubnet": "10.244.0.0/16",
        "dnsServiceIP": "10.240.255.254",
        "serviceCidr": "10.240.0.0/16",
        "networkPolicy": "calico",
        "maxPods": 50,
        "dockerBridgeSubnet": "172.17.0.1/16",
        "useInstanceMetadata": true,
        "enableRbac": true,
        "enableSecureKubelet": true,
        "privateCluster": {
          "enabled": false
        },
        "gchighthreshold": 85,
        "gclowthreshold": 80,
        "etcdVersion": "2.3.7",
        "etcdDiskSizeGB": "128",
        "addons": [
          {
            "name": "tiller",
            "enabled": true,
            "containers": [
              {
                "name": "tiller",
                "cpuRequests": "50m",
                "memoryRequests": "150Mi",
                "cpuLimits": "50m",
                "memoryLimits": "150Mi"
              }
            ],
            "config": {
              "max-history": "0"
            }
          },
          {
            "name": "aci-connector",
            "enabled": false,
            "containers": [
              {
                "name": "aci-connector",
                "cpuRequests": "50m",
                "memoryRequests": "150Mi",
                "cpuLimits": "50m",
                "memoryLimits": "150Mi"
              }
            ],
            "config": {
              "nodeName": "aci-connector",
              "os": "Linux",
              "region": "westus",
              "taint": "azure.com/aci"
            }
          },
          {
            "name": "kubernetes-dashboard",
            "enabled": true,
            "containers": [
              {
                "name": "kubernetes-dashboard",
                "cpuRequests": "300m",
                "memoryRequests": "150Mi",
                "cpuLimits": "300m",
                "memoryLimits": "150Mi"
              }
            ]
          },
          {
            "name": "rescheduler",
            "enabled": false,
            "containers": [
              {
                "name": "rescheduler",
                "cpuRequests": "10m",
                "memoryRequests": "100Mi",
                "cpuLimits": "10m",
                "memoryLimits": "100Mi"
              }
            ]
          },
          {
            "name": "metrics-server",
            "enabled": false,
            "containers": [
              {
                "name": "metrics-server"
              }
            ]
          }
        ],
        "kubeletConfig": {
          "--address": "0.0.0.0",
          "--allow-privileged": "true",
          "--anonymous-auth": "false",
          "--authorization-mode": "Webhook",
          "--azure-container-registry-config": "/etc/kubernetes/azure.json",
          "--cadvisor-port": "0",
          "--cgroups-per-qos": "true",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-dns": "10.240.255.254",
          "--cluster-domain": "cluster.local",
          "--enforce-node-allocatable": "pods",
          "--event-qps": "0",
          "--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
          "--feature-gates": "",
          "--image-gc-high-threshold": "85",
          "--image-gc-low-threshold": "80",
          "--keep-terminated-pod-volumes": "false",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--max-pods": "110",
          "--network-plugin": "cni",
          "--node-status-update-frequency": "10s",
          "--non-masquerade-cidr": "10.0.0.0/8",
          "--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
          "--pod-manifest-path": "/etc/kubernetes/manifests"
        },
        "controllerManagerConfig": {
          "--allocate-node-cidrs": "true",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-cidr": "10.244.0.0/16",
          "--cluster-name": "k8svl",
          "--cluster-signing-cert-file": "/etc/kubernetes/certs/ca.crt",
          "--cluster-signing-key-file": "/etc/kubernetes/certs/ca.key",
          "--feature-gates": "",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--node-monitor-grace-period": "40s",
          "--pod-eviction-timeout": "5m0s",
          "--profiling": "false",
          "--root-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--route-reconciliation-period": "10s",
          "--service-account-private-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--terminated-pod-gc-threshold": "5000",
          "--use-service-account-credentials": "true",
          "--v": "2"
        },
        "cloudControllerManagerConfig": {
          "--allocate-node-cidrs": "true",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-cidr": "10.244.0.0/16",
          "--cluster-name": "k8svl",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--leader-elect": "true",
          "--route-reconciliation-period": "10s",
          "--v": "2"
        },
        "apiServerConfig": {
          "--admission-control": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota,DenyEscalatingExec,AlwaysPullImages",
          "--advertise-address": "<kubernetesAPIServerIP>",
          "--allow-privileged": "true",
          "--anonymous-auth": "false",
          "--audit-log-maxage": "30",
          "--audit-log-maxbackup": "10",
          "--audit-log-maxsize": "100",
          "--audit-log-path": "/var/log/audit.log",
          "--audit-policy-file": "/etc/kubernetes/manifests/audit-policy.yaml",
          "--authorization-mode": "Node,RBAC",
          "--bind-address": "0.0.0.0",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--etcd-cafile": "/etc/kubernetes/certs/ca.crt",
          "--etcd-certfile": "/etc/kubernetes/certs/etcdclient.crt",
          "--etcd-keyfile": "/etc/kubernetes/certs/etcdclient.key",
          "--etcd-quorum-read": "true",
          "--etcd-servers": "https://127.0.0.1:2379",
          "--insecure-port": "8080",
          "--kubelet-client-certificate": "/etc/kubernetes/certs/client.crt",
          "--kubelet-client-key": "/etc/kubernetes/certs/client.key",
          "--profiling": "false",
          "--repair-malformed-updates": "false",
          "--secure-port": "443",
          "--service-account-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--service-account-lookup": "true",
          "--service-cluster-ip-range": "10.240.0.0/16",
          "--storage-backend": "etcd2",
          "--tls-cert-file": "/etc/kubernetes/certs/apiserver.crt",
          "--tls-private-key-file": "/etc/kubernetes/certs/apiserver.key",
          "--v": "4"
        }
      }
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "k8svl",
      "vmSize": "Standard_D2_v2",
      "firstConsecutiveStaticIP": "10.240.255.5",
      "storageProfile": "ManagedDisks",
      "oauthEnabled": false,
      "preProvisionExtension": null,
      "extensions": [],
      "distro": "ubuntu",
      "kubernetesConfig": {
        "kubeletConfig": {
          "--address": "0.0.0.0",
          "--allow-privileged": "true",
          "--anonymous-auth": "false",
          "--authorization-mode": "Webhook",
          "--azure-container-registry-config": "/etc/kubernetes/azure.json",
          "--cadvisor-port": "0",
          "--cgroups-per-qos": "true",
          "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
          "--cloud-config": "/etc/kubernetes/azure.json",
          "--cloud-provider": "azure",
          "--cluster-dns": "10.240.255.254",
          "--cluster-domain": "cluster.local",
          "--enforce-node-allocatable": "pods",
          "--event-qps": "0",
          "--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
          "--feature-gates": "",
          "--image-gc-high-threshold": "85",
          "--image-gc-low-threshold": "80",
          "--keep-terminated-pod-volumes": "false",
          "--kubeconfig": "/var/lib/kubelet/kubeconfig",
          "--max-pods": "110",
          "--network-plugin": "cni",
          "--node-status-update-frequency": "10s",
          "--non-masquerade-cidr": "10.0.0.0/8",
          "--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
          "--pod-manifest-path": "/etc/kubernetes/manifests"
        }
      }
    },
    "agentPoolProfiles": [
      {
        "name": "pool01",
        "count": 3,
        "vmSize": "Standard_DS4_v2",
        "osType": "Linux",
        "availabilityProfile": "AvailabilitySet",
        "storageProfile": "ManagedDisks",
        "distro": "ubuntu",
        "kubernetesConfig": {
          "kubeletConfig": {
            "--address": "0.0.0.0",
            "--allow-privileged": "true",
            "--anonymous-auth": "false",
            "--authorization-mode": "Webhook",
            "--azure-container-registry-config": "/etc/kubernetes/azure.json",
            "--cadvisor-port": "0",
            "--cgroups-per-qos": "true",
            "--client-ca-file": "/etc/kubernetes/certs/ca.crt",
            "--cloud-config": "/etc/kubernetes/azure.json",
            "--cloud-provider": "azure",
            "--cluster-dns": "10.240.255.254",
            "--cluster-domain": "cluster.local",
            "--enforce-node-allocatable": "pods",
            "--event-qps": "0",
            "--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
            "--feature-gates": "Accelerators=true",
            "--image-gc-high-threshold": "85",
            "--image-gc-low-threshold": "80",
            "--keep-terminated-pod-volumes": "false",
            "--kubeconfig": "/var/lib/kubelet/kubeconfig",
            "--max-pods": "110",
            "--network-plugin": "cni",
            "--node-status-update-frequency": "10s",
            "--non-masquerade-cidr": "10.0.0.0/8",
            "--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
            "--pod-manifest-path": "/etc/kubernetes/manifests"
          }
        },
        "fqdn": "",
        "preProvisionExtension": null,
        "extensions": []
      }
    ],
    "linuxProfile": {
      "adminUsername": "foo",
      "ssh": {
        "publicKeys": [
          {
            "keyData": ""
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "",
      "secret": ""
    },
    "certificateProfile": {
      "caCertificate": "",
      "caPrivateKey": "",
      "apiServerCertificate": "",
      "apiServerPrivateKey": "",
      "clientCertificate": "",
      "clientPrivateKey": "",
      "kubeConfigCertificate": "",
      "kubeConfigPrivateKey": "",
      "etcdServerCertificate": "",
      "etcdServerPrivateKey": "",
      "etcdClientCertificate": "",
      "etcdClientPrivateKey": "",
      "etcdPeerCertificates": [
        "",
        "",
        "",
        ""
      ]
    }
  }
}

jalberto · 2018-03-30T20:13:20Z

@jackfrancis I generated a new apimodel.json with latest acs-engine but not significant change in structure:

it uses etcd3
it enabled metrics

So seems not a problem related to my cluster configuration

jalberto · 2018-03-30T20:30:49Z

more logs (cloud init output)

+ bash /etc/kubernetes/generate-proxy-certs.sh
[...]
subject=/CN=aggregator/O=system:masters                                                                                                                       
Getting CA Private Key                                                                                                                                        
seq: invalid floating point argument: 'etcdctl'                                                                                                               
Try 'seq --help' for more information.                                                                                                                        
Executed "" times
Error:  client: etcd cluster is unavailable or misconfigured
error #0: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #2: malformed HTTP response "\x15\x03\x01\x00\x02\x02"

Error:  client: etcd cluster is unavailable or misconfigured
error #0: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #2: malformed HTTP response "\x15\x03\x01\x00\x02\x02"

Error:  client: etcd cluster is unavailable or misconfigured
error #0: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #1: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
error #2: malformed HTTP response "\x15\x03\x01\x00\x02\x02"

jalberto · 2018-03-30T20:55:50Z

@jackfrancis I achieved to upgrade to 1.9.6 with correct etc 2.3.7 but still not working, as no cni config found and the errors in previous comment..
is the script or the cloud-init config stored in the VM? I will liek to just run it manually

jackfrancis · 2018-03-30T20:56:24Z

I ran an ad hoc test against your api model (acs-engine:v0.14.5) and couldn't get a working cluster:

$ kubectl get nodes -o json
2018/03/30 20:44:08 Error trying to run 'kubectl get nodes':{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

jackfrancis · 2018-03-30T21:00:21Z

About manually setting up calico.

make sure KUBELET_NETWORK_PLUGIN=cni in /etc/default/kubelet

make sure DOCKER_OPTS= --volume=/etc/cni/:/etc/cni:ro --volume=/opt/cni/:/opt/cni:ro in /etc/default/kubelet

Calico does not currently work w/ CNI, so your kubelet runtime config should be --network-plugin=kubenet

jalberto · 2018-03-30T21:05:53Z

ar least you have connectivity :)

must I change calico for azure?

KUBELET_NETOWRK_PLUGIN was not in /etc/default/kubelet I added it, but this flag is there --network-plugin=cni

the volume flags are present, but host /etc/cni is empty

jackfrancis · 2018-03-30T21:07:36Z

If you're using Calico for k8s networkPolicy you have to use kubenet for IPAM. So change to --network-plugin=kubenet

jalberto · 2018-03-30T21:21:08Z

@jackfrancis already changed, and rebooted the VM, nothing new.

in other hand, I cannot create a new cluster becasue there is a lack of VMs in westeu, already tried 3 sizes

jalberto · 2018-03-30T21:50:37Z

@jackfrancis I mind to remove calico if it solves the issue

jackfrancis · 2018-03-30T21:53:40Z

Here is the provision script we run, for reference, if you want to try replaying things manually:

https://github.com/Azure/acs-engine/blob/master/parts/k8s/kubernetesmastercustomscript.sh

(also in /opt/azure/containers/provision.sh)

The configNetworkPolicy is where the various network options are applied on the host.

jalberto · 2018-03-30T22:54:28Z

@jackfrancis

ETCD_PEER_CERT=$(echo ${ETCD_PEER_CERTIFICATES} | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d',' -f $((${MASTER_INDEX}+1)))                                      
ETCD_PEER_KEY=$(echo ${ETCD_PEER_PRIVATE_KEYS} | cut -d'[' -f 2 | cut -d']' -f 1 | cut -d',' -f $((${MASTER_INDEX}+1)))

returns nothing, and thst just the beginning of the script

I really am wondering why this script is not just exiting if any required value is not present instead of just blindly run commands

I suspect there is other script to setup all this empty vars before this script runs.

Any clue?

jalberto · 2018-03-30T23:17:33Z

every key in /etc/kubernetes/azure.json is empty so looks liek somethign in provision step is not working

jalberto · 2018-03-30T23:25:55Z

I opened a critical case in azure, I just cannot continue havin ga down cluster with no solution, thanks for you help @jackfrancis

I am really upset with acs-engine i hope the same code is not used in aks, or it will be scary to use

jackfrancis · 2018-03-30T23:44:08Z

acs-engine comes with no operational guarantees. This is the contract users make w/ this tool. I understand why your experience is upsetting, we take this as valuable feedback that our documentation is not adequately expressing the limits of what will and may not work w/ respect to cluster lifecycle management driven by aca-engine only.

It is precisely these kinds of limitations in acs-engine which inform AKS as a service value proposition above and beyond; and inform the additional orchestration implementations that separate AKS from acs-engine.

AKS uses a subset of acs-engine as an SDK-like library dependency, and not the entire codebase as-is, which is the workflow you’re currently suffering through.

To repeat and emphasize for clarity and transparency (not because it is what you want to hear): acs-engine is a collaborative, open source project to facilitate rapid development of Kubernetes (and other container cluster kit) on Azure. The only support model is the PR process, either through the issue process, advocacy, and project maintainer response in the form of PR + release, or the submission and acceptance of your own PR. By design, this model produces changes in code (and potentially against your existing clusters, or onto novel clusters) over the course of days/weeks/months; but not in an appropriate amount of time to facilitate a production operation response.

Once Azure identifies this as a customer-built and maintained cluster, they will deprioritize the issue you’ve opened.

Again, sorry that there is no good news here w/ respect to your current situation, I hope that this transparency is helpful, especially in the long term as it pertains to whether or not acs-engine is an appropriate tool for your Kubernetes cluster management toolkit.

jalberto · 2018-04-02T10:23:31Z

@jackfrancis I understand what you say, and I understand the risk of using ACS-engine, that said, what I don't understand is how a MS-Azure product/project (because it is under MS umbrella) is in this shape.
I also don't understand how the process of releasing "stable" versions is goging, where clearly not every previous case has been taken in consideration before release.
We can talk about the problems reported and never fixed here, actually the main problem with this ticket was reported by me months ago, I spent a considerable amount of time digging and pointing out possible solutions but nothing has been fixed.

At same time acs-engien declares this is a "non supported product" says "this is community driven and we listen to your feedback an implement it" if you check my issues open here, you will see I spent considerable amount of time giving feedback adn tangible solutions, but nothing gets implemented nor fixed.

What i learnt today is: acs-engine is not supported but is not either community driven as it follows a roadmap not decide by the community. so it's a "community project driven by azure product interest" and that never works.

Please don't misunderstand my upset with how MS/azure/acs deals with all this problems with a lack of appreciation for the work of the team, but acs team needs more clear direction and isolation of community driven vs product driven.

Final feedback put in big/bold/shiny in 1st line of README "Don't use this for production workloads under any circumstance"

jackfrancis · 2018-04-02T17:19:04Z

These are valid criticisms and reflect immature aspects of this project:

lack of published roadmap
inconsistent support response
inconsistent community participation

This project started out as a "let's see what happens when we open source the Azure ARM template conveniences to the OSS community on Azure that is interested in prototyping container orchestrator clusters"; i.e., the intent of the open source aspects of the project was intrinsically experimental, rather than a purposeful project with a specific Microsoft-desired outcome. The intent of doing this in the open was to empower folks who were impatient with the maturation process of SLA-backed Azure service offerings (e.g., AKS, which is not yet GA) but whose business goals aligned with this particular tech stack category (e.g., Kubernetes, docker).

Arguably, we can do more to engage community contributions, and improve the above criticisms. We take that feedback seriously.

Consider, though, that the primary objective of this project is to enable folks to iterate and build upon each others' ideas and work to produce novel cluster deployments in Azure. To that end this project continues to add value, with the risk associated with all the above-mentioned caveats.

I would accept the feedback that more disclaimer material would be valuable to warn folks about the support model, but I would push back on your representation of what's dangerous. It's not "production workloads" that acs-engine is operating against: for that the Kubernetes API and the way it is configured that matter. The hard work is rationalizing the Azure API w/ Kubernetes-supporting IaaS + Kubernetes runtime config. The intention of acs-engine is to provide tooling for the user to achieve the outcome of a working Kubernetes cluster on Azure. Once that outcome has been achieved, whether or not production workloads should be scheduled to a particular cluster is really determinant upon the viability of that cluster's configuration as compared to the requirements of the workloads that may land there, including the configuration of the IaaS config underneath. This is not an acs-engine problem: acs-engine merely aims to make the process of defining, declaring, and applying these IaaS + k8s configurations onto Azure.

I would agree, however, that upgrade + scale functionality in acs-engine in its current state is not an acceptable cluster lifecycle management dependency for a production cluster. Whether or not its limitations are more or less reliable than a hand-rolled cluster lifecycle toolkit is up to the discretion of each user. That reality can be better documented, and we will do so.

Thanks for your continued feedback!

jalberto · 2018-04-02T21:40:33Z

Thanks for your time and sincerity @jackfrancis I totally understand the complexity of the project but I also expect high quality outcome from a MS driven project. I think main issue is in this words:

"The intention of acs-engine is to provide tooling for the user to achieve the outcome of a working Kubernetes cluster on Azure."

The meaning of "working Kubernetes cluster on Azure" can mean different things for each one, for me it includes maintenance, not necessarily "major upgrade" support thou, but at least proper troubleshooting options and basic maintenance tasks to ensure a "working Kubernetes cluster on Azure" through time, and not only "once"

Maybe listing which features are stable and which not will help, so at least expectations are managed.

Thanks for your time

jalberto · 2018-04-03T10:59:29Z

@jackfrancis this is another example: #1961

The change is justified, but is a breaking change, not properly documented and without upgrade path attached.

This change make me waste 3 days trying to figure out why gitlab-runner is not working anymore

jalberto · 2018-04-03T11:11:06Z

@jackfrancis this table is an accurate visualisation of my frustration with acs-engine:

So the only way to "fix" a 1.8.4 cluster is to upgrade at least to 1.8.5 so azure-file works as expected (tha this what I was trying to fix with this upgrade) but if you jump "too much" to 1.9.0 it breaks again!

Notice how I need to go to official MS-azure docs to find information about community-driven acs-engine

andyzhangx · 2018-04-04T14:59:18Z

Hi @jalberto, for the azure file fileMode and dirMode issue, there is design change back and forth, I would suggest use azure file mountOptions to set what you want: https://github.com/andyzhangx/demo/blob/master/linux/azurefile/azurefile-mountoptions.md

jalberto · 2018-04-04T15:03:15Z

@andyzhangx agree, the problem is you need at least k8s 1.8.5 to use mountOptionsso when I upgraded my cluster from 1.8.4 to 1.8.x everything breaks

stale · 2019-03-09T14:23:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

jalberto changed the title ~~Upgrade process for kubernets is not working (even if there is plenty of issues opened with work arounds for months)~~ Upgrade process for kubernets is not working Mar 30, 2018

jalberto mentioned this issue Mar 30, 2018

1.8.5 k8s Upgrade Feature #1950

Closed

jalberto changed the title ~~Upgrade process for kubernets is not working~~ Upgrade process for kubernets is not working (acs-engine not ready for production workload) Apr 3, 2018

This was referenced Apr 4, 2018

upgrade k8s process is broke #2022

Closed

Docker version was downgraded in 1.9.x #2589

Closed

stale bot added the stale label Mar 9, 2019

stale bot closed this as completed Mar 16, 2019

Upgrade process for kubernets is not working (acs-engine not ready for production workload) #2567

Upgrade process for kubernets is not working (acs-engine not ready for production workload) #2567

Comments

jalberto commented Mar 30, 2018 • edited

Is this a request for help?: YES & a BUG REPORT

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: 0.14.5

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018 • edited

jalberto commented Mar 30, 2018

jackfrancis commented Mar 30, 2018

jalberto commented Mar 30, 2018 • edited

jackfrancis commented Mar 30, 2018

jalberto commented Mar 30, 2018 • edited

jalberto commented Mar 30, 2018 • edited

jackfrancis commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018 • edited

jackfrancis commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jackfrancis commented Mar 30, 2018

jackfrancis commented Mar 30, 2018

jalberto commented Mar 30, 2018

jackfrancis commented Mar 30, 2018

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jackfrancis commented Mar 30, 2018

jalberto commented Mar 30, 2018 • edited

jalberto commented Mar 30, 2018

jalberto commented Mar 30, 2018

jackfrancis commented Mar 30, 2018

jalberto commented Apr 2, 2018

jackfrancis commented Apr 2, 2018

jalberto commented Apr 2, 2018 • edited

jalberto commented Apr 3, 2018

jalberto commented Apr 3, 2018 • edited

andyzhangx commented Apr 4, 2018

jalberto commented Apr 4, 2018

stale bot commented Mar 9, 2019

jalberto commented Mar 30, 2018 •

edited

Is this a request for help?:
YES & a BUG REPORT

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.14.5

jalberto commented Mar 30, 2018 •

edited

jalberto commented Mar 30, 2018 •

edited

jalberto commented Mar 30, 2018 •

edited

jalberto commented Mar 30, 2018 •

edited

jalberto commented Mar 30, 2018 •

edited

jalberto commented Mar 30, 2018 •

edited

jalberto commented Apr 2, 2018 •

edited

jalberto commented Apr 3, 2018 •

edited