Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods test failures after master upgrade 0.19.3 → 0.21.2 #11355

Closed
mbforbes opened this issue Jul 16, 2015 · 31 comments
Closed

Pods test failures after master upgrade 0.19.3 → 0.21.2 #11355

mbforbes opened this issue Jul 16, 2015 · 31 comments
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@mbforbes
Copy link
Contributor

Conditions

  • e2e test version: 0.19.3
  • nodes version: 0.19.3
  • master version: 0.21.2

Tests

  • [FLAKE] Pods should be restarted with a docker exec "cat /tmp/health" liveness probe
  • [FLAKE] Pods should not be restarted with a docker exec "cat /tmp/health" liveness probe
  • [FLAKE] Pods should be restarted with a /healthz http liveness probe
  • [FAIL] Pods should be updated

Flake history

flakes

### Output #### Pods should be restarted with a docker exec "cat /tmp/health" liveness probe

Pod "liveness-exec" is forbidden: no API token found for service account
e2e-test-5d05436c-2a5a-11e5-916e-42010af01555/default, retry after the token
is automatically created and added to the service account

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/
go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:451
creating pod liveness-exec
Expected error:
    <*errors.StatusError | 0xc208bcac80>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "Pod \"liveness-exec\" is forbidden: no API token found for service account
e2e-test-5d05436c-2a5a-11e5-916e-42010af01555/default, retry after the token
is automatically created and added to the service account",
            Reason: "Forbidden",
            Details: {
                Name: "liveness-exec",
                Kind: "Pod",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 403,
        },
    }
    Pod "liveness-exec" is forbidden: no API token found for service account
e2e-test-5d05436c-2a5a-11e5-916e-42010af01555/default, retry after the token
is automatically created and added to the service account
not to have occurred

Pods should not be restarted with a docker exec "cat /tmp/health" liveness probe

Pod "liveness-exec" is forbidden: no API token found for service
account e2e-test-10b92925-2b3f-11e5-8ace-42010af01555/default, retry after the token
is automatically created and added to the service account

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/
go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:477
creating pod liveness-exec
Expected error:
    <*errors.StatusError | 0xc20965a280>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "Pod \"liveness-exec\" is forbidden: no API token found for service
account e2e-test-10b92925-2b3f-11e5-8ace-42010af01555/default, retry after the token
is automatically created and added to the service account",
            Reason: "Forbidden",
            Details: {
                Name: "liveness-exec",
                Kind: "Pod",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 403,
        },
    }
    Pod "liveness-exec" is forbidden: no API token found for service
account e2e-test-10b92925-2b3f-11e5-8ace-42010af01555/default, retry after the token
is automatically created and added to the service account
not to have occurred

Pods should be restarted with a /healthz http liveness probe

Pod "liveness-http" is forbidden: no API token found for service account
e2e-test-50513bd8-2a76-11e5-948c-42010af01555/default, retry after the token
is automatically created and added to the service account

/go/src/github.com/GoogleCloudPlatform/kubernetes/_output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:504
creating pod liveness-http
Expected error:
    <*errors.StatusError | 0xc2081ffe00>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {SelfLink: "", ResourceVersion: ""},
            Status: "Failure",
            Message: "Pod \"liveness-http\" is forbidden: no API token found for service account e2e-test-1969a2b3-2a04-11e5-9a2a-42010af01555/default, retry after the token is automatically created and added to the service account",
            Reason: "Forbidden",
            Details: {
                Name: "liveness-http",
                Kind: "Pod",
                Causes: nil,
                RetryAfterSeconds: 0,
            },
            Code: 403,
        },
    }
    Pod "liveness-http" is forbidden: no API token found for service account e2e-test-1969a2b3-2a04-11e5-9a2a-42010af01555/default, retry after the token is automatically created and added to the service account

Pods should be updated

may not update fields other than container.image

/go/src/github.com/GoogleCloudPlatform/kubernetes/output/dockerized/go/src/github.com/GoogleCloudPlatform/kubernetes/test/e2e/pods.go:338
failed to update pod: Pod "pod-update-08769420-2b44-11e5-b3e2-42010af01555" is invalid: spec: invalid value '{Volumes:[{Name:default-token-7sezo VolumeSource:{HostPath: EmptyDir: GCEPersistentDisk: AWSElasticBlockStore: GitRepo: Secret:<
>(0xc209b18b30){SecretName:default-token-7sezo} NFS: ISCSI: Glusterfs: PersistentVolumeClaim: RBD:}}] Containers:[{Name:nginx Image:gcr.io/google_containers/nginx:1.7.9 Command: Args: WorkingDir: Ports:[{Name: HostPort:0 ContainerPort:80 Protocol:TCP HostIP:}] Env: Resources:{Limits:map[cpu:100m] Requests:map[]} VolumeMounts:[{Name:default-token-7sezo ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount}] LivenessProbe:<_>(0xc209d2da70){Handler:{Exec: HTTPGet:<*>(0xc20898c190){Path:/index.html Port:8080 Host: Scheme:HTTP} TCPSocket:} InitialDelaySeconds:30 TimeoutSeconds:1} ReadinessProbe: Lifecycle: TerminationMessagePath:/dev/termination-log ImagePullPolicy:IfNotPresent SecurityContext:}] RestartPolicy:Always TerminationGracePeriodSeconds: ActiveDeadlineSeconds: DNSPolicy:ClusterFirst NodeSelector: ServiceAccountName: NodeName:gke-gke-upgrade-13e30713-node-xg2o HostNetwork:false ImagePullSecrets:}': may not update fields other than container.image

(#11343 improves this error message)

@mbforbes mbforbes added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 16, 2015
@liggitt
Copy link
Member

liggitt commented Jul 16, 2015

Is the failing e2e code at the 19.3 level? If so, it's probably missing the e2e fixes in #10523 which should help the liveness-exec check

For the last one, I think the rejected spec mutation might be the scheme in the liveness probe defaulting to HTTP when it was previously empty (added in #9965)

@liggitt
Copy link
Member

liggitt commented Jul 16, 2015

the rejected spec mutation might be the scheme in the liveness probe defaulting to HTTP

Nevermind, the defaulting would have happened in the old podspec as well... so there wouldn't be a diff. I can see now the submitted podspec serviceAccountName is empty, because the 19.3 e2e client code didn't know about it. Here's the sequence:

  1. pod is created by 19.3 e2e client code, probably with no service account
  2. master service account admission sets pod.spec.serviceAccountName to default
  3. 19.3 e2e client code gets the pod, tries to update it, and in the process drops the serviceAccountName field (because its v1 serialization has serviceAccount instead)
  4. master ValidatePodUpdate code fails, because it sees an update trying to change serviceAccountName from "default" to ""

@zmerlynn
Copy link
Member

Yes, I believe the failing e2e code is at the 0.19.3 level as well. I think @mbforbes is reattempting pivoting the e2e versions after the master upgrades as well, but we might run into other issues if the nodes are version skewed. (My hope is fewer.)

@davidopp davidopp added team/master sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 16, 2015
@bgrant0607
Copy link
Member

Filed #11380 for the serviceAccount issue.

@bgrant0607
Copy link
Member

Since a couple issues have popped up already, I decided to look through API changes between 0.19 and 0.20:
https://github.com/GoogleCloudPlatform/kubernetes/commits/master/pkg/api/v1/types.go

pod.status.reason: Additive. kubectl will just print less information if it isn't populated.
ObservedGeneration: I believe the client code should work when this field defaults to 0
PersistentVolumeClaim: Go change, not a json change.

So hopefully there aren't more incompatibilities lurking.

@roberthbailey
Copy link
Contributor

Thanks for checking Brian. With luck we won't need more patch releases into the 0.20 branch.

@mbforbes
Copy link
Contributor Author

Important: running a "pure" 0.19.3 (master, nodes, and e2e code all at 0.19.3) has zero flakes for the three flaky pod tests above.

noflakes

This seems to imply that #10523 didn't prevent these, and the 0.21.2 master is causing the flakes. Could the flakes also be related to the serviceAccountserviceAccountName issue?

@liggitt
Copy link
Member

liggitt commented Jul 16, 2015

The field name change caused the pod update failure.

The liveness exec failures were caused by the e2e code in that test not using the standard method for constructing a test namespace, which meant the e2e test wasn't waiting for the namespace to be ready before creating pods. In 19.3, the admission plugin would let pods get created before their service account and API token was ready, but that was fixed post 19.3

@mbforbes
Copy link
Contributor Author

@liggitt but by that logic, wouldn't all 0.19.3 e2es see the same flakiness in those tests? In #11355 (comment) I'm describing that we have a Jenkins job (pictured) that's running pure 0.19.3 and never seeing those tests flake. (Sorry if I'm confused—attempting to understand!)

@liggitt
Copy link
Member

liggitt commented Jul 17, 2015

@mbforbes admission control changed since 19.3 to require the service account token to exist before admitting a pod. Prior to that, pods that didn't make use of the token (like the liveness-exec pod) were getting admitted when they shouldn't have, but didn't notice because they never attempted to use the API token they were supposed to have available.

In a pure 19.3 env, liveness-exec pods get admitted (incorrectly) before the namespace is ready, but didn't fail their test because they weren't depending on anything service account token related.

In a 19.3 e2e against a 20.0+ master, liveness-exec pods get rejected (correctly) because the namespace's service accounts and their tokens aren't finished initializing yet. The liveness-exec e2e test in particular needed to use the common method to correctly get a test namespace. The e2e test fixes in https://github.com/GoogleCloudPlatform/kubernetes/pull/10523/files#diff-92d176a1025dcbee0981bb7f16cda942 are applicable.

@mbforbes
Copy link
Contributor Author

@liggitt ahhhhh I understand now. Thank you so much for being patient with me.

@n1603
Copy link

n1603 commented Aug 1, 2015

can you please tell what is the solution to this. I am seeing same

kubectl create -f nginx.yml
Error from server: error when creating "nginx.yml": Pod "nginx" is forbidden: no API token found for service account default/default, retry after the token is automatically created and added to the service account

@liggitt
Copy link
Member

liggitt commented Aug 1, 2015

@n1603 make sure you're starting the apiserver and controller manager with the service account arguments needed to auto-generate service account tokens. See local-up-cluster.sh for an example.

@n1603
Copy link

n1603 commented Aug 1, 2015

thanks, I am using fedora22 atomic. apiserver and controller manager are running. i must be missing someting, but not sure,

@n1603
Copy link

n1603 commented Aug 2, 2015

any further suggestions on fixing this ?

@liggitt
Copy link
Member

liggitt commented Aug 2, 2015

Can you show the command lines options you are using to start the apiserver and controller manager?

@n1603
Copy link

n1603 commented Aug 3, 2015

Used below command to start services

for SERVICES in docker kube-proxy.service kubelet.service etcd.service kube-apiserver.service kube-controller-manager.service kube-scheduler.service; do systemctl restart $SERVICES; systemctl enable $SERVICES; systemctl status $SERVICES ; done

Here is how "kube-apiserver.service" looks like in my install
#cat "/usr/lib/systemd/system/kube-apiserver.service"

[Unit]
Description=Kubernetes API Server
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
EnvironmentFile=-/etc/kubernetes/config
EnvironmentFile=-/etc/kubernetes/apiserver
User=kube
ExecStart=/usr/bin/kube-apiserver
$KUBE_LOGTOSTDERR
$KUBE_LOG_LEVEL
$KUBE_ETCD_SERVERS
$KUBE_API_ADDRESS
$KUBE_API_PORT
$KUBELET_PORT
$KUBE_ALLOW_PRIV
$KUBE_SERVICE_ADDRESSES
$KUBE_ADMISSION_CONTROL
$KUBE_API_ARGS
Restart=on-failure
Type=notify
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

$cat "/usr/lib/systemd/system/kube-controller-manager.service"

[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
EnvironmentFile=-/etc/kubernetes/config
EnvironmentFile=-/etc/kubernetes/controller-manager
User=kube
ExecStart=/usr/bin/kube-controller-manager
$KUBE_LOGTOSTDERR
$KUBE_LOG_LEVEL
$KUBE_MASTER
$KUBE_CONTROLLER_MANAGER_ARGS
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

@liggitt
Copy link
Member

liggitt commented Aug 3, 2015

@n1603 looks like the systemd unit files include the ServiceAccount admission controller without specifying the needed signing key. Not sure what to do about that, since those files don't really have a setup script that can create that key...

To get your setup working, you can do the same thing local-up-cluster.sh is doing:

  1. Generate a signing key:

    openssl genrsa -out /tmp/serviceaccount.key 2048
    
  2. Update /etc/kubernetes/apiserver:

    KUBE_API_ARGS="--service_account_key_file=/tmp/serviceaccount.key"
    
  3. Update /etc/kubernetes/controller-manager:

    KUBE_CONTROLLER_MANAGER_ARGS="--service_account_private_key_file=/tmp/serviceaccount.key"
    

@n1603
Copy link

n1603 commented Aug 4, 2015

This works, thanks a lot.

@hw-qiaolei
Copy link
Contributor

@liggitt thanks for your answer, I encountered the same issue and get solved by following your steps.

@irfanjs
Copy link

irfanjs commented Sep 28, 2016

hi,
i followed the following steps :
1: executed command : openssl genrsa -out /tmp/serviceaccount.ket 2048
2; modified the /etc/kubernetes/apiserver file to add following :

KUBE_API_ARGS="--service_account_key_file=/tmp/serviceaccount.key"
3: modified the /etc/kubernetes/controller-manager and add following:
KUBE_CONTROLLER_MANAGER_ARGS="--service_account_private_key_file=/tmp/serviceaccount.key"

4: restarted the kube-apiserver and kube-controller-manager services to restart the services
service kube-apiserver restart
service kube-controller-manager restart

when i run the command : kubectl get events getting following error

No API token found for service account "default" retry after the token is automatically created and added to the service account

How the environment is setup :
i followed the kubernetes the hard way : [https://github.com/kelseyhightower/kubernetes-the-hard-way].
one kube master
one kube node
both these servers are on CentOS 7.2
kubectl version : server version : 1.3 and client version : 1.3

kubectl get pods command does not return anything
kubectl get nodes return the hostname of kube node server and the status is ready.

please suggest what could be additional checks i need to do to get it resolved.
Regards

@liggitt
Copy link
Member

liggitt commented Sep 28, 2016

there's a typo in your openssl command above, I assume the file was actually created without the typo

what service account and secrets show up in your namespace?

kubectl get serviceaccounts
kubectl get secrets

@irfanjs
Copy link

irfanjs commented Sep 28, 2016

thanks Liggitt.
i did not understand what that typo is. i created the key file exactly as per the command. please let me know
command : kubectl get secrets does not return anything
command : kubectl get serviceaccounts returns default`as an account and secrets is 0ards

regards

@liggitt
Copy link
Member

liggitt commented Sep 28, 2016

the typo was "serviceaccount.ket" vs "serviceaccount.key"

do you have the logs from the controllermanager?

@irfanjs
Copy link

irfanjs commented Sep 28, 2016

sorry to say but i dont have /var/log/kube-controller-manager.log file

i ran the command : journalctl -u -f kube-controller-manager and it gives the same error "No API ...."

@irfanjs
Copy link

irfanjs commented Sep 29, 2016

please suggest.
regards

@irfanjs
Copy link

irfanjs commented Sep 30, 2016

this is resolved. i created secrets manually and attach it to "default" service account. after that, restarts the controller manager and pods started running.
thanks again all.

@kewaike
Copy link

kewaike commented Feb 3, 2017

@irfanjs,how to create the secrets and attach it to "default" service account ?

@xaidanwang
Copy link

I also want know how to create the secrets and attach it to "default" service account ?

@kundanatre
Copy link

kundanatre commented Jun 6, 2018

Service Account show 0 secrets attached with it.

NAME      SECRETS   AGE
default   0         6d

Is this problem resolved in kubernetes 1.10?

Still it is complaining for API Token, how does API token gets generated?
do we need to generate it manually as it mentioned by @liggitt.
Even though I have followed same steps by providing correct params "--service-account-private-key-file" and "--service-account-key-file" in respective files; it does not make any difference.

openssl genrsa -out /etc/kubernetes/serviceaccount.key 2048

In apiserver config file

KUBE_API_ARGS="--service-account-key-file=/etc/kubernetes/secret/serviceaccount.key"
KUBE_ADMISSION_CONTROL="--admission-control=NamespaceLifecycle,LimitRanger,ResourceQuota"

Or this apporach

KUBE_API_ARGS="--service-account-key-file=/etc/kubernetes/secret/serviceaccount.key"
KUBE_ADMISSION_CONTROL="--admission-control=NamespaceLifecycle,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota

In controller-manager config file
KUBE_CONTROLLER_MANAGER_ARGS="--service-account-private-key-file=/etc/kubernetes/secret/serviceaccount.key --allocate-node-cidrs=true --attach-detach-reconcile-sync-period=1m0s --cluster-cidr=100.64.0.0/16 --cluster-name=k8s.virtual.local --leader-elect=true --kubeconfig=/etc/kubernetes/kubeConfig --service-cluster-ip-range=100.65.0.0/24"

systemctl restart kube-controller-manager

systemctl restart kube-apiserver

vi ReplicationController.yaml

apiVersion: v1
kind: ReplicationController
metadata:
  name: nginx
spec:
  replicas: 2
  selector:
    app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80

kubectl create -f ReplicationController.yaml

kubectl describe rc nginx

Name:         nginx
Namespace:    kube-system
Selector:     app=nginx
Labels:       app=nginx
Annotations:  <none>
Replicas:     0 current / 2 desired
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=nginx
  Containers:
   nginx:
    Image:        nginx
    Port:         80/TCP
    Host Port:    0/TCP
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Conditions:
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
Events:
  Type     Reason        Age                      From                    Message
  ----     ------        ----                     ----                    -------
  Warning  FailedCreate  <invalid> (x18 over 8m)  replication-controller  Error creating: No API token found for service account "default", retry after the token is automatically created and added to the service account

@j-ibarra
Copy link

@liggitt thaks, I encountered the same issue and get solved by following your steps. but now the

kubectl logs busybox-9b9f89599-w5ln4

error: You must be logged in to the server (the server has asked for the client to provide credentials ( pods/log busybox-9b9f89599-w5ln4))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests