Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seldon-core-manager pod is getting restarted #1910

Closed
ljainker opened this issue Jun 2, 2020 · 12 comments
Closed

seldon-core-manager pod is getting restarted #1910

ljainker opened this issue Jun 2, 2020 · 12 comments

Comments

@ljainker
Copy link

ljainker commented Jun 2, 2020

The seldon-core-manager pod that comes as part of Kubeflow installation is getting restarted.

Logs of the previous pod instance

kubectl logs -p seldon-controller-manager-7db97d6776-wjdjf -n kubeflow 2020-06-02T07:45:51.026Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha2-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha2-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha3-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha3-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1-seldondeployment"} 2020-06-02T07:45:51.029Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1-seldondeployment"} 2020-06-02T07:45:51.029Z INFO setup starting manager 2020-06-02T07:45:51.029Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} 2020-06-02T07:45:51.517Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"kubeflow","name":"controller-leader-election-helper","uid":"50c8cacb-6824-47a0-b9bf-4eb3727b43ae","apiVersion":"v1","resourceVersion":"35336"}, "reason": "LeaderElection", "message": "seldon-controller-manager-7db97d6776-wjdjf_149790ec-a4a5-11ea-b989-ca56f630d025 became leader"} 2020-06-02T07:45:51.729Z INFO controller-runtime.controller Starting Controller {"controller": "seldondeployment"} 2020-06-02T07:45:51.731Z INFO controller-runtime.certwatcher Updated current TLS certificate 2020-06-02T07:45:51.731Z INFO controller-runtime.certwatcher Starting certificate watcher 2020-06-02T07:45:51.830Z INFO controller-runtime.controller Starting workers {"controller": "seldondeployment", "worker count": 1} E0602 09:46:15.320804 1 leaderelection.go:306] error retrieving resource lock kubeflow/controller-leader-election-helper: etcdserver: leader changed E0602 09:46:19.862457 1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'seldon-controller-manager-7db97d6776-wjdjf_149790ec-a4a5-11ea-b989-ca56f630d025 stopped leading' 2020-06-02T09:46:19.863Z INFO controller-runtime.controller Stopping workers {"controller": "seldondeployment"} 2020-06-02T09:46:19.863Z ERROR setup problem running manager {"error": "leader election lost"} github.com/go-logr/zapr.(*zapLogger).Error /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 main.main /workspace/main.go:114 runtime.main /usr/local/go/src/runtime/proc.go:200

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T21:03:42Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

@ljainker ljainker added bug triage Needs to be triaged and prioritised accordingly labels Jun 2, 2020
@ukclivecox
Copy link
Contributor

This looks like a previous version or another manager is running. Its failing leadership election.
How did you install? Was it running ok and then started failing?
Which version of kubeflow?

@ljainker
Copy link
Author

ljainker commented Jun 3, 2020

Hi @cliveseldon,

The seldon-core is installed as part of Kubeflow (version 1.0) deployment.
Seldon-core version getting installed is 1.0.1
The pod was running fine in the beginning, then after 100 minutes or so, it got restarted.

@ukclivecox
Copy link
Contributor

This maybe because its running into resource limits?

In master kubeflow manifests we have updated to higher values:
https://github.com/kubeflow/manifests/blob/af80269f5137328c1aec2ebd5b705b0bcced2cec/seldon/values.yaml#L64-L68

Until kubeflow does a new release you could modify the kfdef resources manually to update the resource requests/limits for the manager?

@ljainker
Copy link
Author

ljainker commented Jun 3, 2020

Hi @cliveseldon ,
I will try this and let you know.
Thank you

@ukclivecox ukclivecox removed bug triage Needs to be triaged and prioritised accordingly labels Jun 4, 2020
@ljainker
Copy link
Author

ljainker commented Jun 8, 2020

Hi @cliveseldon ,

I see the restart happening even after updating the resource requests/limits for the manager.

Error logs:

2020-06-08T05:43:07.458Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} 2020-06-08T05:43:07.459Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-08T05:43:07.459Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-08T05:43:07.459Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-08T05:43:07.459Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-08T05:43:07.459Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha2-seldondeployment"} 2020-06-08T05:43:07.459Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha2-seldondeployment"} 2020-06-08T05:43:07.459Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha3-seldondeployment"} 2020-06-08T05:43:07.459Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha3-seldondeployment"} 2020-06-08T05:43:07.459Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1-seldondeployment"} 2020-06-08T05:43:07.459Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1-seldondeployment"} 2020-06-08T05:43:07.459Z INFO setup starting manager 2020-06-08T05:43:07.460Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} 2020-06-08T05:43:07.490Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"kubeflow","name":"controller-leader-election-helper","uid":"ac9afaa1-dd58-46d3-a093-c69699b4d4b4","apiVersion":"v1","resourceVersion":"18876"}, "reason": "LeaderElection", "message": "seldon-controller-manager-65f89bb6c6-k9qjm_ee0ab373-a94a-11ea-a085-9a1d0cc069c7 became leader"} 2020-06-08T05:43:07.560Z INFO controller-runtime.controller Starting Controller {"controller": "seldondeployment"} 2020-06-08T05:43:07.560Z INFO controller-runtime.certwatcher Updated current TLS certificate 2020-06-08T05:43:07.561Z INFO controller-runtime.certwatcher Starting certificate watcher 2020-06-08T05:43:07.660Z INFO controller-runtime.controller Starting workers {"controller": "seldondeployment", "worker count": 1} 2020-06-08T06:14:17.942Z ERROR setup problem running manager {"error": "leader election lost"} github.com/go-logr/zapr.(*zapLogger).Error /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 main.main /workspace/main.go:114 runtime.main /usr/local/go/src/runtime/proc.go:200

@ukclivecox
Copy link
Contributor

Is there anything special about your cluster? Which provider AWS, onPrem?

What I don't understand is why leadership election is lost? Does anything change in the namespace when this happens?

@ryandawsonuk
Copy link
Contributor

Also, do all the other kubeflow components work? I notice you're using a kubernetes 1.17 cluster and I don't think that's supported by kubeflow yet kubeflow/kubeflow#4822

@ukclivecox
Copy link
Contributor

Several related issues, here is one that might apply:
kubernetes/kubernetes#74340

Seems to be networking issues. Do you see any errors in kube-apiserver or related kube-system pods?

@ljainker
Copy link
Author

ljainker commented Jun 8, 2020

Hi @cliveseldon and @ryandawsonuk,

I am facing the issue on onPrem Kubernetes cluster, the same issue was seen when tested on different platform as well.
There are two other components whose pods restart (tf-job-operator and pytorch-operator), I have raised a ticket - kubeflow/training-operator#1167

Also, we are facing similar issue as mentioned in kubernetes/kubernetes#74340

@gaocegege
Copy link
Contributor

Is there anything special about your cluster? Which provider AWS, onPrem?

What I don't understand is why leadership election is lost? Does anything change in the namespace when this happens?

Such problems mainly come from network issues.

@ljainker I think you can have a deeper dive into the nework to see if APIServer is healthy. I think it is not caused by seldon core or tf-operator or pytorch operator.

@hemantha-kumara
Copy link

Such problems mainly come from network issues.

@gageorge yes, as mentioned above we are seeing network issuekubernetes/kubernetes#74340
We are checking our internal team who can check and support k8s network issue.

Thanks for your help and suggestions

@ljainker
Copy link
Author

ljainker commented Jun 8, 2020

Thank you very much for the support :)
I will be now closing this issue as we will be checking the network issue with our internal team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants