New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seldon-core-manager pod is getting restarted #1910
Comments
This looks like a previous version or another manager is running. Its failing leadership election. |
Hi @cliveseldon, The seldon-core is installed as part of Kubeflow (version 1.0) deployment. |
This maybe because its running into resource limits? In master kubeflow manifests we have updated to higher values: Until kubeflow does a new release you could modify the kfdef resources manually to update the resource requests/limits for the manager? |
Hi @cliveseldon , |
Hi @cliveseldon , I see the restart happening even after updating the resource requests/limits for the manager. Error logs:
|
Is there anything special about your cluster? Which provider AWS, onPrem? What I don't understand is why leadership election is lost? Does anything change in the namespace when this happens? |
Also, do all the other kubeflow components work? I notice you're using a kubernetes 1.17 cluster and I don't think that's supported by kubeflow yet kubeflow/kubeflow#4822 |
Several related issues, here is one that might apply: Seems to be networking issues. Do you see any errors in kube-apiserver or related kube-system pods? |
Hi @cliveseldon and @ryandawsonuk, I am facing the issue on onPrem Kubernetes cluster, the same issue was seen when tested on different platform as well. Also, we are facing similar issue as mentioned in kubernetes/kubernetes#74340 |
Such problems mainly come from network issues. @ljainker I think you can have a deeper dive into the nework to see if APIServer is healthy. I think it is not caused by seldon core or tf-operator or pytorch operator. |
@gageorge yes, as mentioned above we are seeing network issuekubernetes/kubernetes#74340 Thanks for your help and suggestions |
Thank you very much for the support :) |
The seldon-core-manager pod that comes as part of Kubeflow installation is getting restarted.
Logs of the previous pod instance
kubectl logs -p seldon-controller-manager-7db97d6776-wjdjf -n kubeflow 2020-06-02T07:45:51.026Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.controller Starting EventSource {"controller": "seldondeployment", "source": "kind source: /, Kind="} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha2-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1alpha2, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha2-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1alpha3-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1alpha3, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1alpha3-seldondeployment"} 2020-06-02T07:45:51.028Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/mutate-machinelearning-seldon-io-v1-seldondeployment"} 2020-06-02T07:45:51.029Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "machinelearning.seldon.io/v1, Kind=SeldonDeployment", "path": "/validate-machinelearning-seldon-io-v1-seldondeployment"} 2020-06-02T07:45:51.029Z INFO setup starting manager 2020-06-02T07:45:51.029Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} 2020-06-02T07:45:51.517Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"kubeflow","name":"controller-leader-election-helper","uid":"50c8cacb-6824-47a0-b9bf-4eb3727b43ae","apiVersion":"v1","resourceVersion":"35336"}, "reason": "LeaderElection", "message": "seldon-controller-manager-7db97d6776-wjdjf_149790ec-a4a5-11ea-b989-ca56f630d025 became leader"} 2020-06-02T07:45:51.729Z INFO controller-runtime.controller Starting Controller {"controller": "seldondeployment"} 2020-06-02T07:45:51.731Z INFO controller-runtime.certwatcher Updated current TLS certificate 2020-06-02T07:45:51.731Z INFO controller-runtime.certwatcher Starting certificate watcher 2020-06-02T07:45:51.830Z INFO controller-runtime.controller Starting workers {"controller": "seldondeployment", "worker count": 1} E0602 09:46:15.320804 1 leaderelection.go:306] error retrieving resource lock kubeflow/controller-leader-election-helper: etcdserver: leader changed E0602 09:46:19.862457 1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'seldon-controller-manager-7db97d6776-wjdjf_149790ec-a4a5-11ea-b989-ca56f630d025 stopped leading' 2020-06-02T09:46:19.863Z INFO controller-runtime.controller Stopping workers {"controller": "seldondeployment"} 2020-06-02T09:46:19.863Z ERROR setup problem running manager {"error": "leader election lost"} github.com/go-logr/zapr.(*zapLogger).Error /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 main.main /workspace/main.go:114 runtime.main /usr/local/go/src/runtime/proc.go:200
Kubernetes version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T21:03:42Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
The text was updated successfully, but these errors were encountered: