Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seldon deployment success/failure condition #255

Closed
jamborta opened this issue Oct 15, 2018 · 18 comments
Closed

Seldon deployment success/failure condition #255

jamborta opened this issue Oct 15, 2018 · 18 comments

Comments

@jamborta
Copy link

Any recommendation to how to set success/failure condition in the deployment specs? Ideally, to track deployment till it is all green. In the examples here there is a question mark.

@ukclivecox
Copy link
Contributor

You could use the "status" of the SeldonDeployment which will be updated as the SeldonDeployment is created.

@jamborta
Copy link
Author

@cliveseldon I tried to add
successCondition: status.state == Available

but the workflow does not seem it pick it up. Just hangs until it times out.

Not sure why as this command returns Available:
kubectl get seldondeployments.machinelearning.seldon.io deployment-name -n seldon -o jsonpath='{.status.state}'

@ukclivecox
Copy link
Contributor

Yes, that is strange. Is there anything in the Argo logs? You might need to dig into the Argo project to help you why this is not working.

@jamborta
Copy link
Author

jamborta commented Oct 16, 2018

Not much in the logs, looks like it is not getting back any data:

time="2018-10-16T18:13:48Z" level=warning msg="cmd.Wait for kubectl get -w command for resource seldondeployment.machinelearning.seldon.io/deployment-name returned error exit status 1"
time="2018-10-16T18:13:48Z" level=info msg="Waiting for resource seldondeployment.machinelearning.seldon.io/deployment-name resulted in retryable error exit status 1"
time="2018-10-16T18:15:08Z" level=info msg="kubectl get seldondeployment.machinelearning.seldon.io/deployment-name -w -o json"
time="2018-10-16T18:15:08Z" level=warning msg="Json reader returned error EOF. Calling kill (usually superfluous)"

@jamborta
Copy link
Author

@cliveseldon any suggestion?

@ukclivecox
Copy link
Contributor

Can you see the "status" field in the SeldonDeployment? And it matches your condition?
If so, I would check any low level argo logs to see if the check is erroring?
Happy to discuss on Slack channel.

@ukclivecox
Copy link
Contributor

Can you try with the latest 0.2.4-SNAPSHOT images?
It looks like from the error that there was no status field returned - this may be a bug in old versions of seldon-core. @gsunner

@jamborta
Copy link
Author

jamborta commented Oct 29, 2018

@cliveseldon

I updated to 0.2.4-SNAPSHOT, but it seems that the deployment is still not sending the status bit:

time="2018-10-29T18:55:19Z" level=info msg="{\"apiVersion\": \"machinelearning.seldon.io/v1alpha2\",\"kind\": \"SeldonDeployment\",\"metadata\": {\"annotations\": {\"kubectl.kubernetes.io/last-applied-configuration\": \"{\\\"apiVersion\\\":\\\"machinelearning.seldon.io/v1alpha2\\\",\\\"kind\\\":\\\"SeldonDeployment\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"labels\\\":{\\\"app\\\":\\\"seldon\\\"},\\\"name\\\":\\\"dep1\\\",\\\"namespace\\\":\\\"seldon\\\"},\\\"spec\\\":{\\\"annotations\\\":{\\\"deployment_version\\\":\\\"v1\\\",\\\"project_name\\\":\\\"Model Prediction\\\"},\\\"name\\\":\\\"dep1\\\",\\\"oauth_key\\\":\\\"dep1-key1\\\",\\\"oauth_secret\\\":\\\"dep1-secret1\\\",\\\"predictors\\\":[{\\\"annotations\\\":{\\\"predictor_version\\\":\\\"v1\\\"},\\\"componentSpecs\\\":[{\\\"spec\\\":{\\\"containers\\\":[{\\\"image\\\":\\\"repo-name/dep1-lookup:1.7\\\",\\\"imagePullPolicy\\\":\\\"IfNotPresent\\\",\\\"name\\\":\\\"lookup\\\",\\\"resources\\\":{\\\"requests\\\":{\\\"memory\\\":\\\"1Mi\\\"}}}]}},{\\\"spec\\\":{\\\"containers\\\":[{\\\"image\\\":\\\"repo-name/dep1-prediction:2018-10-29-18-45-44\\\",\\\"imagePullPolicy\\\":\\\"IfNotPresent\\\",\\\"name\\\":\\\"predictor\\\",\\\"resources\\\":{\\\"requests\\\":{\\\"memory\\\":\\\"1Mi\\\"}}}],\\\"terminationGracePeriodSeconds\\\":20}},{\\\"spec\\\":{\\\"containers\\\":[{\\\"image\\\":\\\"repo-name:2018-10-29-18-45-44\\\",\\\"imagePullPolicy\\\":\\\"IfNotPresent\\\",\\\"name\\\":\\\"external-to-internal-mapping\\\",\\\"resources\\\":{\\\"requests\\\":{\\\"memory\\\":\\\"1Mi\\\"}}}],\\\"terminationGracePeriodSeconds\\\":20}}],\\\"graph\\\":{\\\"children\\\":[{\\\"children\\\":[{\\\"endpoint\\\":{\\\"type\\\":\\\"REST\\\"},\\\"name\\\":\\\"predictor\\\",\\\"type\\\":\\\"MODEL\\\"}],\\\"endpoint\\\":{\\\"type\\\":\\\"REST\\\"},\\\"name\\\":\\\"external-to-internal-mapping\\\",\\\"parameters\\\":[{\\\"name\\\":\\\"external_to_internal\\\",\\\"type\\\":\\\"BOOL\\\",\\\"value\\\":true}],\\\"type\\\":\\\"TRANSFORMER\\\"}],\\\"endpoint\\\":{\\\"type\\\":\\\"REST\\\"},\\\"name\\\":\\\"lookup\\\",\\\"type\\\":\\\"TRANSFORMER\\\"},\\\"name\\\":\\\"single-model\\\",\\\"replicas\\\":1}]}}\\n\"},\"clusterName\": \"\",\"creationTimestamp\": \"2018-10-29T18:49:16Z\",\"generation\": 1,\"labels\": {\"app\": \"seldon\"},\"name\": \"dep1\",\"namespace\": \"seldon\",\"resourceVersion\": \"10677\",\"selfLink\": \"/apis/machinelearning.seldon.io/v1alpha2/namespaces/seldon/seldondeployments/dep1\",\"uid\": \"561fda07-dbab-11e8-b216-0216d208bed4\"},\"spec\": {\"annotations\": {\"deployment_version\": \"v1\",\"project_name\": \"Model Prediction\"},\"name\": \"dep1\",\"oauth_key\": \"dep1-key1\",\"oauth_secret\": \"dep1-secret1\",\"predictors\": [{\"annotations\": {\"predictor_version\": \"v1\"},\"componentSpecs\": [{\"spec\": {\"containers\": [{\"image\": \"repo-name/dep1-lookup:1.7\",\"imagePullPolicy\": \"IfNotPresent\",\"name\": \"lookup\",\"resources\": {\"requests\": {\"memory\": \"1Mi\"}}}]}},{\"spec\": {\"containers\": [{\"image\": \"repo-name/dep1-prediction:2018-10-29-18-45-44\",\"imagePullPolicy\": \"IfNotPresent\",\"name\": \"predictor\",\"resources\": {\"requests\": {\"memory\": \"1Mi\"}}}],\"terminationGracePeriodSeconds\": 20}},{\"spec\": {\"containers\": [{\"image\": \"repo-name:2018-10-29-18-45-44\",\"imagePullPolicy\": \"IfNotPresent\",\"name\": \"external-to-internal-mapping\",\"resources\": {\"requests\": {\"memory\": \"1Mi\"}}}],\"terminationGracePeriodSeconds\": 20}}],\"graph\": {\"children\": [{\"children\": [{\"endpoint\": {\"type\": \"REST\"},\"name\": \"predictor\",\"type\": \"MODEL\"}],\"endpoint\": {\"type\": \"REST\"},\"name\": \"external-to-internal-mapping\",\"parameters\": [{\"name\": \"external_to_internal\",\"type\": \"BOOL\",\"value\": true}],\"type\": \"TRANSFORMER\"}],\"endpoint\": {\"type\": \"REST\"},\"name\": \"lookup\",\"type\": \"TRANSFORMER\"},\"name\": \"single-model\",\"replicas\": 1}]}}"
time="2018-10-29T18:55:19Z" level=info msg="failure condition '{status.state == [Failed]}' evaluated false"
time="2018-10-29T18:55:19Z" level=info msg="success condition '{status.state == [Available]}' evaluated false"

Also, the following returns nothing:

kubectl get seldondeployments.machinelearning.seldon.io dep1 -n seldon -o jsonpath='{.status.state}'

@ukclivecox
Copy link
Contributor

Which version of k8s are you running?
Can you show the final output of the SeldonDeployment with:

kubectl get sdep -n seldon dep1 -o json

Can you run this when its finally ready and running to confirm no status resource is there?

@jamborta
Copy link
Author

jamborta commented Oct 29, 2018

I am running kubernetes 1.10.6

The model is live (I can use it to predict) the command kubectl get sdep -n seldon dep1 -o json returns:

{
    "apiVersion": "machinelearning.seldon.io/v1alpha2",
    "kind": "SeldonDeployment",
    "metadata": {
        "annotations": {
            "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"machinelearning.seldon.io/v1alpha2\",\"kind\":\"SeldonDeployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"seldon\"},\"name\":\"dep1\",\"namespace\":\"seldon\"},\"spec\":{\"annotations\":{\"deployment_version\":\"v1\",\"project_name\":\"Prediction\"},\"name\":\"dep1\",\"oauth_key\":\"dep1-key1\",\"oauth_secret\":\"dep1-secret1\",\"predictors\":[{\"annotations\":{\"predictor_version\":\"v1\"},\"componentSpecs\":[{\"spec\":{\"containers\":[{\"image\":\"repo-name/dep1-lookup:1.7\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"lookup\",\"resources\":{\"requests\":{\"memory\":\"1Mi\"}}}]}},{\"spec\":{\"containers\":[{\"image\":\"repo-name/dep1-prediction:2018-10-29-18-45-44\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"predictor\",\"resources\":{\"requests\":{\"memory\":\"1Mi\"}}}],\"terminationGracePeriodSeconds\":20}},{\"spec\":{\"containers\":[{\"image\":\"repo-name/dep1-mapping:2018-10-29-18-45-44\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"external-to-internal-mapping\",\"resources\":{\"requests\":{\"memory\":\"1Mi\"}}}],\"terminationGracePeriodSeconds\":20}}],\"graph\":{\"children\":[{\"children\":[{\"endpoint\":{\"type\":\"REST\"},\"name\":\"predictor\",\"type\":\"MODEL\"}],\"endpoint\":{\"type\":\"REST\"},\"name\":\"external-to-internal-mapping\",\"parameters\":[{\"name\":\"external_to_internal\",\"type\":\"BOOL\",\"value\":true}],\"type\":\"TRANSFORMER\"}],\"endpoint\":{\"type\":\"REST\"},\"name\":\"lookup\",\"type\":\"TRANSFORMER\"},\"name\":\"single-model\",\"replicas\":1}]}}\n"
        },
        "clusterName": "",
        "creationTimestamp": "2018-10-29T18:49:16Z",
        "generation": 1,
        "labels": {
            "app": "seldon"
        },
        "name": "dep1",
        "namespace": "seldon",
        "resourceVersion": "10677",
        "selfLink": "/apis/machinelearning.seldon.io/v1alpha2/namespaces/seldon/seldondeployments/dep1",
        "uid": "561fda07-dbab-11e8-b216-0216d208bed4"
    },
    "spec": {
        "annotations": {
            "deployment_version": "v1",
            "project_name": "Prediction"
        },
        "name": "dep1",
        "oauth_key": "dep1-key1",
        "oauth_secret": "dep1-secret1",
        "predictors": [
            {
                "annotations": {
                    "predictor_version": "v1"
                },
                "componentSpecs": [
                    {
                        "spec": {
                            "containers": [
                                {
                                    "image": "repo-name/dep1-lookup:1.7",
                                    "imagePullPolicy": "IfNotPresent",
                                    "name": "lookup",
                                    "resources": {
                                        "requests": {
                                            "memory": "1Mi"
                                        }
                                    }
                                }
                            ]
                        }
                    },
                    {
                        "spec": {
                            "containers": [
                                {
                                    "image": "repo-name/dep1-prediction:2018-10-29-18-45-44",
                                    "imagePullPolicy": "IfNotPresent",
                                    "name": "predictor",
                                    "resources": {
                                        "requests": {
                                            "memory": "1Mi"
                                        }
                                    }
                                }
                            ],
                            "terminationGracePeriodSeconds": 20
                        }
                    },
                    {
                        "spec": {
                            "containers": [
                                {
                                    "image": "repo-name/dep1-mapping:2018-10-29-18-45-44",
                                    "imagePullPolicy": "IfNotPresent",
                                    "name": "external-to-internal-mapping",
                                    "resources": {
                                        "requests": {
                                            "memory": "1Mi"
                                        }
                                    }
                                }
                            ],
                            "terminationGracePeriodSeconds": 20
                        }
                    }
                ],
                "graph": {
                    "children": [
                        {
                            "children": [
                                {
                                    "endpoint": {
                                        "type": "REST"
                                    },
                                    "name": "predictor",
                                    "type": "MODEL"
                                }
                            ],
                            "endpoint": {
                                "type": "REST"
                            },
                            "name": "external-to-internal-mapping",
                            "parameters": [
                                {
                                    "name": "external_to_internal",
                                    "type": "BOOL",
                                    "value": true
                                }
                            ],
                            "type": "TRANSFORMER"
                        }
                    ],
                    "endpoint": {
                        "type": "REST"
                    },
                    "name": "lookup",
                    "type": "TRANSFORMER"
                },
                "name": "single-model",
                "replicas": 1
            }
        ]
    }
}

@ukclivecox
Copy link
Contributor

Thanks. Are you able to confirm that the status is never set when you create the SeldonDeployment again? We need to understand if the status is never set or is initial set then disappears.

@jamborta
Copy link
Author

@cliveseldon I investigated this and these are the step to reproduce:

  • Create a new cluster
  • Deploy a model: after the first deployment status appears (another bug I discovered is that if I deploy multiple steps and one of them fails, status still says Available)
  • Delete deployed model
  • After deploying the same model the status is no longer there

@ukclivecox
Copy link
Contributor

A pull request #273 has updated when the status is set. If you can test with 0.2.4-SNAPSHOT.
Make sure you remove the old images it you are using the same kubernetes instance otherwise they waon't be re-downloaded.

@jamborta
Copy link
Author

jamborta commented Nov 14, 2018

@cliveseldon updated to 2.4.0, it works for the first deployment, but fails after re-deploying the same model a few times. (status disappears from seldondeployment object)

@ukclivecox
Copy link
Contributor

Can you try with 0.2.5-SNAPSHOT images. There was a recent fix that should fix this

@jamborta
Copy link
Author

@cliveseldon 0.2.5-SNAPSHOT is not getting published to helm (using charts from here: https://storage.googleapis.com/seldon-charts)

@ukclivecox
Copy link
Contributor

I've pushed to helm the 0.2.5-SNAPSHOT chart if you can try.

@ukclivecox
Copy link
Contributor

Assuming fixed. Please reopen if still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants