Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initialDelaySeconds: 10 sec is not enough for some models #323

Closed
sasvaritoni opened this issue Dec 3, 2018 · 4 comments
Closed

initialDelaySeconds: 10 sec is not enough for some models #323

sasvaritoni opened this issue Dec 3, 2018 · 4 comments

Comments

@sasvaritoni
Copy link
Contributor

I have a model which is loading bigger amount of data (it takes around 45 sec in the init() ), deploying it with Kubeflow, seldon-serve-simple-v1alpha2.

The liveness and readiness probes timeout, and the container gets cyclically restarted, never getting initialized.
When I manually increase the initialDelaySeconds values for both probes in the K8s deployment, the pod gets initialized successfully. I tried this via the K8s dashboard by clicking Edit on the model deployment.

What is the proper way to increase these delay values?

I can see that the 10 seconds seems to be hardcoded in SeldonDeploymentOperatorImpl.java, but can I somehow change this when generating the SeldonDeployment with Kubeflow?

I guess this can affect several models where the initialization involves loading large amount of data.

@ukclivecox
Copy link
Contributor

Have you tried adding your own liveness and readiness probes in the podTemplateSpec? It should only add the default ones if ones don't exist already.

@sasvaritoni
Copy link
Contributor Author

sasvaritoni commented Dec 4, 2018

Do you mean to add it to the prototype?

Now I tried this:

  • Copied the serve-simple-v1alpha2.jsonnet to a different file name
  • edited the readiness and liveness probes under the "containters" section
  • ks generated the component (no error report)
  • ks deployed the component (no error report)

Then, "kubectl get seldondeployment ... -o json " returns:
...
"status": {
"description": "Cannot find field: tcpSocket in message k8s.io.api.core.v1.Probe",
"state": "Failed"
}

The deployment does not show up actually in K8s.

Maybe I am missing here something.
Is this the proper way to go, to create a new prorotype based on the original seldon-serve-simple, with the probes added? Or am I inserting to the wrong place?

This is the full prototype file I used with the probes added:


// @apiVersion 0.1
// @name io.ksonnet.pkg.seldon-serve-simple-v1alpha2-test
// @description Serve a single seldon model for the v1alpha2 CRD (Seldon 0.2.X)
// @shortDescription Serve a single seldon model
// @param name string Name to give this deployment
// @param image string Docker image which contains this model
// @optionalParam replicas number 1 Number of replicas
// @optionalParam endpoint string REST The endpoint type: REST or GRPC
// @optionalParam pvcName string null Name of PVC
// @optionalParam imagePullSecret string null name of image pull secret

local k = import "k.libsonnet";

local pvcClaim = {
  apiVersion: "v1",
  kind: "PersistentVolumeClaim",
  metadata: {
    name: params.pvcName,
  },
  spec: {
    accessModes: [
      "ReadWriteOnce",
    ],
    resources: {
      requests: {
        storage: "10Gi",
      },
    },
  },
};

local seldonDeployment = {
  apiVersion: "machinelearning.seldon.io/v1alpha2",
  kind: "SeldonDeployment",
  metadata: {
    labels: {
      app: "seldon",
    },
    name: params.name,
    namespace: env.namespace,
  },
  spec: {
    annotations: {
      deployment_version: "v1",
      project_name: params.name,
    },
    name: params.name,
    predictors: [
      {
        annotations: {
          predictor_version: "v1",
        },
        componentSpecs: [{
          spec: {
            containers: [
              {
                image: params.image,
                imagePullPolicy: "IfNotPresent",
                name: params.name,
                volumeMounts+: if params.pvcName != "null" && params.pvcName != "" then [
                  {
                    mountPath: "/mnt",
                    name: "persistent-storage",
                  },
                ] else [],
                livenessProbe: {
                  tcpSocket: {
                    port: "http"
                  },
                  initialDelaySeconds: 100,
                  timeoutSeconds: 1,
                  periodSeconds: 5,
                  successThreshold: 1,
                  failureThreshold: 3
                },
                readinessProbe: {
                  tcpSocket: {
                    port: "http"
                  },
                  initialDelaySeconds: 100,
                  timeoutSeconds: 1,
                  periodSeconds: 5,
                  successThreshold: 1,
                  failureThreshold: 3
                }
              }
            ],
            terminationGracePeriodSeconds: 1,
            imagePullSecrets+: if params.imagePullSecret != "null" && params.imagePullSecret != "" then [
              {
                name: params.imagePullSecret,
              },
            ] else [],
            volumes+: if params.pvcName != "null" && params.pvcName != "" then [
              {
                name: "persistent-storage",
                volumeSource: {
                  persistentVolumeClaim: {
                    claimName: params.pvcName,
                  },
                },
              },
            ] else []
          },
        }],
        graph: {
          children: [
          ],
          endpoint: {
            type: params.endpoint,
          },
          name: params.name,
          type: "MODEL",
        },
        name: params.name,
        replicas: params.replicas,
      },
    ],
  },
};

k.core.v1.list.new([
  pvcClaim,
  seldonDeployment,
])

@ukclivecox
Copy link
Contributor

You need to specify the tcpSocket more explicitly. We use the Kubernetes Proto Buffer definitions for parsing which are more strict than that allowed by the OpenAPI versions.

Try something like:

                               "readinessProbe": {
                                    "failureThreshold": 3,
                                    "initialDelaySeconds": 100,
                                    "periodSeconds": 5,
                                    "successThreshold": 1,
                                    "handler" : {
                                        "tcpSocket": {
                                            "port": "http"
                                        }
                                    },

@sasvaritoni
Copy link
Contributor Author

Wow, this helped :)
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants