Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Processing Exploration for Seldon Core #1413

Closed
axsaucedo opened this issue Feb 6, 2020 · 4 comments
Closed

Batch Processing Exploration for Seldon Core #1413

axsaucedo opened this issue Feb 6, 2020 · 4 comments

Comments

@axsaucedo
Copy link
Contributor

axsaucedo commented Feb 6, 2020

Batch Processing Exploration for Seldon Core

We are currently exploring ways of enabling batch functionality within Seldon Core. First section defines the terminology, and then we dive into options to implement it.

1. Batch types

We have been able to identify two different types of "Batch jobs" which have been grouped based on functionality:

1.1. Non-long running batch jobs

1.2. Long running batch jobs

Long running is defined as jobs that would take more than a few dozen seconds to provide a resonse (and hence the HTTP/GRPC request/response architecture would not be suitable).

The scope of #1413 will be of "Non-long running" batch jobs, as from our current research, being able to extend Seldon Core functionality for this type of batch jobs seems feasible without major modifications.

The latter piece will be outside of the scope of #1413, but we would still be interested to explore in the medium or long term. However at the bottom we do provide some insight about our current thoughts.

2. Requirements

There are three key requirements that were identified for batch processing:

2.1. Jobs that are asynchronous, defined as being able to pull resources from a data source and push resources to another data source when it's done

2.2. Jobs that only run to process the data and terminate when the finite dataset has been processed

2.3. Jobs that encompass [2.2] and can be triggered on a schedule

The way that we tackle them is by providing a solution for just [2.1] in isolation (as it's relevant for continuously running async jobs), and then [2.1], [2.2] and [2.3] together, as it's relevant for general batch jobs.

3. Proposed Implementations

3.1. Asynchronous Jobs

image

The design consists of one extra component which is in charge of:

  • Ingesting data from data source or data stream (continuously polling)
  • Processing request(s) by sending to internal engine and waiting for response
  • Sending response back to data source or data stream
  • Notifying external system of data point completion (success, failure, etc)

This implementation would also be able to scale as it consumes from the queue:
image

This implementation could be set up as a container within the SeldonDeploy yaml. We can leverage the componentSpecs to add an extra container which would not be part of the graph from an image referenced as "seldon-data-ingestor" which would be in charge of the actions above:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: sklearn
spec:
  name: iris
  predictors:
  - componentSpecs:
    - spec:
      containers:
      - image: seldon-data-ingestor:0.1
        name: seldon-data-ingestor
  - graph:
      children: []
      implementation: SKLEARN_SERVER
      modelUri: gs://seldon-models/sklearn/iris
      name: classifier
    name: default
    replicas: 1

3.2 Asynchronous Job + Termination + Scheduling

This section aims to achieve [2.1] + [2.2] + [2.3].

Note: Both options below ([3.2.1] and [3.2.2]) have a strong assumption of Kube Batch being able to address our requirement, as well as for leveraging the schedule functionality, but this is something that will require further investigation to confirm feasibility (https://github.com/kubernetes-sigs/kube-batch)

3.2.1 - Option 1

Consists of two fully external components; the DataIngestor component and the Batch Job component (which would start both the SeldonDeployment and DataIngestor and terminate).

image

This design consists of two components:

  • An extensible data ingestor container responsible for:

    • Ingesting data from custom data source
    • Coordinating sending request to Executor and waiting for response
    • Being able to hold long-running requests (60min+)
    • Uploading results / notifying termination
    • Notifying external system of batch completion (Success, Failure)
  • Kubernetes Batch Component

    • Component in charge of turning everything off when data ingestor finishes or fails
    • Component in charge of making logs available when job is terminated
    • Component in charge of providing status (running, success, failed, etc)

Disadvantages:

  • Less integrated with batch component (as may require own CRD)
  • Potentially harder to handle long-running containers due to dependency on load balancer

Advantages:

  • Ability to scale up data ingestor pods and SeldonDeployment pods based on HPA (as per diagram below)
  • No changes / modifications to Seldon CRD / operator logic
  • May require a new operator to handle creation of Seldon Deployment

This implementation would also be able to scale as more requests are sent by leveraging HPA:

image

3.2.2 - Option 2

image

This design consists of two components:

  • An extensible data ingestor container responsible for:
    • Ingesting data from custom data source
    • Coordinating sending request to Executor and waiting for response
    • Being able to hold long-running requests (60min+)
    • Uploading results / notifying termination
    • Notifying external system of batch completion (Success, Failure)
  • Kubernetes Batch Component
    • Component in charge of turning everything off when data ingestor finishes or fails
    • Component in charge of making logs available when job is terminated
    • Component in charge of providing status (running, success, failed, etc)

Disadvantages:

  • This approach wouldn't be able to scale using HPA if loading data from a database (instead of a stream as per [3.1]) given that unlike [3.2.1] the data ingestor is inside of the SeldonDeploy definition, so there would have to be some very complex logic to split the data across the multple jobs.

4. Further exploration

The following section is only high level exploration on how the "Long-running" batch job type could be achieved / explored. The suggestion is that it could be possible to leverage an external framework like ariflow, through a Seldon Engine proxy that would wrap the API as follows:

image

This is something that we'll be exploring once we have a better understanding of the above (as this wouldn't encompass the async, nor job termination requirements outlined)

@axsaucedo
Copy link
Contributor Author

Part of #1391

@evankanderson
Copy link

Possibly of interest if you're talking about streaming systems:

https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/

They are medium-length, but should give a good set of terminology and baseline patterns as of 2016 for processing non-interactive high-throughput data.

@pisymbol
Copy link

What is the status and/or roadmap behind this feature?

@ukclivecox
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants