Batch Processing Exploration for Seldon Core #1413

axsaucedo · 2020-02-06T20:52:45Z

Batch Processing Exploration for Seldon Core

We are currently exploring ways of enabling batch functionality within Seldon Core. First section defines the terminology, and then we dive into options to implement it.

1. Batch types

We have been able to identify two different types of "Batch jobs" which have been grouped based on functionality:

1.1. Non-long running batch jobs

1.2. Long running batch jobs

Long running is defined as jobs that would take more than a few dozen seconds to provide a resonse (and hence the HTTP/GRPC request/response architecture would not be suitable).

The scope of #1413 will be of "Non-long running" batch jobs, as from our current research, being able to extend Seldon Core functionality for this type of batch jobs seems feasible without major modifications.

The latter piece will be outside of the scope of #1413, but we would still be interested to explore in the medium or long term. However at the bottom we do provide some insight about our current thoughts.

2. Requirements

There are three key requirements that were identified for batch processing:

2.1. Jobs that are asynchronous, defined as being able to pull resources from a data source and push resources to another data source when it's done

2.2. Jobs that only run to process the data and terminate when the finite dataset has been processed

2.3. Jobs that encompass [2.2] and can be triggered on a schedule

The way that we tackle them is by providing a solution for just [2.1] in isolation (as it's relevant for continuously running async jobs), and then [2.1], [2.2] and [2.3] together, as it's relevant for general batch jobs.

3. Proposed Implementations

3.1. Asynchronous Jobs

The design consists of one extra component which is in charge of:

Ingesting data from data source or data stream (continuously polling)
Processing request(s) by sending to internal engine and waiting for response
Sending response back to data source or data stream
Notifying external system of data point completion (success, failure, etc)

This implementation would also be able to scale as it consumes from the queue:

This implementation could be set up as a container within the SeldonDeploy yaml. We can leverage the componentSpecs to add an extra container which would not be part of the graph from an image referenced as "seldon-data-ingestor" which would be in charge of the actions above:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: sklearn
spec:
  name: iris
  predictors:
  - componentSpecs:
    - spec:
      containers:
      - image: seldon-data-ingestor:0.1
        name: seldon-data-ingestor
  - graph:
      children: []
      implementation: SKLEARN_SERVER
      modelUri: gs://seldon-models/sklearn/iris
      name: classifier
    name: default
    replicas: 1

3.2 Asynchronous Job + Termination + Scheduling

This section aims to achieve [2.1] + [2.2] + [2.3].

Note: Both options below ([3.2.1] and [3.2.2]) have a strong assumption of Kube Batch being able to address our requirement, as well as for leveraging the schedule functionality, but this is something that will require further investigation to confirm feasibility (https://github.com/kubernetes-sigs/kube-batch)

3.2.1 - Option 1

Consists of two fully external components; the DataIngestor component and the Batch Job component (which would start both the SeldonDeployment and DataIngestor and terminate).

This design consists of two components:

An extensible data ingestor container responsible for:
- Ingesting data from custom data source
- Coordinating sending request to Executor and waiting for response
- Being able to hold long-running requests (60min+)
- Uploading results / notifying termination
- Notifying external system of batch completion (Success, Failure)
Kubernetes Batch Component
- Component in charge of turning everything off when data ingestor finishes or fails
- Component in charge of making logs available when job is terminated
- Component in charge of providing status (running, success, failed, etc)

Disadvantages:

Less integrated with batch component (as may require own CRD)
Potentially harder to handle long-running containers due to dependency on load balancer

Advantages:

Ability to scale up data ingestor pods and SeldonDeployment pods based on HPA (as per diagram below)
No changes / modifications to Seldon CRD / operator logic
May require a new operator to handle creation of Seldon Deployment

This implementation would also be able to scale as more requests are sent by leveraging HPA:

3.2.2 - Option 2

This design consists of two components:

An extensible data ingestor container responsible for:
- Ingesting data from custom data source
- Coordinating sending request to Executor and waiting for response
- Being able to hold long-running requests (60min+)
- Uploading results / notifying termination
- Notifying external system of batch completion (Success, Failure)
Kubernetes Batch Component
- Component in charge of turning everything off when data ingestor finishes or fails
- Component in charge of making logs available when job is terminated
- Component in charge of providing status (running, success, failed, etc)

Disadvantages:

This approach wouldn't be able to scale using HPA if loading data from a database (instead of a stream as per [3.1]) given that unlike [3.2.1] the data ingestor is inside of the SeldonDeploy definition, so there would have to be some very complex logic to split the data across the multple jobs.

4. Further exploration

The following section is only high level exploration on how the "Long-running" batch job type could be achieved / explored. The suggestion is that it could be possible to leverage an external framework like ariflow, through a Seldon Engine proxy that would wrap the API as follows:

This is something that we'll be exploring once we have a better understanding of the above (as this wouldn't encompass the async, nor job termination requirements outlined)

The text was updated successfully, but these errors were encountered:

axsaucedo · 2020-03-04T12:45:59Z

Part of #1391

evankanderson · 2020-03-11T17:47:40Z

Possibly of interest if you're talking about streaming systems:

https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/
https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/

They are medium-length, but should give a good set of terminology and baseline patterns as of 2016 for processing non-interactive high-throughput data.

pisymbol · 2020-03-24T19:18:32Z

What is the status and/or roadmap behind this feature?

ukclivecox · 2020-06-25T11:01:57Z

Done https://docs.seldon.io/projects/seldon-core/en/latest/servers/batch.html

axsaucedo mentioned this issue Feb 15, 2020

WIP: Seldon Core Async "Batch" Processing with Long Running Capability #1447

Closed

axsaucedo mentioned this issue Mar 5, 2020

WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505

Closed

ukclivecox added the priority/p1 label Apr 23, 2020

ukclivecox closed this as completed Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Processing Exploration for Seldon Core #1413

Batch Processing Exploration for Seldon Core #1413

axsaucedo commented Feb 6, 2020 •

edited

axsaucedo commented Mar 4, 2020

evankanderson commented Mar 11, 2020

pisymbol commented Mar 24, 2020

ukclivecox commented Jun 25, 2020

Batch Processing Exploration for Seldon Core #1413

Batch Processing Exploration for Seldon Core #1413

Comments

axsaucedo commented Feb 6, 2020 • edited

Batch Processing Exploration for Seldon Core

1. Batch types

1.1. Non-long running batch jobs

1.2. Long running batch jobs

2. Requirements

2.1. Jobs that are asynchronous, defined as being able to pull resources from a data source and push resources to another data source when it's done

2.2. Jobs that only run to process the data and terminate when the finite dataset has been processed

2.3. Jobs that encompass [2.2] and can be triggered on a schedule

3. Proposed Implementations

3.1. Asynchronous Jobs

3.2 Asynchronous Job + Termination + Scheduling

3.2.1 - Option 1

3.2.2 - Option 2

4. Further exploration

axsaucedo commented Mar 4, 2020

evankanderson commented Mar 11, 2020

pisymbol commented Mar 24, 2020

ukclivecox commented Jun 25, 2020

axsaucedo commented Feb 6, 2020 •

edited