Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processor enhancemenst through raw data parameter #3702

Closed
axsaucedo opened this issue Oct 26, 2021 — with Board Genius Sync · 1 comment · Fixed by #3718
Closed

Batch processor enhancemenst through raw data parameter #3702

axsaucedo opened this issue Oct 26, 2021 — with Board Genius Sync · 1 comment · Fixed by #3718
Assignees

Comments

Copy link
Contributor

axsaucedo commented Oct 26, 2021

As discussed, the way that this will be explored will be in a way that will address #2657, #3409, #3681 and #3408. More specifically the functionality of the batch processor will be extended to support raw json inputs in the form of valid SeldonMessage values, which will support also support a limited and specified version of microbatching. This will ensrue the float to int issue is no longer this would also encompass extending the seldon_client to support raw data for predict parameters.

For the case of micro batching, the way that it will be handled will be as follows:

Input is

{"names": ["a", "b", "c"], "data": {"ndarray": [[1,2,3]]}, "meta": { "tags": {"internal-id": 1} }}
{"names": ["a", "b", "c"], "data": {"ndarray": [[1,2,3]]}, "meta": { "tags": {"internal-id": 2} }}

If microbatch value is 1, then each request is sent as is. However if microbatching request is 2 then microbatching is limited to only ndarray and tensor data provided, and initial request is sent without the meta , and with the names of the first parameter. Similar to other requests, it would still be sent with the unique batch ID

{"names": ["a", "b", "c"], "data": {"ndarray": [[1,2,3], [1,2,3]]}, “meta”: { “tags”: {“batch-uid”: …} }}

Let's say that the response would contain the following data

{"names": ["a", "b", "c"], "data": {"ndarray": [[9,9,9], [8,8,8]]}, "meta": { "tags": {"extra_id": 0}}}

Then it the response would merge the previous meta content of each request with the meta batch params (batch uid), giving the output as following

{"names": ["a", "b", "c"], "data": {"ndarray": [[9,9,9]]}, "meta": {"tags": {“batch-uid”: …, "internal-id": 1, "extra_id": 0}}}
{"names": ["a", "b", "c"], "data": {"ndarray": [[8,8,8]]}, "meta": {"tags": {“batch-uid”: …, "internal-id": 2, "extra_id": 0}}}
@axsaucedo axsaucedo added the triage Needs to be triaged and prioritised accordingly label Oct 26, 2021
@RafalSkolasinski
Copy link
Member

To clarify more on meta in output file. There are three inserted created by batch component, e.g. single row in output would contain:

{
  ...,
  "meta": {
    "tags": {
      "batch_id": "3d6acd6c-3744-11ec-951d-c3e49d18a2d2",
      "batch_index": 2,
      "batch_instance_id": "3d6b3f88-3744-11ec-951d-c3e49d18a2d2"
    }
}

where:

  • batch_id - unique for batch job as a whole, same value for each instance (row) in the input/output file
  • batch_index - index (number of row) at which given instance in output.txt was present in input.txt
  • batch_instance_id - unique identifier of each instance

Now, batch_instance_id with BATCH_SIZE=1 (each request sending only one instance) is equal to Seldon-Puid sent to the model server. This is also used as index when logging into ELK: id = {seldon-puid} as it is.

However, if BATCH_SIZE = N > 1 then mini-batch contains N instances sent with single {seldon-puid} = batch_instance_id[0]. These get logged into ELK with id = {seldon-puid}-item[n] where n = 0, ..., N - 1 identifies instances in the mini-batch, effectively being batch_instance_id[0]-item[n].

The problem is that in output.txt the instances from the said mini-batch currently have for n = 1, ..., N - 2 different batch_instance_id than the value of related seldon-puid and are present in ELK under different indices.

We should probably get these in sync and set batch_instance_id to follow the same patter, therefore for all instances grouped in single mini-batch to be {seldon-puid}-item[n] with n = 0, ..., N - 1.

@seldondev seldondev removed the triage Needs to be triaged and prioritised accordingly label Oct 28, 2021 — with Board Genius Sync
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants