Question: Concurrency in Model REST server #453

NegatioN · 2019-02-20T12:44:15Z

I was wondering if you have any tests on how the REST API for MODEL type deployments handle concurrency?
I'm guessing this is the part of the code that handles it, so I would expect the Tornado coroutines to do something: https://github.com/SeldonIO/seldon-core/blob/master/python/seldon_core/model_microservice.py#L228-L262

However when testing with ab to throw some concurrent traffic at the API I see very linear scaling of the processing-time no matter if the model is extremely simple (a simple one layer model going from 100x1) or if it's more complex (Embeddings, RNNs, DotProducts).

Example:
Note: This is currently while running both the docker server and ab on the same machine, so I don't except everything to fly off the moon, but I expected some concurrency on a 4core machine with hyperthreading.
The problem seems to be identical when running the server with the CLI seldon-core-microservice Model REST, although with less network overhead.

Concurrency 1

ab -p model_input -T "application/x-www-form-urlencoded" -c 1 -n 1000 http://0.0.0.0:5000/predict
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 0.0.0.0 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        Werkzeug/0.14.1
Server Hostname:        0.0.0.0
Server Port:            5000

Document Path:          /predict
Document Length:        167 bytes

Concurrency Level:      1
Time taken for tests:   28.145 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      346000 bytes
Total body sent:        235000
HTML transferred:       167000 bytes
Requests per second:    35.53 [#/sec] (mean)
Time per request:       28.145 [ms] (mean)
Time per request:       28.145 [ms] (mean, across all concurrent requests)
Transfer rate:          12.01 [Kbytes/sec] received
                        8.15 kb/s sent
                        20.16 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    17   28  36.0     26    1039
Waiting:       17   28  36.0     26    1038
Total:         17   28  36.0     26    1039

Percentage of the requests served within a certain time (ms)
  50%     26
  66%     28
  75%     29
  80%     29
  90%     31
  95%     32
  98%     36
  99%     39
 100%   1039 (longest request)

Concurrency 5

ab -p model_input -T "application/x-www-form-urlencoded" -c 5 -n 1000 http://0.0.0.0:5000/predict
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 0.0.0.0 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        Werkzeug/0.14.1
Server Hostname:        0.0.0.0
Server Port:            5000

Document Path:          /predict
Document Length:        167 bytes

Concurrency Level:      5
Time taken for tests:   27.359 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      346000 bytes
Total body sent:        235000
HTML transferred:       167000 bytes
Requests per second:    36.55 [#/sec] (mean)
Time per request:       136.797 [ms] (mean)
Time per request:       27.359 [ms] (mean, across all concurrent requests)
Transfer rate:          12.35 [Kbytes/sec] received
                        8.39 kb/s sent
                        20.74 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:    30  136  81.6    131    1207
Waiting:       29  122  72.2    118    1156
Total:         30  136  81.6    131    1207

Percentage of the requests served within a certain time (ms)
  50%    131
  66%    141
  75%    155
  80%    162
  90%    182
  95%    199
  98%    233
  99%    255
 100%   1207 (longest request)

Am I doing something wrong here, or is the REST api intended to scale on a pod-level, not at the thread level?

Or are you intending for gRPC to be used in this case?

Sorry for flooding your issues! :)

The text was updated successfully, but these errors were encountered:

NegatioN · 2019-02-20T12:54:35Z

And right after posting I found this on the roadmap #108 and I'm guessing since we use Flask on the highest level we are somewhat limited to a single effective thread?

https://github.com/SeldonIO/seldon-core/blob/master/python/seldon_core/model_microservice.py#L57

ukclivecox · 2019-02-20T13:34:26Z

Hi, yes for the roadmap we presently have considered gunicorn or tornado. We want to prioritise this work and would welcome your feedback. Sagemaker uses gunicorn I see. I think there should be a configurable number of threads started. We would welcome your thoughts.

NegatioN · 2019-02-20T14:32:27Z

I definitely like the intuitive options that are available when running TF-serving like:
--rest_api_num_threads which would be similar to the number of threads allowed for Tornado or Gunicorn here.

Beyond that I'm not sure I have much to add. I haven't looked at gunicorn or Tornado much. We have some internal java applications that solve things in a similar way with threadpools. It seems to be the norm, and having that kind of an option is very useful for us in a production setting at least.

tszumowski · 2019-02-20T17:59:13Z

Hi all. This thread is great timing! I was just playing around with some of the Seldon tutorials this morning and dove into the source, curious on how the REST endpoints were being served. I figured I'd chime in with some recent experience with gunicorn.

Regarding a server option, using just the built-in Flask webserver can be dangerous in deployment. Per the docs:

While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well. Some of the options available for properly running Flask in production are documented here.

On that page they list several servers: gunicorn, uWSGI, Tornado, etc. Looking online, I've found there are a lot of (opinionated) posts online regarding which is ideal (example, example). There are also some benchmarks out there for different servers (example, example).

Our engineering organization has used Tornado extensively in the past. More recently for ML deployment experimentation, I've been focused on gunicorn. This is for no reason other than convenience. Both AWS Sagemaker's scikit-learn example and MLFlow's server used it, and it seemed simple, so I figured "why not".

For gunicorn, I believe the power really comes in how the workers are defined, and how many workers you can allocate per pod on Kubernetes. Theoretically one can just let Kubernetes scale a single-worker server, but pod-scaling is probably less reactive than worker-management on a single pod.

For my tests, I was utilizing some Google cloud datastores within the deployed gunicorn service (prototype initially). It turned out the gevent worker had some incompatibilities (link, link) so that may be something to keep in mind too.

Regardless of the solution seldon selects, it may be worth posting a disclaimer in the docs somewhere that the REST endpoint currently uses the Flask web server which Flask recommends not to use in production.

ukclivecox · 2019-02-20T18:16:34Z

Thanks @tszumowski very useful info and comments. I think we are probably also viewing gunicorn as the likely addition.

* Add string method on server snapshot pointers Print server names rather than memory addresses when working with lists of pointers, as in issue #449. * Fix typo in log message * Update function context field in Kafka consumer logger * Use consistent var naming for Kafka config maps * Create fresh config for Kafka admin client in model gateway This avoids librdkafka complaining about irrelevant config options. In fact, at the time of writing the version of confluent-kafka-go we use (v1.9.1) creates a producer under the hood, as apparently this is slightly cheaper than a consumer. In any case, we do not want to pass irrelevant config options which create noisy logs, nor do we want to rely on knowledge of the Go-Kafka integration's implementation. * Add producer config as context field instead of in-line in log message * Convert producer config to JSON for use as log context field * Add consumer config as JSON context field instead of inline in log message * Use specific logger not general one in setup method * Add tracing config as (JSON) context field in log message instead of as inline string * Grammatical improvements/fixes * Use debug level when logging event hub messages * Formatting Add blank lines for logical grouping. Split conditions on long lines for legibility.

ukclivecox added this to the 0.2.x milestone Feb 20, 2019

ukclivecox added area/language wrapper Python labels Feb 20, 2019

ukclivecox added this to To do in 0.2.7 Feb 21, 2019

ukclivecox removed this from To do in 0.2.7 Apr 4, 2019

ryandawsonuk mentioned this issue May 28, 2019

Add an WSGI app server to Python REST wrappers #383

Closed

ukclivecox modified the milestones: 0.2.x, 0.3.x Jun 3, 2019

ukclivecox mentioned this issue Jul 12, 2019

WIP: Update python wrapper to use gunicorn #684

Merged

ukclivecox closed this as completed in #684 Aug 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Concurrency in Model REST server #453

Question: Concurrency in Model REST server #453

NegatioN commented Feb 20, 2019

NegatioN commented Feb 20, 2019

ukclivecox commented Feb 20, 2019

NegatioN commented Feb 20, 2019

tszumowski commented Feb 20, 2019

ukclivecox commented Feb 20, 2019

Question: Concurrency in Model REST server #453

Question: Concurrency in Model REST server #453

Comments

NegatioN commented Feb 20, 2019

NegatioN commented Feb 20, 2019

ukclivecox commented Feb 20, 2019

NegatioN commented Feb 20, 2019

tszumowski commented Feb 20, 2019

ukclivecox commented Feb 20, 2019