gunicorn + nginx issues #479

wghilliard · 2015-12-17T22:25:58Z

I am currently using digits on large data samples (225000+) and after I train a model I am trying to tune my data samples by using classification stats via the API's classify many. I have noticed that when I attempt to classify >4500 images, the server will timeout and produce a 502 error. If I append the gunicorn call in /etc/init/nvidia-digits-server.conf with "--timeout=300" the server will timeout after 60 seconds and produce a 504 error. After some reading I found out nginx default is 60 seconds and you must append the /etc/nginx/sites-available/digits.site with "proxy_read_timeout 300;". Maybe it was due to my naivety, but I didn't find this fix immediately obvious. Should the docs be modified to cite this issue for users of my skill level?
I noticed that when posting to the API's classify many that the webapp will hang and result in either a timeout or an extremely delayed response. I would attribute this to only having a single gunicorn worker. I tried appending "--workers=4" to the aforementioned gunicorn call which resulted in intermittent 404 errors when you tried to look at jobs via the webapp along with failed model training. With this in mind, my question is: Is Digits designed to run with multiple worker instances? If not, why wrap Flask with gunicorn?

lukeyeager · 2015-12-18T15:40:04Z

Re (1):

Thanks for reporting the issue! This is a known issue and needs to be addressed in the release notes/documentation. Here's a previous discussion of the issue:

I think the best way to solve this would be with the intermediate results / progress bar feature I mentioned earlier. Then the page can return quickly with a "0/1,300,000 images processed" page, and then slowly return the data as it comes through. That's a lot of work, and won't get implemented quickly. For now I'd suggest breaking up your huge textfile into manageable chunks.
#70 (comment)

The basic issue is that inference is that the request doesn't return a response until all of the classifications are complete (which could take a very long time). We've prioritized other things for the v3.0 release and didn't have time to get around to it.

So far, we've been treating DIGITS as primarily a deployment solution without working to make it a quality inference solution:

DIGITS is not a deployment solution, it is a training system. As such, it has been designed assuming that datasets are created and models are trained within DIGITS. Also, the emphasis has been on making training efficient, not inference.
#49 (comment)

But realistically, people are using it for inference and it needs to work - or at least not crash the server.

I don't think we can fix this for the v3.0 release because it's a big change that touches some important parts of the code and I don't want to introduce major bugs in-between the RC release and the final release. But it should be documented. I'll work on that.

lukeyeager · 2015-12-18T15:42:10Z

Re (2):

IIRC, I ran into issues with the SocketIO server when trying to use multiple worker processes. Gunicorn is used mainly because it handles server crashes well. Caffe tends to SIGABRT on errors, which can bring down the DIGITS server if it occurs during inference (which is another big issue that needs to get fixed). Gunicorn is nice because it detects when the worker process goes down and restarts it automatically.

lukeyeager · 2015-12-18T16:00:01Z

BTW, I'm happy to review submissions to fix any of these issues!

wghilliard · 2015-12-18T16:50:07Z

But realistically, people are using it for inference and it needs to work - or at least not crash the server.

Maybe I'm overstepping the scope of this discussion, but after reviewing the Scheduler implementation it seems like the entirety of the processing is done inside the Flask application itself. Would an alternative implementation that abstracted the application into 2/3 layer model that explicitly separated Flask and the core neural network functions be more appropriate in this scenario? For instance, Celery could be utilized to run the Scheduler and other job processes in order to allow Flask to assume an observe and report role thus delegating the heavy lifting to something more suited for it. Just my 2¢!

lukeyeager · 2015-12-18T16:51:41Z

In the meantime, if you want to get some work done, you might want to use a script similar to this:

#!/bin/sh -e

if [ "$#" -ne 3 ]
then
    echo Usage: $0 server job_id filename
    exit 1
fi

server=$1
job_id=$2
filename=$3

for line in `cat $filename | awk '{ print $1 }'`
do
    echo $line
    curl -XPOST \
        $server/models/images/classification/classify_one.json?job_id=$job_id \
        -Fimage_url=$line
done

If you break up each image into a separate task, DIGITS will respond much better.

lukeyeager · 2015-12-18T17:12:03Z

Maybe I'm overstepping the scope of this discussion, but after reviewing the Scheduler implementation it seems like the entirety of the processing is done inside the Flask application itself. Would an alternative implementation that abstracted the application into 2/3 layer model that explicitly separated Flask and the core neural network functions be more appropriate in this scenario? For instance, Celery could be utilized to run the Scheduler and other job processes in order to allow Flask to assume an observe and report role thus delegating the heavy lifting to something more suited for it. Just my 2¢!

That is a bit outside the scope of this particular issue, but those are great suggestions. It'd be great if you could open a new issue for this.

lukeyeager · 2016-10-31T17:06:13Z

gunicorn was removed in #1127.

lukeyeager added the bug label Dec 18, 2015

lukeyeager mentioned this issue Feb 4, 2016

Save job information to a database [DON'T MERGE] #566

Closed

9 tasks

lukeyeager mentioned this issue Feb 17, 2016

Inference jobs #573

Merged

15 tasks

lukeyeager mentioned this issue Feb 29, 2016

View partial results for inference #611

Open

lukeyeager closed this as completed Oct 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gunicorn + nginx issues #479

gunicorn + nginx issues #479

wghilliard commented Dec 17, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

wghilliard commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Oct 31, 2016

gunicorn + nginx issues #479

gunicorn + nginx issues #479

Comments

wghilliard commented Dec 17, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

wghilliard commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Dec 18, 2015

lukeyeager commented Oct 31, 2016