Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gunicorn + nginx issues #479

Closed
wghilliard opened this issue Dec 17, 2015 · 7 comments
Closed

gunicorn + nginx issues #479

wghilliard opened this issue Dec 17, 2015 · 7 comments
Labels

Comments

@wghilliard
Copy link

  1. I am currently using digits on large data samples (225000+) and after I train a model I am trying to tune my data samples by using classification stats via the API's classify many. I have noticed that when I attempt to classify >4500 images, the server will timeout and produce a 502 error. If I append the gunicorn call in /etc/init/nvidia-digits-server.conf with "--timeout=300" the server will timeout after 60 seconds and produce a 504 error. After some reading I found out nginx default is 60 seconds and you must append the /etc/nginx/sites-available/digits.site with "proxy_read_timeout 300;". Maybe it was due to my naivety, but I didn't find this fix immediately obvious. Should the docs be modified to cite this issue for users of my skill level?
  2. I noticed that when posting to the API's classify many that the webapp will hang and result in either a timeout or an extremely delayed response. I would attribute this to only having a single gunicorn worker. I tried appending "--workers=4" to the aforementioned gunicorn call which resulted in intermittent 404 errors when you tried to look at jobs via the webapp along with failed model training. With this in mind, my question is: Is Digits designed to run with multiple worker instances? If not, why wrap Flask with gunicorn?
@lukeyeager lukeyeager added the bug label Dec 18, 2015
@lukeyeager
Copy link
Member

Re (1):

Thanks for reporting the issue! This is a known issue and needs to be addressed in the release notes/documentation. Here's a previous discussion of the issue:

I think the best way to solve this would be with the intermediate results / progress bar feature I mentioned earlier. Then the page can return quickly with a "0/1,300,000 images processed" page, and then slowly return the data as it comes through. That's a lot of work, and won't get implemented quickly. For now I'd suggest breaking up your huge textfile into manageable chunks.
#70 (comment)

The basic issue is that inference is that the request doesn't return a response until all of the classifications are complete (which could take a very long time). We've prioritized other things for the v3.0 release and didn't have time to get around to it.

So far, we've been treating DIGITS as primarily a deployment solution without working to make it a quality inference solution:

DIGITS is not a deployment solution, it is a training system. As such, it has been designed assuming that datasets are created and models are trained within DIGITS. Also, the emphasis has been on making training efficient, not inference.
#49 (comment)

But realistically, people are using it for inference and it needs to work - or at least not crash the server.

I don't think we can fix this for the v3.0 release because it's a big change that touches some important parts of the code and I don't want to introduce major bugs in-between the RC release and the final release. But it should be documented. I'll work on that.

@lukeyeager
Copy link
Member

Re (2):

IIRC, I ran into issues with the SocketIO server when trying to use multiple worker processes. Gunicorn is used mainly because it handles server crashes well. Caffe tends to SIGABRT on errors, which can bring down the DIGITS server if it occurs during inference (which is another big issue that needs to get fixed). Gunicorn is nice because it detects when the worker process goes down and restarts it automatically.

@lukeyeager
Copy link
Member

BTW, I'm happy to review submissions to fix any of these issues!

@wghilliard
Copy link
Author

But realistically, people are using it for inference and it needs to work - or at least not crash the server.

Maybe I'm overstepping the scope of this discussion, but after reviewing the Scheduler implementation it seems like the entirety of the processing is done inside the Flask application itself. Would an alternative implementation that abstracted the application into 2/3 layer model that explicitly separated Flask and the core neural network functions be more appropriate in this scenario? For instance, Celery could be utilized to run the Scheduler and other job processes in order to allow Flask to assume an observe and report role thus delegating the heavy lifting to something more suited for it. Just my 2¢!

@lukeyeager
Copy link
Member

In the meantime, if you want to get some work done, you might want to use a script similar to this:

#!/bin/sh -e

if [ "$#" -ne 3 ]
then
    echo Usage: $0 server job_id filename
    exit 1
fi

server=$1
job_id=$2
filename=$3

for line in `cat $filename | awk '{ print $1 }'`
do
    echo $line
    curl -XPOST \
        $server/models/images/classification/classify_one.json?job_id=$job_id \
        -Fimage_url=$line
done

If you break up each image into a separate task, DIGITS will respond much better.

@lukeyeager
Copy link
Member

Maybe I'm overstepping the scope of this discussion, but after reviewing the Scheduler implementation it seems like the entirety of the processing is done inside the Flask application itself. Would an alternative implementation that abstracted the application into 2/3 layer model that explicitly separated Flask and the core neural network functions be more appropriate in this scenario? For instance, Celery could be utilized to run the Scheduler and other job processes in order to allow Flask to assume an observe and report role thus delegating the heavy lifting to something more suited for it. Just my 2¢!

That is a bit outside the scope of this particular issue, but those are great suggestions. It'd be great if you could open a new issue for this.

@lukeyeager
Copy link
Member

gunicorn was removed in #1127.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants