-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Progress notifications in infer_many #616
Progress notifications in infer_many #616
Conversation
05822eb
to
78aacb9
Compare
Rebased on master and added commits to move |
There's a merge conflict from #608. |
What happens if the job finishes in 10.01 seconds? Does it reload again after another 10 seconds? |
If the client is still connected, then they should reload the page within a second of the job being done, right? How about a 1 minute timeout on the job somehow? Also, it would probably make sense to delete any |
This is working pretty well for me, but the progress doesn't seem to be working. I just get 0% for a few minutes, then it jumps up to 100% all of a sudden and immediately the page is refreshed. Not sure how to debug what's going on yet ... |
That would still work because after another 10 seconds, the timer would reload the page (at which point the job is done and results are displayed). Note that the timer is only started when the job is still running at the point the page is being served by the web app.
That would work for the "web" path. What about the REST API?
Yes, this plus perhaps a flag to tell the
That might have been due to the default batch size of 1024 images at the
The downside is we're having to call the DL framework multiple times so we have to use pretty large batches otherwise the overhead of calling into the DL framework is too large. In the TODO list I have added a task to choose the batch size automatically in order to control the frequency of the updates. We'd start with a small batch (e.g. 128 images) and then depending on the time it took to do the inference we'd grow or shrink the batch size to aim for one update every e.g. 10 seconds. |
10-second refresh
Ok, sounds good. Unclaimed jobs
Good question. How much memory might these jobs potentially take up on the server? Just the network outputs? Maybe a 1hr timeout instead? 1024 batch size
I haven't dug into the code enough to check yet, but this confuses me. The GPU can't handle a batch size of 1024, so you must be talking about some other "batch". I would expect the server to save the first |
Yes, network outputs plus resized images and visualizations. The resized image is used during single image inference to show what the actual network output looks like and also during Top-N classification. Admittedly, we don't need it for multiple image classification, though I figured it is preferable to have a common set of data and let the
The script in
As I mentioned in the previous comment there are a couple of issues associated with this scheme so I introduced the notion of a batch in the |
Volatile jobs are not persisted to disk. Besides, they are automatically deleted by the scheduler 1 hour after completion.
78aacb9
to
b6a5ee4
Compare
I have pushed #631 to try and address the issue of unclaimed jobs (along the lines that Luke suggested on #616 (comment) |
Admittedly the different batch size in |
Closing - backed up on https://github.com/gheinrich/DIGITS/tree/backup/inference-progress-notifications |
Close #611
Summary
This change causes the inference job page to be displayed immediately while the job is running. Progress notifications are reflected on the job status bar to show the estimated time remaining. Results are not showed continuously however when the inference job completes the page is being reloaded to display all classifications and statistics.
Implementation details
The inference tool was updated to perform inference in batches of (by default) 1024. This allows progress messages to be sent back to the web app.
Theclassify_many
route now serves two purposes:- kick off a new inference job, whenjob_id
is an instance of a classification model. In that case theimage_list
field is required;- check the current status of the inference job, whenjob_id
is an instance of an inference job.In both cases the route returns a page that shows the current status of the inference job. If the job is done this route shows the classification results and statistics. If this sounds too convoluted I can add another route.On second thought it feels like a separate route is cleaner. I moved the
classify_many
route from themodel
blueprint todigits.inference.images.classification.views.classify_many
. When that route is being accessed required parameters arejob_id
(the model job ID) andimage_list
. The client is then redirected todigits.inference.views.show
. This new route shows the current status of the inference process. When the route is being accessed before the job is done, progress notifications are shown in the job status bar. When the job is done a socketio message is sent to the client to reload the page (and show all results). This sounds slightly racey (the "job done" update might be sent before the client establishes the socket io connection) therefore a javascript timer was added to automatically refresh the page on client side if no notification is received within the first 10 seconds.The REST interface is updated in the same way. The only difference being that there is no way to send updates to the client through socketio. To address this issue, the inference job ID is returned in the initial JSON response. The user can then access the new
digits.inference/<job>.json
route to check the job status and progress. For example, the initial request might look like this (20160117-142505-b706
is the model job id):The client can now check the current status of the inference job:
The
progress
field can be used by the client to poll "intelligently" i.e. according to the speed at which progress is being made.Finally, when the job is done, classification results are shown:
Progress
Future
Apply same scheme to generic models.