Progress notifications in infer_many #616

gheinrich · 2016-03-03T13:58:21Z

Close #611

Summary

This change causes the inference job page to be displayed immediately while the job is running. Progress notifications are reflected on the job status bar to show the estimated time remaining. Results are not showed continuously however when the inference job completes the page is being reloaded to display all classifications and statistics.

Implementation details

The inference tool was updated to perform inference in batches of (by default) 1024. This allows progress messages to be sent back to the web app.

~~The classify_many route now serves two purposes:~~
~~- kick off a new inference job, when job_id is an instance of a classification model. In that case the image_list field is required;~~
~~- check the current status of the inference job, when job_id is an instance of an inference job.~~

In both cases the route returns a page that shows the current status of the inference job. If the job is done this route shows the classification results and statistics. If this sounds too convoluted I can add another route.

On second thought it feels like a separate route is cleaner. I moved the classify_many route from the model blueprint to digits.inference.images.classification.views.classify_many. When that route is being accessed required parameters are job_id (the model job ID) and image_list. The client is then redirected to digits.inference.views.show. This new route shows the current status of the inference process. When the route is being accessed before the job is done, progress notifications are shown in the job status bar. When the job is done a socketio message is sent to the client to reload the page (and show all results). This sounds slightly racey (the "job done" update might be sent before the client establishes the socket io connection) therefore a javascript timer was added to automatically refresh the page on client side if no notification is received within the first 10 seconds.

NOTE: the inference job is automatically removed when the show route is being accessed after the job is done but what do we need to do about "unclaimed" jobs?

The REST interface is updated in the same way. The only difference being that there is no way to send updates to the client through socketio. To address this issue, the inference job ID is returned in the initial JSON response. The user can then access the new digits.inference/<job>.json route to check the job status and progress. For example, the initial request might look like this (20160117-142505-b706 is the model job id):

$ curl --form "image_list=@/tmp/val.txt" http://localhost:5000//inference/images/classification/classify_many.json?job_id=20160117-142505-b706
{
  "id": "20160304-214034-d677", 
  "name": "Classify Many Images", 
  "progress": 0.0, 
  "status": "Initialized"

The client can now check the current status of the inference job:

$ curl http://localhost:5000/inference/20160304-214034-d677.json
{
  "id": "20160304-214034-d677", 
  "name": "Classify Many Images", 
  "progress": 0.6827577010268036, 
  "status": "Running"
}

The progress field can be used by the client to poll "intelligently" i.e. according to the speed at which progress is being made.

Finally, when the job is done, classification results are shown:

$ curl http://localhost:5000/inference/20160304-214034-d677.json
{
  "classifications": {
    "/home/greg/ws/datasets/mnist/train/0/00037.png": [
      [
        "0", 
         100.0
       ], 
...

Progress

Future

Apply same scheme to generic models.

gheinrich · 2016-03-08T14:18:18Z

Rebased on master and added commits to move classify_one and top_n to new scheme.

lukeyeager · 2016-03-10T19:08:20Z

There's a merge conflict from #608.

lukeyeager · 2016-03-11T01:50:05Z

This sounds slightly racey (the "job done" update might be sent before the client establishes the socket io connection) therefore a javascript timer was added to automatically refresh the page on client side if no notification is received within the first 10 seconds.

What happens if the job finishes in 10.01 seconds? Does it reload again after another 10 seconds?

lukeyeager · 2016-03-11T01:51:44Z

NOTE: the inference job is automatically removed when the show route is being accessed after the job is done but what do we need to do about "unclaimed" jobs?

If the client is still connected, then they should reload the page within a second of the job being done, right? How about a 1 minute timeout on the job somehow?

Also, it would probably make sense to delete any InferenceJobs that exist on disk when running scheduler.load_previous_jobs().

lukeyeager · 2016-03-11T01:58:02Z

This is working pretty well for me, but the progress doesn't seem to be working. I just get 0% for a few minutes, then it jumps up to 100% all of a sudden and immediately the page is refreshed. Not sure how to debug what's going on yet ...

gheinrich · 2016-03-11T16:24:39Z

What happens if the job finishes in 10.01 seconds? Does it reload again after another 10 seconds?

That would still work because after another 10 seconds, the timer would reload the page (at which point the job is done and results are displayed). Note that the timer is only started when the job is still running at the point the page is being served by the web app.

If the client is still connected, then they should reload the page within a second of the job being done, right? How about a 1 minute timeout on the job somehow?

That would work for the "web" path. What about the REST API?

Also, it would probably make sense to delete any InferenceJobs that exist on disk when running scheduler.load_previous_jobs().

Yes, this plus perhaps a flag to tell the Job to delete its job_dir as soon as it completes. That would make those jobs volatile as they would exist in RAM until the job is "claimed" or the server is restarted.

This is working pretty well for me, but the progress doesn't seem to be working. I just get 0% for a few minutes, then it jumps up to 100% all of a sudden and immediately the page is refreshed. Not sure how to debug what's going on yet ...

That might have been due to the default batch size of 1024 images at the tools/inference.py level? How many images did you have in the list? I wasn't sure whether it was a good idea to do the inference in batches at this level but I thought that:

this would spread the image resizing processing more evenly across the job,
this would reduce total memory footprint (we don't have to keep all images in memory before feeding them to the DL framework)
this would factor out some code (no need to implement progress notifications in the framework)

The downside is we're having to call the DL framework multiple times so we have to use pretty large batches otherwise the overhead of calling into the DL framework is too large. In the TODO list I have added a task to choose the batch size automatically in order to control the frequency of the updates. We'd start with a small batch (e.g. 128 images) and then depending on the time it took to do the inference we'd grow or shrink the batch size to aim for one update every e.g. 10 seconds.

lukeyeager · 2016-03-11T17:48:26Z

10-second refresh

That would still work because after another 10 seconds, the timer would reload the page (at which point the job is done and results are displayed). Note that the timer is only started when the job is still running at the point the page is being served by the web app.

Ok, sounds good.

Unclaimed jobs

That would work for the "web" path. What about the REST API?

Good question. How much memory might these jobs potentially take up on the server? Just the network outputs? Maybe a 1hr timeout instead?

1024 batch size

That might have been due to the default batch size of 1024 images at the tools/inference.py level?

I haven't dug into the code enough to check yet, but this confuses me. The GPU can't handle a batch size of 1024, so you must be talking about some other "batch".

I would expect the server to save the first num_test_images of the input file to disk, and then let tools/inference.py read it in batches of 64 or 24 or whatever the network was designed for. What are you doing instead?

gheinrich · 2016-03-11T18:10:19Z

How much memory might these jobs potentially take up on the server? Just the network outputs? Maybe a 1hr timeout instead?

Yes, network outputs plus resized images and visualizations. The resized image is used during single image inference to show what the actual network output looks like and also during Top-N classification. Admittedly, we don't need it for multiple image classification, though I figured it is preferable to have a common set of data and let the views choose how to display results.

I would expect the server to save the first num_test_images of the input file to disk, and then let tools/inference.py read it in batches of 64 or 24 or whatever the network was designed for. What are you doing instead?

The script in tools/inference.py is using the framework's infer_many() method to perform inference. infer_many() can take any number of input samples however it needs these samples to be resized to the network input dimensions. Before this PR, this works as follows:

read and resized all images from the list up to num_test_images
provide the full set of resized images to infer_many() and have the framework do the inference.

As I mentioned in the previous comment there are a couple of issues associated with this scheme so I introduced the notion of a batch in the tools/inference.py script: instead of loading/resizing all images, I am loading batches of (by default 1024) images and I pass those batches to infer_many(). So indeed this isn't the same thing as the batch at GPU level since infer_many() further splits its input data into smaller network dependent batches. Now the question is: what batch size to use in tools/inference.py? If I use too small a batch size there will be a large overhead in infer_many() since it reloads the model every time. In the TODO list there is an item for keeping the model resident in memory to reduce the overhead. This sounds easy enough to implement in Caffe but it will be less obvious in Torch since inference isn't done in Python.

Volatile jobs are not persisted to disk. Besides, they are automatically deleted by the scheduler 1 hour after completion.

gheinrich · 2016-03-14T12:47:07Z

I have pushed #631 to try and address the issue of unclaimed jobs (along the lines that Luke suggested on #616 (comment)

gheinrich · 2016-03-15T15:37:29Z

Admittedly the different batch size in inference.py was rather confusing... so I pushed another change to inherit the batch size from the model. @lukeyeager can you try again on a887e05? You should see much more frequent updates now.

gheinrich · 2016-06-16T22:06:23Z

Closing - backed up on https://github.com/gheinrich/DIGITS/tree/backup/inference-progress-notifications

gheinrich changed the title ~~Progress notifications in infer_many~~ Progress notifications in infer_many [ DON'T MERGE] Mar 3, 2016

gheinrich added enhancement UI labels Mar 7, 2016

gheinrich force-pushed the dev/infer-many-progress-notifs branch from 05822eb to 78aacb9 Compare March 8, 2016 13:53

gheinrich added 15 commits March 14, 2016 12:08

Support for volatile jobs

3e7b275

Volatile jobs are not persisted to disk. Besides, they are automatically deleted by the scheduler 1 hour after completion.

Batched inference

3b4b488

Showing inference job progress and reloading page when done

863ce53

Add update notification timeout

8a1f31a

Fix reference to classifications

8919699

Allow to query classify_many.json with GET method

91d2ee7

Create routes for inference

63fc4bd

Add missing file

d2f6f8b

Remove unused imports

cb8d0ff

Misc update in tests

778a4a6

Add missing file

30dcb19

Fix merge error

ffdb6ff

Fix other merge error

754e406

Progress notifications in TopN classification

8f02fb0

Show progress notifications in classify one

b6a5ee4

gheinrich force-pushed the dev/infer-many-progress-notifs branch from 78aacb9 to b6a5ee4 Compare March 14, 2016 12:45

Fix tests for mean_image/pixel

a3f4018

Retrieve batch size from model

a887e05

gheinrich changed the title ~~Progress notifications in infer_many [ DON'T MERGE]~~ Progress notifications in infer_many Mar 15, 2016

gheinrich mentioned this pull request Jun 7, 2016

Document the API #814

Closed

gheinrich closed this Jun 16, 2016

gheinrich mentioned this pull request Jan 3, 2017

Why does inference.py load first the whole dataset in memory? #1368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress notifications in infer_many #616

Progress notifications in infer_many #616

gheinrich commented Mar 3, 2016

gheinrich commented Mar 8, 2016

lukeyeager commented Mar 10, 2016

lukeyeager commented Mar 11, 2016

lukeyeager commented Mar 11, 2016

lukeyeager commented Mar 11, 2016

gheinrich commented Mar 11, 2016

lukeyeager commented Mar 11, 2016

gheinrich commented Mar 11, 2016

gheinrich commented Mar 14, 2016

gheinrich commented Mar 15, 2016

gheinrich commented Jun 16, 2016

Progress notifications in infer_many #616

Progress notifications in infer_many #616

Conversation

gheinrich commented Mar 3, 2016

Summary

Implementation details

Progress

Future

gheinrich commented Mar 8, 2016

lukeyeager commented Mar 10, 2016

lukeyeager commented Mar 11, 2016

lukeyeager commented Mar 11, 2016

lukeyeager commented Mar 11, 2016

gheinrich commented Mar 11, 2016

lukeyeager commented Mar 11, 2016

10-second refresh

Unclaimed jobs

1024 batch size

gheinrich commented Mar 11, 2016

gheinrich commented Mar 14, 2016

gheinrich commented Mar 15, 2016

gheinrich commented Jun 16, 2016