Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress notifications in infer_many #616

Closed

Conversation

gheinrich
Copy link
Contributor

Close #611

Summary

This change causes the inference job page to be displayed immediately while the job is running. Progress notifications are reflected on the job status bar to show the estimated time remaining. Results are not showed continuously however when the inference job completes the page is being reloaded to display all classifications and statistics.

Implementation details

The inference tool was updated to perform inference in batches of (by default) 1024. This allows progress messages to be sent back to the web app.

The classify_many route now serves two purposes:
- kick off a new inference job, when job_id is an instance of a classification model. In that case the image_list field is required;
- check the current status of the inference job, when job_id is an instance of an inference job.

In both cases the route returns a page that shows the current status of the inference job. If the job is done this route shows the classification results and statistics. If this sounds too convoluted I can add another route.

On second thought it feels like a separate route is cleaner. I moved the classify_many route from the model blueprint to digits.inference.images.classification.views.classify_many. When that route is being accessed required parameters are job_id (the model job ID) and image_list. The client is then redirected to digits.inference.views.show. This new route shows the current status of the inference process. When the route is being accessed before the job is done, progress notifications are shown in the job status bar. When the job is done a socketio message is sent to the client to reload the page (and show all results). This sounds slightly racey (the "job done" update might be sent before the client establishes the socket io connection) therefore a javascript timer was added to automatically refresh the page on client side if no notification is received within the first 10 seconds.

NOTE: the inference job is automatically removed when the show route is being accessed after the job is done but what do we need to do about "unclaimed" jobs?

The REST interface is updated in the same way. The only difference being that there is no way to send updates to the client through socketio. To address this issue, the inference job ID is returned in the initial JSON response. The user can then access the new digits.inference/<job>.json route to check the job status and progress. For example, the initial request might look like this (20160117-142505-b706 is the model job id):

$ curl --form "image_list=@/tmp/val.txt" http://localhost:5000//inference/images/classification/classify_many.json?job_id=20160117-142505-b706
{
  "id": "20160304-214034-d677", 
  "name": "Classify Many Images", 
  "progress": 0.0, 
  "status": "Initialized"

The client can now check the current status of the inference job:

$ curl http://localhost:5000/inference/20160304-214034-d677.json
{
  "id": "20160304-214034-d677", 
  "name": "Classify Many Images", 
  "progress": 0.6827577010268036, 
  "status": "Running"
}

The progress field can be used by the client to poll "intelligently" i.e. according to the speed at which progress is being made.

Finally, when the job is done, classification results are shown:

$ curl http://localhost:5000/inference/20160304-214034-d677.json
{
  "classifications": {
    "/home/greg/ws/datasets/mnist/train/0/00037.png": [
      [
        "0", 
         100.0
       ], 
...

Progress

  • batched inference in inference tool
  • send progress notifications to server
  • show job progress immediately
  • reload page when inference job is done
  • timer to detect socketio race between client and server
  • update REST interface
  • classify One
  • classify Many
  • top-N classification
  • update classification tests
  • find out what to do about unclaimed jobs

Future

Apply same scheme to generic models.

@gheinrich gheinrich changed the title Progress notifications in infer_many Progress notifications in infer_many [ DON'T MERGE] Mar 3, 2016
@gheinrich gheinrich force-pushed the dev/infer-many-progress-notifs branch from 05822eb to 78aacb9 Compare March 8, 2016 13:53
@gheinrich
Copy link
Contributor Author

Rebased on master and added commits to move classify_one and top_n to new scheme.

@lukeyeager
Copy link
Member

There's a merge conflict from #608.

@lukeyeager
Copy link
Member

This sounds slightly racey (the "job done" update might be sent before the client establishes the socket io connection) therefore a javascript timer was added to automatically refresh the page on client side if no notification is received within the first 10 seconds.

What happens if the job finishes in 10.01 seconds? Does it reload again after another 10 seconds?

@lukeyeager
Copy link
Member

NOTE: the inference job is automatically removed when the show route is being accessed after the job is done but what do we need to do about "unclaimed" jobs?

If the client is still connected, then they should reload the page within a second of the job being done, right? How about a 1 minute timeout on the job somehow?

Also, it would probably make sense to delete any InferenceJobs that exist on disk when running scheduler.load_previous_jobs().

@lukeyeager
Copy link
Member

This is working pretty well for me, but the progress doesn't seem to be working. I just get 0% for a few minutes, then it jumps up to 100% all of a sudden and immediately the page is refreshed. Not sure how to debug what's going on yet ...

@gheinrich
Copy link
Contributor Author

What happens if the job finishes in 10.01 seconds? Does it reload again after another 10 seconds?

That would still work because after another 10 seconds, the timer would reload the page (at which point the job is done and results are displayed). Note that the timer is only started when the job is still running at the point the page is being served by the web app.

If the client is still connected, then they should reload the page within a second of the job being done, right? How about a 1 minute timeout on the job somehow?

That would work for the "web" path. What about the REST API?

Also, it would probably make sense to delete any InferenceJobs that exist on disk when running scheduler.load_previous_jobs().

Yes, this plus perhaps a flag to tell the Job to delete its job_dir as soon as it completes. That would make those jobs volatile as they would exist in RAM until the job is "claimed" or the server is restarted.

This is working pretty well for me, but the progress doesn't seem to be working. I just get 0% for a few minutes, then it jumps up to 100% all of a sudden and immediately the page is refreshed. Not sure how to debug what's going on yet ...

That might have been due to the default batch size of 1024 images at the tools/inference.py level? How many images did you have in the list? I wasn't sure whether it was a good idea to do the inference in batches at this level but I thought that:

  • this would spread the image resizing processing more evenly across the job,
  • this would reduce total memory footprint (we don't have to keep all images in memory before feeding them to the DL framework)
  • this would factor out some code (no need to implement progress notifications in the framework)

The downside is we're having to call the DL framework multiple times so we have to use pretty large batches otherwise the overhead of calling into the DL framework is too large. In the TODO list I have added a task to choose the batch size automatically in order to control the frequency of the updates. We'd start with a small batch (e.g. 128 images) and then depending on the time it took to do the inference we'd grow or shrink the batch size to aim for one update every e.g. 10 seconds.

@lukeyeager
Copy link
Member

10-second refresh

That would still work because after another 10 seconds, the timer would reload the page (at which point the job is done and results are displayed). Note that the timer is only started when the job is still running at the point the page is being served by the web app.

Ok, sounds good.

Unclaimed jobs

That would work for the "web" path. What about the REST API?

Good question. How much memory might these jobs potentially take up on the server? Just the network outputs? Maybe a 1hr timeout instead?

1024 batch size

That might have been due to the default batch size of 1024 images at the tools/inference.py level?

I haven't dug into the code enough to check yet, but this confuses me. The GPU can't handle a batch size of 1024, so you must be talking about some other "batch".

I would expect the server to save the first num_test_images of the input file to disk, and then let tools/inference.py read it in batches of 64 or 24 or whatever the network was designed for. What are you doing instead?

@gheinrich
Copy link
Contributor Author

How much memory might these jobs potentially take up on the server? Just the network outputs? Maybe a 1hr timeout instead?

Yes, network outputs plus resized images and visualizations. The resized image is used during single image inference to show what the actual network output looks like and also during Top-N classification. Admittedly, we don't need it for multiple image classification, though I figured it is preferable to have a common set of data and let the views choose how to display results.

I would expect the server to save the first num_test_images of the input file to disk, and then let tools/inference.py read it in batches of 64 or 24 or whatever the network was designed for. What are you doing instead?

The script in tools/inference.py is using the framework's infer_many() method to perform inference. infer_many() can take any number of input samples however it needs these samples to be resized to the network input dimensions. Before this PR, this works as follows:

  • read and resized all images from the list up to num_test_images
  • provide the full set of resized images to infer_many() and have the framework do the inference.

As I mentioned in the previous comment there are a couple of issues associated with this scheme so I introduced the notion of a batch in the tools/inference.py script: instead of loading/resizing all images, I am loading batches of (by default 1024) images and I pass those batches to infer_many(). So indeed this isn't the same thing as the batch at GPU level since infer_many() further splits its input data into smaller network dependent batches. Now the question is: what batch size to use in tools/inference.py? If I use too small a batch size there will be a large overhead in infer_many() since it reloads the model every time. In the TODO list there is an item for keeping the model resident in memory to reduce the overhead. This sounds easy enough to implement in Caffe but it will be less obvious in Torch since inference isn't done in Python.

@gheinrich gheinrich force-pushed the dev/infer-many-progress-notifs branch from 78aacb9 to b6a5ee4 Compare March 14, 2016 12:45
@gheinrich
Copy link
Contributor Author

I have pushed #631 to try and address the issue of unclaimed jobs (along the lines that Luke suggested on #616 (comment)

@gheinrich gheinrich changed the title Progress notifications in infer_many [ DON'T MERGE] Progress notifications in infer_many Mar 15, 2016
@gheinrich
Copy link
Contributor Author

Admittedly the different batch size in inference.py was rather confusing... so I pushed another change to inherit the batch size from the model. @lukeyeager can you try again on a887e05? You should see much more frequent updates now.

@gheinrich gheinrich mentioned this pull request Jun 7, 2016
@gheinrich
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants