Sliding Window, Varying input/output size and Dense, multiscale extraction #189

akosiorek · 2014-03-05T11:42:45Z

[1] enables varying input/output size in order to perform multiscale multiview image processing so as to to bolster classification confidence and to perform localisation and object detection. I wonder if and how could it be implemented in Caffe?

One possibility would be to set blob sizes to their maximum expected values and then account for the actual input size during computation at each layer. I am not familiar enough with Caffe sources to predict the overhead this approach might cause. I imagine it can lead to redundant memory copying and involved index arithmetic in order to access the right data.

What are other possibilities? I would be happy to PR it should we be able to work out a decent solution.

[1] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. arXiv:1312.6229 [cs.CV].

nian-liu · 2014-03-05T15:20:17Z

I am also concerning about similar issues: Does caffe support multiple input data layers and the situations that multiple layers input to a higher layer and vice versa?

mavenlin · 2014-03-05T15:36:23Z

As for convolution, caffe processes image one by one.
In this sense, the size of each image can vary, im2col buffer can be preallocated to fit the largest image.
For innerproduct layer, the batch mode will no longer work. (But anyways, for network involving multiscaling there would be no innerproduct layer).
Dropout is also not a problem, but I didn't read pooling code, no idea whether it is a problem.

shelhamer · 2014-03-05T18:30:59Z

@kosiorekadam For varying output size with input size, the inner product layers for classification can be made convolutional too, such that the network makes a spatial output map. This can be done in Caffe as-is with the proper network definition. We will try to include an example.

Dense, multiscale feature extraction (that's fast!) is afforded by convolutional architectures if done right, and has been done within the BVLC in Caffe. We hope to publicly release this enhancement before long.

@mavenlin while convolution is the bottleneck in the current pipeline, images of varying dimensions and scale can be accommodated in a single convolutional pass with the right indexing. Essentially, one packs a pyramid or image set into a "plane" for processing through the net. This amortizes the convolutional computation across windows. By careful indexing one can extract the features/output as if they were processed one-by-one.

@forresti is a BVLC member working on amortized computation, reduced memory usage, and further efficiency improvements to Caffe among other projects.

shelhamer · 2014-03-05T18:35:06Z

@nian-liu Caffe layers can have multiple inputs and outputs. Caffe networks can have any DAG (directed acyclic graph) structure #114 #129 , so many kinds of branching are supported. Although there aren't examples included yet, it is done by listing multiple outputs in the network definition, which are then automatically connected by inserting split layers.

For multiple inputs, you might find the concatenation layer helpful #125. This combines multiple input images into a single input blob. This could be used for example to process consecutive frames of video together.

akosiorek · 2014-03-07T12:45:16Z

I've done a little bit of code reading and as I understand both convolution and pooling layers can work with changing image sizes. They only have to be preallocated to fit the biggest image anticipated, just as @mavenlin mentioned.

However, this approach results in convolving and pooling a small image with a lot of padding (corresponding to the image of the maximum size). In order to narrow down the computation to the area of the currently processed image I need to store the size somewhere. I can feed the image to the network and compute the size after each layer inside the Net::forward or add a couple of fields to the Blob that would store the size. Of course I would have to change the API to allow input of different size than indicated in the layer_param. Am I correct?

kloudkl · 2014-03-09T04:33:22Z

Torch7 PyramidPacker does exactly what we want.
@shelhamer, is your internal implementation in BVLC the same as Torch7? If it is and you cannot open source it shortly for some reasons, we understand you well and would like to implement one to benefit everyone as soon as possible.

shelhamer · 2014-03-10T17:23:57Z

@kloudkl the BVLC implementation is the same as the Torch7 PyramidPacker at least in spirit; I have not read the Torch7 code yet to compare the details.

Our pipeline is not quite identical, but it is a pack, plane, unpack method.

I agree it is time for dense extraction in Caffe. Since there are several design choices, it is unlikely that the implementation planned here will be identical to the private (and still experimental) implementation. I suggest we move ahead on a public implementation, and then we can compare and draw from the strengths of both implementations in the end. The BVLC pyramid team agrees with this path and will continue work on their implementation too.

PRs for dense + pyramid extraction are welcome!

sguada · 2014-03-10T18:04:21Z

@shelhamer making innerproduct layers into convolutional layers will slow down the process a lot. I made some tests by changing innerproduct layers to convolution layers with 4096 filters and the running time goes from 1.25 seconds per batch (of 256 227x227 images) to 4.37 seconds, so almost 4x slower.

When I increase the size of the images to 454x454 then I have to reduce the size of the batch to 128, otherwise it doesn't fit in the 12G of memory of the K40, and the then the time per batch is 4.23 seconds per batch, what means that the time to process 256 images would be 8.47 seconds.
That would make that network impractical for training since it will take ~30 days, however it could be used for testing or deploy.

Maybe a different way to do the convolutions could help in that case.
Also by changing the size of the inputs like #195 one could pass multiple scales independently instead of all together.

shelhamer · 2014-03-10T19:39:12Z

Thanks for the timing evaluation @sguada. The convolutional bottleneck is an important target for improvement. @forresti, I think you had some ideas for this?.

However, it's important to note the overall efficiency of this scheme. In the 454x454 case an 8x8 classification map is computed, so the convolutional fully-connected net is doing 64x the work in ~8x the time (I did this math in my head, so someone might check me on this).

Further, one need not necessarily densely compute the classifier. One could fuse the dense and selective approaches by densely extracting features (across space and scale as desired), then selectively computing the fully-connected layers at selective search from the image mapped into feature space coordinates.

Perhaps #194 might help alleviate the issue if instead of convolution fully-connected layers one tiles the inner product layer weights to compute the classification map as one massive multiplication, although this is of course wasteful in memory.

sguada · 2014-03-10T20:03:58Z

@shelhamer you math was almost correct, the final output map is 9x9, so in fact is doing 81x the work in ~8x the time. However there is something to look in the convolutional layer since when given the same image size and have to do exactly the same work it requires 4x the time.

However this approach would work for big images since the extra cost is amortized very quickly.

@forresti and me have been looking into how to speed up the convolution, but so far didn't success. However maybe for this case where there are a lot of filters with many channels it could work.

forresti · 2014-03-10T20:07:27Z

@sguada oh cool, thanks for doing some benchmarking with the convolutional fc6 and fc7.

To begin with, I'll see if I can discern why conv is slower than innerproduct for the standard 227x227 setup.

@kloudkl Do you have any thoughts on the computational efficiency of Torch7 for Alexnet and similar deep models? Are there any particularly interesting scenarios where Caffe is much faster or slower than Torch7?

rodrigob · 2014-03-10T21:31:43Z

It might be interesting to look into the implementation details of OverFeat since it is supposedly optimized for the "dense sliding window" use case.
https://github.com/sermanet/OverFeat

rodrigob · 2014-03-10T21:40:02Z

Scratch my last comment for now OverFeat only released source code for the CPU version, and binaries for the GPU version (?!).

Then for now we can look at Torch's GPU code

https://github.com/torch/cunn
https://github.com/torch/cutorch/blob/master/lib/THC/THCTensorConv.cu

rodrigob · 2014-03-10T23:05:35Z

A nice demo for this new feature would be a face detector similar to
http://eblearn.sourceforge.net/face_detector.html

kloudkl · 2014-03-11T13:51:23Z

@forresti, I have read some codes of Torch7 but never run it. Unlike Caffe, only a small part of Torch7 is written in CUDA. Torch7 and Theano have been benchmarked against each other in the pre-Caffe era. The results largely depends on who does the benchmarking, when(both teams never stop optimizing the preformance) and on which GPU device they are benchmarked.

To inspire further discussions, I excerpt the following contents that can be found in many CUDA courses. To gain insight into the performance bottlenecks and root causes, the orthogonal method is profiling. Usual suspects are device utilization and memory bus utilization. The former can be tuned with launch config (#111). The latter is higher when the memory access is coalesced. Latency hiding technique can also increase throughput while warp divergence does the opposite.

If optimization becomes a really high priority, systematically studying the related professional techniques will help a lot.

shelhamer · 2014-03-11T16:55:36Z

I do not see Torch7 and Theano so much as guides for our computational pipeline and convolutional architecture, but as machine learning / deep learning libraries we can take as inspiration for features.

The central feature relevant to dense and pyramid processing in Torch7 is pyramid packing and unpacking. While optimization of the indexing, convolution, and fully-connected layers will be important for a widely-useful implementation, first we must have an implementation. From pyramid processing we can go in many directions, including for problems other than detection, and of course work on a Caffe reference implementation of OverFeat.

Thanks @kloudkl for the review of CUDA optimization and benchmark history. Perhaps we could have an "on CUDA optimization" section of the developer documentation to keep your pointers together.

The face detector highlighted by @rodrigob would be a nice demo for pyramid processing.

shelhamer · 2014-03-13T21:12:32Z

The BVLC pyramid team is working on integrating their implementation into dev ASAP. The only hold-up is the usual integration hacking and a license complication that is being hammered out now. Thanks all for your patience while this feature coalesces.

However, I stand by my original suggestion that a Torch7 style pyramid pack/plane/unpack method be pursued in the community so that we can analyze and improve on the differences. There are many design choices in such a feature.

shelhamer · 2014-03-13T21:15:04Z

Re: @forresti's #189 (comment), the convolutional implementation is slowed by the roll / unroll and copy instead of straight dgemm as in the InnerProduct layer.

rodrigob · 2014-03-22T21:31:57Z

Thanks @shelhamer for looking into the topic. Any update on the BVLC pyramid integration plans ? It is there a branch where we can track progress on this topic ?

kloudkl · 2014-03-23T02:26:46Z

It is scheduled to be released in the milestone 1.0. There is no PR or branch to track yet. But anyone could feel free to develop one.

shelhamer · 2014-03-23T03:39:56Z

The BVLC pyramid team hope to make a public PR in the next week. That said, it was developed somewhat independently of Caffe and is going to take serious effort to integrate, so the appearance of the PR doesn't signal that the feature is ready.

My honest suggestion is that anyone interested pursue the Torch7 pyramid pack/plane/unpack line of thought. There's understanding and improvements to be had in comparing implementations. As @kloudkl noted, this is a milestone feature or us so we could help review and discuss any contributions in this direction.

One could even prototype it in python instead of coding it directly into the library to first understand the choices to make. For instance, the Torch7 packing when convolved together will not produce the same filter activations as running separate inputs; there will be border effects according to the kernel sizes. Likewise, how should one pad to avoid false edge responses along the negative space where no image is packed? Yet another issue is that a mean image will not work, unless it is scaled and applied to each packed image, and one might instead use a channel mean that is spatially uniform.

These options are worth exploring in more than a single thread.

kloudkl · 2014-03-23T07:49:43Z

Although already mentioned in a comment two weeks ago, but I think it is still very relevant and useful to post the links related to @clementfarabet's implemention.

rodrigob · 2014-04-08T12:14:53Z

Now DenseNet is out, thus we should be able to close this item soon ?

http://arxiv-web3.library.cornell.edu/abs/1404.1869

moskewcz · 2014-04-08T20:01:58Z

i just pushed the code DenseNet code public and opened a PR #308 (not #307, wrong target branch) to discuss the various TODOs and/or integration plans.

bhack · 2014-05-31T12:08:49Z

Please take a look here: http://arxiv.org/abs/1405.3866v1

shelhamer · 2014-06-12T23:02:40Z

#455 might be of interest in the meantime. It shows how to make a fully-convolutional model for dense feature extraction or sliding window classification inference.

bhack · 2014-09-08T09:57:35Z

This could include also regression support on a variabile length set of bounding boxes coordinates and sizes

dasguptar · 2014-09-28T20:21:33Z

I was a bit curious regarding the current status of sliding window based dense multiscale extraction. Any plans to integrate it into Caffe anytime soon?

melgor · 2014-10-27T12:24:50Z

Since the start of the talks start 7 months ago, is there any progress in that field? Maybe someone implement "Efficient Sliding Window" like in OverFeat?

EvanWeiner · 2015-04-10T03:08:02Z

Echo @melgor -- any progress on the "Efficient Sliding Window" like in OverFeat in Caffe?

melgor · 2015-04-10T06:49:36Z

@EvanWeiner, now everything is implemented in Caffe, in the similar fashion like in OverFeat.
For running it you need to things:

Reshape single input batches for inputs of varying dimension #1313, which will enable you to varying Input/Output size. You could use mutliple scale of same image to get multi-scale extraction. Just call "Reshape" with new size
Save from python for net surgery #455, which enable you to transform your model to fully-convolutioanal. It is like "Efficient Sliding Window"

This two things are merged in Caffe, so you can use it. As a example of output, take a look here: http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/net_surgery.ipynb

I think, that this Issue can be closed.

EvanWeiner · 2015-04-11T13:22:26Z

@melgor Thank you. But how can I use an output matrix like:

[[282 282 281 281 281 281 277 282]
[281 283 283 281 281 281 281 282]
[283 283 283 283 283 283 287 282]
[283 283 283 281 283 283 283 259]
[283 283 283 283 283 283 283 259]
[283 283 283 283 283 283 259 259]
[283 283 283 283 259 259 259 277]
[335 335 283 259 263 263 263 277]]

To locate an object within the photo? These values correspond to the ImageNet classification value. But it seems the output has the same or similiar class in all locations. How to discern a particular object?

melgor · 2015-04-11T18:45:19Z

@EvanWeiner find more information at Caffe mailing list: https://groups.google.com/forum/#!searchin/caffe-users/Object$20Detection/caffe-users/5TyzPCEjuRs/7sJA0DXhJ-kJ

Here I point, that you could do to detect objects. Read OverFeat paper, here are all the informations.

shelhamer · 2017-03-23T06:54:25Z

Closing as this is handled by fully convolutional networks and their implementation in Caffe through coordinate mapping, and cropping.

kloudkl mentioned this issue Mar 9, 2014

Allow dynamic batch sizes in all the layers #195

Closed

shelhamer mentioned this issue Mar 10, 2014

Sliding window vs. Selective search detection #197

Closed

shelhamer added the enhancement label Mar 10, 2014

sergeyk added this to the 1.0 milestone Mar 13, 2014

kloudkl mentioned this issue Apr 10, 2014

Implement the RBM layer to learn binary codes for large scale image retrieval #274

Closed

forresti mentioned this issue Apr 16, 2014

DenseNet feature pyramid computation #308

Closed

shelhamer mentioned this issue May 26, 2014

Save from python for net surgery #455

Merged

bhack mentioned this issue May 31, 2014

sliding window lisa-lab/pylearn2#943

Closed

kloudkl mentioned this issue Jul 4, 2014

Implement SpatialPyramidPoolingLayer with the Split, Pooling, Flatten & Concat layers #560

Closed

shelhamer mentioned this issue Jan 8, 2015

Make a matrix output and ground truth example (segmentation, sliding window detection, etc.) #1698

Closed

shelhamer closed this as completed Mar 23, 2017

shelhamer removed this from the 1.0 milestone Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliding Window, Varying input/output size and Dense, multiscale extraction #189

Sliding Window, Varying input/output size and Dense, multiscale extraction #189

akosiorek commented Mar 5, 2014

nian-liu commented Mar 5, 2014

mavenlin commented Mar 5, 2014

shelhamer commented Mar 5, 2014

shelhamer commented Mar 5, 2014

akosiorek commented Mar 7, 2014

kloudkl commented Mar 9, 2014

shelhamer commented Mar 10, 2014

sguada commented Mar 10, 2014

shelhamer commented Mar 10, 2014

sguada commented Mar 10, 2014

forresti commented Mar 10, 2014

rodrigob commented Mar 10, 2014

rodrigob commented Mar 10, 2014

rodrigob commented Mar 10, 2014

kloudkl commented Mar 11, 2014

shelhamer commented Mar 11, 2014

shelhamer commented Mar 13, 2014

shelhamer commented Mar 13, 2014

rodrigob commented Mar 22, 2014

kloudkl commented Mar 23, 2014

shelhamer commented Mar 23, 2014

kloudkl commented Mar 23, 2014

rodrigob commented Apr 8, 2014

moskewcz commented Apr 8, 2014

bhack commented May 31, 2014

shelhamer commented Jun 12, 2014

bhack commented Sep 8, 2014

dasguptar commented Sep 28, 2014

melgor commented Oct 27, 2014

EvanWeiner commented Apr 10, 2015

melgor commented Apr 10, 2015

EvanWeiner commented Apr 11, 2015

melgor commented Apr 11, 2015

shelhamer commented Mar 23, 2017

Sliding Window, Varying input/output size and Dense, multiscale extraction #189

Sliding Window, Varying input/output size and Dense, multiscale extraction #189

Comments

akosiorek commented Mar 5, 2014

nian-liu commented Mar 5, 2014

mavenlin commented Mar 5, 2014

shelhamer commented Mar 5, 2014

shelhamer commented Mar 5, 2014

akosiorek commented Mar 7, 2014

kloudkl commented Mar 9, 2014

shelhamer commented Mar 10, 2014

sguada commented Mar 10, 2014

shelhamer commented Mar 10, 2014

sguada commented Mar 10, 2014

forresti commented Mar 10, 2014

rodrigob commented Mar 10, 2014

rodrigob commented Mar 10, 2014

rodrigob commented Mar 10, 2014

kloudkl commented Mar 11, 2014

shelhamer commented Mar 11, 2014

shelhamer commented Mar 13, 2014

shelhamer commented Mar 13, 2014

rodrigob commented Mar 22, 2014

kloudkl commented Mar 23, 2014

shelhamer commented Mar 23, 2014

kloudkl commented Mar 23, 2014

rodrigob commented Apr 8, 2014

moskewcz commented Apr 8, 2014

bhack commented May 31, 2014

shelhamer commented Jun 12, 2014

bhack commented Sep 8, 2014

dasguptar commented Sep 28, 2014

melgor commented Oct 27, 2014

EvanWeiner commented Apr 10, 2015

melgor commented Apr 10, 2015

EvanWeiner commented Apr 11, 2015

melgor commented Apr 11, 2015

shelhamer commented Mar 23, 2017