-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Dilated Convolution #3452
Conversation
This looks pretty nice, thanks @fyu. A few initial comments:
I'll make additional comments inline. |
num_dilation_dims == num_spatial_axes_) | ||
<< "dilation must be specified once, or once per spatial dimension " | ||
<< "(dilation specified " << num_dilation_dims << " times; " | ||
<< num_spatial_axes_ << " spatial dims);"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should have a four-space indent here for line continuation, following the messages above (and while I'm here, the trailing semicolon in the message should be a period...)
Thank @longjon and @shelhamer for the fine job to manage the caffe code. I believe the performance difference is due to the some additional operation for dilation. Because the original code is already very concise, the difference becomes noticeable by 1% to 2%. For ND case, I guess the difference would be harder to notice if we removed the dilation, since the original implementation is more complicated than the 2D case. |
@@ -230,33 +255,27 @@ __global__ void col2im_gpu_kernel(const int n, const Dtype* data_col, | |||
const int w_im = index % width + pad_w; | |||
const int h_im = (index / width) % height + pad_h; | |||
const int c_im = index / (width * height); | |||
int kernel_extent_w = (kernel_w - 1) * dilation_w + 1; | |||
int kernel_extent_h = (kernel_h - 1) * dilation_h + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about this here, because the existing code doesn't follow this convention, but I'll just comment for the future that to me the h
computations should come before the w
ones, following argument order and loop order.
Re: 2 above, it looks like the reference implementation is actually upgraded here; so @shelhamer perhaps you can check whether your tests do or do not provide additional functionality. Additionally, the tests feel a little lackluster to me for the following reasons:
Other than that this looks pretty ready except as noted (all the notes being nitpicks). I haven't checked the logic in detail, but @shelhamer can comment on whether it all looks right to him, as he's wrestled with it before. Re: the performance difference (in 2d, I'm not going to worry about nd for now), it looks like most of the gap is in |
@fyu @longjon @shelhamer Thanks. Yes, 'dilation' sounds more adequate than 'kernel_stride' |
Thank @gpapan for the input. I guess we all agree that holes and a trous may not be a proper term since it refers to the algorithm with a pre-determined filter. We propose to name it "Dilated Convolution" because of the similarity to the traditional dilation. It will also be easier to communicate if we can keep the name a single word. If we call it input_stride, we may have to change the current "stride" to output_stride to disambiguate and it will require people to change the concept of stride, which doesn't sound ideal. So I think dilation is better to make the name short, distinct and relevant. It is definitely great to know that TensorFlow will also support this operation. |
Okay, a few observations regarding naming and such. There are three ways you can "add holes" in convolution:
Now, given the above it does seem natural to refer to the spacing parameter for case 1 as a "stride". However,
Also keep in mind that memory layout could be strided (in a parametrized way) in the future (as is supported by cuDNN), which is another use of stride with a different meaning. So even though it's somewhat natural to call this new parameter
|
Thanks to @longjon's efforts to scrutinize the code, the format issues have been cleared out. I also add more tests in convolution to test dilation for both forward and backward passes. The tests for im2col are also changed so that the special cases for dilation are tested separately. @shelhamer, do you have any further comments on the tests? |
@fyu thanks for the PR extending the Caffe convolution and thanks @longjon for leading the review. For naming For attribution we can address code and citations by commit message and code comment. The message suggested by @longjon looks good to me (although it could be rewritten as a merge message for the whole PR rather than tied to any single commit):
Citations could be listed by a code comment. As this is not a distinct layer, I believe the best place for these would be above the caffe.proto line that defines If it is of any help the citations I know of are these:
|
Apologies for insisting on that, but I don't consider the fact that the The term convolution itself was traditionally used in conjunction with Inventing a new term every time we reuse an algorithm is just misleading in
|
To be clear, first note that there are (at least) three distinct naming issues in play:
That fact, however, does not resolve 2 above, as à trous does not describe the parameter in question. So, shall we look at the original literature and see what language was used? The earliest reference I can find to the "algorithme à trous" is (from @fyu's citations) Holschneider et al. (1987). (I'll note that that paper does not appear to contain the invention of the term. However, its two citations (both "to appear", both not apparently available) are to a meeting in Marseille and a paper by the same authors, so this is, as best I can tell, the first appearance of the term in print.) This paper is about computing the discrete wavelet transform, which is about correlation with (from the paper) "translated and dilated" versions of a (wavelet) filter. Now the term "dilation" makes a lot of sense in the continuous case, but as Holschneider et al. make things discrete, that language is naturally preserved. For example, in describing the algorithme à trous, I quote:
See also diagram (3.5), which is introduced by the text:
See also the rest of the paper. Although the factor itself is not, as far as I can tell, given a name, the modified filter is consistently called the "dilated filter". For another early reference, see Shensa (1991), "Discrete Wavelet Transforms: The Relationship of the à trous and Mallat Algorithms". From which I quote:
Again, the dilation operator is described in use thus:
And finally note that, in this reference, the term à trous is actually reserved for something more specific than dilation (emphasis sic):
(By the way, "stride" is not a word you will find in either of these papers.) Now, I don't know for sure the process by which @fyu and Vladlen arrived at the name, but the language in their paper suggests that they are aware of the way the terminology is actually used in the literature:
I do not think the distinction being made here is simply "the fact that the filter values are learned" (quoting @gpapan), but rather that the inventors of the algorithme à trous treat convolution with a dilated filter (in those words!) as a component of that algorithm, the remainder of which has no relevance here. (Put another way, dilated convolution could be used to implement the algorithme à trous.) So the operation we are talking about here has been referred to as convolution with a dilated filter for at least as long as the term "algorithme à trous" has existed (with dilation being (as pointed out by @fyu already) exactly the ordinary mathematical term (which, by the way, surely predates morphological dilation (with which, anyway, I don't see the potential for confusion here))). Calling this a renaming is a historical inversion. I'm happy to hear additional well-supported arguments from anyone about what we should call this parameter, but please do check your map against the territory. |
Dtype* data_col) { | ||
const int* dilation, Dtype* data_col) { | ||
// num_axes should be smaller than block size | ||
DCHECK_LT(10, CAFFE_CUDA_NUM_THREADS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be DCHECK_LE
? (I also think we should probably check num_spatial_axes
here instead of 10
, to lose the constant literal which is subject to change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because im_shape and col_shape have one more dimension than num_spatial_axes, LT is proper here. I will finish up the other changes and clean-up. Thanks!
Final(?) action items for merge:
I'll handle the reference to previous code as a merge message. |
Hi Jon, just two points: (1) As stated in my second email, I consider the term "algorithme à trous" (2) As stated in my first email, my main objection in adopting the term Having said that, I consider point (1) above more important. In particular, Best, On Fri, Dec 18, 2015 at 5:04 PM, Jon Long notifications@github.com wrote:
|
@gpapan... respectfully, I am compelled to ask: have you read Holschnieder's 1987 paper, including my citations above, and the details of my argument? I have taken pains to check the original source material, and it appears to me (as apparently known by Fisher and Koltun) that "dilated convolution" is a historically more accurate name than "à trous convolution". I am claiming that the operation we are discussing was never called the algorithme à trous, which referred instead to an algorithm for a complete wavelet transform, and that the step of that algorithm consisting of convolution with a dilated filter was called simply "convolution with a dilated filter". I am claiming that the use of the term "à trous" for the convolution itself is in fact a renaming of the operation previously called "convolution with a dilated filter". I have cited and explained in detail the exact same passage from Fisher and Koltun you cited above. Exactly how the term "à trous" came into the computer vision literature is not clear to me. OverFeat cites Fast Scanning, and Fast Scanning does not appear to have a citation. Your own work cites Mallat's 1999 book. From which I'll quote:
Again, the term refers to an algorithm for the wavelet transform, which is not simply a convolution. Later in that section:
Again, the term used to refer to the filter-with-holes is "dilated", not "à trous". Regarding (2), I did not claim that the use of "dilation" for "wavelet filter dilation" predates the morphological use. I'll quote myself:
I claimed that the use of "dilation" here is exactly in accord with standard mathematical usage ("dilation of a function"), and that that usage predates the morphological usage, which I'm well aware of. The mathematical usage is so old it's hard to say exactly when it began, but there is evidence for it in Newton's Principia -- 1687 (e.g., see page 436 of the 1871 reprint of the 1726 edition, and note that "dilate" comes directly from Latin). So, I hope you understand why your reply leaves me feeling as though my argument went unread. Every source I have checked, including your own citation, supports Fisher and Koltun's term as not merely more understandable, but more historically correct. I don't write these comments to bludgeon some preferred term over another, but so that we can all understand the facts of the matter, and all write in a way that accurately references the history. In fact, I had expected to write in support of your point, until I actually checked the literature! If there really is evidence in the other direction (even though by now I've checked nearly all the sources I could think of), please cite it for me, as I have cited the evidence for you! |
Hi Jon, thanks for your response, your feedback is well received! I now read again Holschnieder's 1987 paper (was at ICCV last week and In that sense, using in my CVPR-15 paper (page 396, paragraph "Dense I had not mentioned the term "convolution with holes" and only used the Fisher and Koltun use the term "dilated convolution" in The fact that when you skip subsampling the input by a factor of p, and As for the source of the word dilation, I agree with you that it has been On Sat, Dec 19, 2015 at 4:01 PM, Jon Long notifications@github.com wrote:
|
Okay, thanks for checking this with me @gpapan! I think we've converged on the facts, I'll just summarize the last points and how they apply here:
[@gpapan by the way, I don't think you can attach things to a GitHub email; you can, however, insert images.] |
Agree on all three points, except that I still think a reference to the Re. 2, the term "rate" would refer to the factor by which the actual input On Mon, Dec 21, 2015 at 2:23 PM, Jon Long notifications@github.com wrote:
|
also add safeguard to avoid unused variable warning and clean code format
@longjon The dilation for deconvolution is actually not implemented since I don't think it is useful. I put a check in parsing the convolution parameters in case people change the default dilation accidentally when using deconvolution. The others concerns should have been addressed. Let me know if there is anything else I can improve. |
Looks pretty good but there are still a few issues:
To save a round-trip time, I've gone ahead and made the changes above in PR #3487; just take a look and let me know if you approve! |
Merged as #3487. I did leave the parameter name as Thanks @fyu for a well-constructed PR for a long-awaited feature! Thanks @gpapan, @tamakoji, and @shelhamer for earlier implementations, feedback, and discussion. @fyu, we're looking forward to seeing models in the zoo! |
Great job guys! On Mon, Dec 28, 2015 at 3:57 PM, Jon Long notifications@github.com wrote:
|
Here is a small model zoo featuring dilated/strided kernels from earlier this year: it also requires this feature on pooling layers though, which is quite trivial to implement. |
This PR extends im2col to support dilated convolution, as described in the paper Multi-Scale Context Aggregation by Dilated Convolutions. The changes aim to support general combinations of dilation and stride in convolution layer. Although it is hard to tweak cuDNN to support dilation, im2col seems a natural place to start with. Both 2D and ND im2col are changed to support dilation. The test code for im2col layer is changed to test dilation. Let me know if there is any other case that may require testing.
In addition, I found that the original implementation of ND im2col and col2im doesn't make use of CUDA memory efficiently. To improve that, I use shared memory in CUDA to cache global memory.
I ran benchmarks on alexnet to compare the performance between master and this branch. The results are:
master 2D convolution:
dilation 2D convolution:
master ND convolution:
dilation ND convolution:
I ran 50 iterations with batch size 256. The numbers show it adds small overhead to support dilation. It improves the ND convolution a lot to use shared memory.
Benchmark model for 2D: alexnet.prototxt
Benchmark mdoel for ND: alexnet_nd.prototxt