Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Dilated Convolution #3452

Closed
wants to merge 4 commits into from
Closed

Add Support for Dilated Convolution #3452

wants to merge 4 commits into from

Conversation

fyu
Copy link
Contributor

@fyu fyu commented Dec 15, 2015

This PR extends im2col to support dilated convolution, as described in the paper Multi-Scale Context Aggregation by Dilated Convolutions. The changes aim to support general combinations of dilation and stride in convolution layer. Although it is hard to tweak cuDNN to support dilation, im2col seems a natural place to start with. Both 2D and ND im2col are changed to support dilation. The test code for im2col layer is changed to test dilation. Let me know if there is any other case that may require testing.

In addition, I found that the original implementation of ND im2col and col2im doesn't make use of CUDA memory efficiently. To improve that, I use shared memory in CUDA to cache global memory.

I ran benchmarks on alexnet to compare the performance between master and this branch. The results are:

master 2D convolution:

Average Forward pass: 352.594 ms.
Average Backward pass: 254.317 ms.
Average Forward-Backward: 607.01 ms.

dilation 2D convolution:

Average Forward pass: 360.493 ms.
Average Backward pass: 255.279 ms.
Average Forward-Backward: 615.875 ms.

master ND convolution:

Average Forward pass: 505.18 ms.
Average Backward pass: 398.236 ms.
Average Forward-Backward: 903.509 ms.

dilation ND convolution:

Average Forward pass: 425.503 ms.
Average Backward pass: 323.731 ms.
Average Forward-Backward: 749.334 ms.

I ran 50 iterations with batch size 256. The numbers show it adds small overhead to support dilation. It improves the ND convolution a lot to use shared memory.

Benchmark model for 2D: alexnet.prototxt
Benchmark mdoel for ND: alexnet_nd.prototxt

@longjon
Copy link
Contributor

longjon commented Dec 15, 2015

This looks pretty nice, thanks @fyu. A few initial comments:

  1. There is already a similar PR for this functionality in im2col with hole #3232. However, it looks like this one is in better shape (e.g., im2col with hole #3232 is failing Travis and has neither a clean history nor a clean diff), so I'd like to proceed from here. To acknowledge the earlier implementations, I propose adding the following text to the initial commit message:

    An early implementation of this functionality for Caffe was written by @gpapan, which was extended and improved by @tamakoji in a previous implementation of this and the next commit.

    Does that sound accurate and appropriate to everyone involved? @shelhamer, does that sound compatible with how we've handled credit distribution in similar situations in the past?

  2. I believe @shelhamer has written more exhaustive tests for this functionality (e.g., upgrading the reference implementation), which should be included here. @shelhamer, perhaps you can point @fyu to relevant commits to cherry-pick?

  3. @shelhamer and I discussed earlier the issue of what to call the new parameter here; at that time we did not like any of the options we could think of: hole or hole_stride is misleading as pointed out by @tamakoji; kernel_stride is easily confused with stride; and other possibilities are unclear or obscure. dilation seems like the best term I've heard yet, so I don't mind keeping this as-is. Opinions?

  4. The perf difference for 2D im2col seems acceptable to me; do others agree? @fyu, can you explain what the source of the small difference is?

I'll make additional comments inline.

num_dilation_dims == num_spatial_axes_)
<< "dilation must be specified once, or once per spatial dimension "
<< "(dilation specified " << num_dilation_dims << " times; "
<< num_spatial_axes_ << " spatial dims);";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we should have a four-space indent here for line continuation, following the messages above (and while I'm here, the trailing semicolon in the message should be a period...)

@fyu
Copy link
Contributor Author

fyu commented Dec 15, 2015

Thank @longjon and @shelhamer for the fine job to manage the caffe code. I believe the performance difference is due to the some additional operation for dilation. Because the original code is already very concise, the difference becomes noticeable by 1% to 2%. For ND case, I guess the difference would be harder to notice if we removed the dilation, since the original implementation is more complicated than the 2D case.

@@ -230,33 +255,27 @@ __global__ void col2im_gpu_kernel(const int n, const Dtype* data_col,
const int w_im = index % width + pad_w;
const int h_im = (index / width) % height + pad_h;
const int c_im = index / (width * height);
int kernel_extent_w = (kernel_w - 1) * dilation_w + 1;
int kernel_extent_h = (kernel_h - 1) * dilation_h + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about this here, because the existing code doesn't follow this convention, but I'll just comment for the future that to me the h computations should come before the w ones, following argument order and loop order.

@longjon
Copy link
Contributor

longjon commented Dec 15, 2015

Re: 2 above, it looks like the reference implementation is actually upgraded here; so @shelhamer perhaps you can check whether your tests do or do not provide additional functionality.

Additionally, the tests feel a little lackluster to me for the following reasons:

  1. I don't see dilated convolution being checked against the reference implementation, as the conv tests use dilation 1.
  2. The im2col tests are modified to use non-one dilation. I'd rather see additional tests for non-one dilation; the current changes make me nervous as we're departing from the common case.

Other than that this looks pretty ready except as noted (all the notes being nitpicks). I haven't checked the logic in detail, but @shelhamer can comment on whether it all looks right to him, as he's wrestled with it before.

Re: the performance difference (in 2d, I'm not going to worry about nd for now), it looks like most of the gap is in im2col, which I find it somewhat surprising that a few 32-bit multiplies would make a noticeable difference, but I guess it does. We can always consider a separate kernel or a templated kernel to improve performance if it becomes an issue in the future (which seems unlikely).

@tamakoji
Copy link

@fyu @longjon @shelhamer Thanks. Yes, 'dilation' sounds more adequate than 'kernel_stride'

@fyu
Copy link
Contributor Author

fyu commented Dec 16, 2015

Thank @gpapan for the input. I guess we all agree that holes and a trous may not be a proper term since it refers to the algorithm with a pre-determined filter. We propose to name it "Dilated Convolution" because of the similarity to the traditional dilation. It will also be easier to communicate if we can keep the name a single word. If we call it input_stride, we may have to change the current "stride" to output_stride to disambiguate and it will require people to change the concept of stride, which doesn't sound ideal. So I think dilation is better to make the name short, distinct and relevant. It is definitely great to know that TensorFlow will also support this operation.

@longjon
Copy link
Contributor

longjon commented Dec 17, 2015

Okay, a few observations regarding naming and such. There are three ways you can "add holes" in convolution:

  1. You can implicitly add holes to the kernel. That's the subject of this patch.
  2. You can implicitly add holes to the input. (Note that there's nothing special distinguishing input from kernel.) That's supported in master using the stride parameter of the Deconvolution layer.
  3. You can implicitly add holes to the output. (If that's confusing, think of the gradient computation, which is the same as the previous case.) That's supported in master using the stride parameter of the Convolution layer.

Now, given the above it does seem natural to refer to the spacing parameter for case 1 as a "stride". However, input_stride seems particularly unfortunate for these reasons:

  1. It's not case 2 above.
  2. It's also reasonably confusable with case 3 above, which could be described as "the stride of the kernel in the input".
  3. In the future, we may support simultaneous specification of all three spacing parameters, and in that case it seems clear that using input_stride for 1 breaks the symmetry of the terms.

Also keep in mind that memory layout could be strided (in a parametrized way) in the future (as is supported by cuDNN), which is another use of stride with a different meaning.

So even though it's somewhat natural to call this new parameter *_stride, I haven't heard a compelling choice for *, and given the current situation where the other two hole-making options are actually different layers, it also seems natural to pick a distinct name to avoid creating a false correspondence.

dilation, while a newly invented term, is at least evocative and not easily confused with regular old stride...

@fyu
Copy link
Contributor Author

fyu commented Dec 17, 2015

Thanks to @longjon's efforts to scrutinize the code, the format issues have been cleared out. I also add more tests in convolution to test dilation for both forward and backward passes. The tests for im2col are also changed so that the special cases for dilation are tested separately. @shelhamer, do you have any further comments on the tests?

@shelhamer
Copy link
Member

@fyu thanks for the PR extending the Caffe convolution and thanks @longjon for leading the review.

For naming dilation has my vote in order to escape the confusions of *_stride and hole*. Although this collides with the morphological operation as raised by @gpapan it is in my opinion more distinct and clear in the context of convolution.

For attribution we can address code and citations by commit message and code comment. The message suggested by @longjon looks good to me (although it could be rewritten as a merge message for the whole PR rather than tied to any single commit):

An early implementation of this functionality for Caffe was written by @gpapan, which was extended and improved by @tamakoji in a previous implementation of this and the next commit.

Citations could be listed by a code comment. As this is not a distinct layer, I believe the best place for these would be above the caffe.proto line that defines dilation. The citations could group into à trous / wavelet algorithms and then the more recent works on the learned / deep network case. However, I argue that references are best kept by the literature. Instead of trying to exhaustively explain I suggest a roll call of method names like à trous / fast scanning / rarefaction / dilation to help discover the literature.

If it is of any help the citations I know of are these:

  • Holschneider et al. A real-time algorithm for signal
    analysis with the help of the wavelet transform. 1987
  • Shensa et al. The discrete wavelet transform: wedding the à trous and Mallat algorithms. 1992
  • Mallat et al. A Wavelet Tour of Signal Processing. 1999
  • Giusti et al. Fast image scanning with deep max-pooling convolutional neural networks. 2013.
  • Sermanet et al. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013.
  • Long* and Shelhamer* and Darrell. Fully convolutional networks for semantic segmentation. arXiv 2014.
  • Chen* and Papandreou* et al. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014.
  • Fisher and Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015.

@gpapan
Copy link

gpapan commented Dec 18, 2015

Apologies for insisting on that, but I don't consider the fact that the
filter values are learned a compelling reason to change the name of the
operation from atrous convolution to dilated convolution (which also
clashes with the use of an established term in morphological image
processing).

The term convolution itself was traditionally used in conjunction with
fixed/handset filters (e.g., wavelet signal processing). When Yann and
others figured out how to learn filter weights in CNNs, they didn't find it
necessary to rename the convolution operation/ algorithm to something else.

Inventing a new term every time we reuse an algorithm is just misleading in
my opinion.
On Dec 18, 2015 2:03 AM, "Evan Shelhamer" notifications@github.com wrote:

@fyu https://github.com/fyu thanks for the PR extending the Caffe
convolution and thanks @longjon https://github.com/longjon for leading
the review.

For naming dilation has my vote in order to escape the confusions of
_stride and hole. Although this collides with the morphological
operation as raised by @gpapan https://github.com/gpapan it is in my
opinion more distinct and clear in the context of convolution.

For attribution we can address code and citations by commit message and
code comment. The message suggested by @longjon
https://github.com/longjon looks good to me (although it could be
rewritten as a merge message for the whole PR rather than tied to any
single commit):

An early implementation of this functionality for Caffe was written by
@gpapan https://github.com/gpapan, which was extended and improved by
@tamakoji https://github.com/tamakoji in a previous implementation of
this and the next commit.

Citations could be listed by a code comment. As this is not a distinct
layer, I believe the best place for these would be above the caffe.proto
line that defines dilation. The citations could group into à trous /
wavelet algorithms and then the more recent works on the learned / deep
network case. However, I argue that references are best kept by the
literature. Instead of trying to exhaustively explain I suggest a roll call
of method names like à trous / fast scanning / rarefaction / dilation to
help discover the literature.

If it is of any help the citations I know of are these:

  • Holschneider et al. A real-time algorithm for signal analysis with
    the help of the wavelet transform. 1987
  • Shensa et al. The discrete wavelet transform: wedding the à trous
    and Mallat algorithms. 1992
  • Mallat et al. A Wavelet Tour of Signal Processing. 1999
  • Giusti et al. Fast image scanning with deep max-pooling
    convolutional neural networks. 2013.
  • Sermanet et al. Overfeat: Integrated recognition, localization and
    detection using convolutional networks. arXiv 2013.
  • Long* and Shelhamer* and Darrell. Fully convolutional networks for
    semantic segmentation. arXiv 2014.
  • Chen* and Papandreou* et al. Semantic Image Segmentation with Deep
    Convolutional Nets and Fully Connected CRFs. arXiv 2014.
  • Fisher and Koltun. Multi-Scale Context Aggregation by Dilated
    Convolutions. arXiv 2015.


Reply to this email directly or view it on GitHub
#3452 (comment).

@longjon
Copy link
Contributor

longjon commented Dec 19, 2015

To be clear, first note that there are (at least) three distinct naming issues in play:

  1. The Caffe layer name. Since we are not implementing new layers here, this is a non-issue; the layer names are Convolution and Im2col.
  2. The name of the parameter specifying how many holes to insert in the filters. This is the only issue I've been referring to so far, as it's somewhat difficult to change.
  3. How the operation is referred to in comments and documentation. On this point, I (partially) agree with @gpapan (echoing somewhat @shelhamer above) that "à trous" is a name commonly used in this context, and the comments should make it clear that this operation is often referred to as the à trous operation.

That fact, however, does not resolve 2 above, as à trous does not describe the parameter in question.

So, shall we look at the original literature and see what language was used?

The earliest reference I can find to the "algorithme à trous" is (from @fyu's citations) Holschneider et al. (1987). (I'll note that that paper does not appear to contain the invention of the term. However, its two citations (both "to appear", both not apparently available) are to a meeting in Marseille and a paper by the same authors, so this is, as best I can tell, the first appearance of the term in print.)

This paper is about computing the discrete wavelet transform, which is about correlation with (from the paper) "translated and dilated" versions of a (wavelet) filter. Now the term "dilation" makes a lot of sense in the continuous case, but as Holschneider et al. make things discrete, that language is naturally preserved. For example, in describing the algorithme à trous, I quote:

The convolutions with filters F_i, which are all the dilated versions of one fixed filter can be realized by an "algorithme à trous".

See also diagram (3.5), which is introduced by the text:

Then the convolution with a dilated filter alpha^{-1} D_2 h is realized as: [diagram showing exactly the operation we are discussing here].

See also the rest of the paper. Although the factor itself is not, as far as I can tell, given a name, the modified filter is consistently called the "dilated filter".

For another early reference, see Shensa (1991), "Discrete Wavelet Transforms: The Relationship of the à trous and Mallat Algorithms". From which I quote:

Decimation...plays a pivotal role in all DWT algorithms...[a]lso of significances [sic] is its transpose, D_kj = delta(k - 2j), which dilates a vector by inserting zeros.

Again, the dilation operator is described in use thus:

First we spread g^dagger to provide space in which to insert the interpolated values. The resulting filter is Dg^dagger.

And finally note that, in this reference, the term à trous is actually reserved for something more specific than dilation (emphasis sic):

This condition, that f be the identity on even points, is sufficiently important to warrant a separate definition: The lowpass filter f is said to be an à trous filter if it satisfies [being the identity on even points].

(By the way, "stride" is not a word you will find in either of these papers.)

Now, I don't know for sure the process by which @fyu and Vladlen arrived at the name, but the language in their paper suggests that they are aware of the way the terminology is actually used in the literature:

Most notably, the algorithme à trous for wavelet decomposition uses dilated filters (Holschneider et al., 1987; Shensa, 1992). We do not apply the term algorithme à trous to our work since the algorithme à trous is an algorithm for wavelet decomposition that applies a single pre-determined filter at successively larger scales to produce a signal decomposition.

I do not think the distinction being made here is simply "the fact that the filter values are learned" (quoting @gpapan), but rather that the inventors of the algorithme à trous treat convolution with a dilated filter (in those words!) as a component of that algorithm, the remainder of which has no relevance here. (Put another way, dilated convolution could be used to implement the algorithme à trous.)

So the operation we are talking about here has been referred to as convolution with a dilated filter for at least as long as the term "algorithme à trous" has existed (with dilation being (as pointed out by @fyu already) exactly the ordinary mathematical term (which, by the way, surely predates morphological dilation (with which, anyway, I don't see the potential for confusion here))). Calling this a renaming is a historical inversion.

I'm happy to hear additional well-supported arguments from anyone about what we should call this parameter, but please do check your map against the territory.

Dtype* data_col) {
const int* dilation, Dtype* data_col) {
// num_axes should be smaller than block size
DCHECK_LT(10, CAFFE_CUDA_NUM_THREADS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be DCHECK_LE? (I also think we should probably check num_spatial_axes here instead of 10, to lose the constant literal which is subject to change.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because im_shape and col_shape have one more dimension than num_spatial_axes, LT is proper here. I will finish up the other changes and clean-up. Thanks!

@longjon
Copy link
Contributor

longjon commented Dec 19, 2015

Final(?) action items for merge:

  • I've made two additional comments on the latest diff.
  • I wonder if we should provide a test for Deconvolution layer with non-unit dilation?
  • Let's provide a comment in caffe.proto explaining a little bit the meaning of dilation, and referencing in an appropriate way the term "à trous", since that's how many know this operation. I'll let @fyu decide what to write exactly and whether to mention other terms as suggested by @shelhamer.
  • Finally if it's not too much trouble I'd ask that you clean up the history, chiefly by squashing in style fixes and making sure each commit touches only the relevant files.

I'll handle the reference to previous code as a merge message.

@gpapan
Copy link

gpapan commented Dec 19, 2015

Hi Jon,

just two points:

(1) As stated in my second email, I consider the term "algorithme à trous"
or "convolution with holes" well established in the related literature and
do not find it appropriate to rename it to "dilated convolution". The
justification that Fisher and Koltun use for the renaming is:
"We do not apply the term algorithme à trous to our work since the
algorithme à trous is an algorithm for wavelet decomposition that applies a
single pre-determined filter at successively larger scales to produce a
signal decomposition."
I find this justification unconvincing. Inventing a new name for an
existing operation is misleading and adds unnecessary confusion.

(2) As stated in my first email, my main objection in adopting the term
"dilation" for describing the stride in the input tensor is the name clash
with dilation filtering https://www.google.com/webhp?q=dilation+filtering.
The term dilation has been used in morphological image processing since at
least the 1970's (so your comment that wavelet filter dilation "surely
predates morphological dilation" is wrong). The standard reference is:
J. Serra. Image Analysis and Mathematical Morphology. Academic Press, 1982.
You may want to consult any textbook on Digital Image Processing for an
introduction to morphological image processing, e.g.,
http://www.amazon.com/Digital-Image-Processing-Scientific-Inside/dp/0471767778

Having said that, I consider point (1) above more important. In particular,
although I explained my position in (2), I think it is your call to decide
which term to adopt in Caffe.

Best,
George

On Fri, Dec 18, 2015 at 5:04 PM, Jon Long notifications@github.com wrote:

To be clear, first note that there are (at least) three distinct naming
issues in play:

  1. The Caffe layer name. Since we are not implementing new layers
    here, this is a non-issue; the layer names are Convolution and Im2col.
  2. The name of the parameter specifying how many holes to insert in
    the filters. This is the only issue I've been referring to so far, as it's
    somewhat difficult to change.
  3. How the operation is referred to in comments and documentation. On
    this point, I (partially) agree with @gpapan
    https://github.com/gpapan (echoing somewhat @shelhamer
    https://github.com/shelhamer above) that "à trous" is a name
    commonly used in this context, and the comments should make it clear that
    this operation is often referred to as the à trous operation.

That fact, however, does not resolve 2 above, as à trous does not describe
the parameter in question.

So, shall we look at the original literature and see what language was
used?

The earliest reference I can find to the "algorithme à trous" is (from
@fyu https://github.com/fyu's citations) Holschneider et al. (1987).
(I'll note that that paper does not appear to contain the invention of
the term. However, its two citations (both "to appear", both not apparently
available) are to a meeting in Marseille and a paper by the same authors,
so this is, as best I can tell, the first appearance of the term in print.)

This paper is about computing the discrete wavelet transform, which is
about correlation with (from the paper) "translated and dilated" versions
of a (wavelet) filter. Now the term "dilation" makes a lot of sense in the
continuous case, but as Holschneider et al. make things discrete, that
language is naturally preserved. For example, in describing the algorithme
à trous, I quote:

The convolutions with filters F_i, which are all the dilated versions of
one fixed filter can be realized by an "algorithme à trous".

See also diagram (3.5), which is introduced by the text:

Then the convolution with a dilated filter alpha^{-1} D_2 h is realized
as: [diagram showing exactly the operation we are discussing here].

See also the rest of the paper. Although the factor itself is not, as far
as I can tell, given a name, the modified filter is consistently called the
"dilated filter".

For another early reference, see Shensa (1991), "Discrete Wavelet
Transforms: The Relationship of the à trous and Mallat Algorithms". From
which I quote:

Decimation...plays a pivotal role in all DWT algorithms...[a]lso of
significances [sic] is its transpose, D_kj = delta(k - 2j), which dilates
a vector by inserting zeros.

Again, the dilation operator is described in use thus:

First we spread g^dagger to provide space in which to insert the
interpolated values. The resulting filter is Dg^dagger.

And finally note that, in this reference, the term à trous is actually
reserved for something more specific than dilation (emphasis sic):

This condition, that f be the identity on even points, is sufficiently
important to warrant a separate definition: The lowpass filter f is
said to be an à trous filter if it satisfies [being the identity on
even points].

(By the way, "stride" is not a word you will find in either of these
papers.)

Now, I don't know for sure the process by which @fyu
https://github.com/fyu and Vladlen arrived at the name, but the
language in their paper suggests that they are aware of the way the
terminology is actually used in the literature:

Most notably, the algorithme à trous for wavelet decomposition uses
dilated filters (Holschneider et al., 1987; Shensa, 1992). We do not apply
the term algorithme à trous to our work since the algorithme à trous is an
algorithm for wavelet decomposition that applies a single pre-determined
filter at successively larger scales to produce a signal decomposition.

I do not think the distinction being made here is simply "the fact that
the filter values are learned" (quoting @gpapan
https://github.com/gpapan), but rather that the inventors of the
algorithme à trous treat convolution with a dilated filter (in those
words!) as a component of that algorithm, the remainder of which has no
relevance here. (Put another way, dilated convolution could be used to
implement the algorithme à trous.)

So the operation we are talking about here has been referred to as
convolution with a dilated filter for at least as long as the term
"algorithme à trous" has existed (with dilation being (as pointed out by
@fyu https://github.com/fyu already) exactly the ordinary mathematical
term (which, by the way, surely predates morphological dilation (with
which, anyway, I don't see the potential for confusion here))). Calling
this a renaming is a historical inversion.

I'm happy to hear additional well-supported arguments from anyone about
what we should call this parameter, but please do check your map against
the territory.


Reply to this email directly or view it on GitHub
#3452 (comment).

@longjon
Copy link
Contributor

longjon commented Dec 20, 2015

@gpapan... respectfully, I am compelled to ask: have you read Holschnieder's 1987 paper, including my citations above, and the details of my argument? I have taken pains to check the original source material, and it appears to me (as apparently known by Fisher and Koltun) that "dilated convolution" is a historically more accurate name than "à trous convolution". I am claiming that the operation we are discussing was never called the algorithme à trous, which referred instead to an algorithm for a complete wavelet transform, and that the step of that algorithm consisting of convolution with a dilated filter was called simply "convolution with a dilated filter". I am claiming that the use of the term "à trous" for the convolution itself is in fact a renaming of the operation previously called "convolution with a dilated filter".

I have cited and explained in detail the exact same passage from Fisher and Koltun you cited above.

Exactly how the term "à trous" came into the computer vision literature is not clear to me. OverFeat cites Fast Scanning, and Fast Scanning does not appear to have a citation. Your own work cites Mallat's 1999 book. From which I'll quote:

A fast dyadic wavelet transform is calculated with a filter bank algorithm called in French the algorithme à trous

Again, the term refers to an algorithm for the wavelet transform, which is not simply a convolution. Later in that section:

Suppose that h and g have respectively K_h and K_g non-zero samples. The "dilated" filters h_j and g_j have the same number of non-zero coefficients.

Again, the term used to refer to the filter-with-holes is "dilated", not "à trous".

Regarding (2), I did not claim that the use of "dilation" for "wavelet filter dilation" predates the morphological use. I'll quote myself:

...with dilation being...exactly the ordinary mathematical term (which, by the way, surely predates morphological dilation

I claimed that the use of "dilation" here is exactly in accord with standard mathematical usage ("dilation of a function"), and that that usage predates the morphological usage, which I'm well aware of. The mathematical usage is so old it's hard to say exactly when it began, but there is evidence for it in Newton's Principia -- 1687 (e.g., see page 436 of the 1871 reprint of the 1726 edition, and note that "dilate" comes directly from Latin).

So, I hope you understand why your reply leaves me feeling as though my argument went unread. Every source I have checked, including your own citation, supports Fisher and Koltun's term as not merely more understandable, but more historically correct. I don't write these comments to bludgeon some preferred term over another, but so that we can all understand the facts of the matter, and all write in a way that accurately references the history. In fact, I had expected to write in support of your point, until I actually checked the literature!

If there really is evidence in the other direction (even though by now I've checked nearly all the sources I could think of), please cite it for me, as I have cited the evidence for you!

@gpapan
Copy link

gpapan commented Dec 21, 2015

Hi Jon,

thanks for your response, your feedback is well received!

I now read again Holschnieder's 1987 paper (was at ICCV last week and
didn't have access to it). You are right, Holschnieder and also Mallat in
his book and 1989 paper use the term "filter dilated by a factor of p" as a
synonym to the signal processing term "filter upsampled by a factor of p".
As far as I can tell, they use neither of the terms (1) "convolution with
holes"/ "a trous convolution" nor (2) "dilated convolution" to describe
convolution with upsampled filters.

In that sense, using in my CVPR-15 paper (page 396, paragraph "Dense
feature extraction with the a trous algorithm")
http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Papandreou_Modeling_Local_and_2015_CVPR_paper.pdf
the term "convolution with holes" or "a trous convolution" to describe Eq.
(7) is non-standard. Thanks for pointing it out to me, it took a couple of
your emails for me to realize it!

I had not mentioned the term "convolution with holes" and only used the
term "atrous algorithm" in earlier drafts:
Section 4.2.2 of: http://arxiv.org/pdf/1412.0296.pdf (earlier version of
the CVPR-15 paper)
Section 3.1 of: http://arxiv.org/pdf/1412.7062.pdf (which appeared at
ICLR-15)

Fisher and Koltun use the term "dilated convolution" in
http://arxiv.org/abs/1511.07122 to describe the exact same operation
(modulo correlation vs. convolution; compare their Eq. 2 with Eq. 7 in my
CVPR-15 paper).
Their use of the term "dilated convolution" is also non-standard AFAIK.

The fact that when you skip subsampling the input by a factor of p, and
instead upsample the filter by a factor of p, then you get the same output
at p times the rate is also a fundamental idea in multirate signal
processing, sometimes called "noble identity" (see Fig. 4.30 in the
attached page excerpted from Oppenheim-Schafer's Discrete Time Signal
Processing, Ed.2, 1998, or Fig. 9 and surrounding text of Vaidyanathan's
Proc. IEEE 1990 paper
http://authors.library.caltech.edu/6798/1/VAIprocieee90.pdf). I have not
been able to trace back the original reference that discusses this property.
In this sense, yet another candidate for naming the parameter of interest
is the word "rate" (with default value 1).

As for the source of the word dilation, I agree with you that it has been
used before in the mathematical literature roughly as a synonym for scale,
however always in the continuous setting. This is also how it was used
by Grossman
and Morlet
https://www.researchgate.net/publication/216027353_Grossmann_A_Morlet_J_Decomposition_of_Hardy_functions_into_square_integrable_wavelets_of_constant_shape_SIAM_J_Math_Anal_15_723-736.
Also see the use of the term dilation in Strang's Siam Review 1989 paper
http://www4.ncsu.edu/~gremaud/REU/strang.pdf and Strang and Nguyen's
Wavelets and Filterbanks 1996 book. Using it to characterize discrete
upsampling by inserting zeros is first done by Holschnieder et al. after
1985 AFAIK.

On Sat, Dec 19, 2015 at 4:01 PM, Jon Long notifications@github.com wrote:

@gpapan https://github.com/gpapan... respectfully, I am compelled to
ask: have you read Holschnieder's 1987 paper, including my citations above,
and the details of my argument? I have taken pains to check the original
source material, and it appears to me (as apparently known by Fisher and
Koltun) that "dilated convolution" is a historically more accurate name
than "à trous convolution"
. I am claiming that the operation we are
discussing was never called the algorithme à trous, which referred instead
to an algorithm for a complete wavelet transform, and that the step of that
algorithm consisting of convolution with a dilated filter was called simply
"convolution with a dilated filter". I am claiming that the use of the term
"à trous" for the convolution itself is in fact a renaming of the
operation previously called "convolution with a dilated filter".

I have cited and explained in detail the exact same passage from Fisher
and Koltun you cited above.

Exactly how the term "à trous" came into the computer vision literature is
not clear to me. OverFeat cites Fast Scanning, and Fast Scanning does not
appear to have a citation. Your own work cites Mallat's 1999 book. From
which I'll quote:

A fast dyadic wavelet transform is calculated with a filter bank algorithm
called in French the algorithme à trous

Again, the term refers to an algorithm for the wavelet transform, which is
not simply a convolution. Later in that section:

Suppose that h and g have respectively K_h and K_g non-zero samples. The
"dilated" filters h_j and g_j have the same number of non-zero
coefficients.

Again, the term used to refer to the filter-with-holes is "dilated", not
"à trous".

Regarding (2), I did not claim that the use of "dilation" for "wavelet
filter dilation" predates the morphological use. I'll quote myself:

...with dilation being...exactly the ordinary mathematical term (which, by
the way, surely predates morphological dilation

I claimed that the use of "dilation" here is exactly in accord with
standard mathematical usage ("dilation of a function"), and that that usage
predates the morphological usage, which I'm well aware of. The mathematical
usage is so old it's hard to say exactly when it began, but there is
evidence for it in Newton's Principia -- 1687 (e.g., see page 436 of
the 1871 reprint of the 1726 edition, and note that "dilate" comes directly
from Latin).

So, I hope you understand why your reply leaves me feeling as though my
argument went unread. Every source I have checked, including your own
citation, supports Fisher and Koltun's term as not merely more
understandable, but more historically correct. I don't write these comments
to bludgeon some preferred term over another, but so that we can all
understand the facts of the matter, and all write in a way that
accurately references the history. In fact, I had expected to write in
support of your point, until I actually checked the literature!

If there really is evidence in the other direction (even though by now
I've checked nearly all the sources I could think of), please cite it
for me, as I have cited the evidence for you!


Reply to this email directly or view it on GitHub
#3452 (comment).

@longjon
Copy link
Contributor

longjon commented Dec 21, 2015

Okay, thanks for checking this with me @gpapan!

I think we've converged on the facts, I'll just summarize the last points and how they apply here:

  1. It does seem that the bigram "dilated convolution" is uncommon, and I did not find it in any reference describing the algorithme à trous. (However, it does seem natural and was not first invoked by Fisher and Koltun; a Googling returns, e.g., a 1985 paper on "Limits of Dilated Convolution Transforms" (http://epubs.siam.org/doi/abs/10.1137/0516041; it's the continuous version)). Since we are not changing layer names, this has relevance only to the description in comments and documentation.
  2. It does look to me like the noble identity is (the Z-transformed version of) exactly what we've been doing in computer vision (and perhaps is what we should all be citing!) Note however that the operation here does not have to be used by commutation with a downsampling; that's just a common case. Rate is another possible term for the parameter, although... aren't higher values here lower rates?
  3. I also have no source for "dilate" applied to a discrete-domained function by inserting values before Holschnieder et al. Whether you view the term as having historical precedence over the morphological usage depends on how closely allied you see the continuous and discrete versions, I guess.

[@gpapan by the way, I don't think you can attach things to a GitHub email; you can, however, insert images.]

@gpapan
Copy link

gpapan commented Dec 21, 2015

Agree on all three points, except that I still think a reference to the
Holschnieder algorithm with holes paper is relevant (it is essentially the
same algorithm and same implementation).

Re. 2, the term "rate" would refer to the factor by which the actual input
tensor is oversampled compared to the nominal sampling rate with which the
less dense model has been trained.
E.g., rate = 4 would mean exactly the same as input_stride = 4 or dilation
= 4.
Please choose whichever you think best.

On Mon, Dec 21, 2015 at 2:23 PM, Jon Long notifications@github.com wrote:

Okay, thanks for checking this with me @gpapan https://github.com/gpapan
!

I think we've converged on the facts, I'll just summarize the last points
and how they apply here:

It does seem that the bigram "dilated convolution" is uncommon, and I
did not find it in any reference describing the algorithme à trous.
(However, it does seem natural and was not first invoked by Fisher and
Koltun; a Googling returns, e.g., a 1985 paper on "Limits of Dilated
Convolution Transforms" (http://epubs.siam.org/doi/abs/10.1137/0516041;
it's the continuous version)). Since we are not changing layer names, this
has relevance only to the description in comments and documentation.
2.

It does look to me like the noble identity is (the Z-transformed
version of) exactly what we've been doing in computer vision (and perhaps
is what we should all be citing!) Note however that the operation here does
not have to be used by commutation with a downsampling; that's just a
common case. Rate is another possible term for the parameter, although...
aren't higher values here lower rates?
3.

I also have no source for "dilate" applied to a discrete-domained
function by inserting values before Holschnieder et al. Whether you view
the term as having historical precedence over the morphological usage
depends on how closely allied you see the continuous and discrete versions,
I guess.

[@gpapan https://github.com/gpapan by the way, I don't think you can
attach things to a GitHub email; you can, however, insert images.]


Reply to this email directly or view it on GitHub
#3452 (comment).

also add safeguard to avoid unused variable warning and clean code format
@fyu
Copy link
Contributor Author

fyu commented Dec 25, 2015

@longjon The dilation for deconvolution is actually not implemented since I don't think it is useful. I put a check in parsing the convolution parameters in case people change the default dilation accidentally when using deconvolution. The others concerns should have been addressed. Let me know if there is anything else I can improve.

@longjon
Copy link
Contributor

longjon commented Dec 28, 2015

Looks pretty good but there are still a few issues:

  1. The history is not quite right... some of the style fixes seem to have been squashed into the wrong commits, and still appear in the history (e.g., git log --stat should make sense with respect to the commit messages, and style fixes shouldn't appear in any diff (not just the overall diff), but still do).
  2. Rather than throwing cites into the caffe.proto comment, I'd actually prefer (per @shelhamer's comments above and earlier discussion) a short definition/explanation to clarify the meaning of the parameter.
  3. It seems to me that dilation for deconvolution is implemented, since the underlying computation is shared with (regular) convolution. Note that the only change to conv_layer.cpp is in compute_output_shape, so it seems that the corresponding change in deconv_layer.cpp ought to make everything work. (As for that being useful, well, it's at least useful for Zeiler and Fergus-style deconvolution on a net with dilated convolution...) Is there another reason that dilated deconvolution won't work?

To save a round-trip time, I've gone ahead and made the changes above in PR #3487; just take a look and let me know if you approve!

@longjon
Copy link
Contributor

longjon commented Dec 28, 2015

Merged as #3487.

I did leave the parameter name as dilation, as it is succinct, evocative, in accord with prior use, and difficult to confuse with existing parameters.

Thanks @fyu for a well-constructed PR for a long-awaited feature! Thanks @gpapan, @tamakoji, and @shelhamer for earlier implementations, feedback, and discussion.

@fyu, we're looking forward to seeing models in the zoo!

image

@longjon longjon closed this Dec 28, 2015
@longjon longjon mentioned this pull request Dec 29, 2015
@gpapan
Copy link

gpapan commented Dec 30, 2015

Great job guys!

On Mon, Dec 28, 2015 at 3:57 PM, Jon Long notifications@github.com wrote:

Closed #3452 #3452.


Reply to this email directly or view it on GitHub
#3452 (comment).

@naibaf7
Copy link
Member

naibaf7 commented Jan 2, 2016

Here is a small model zoo featuring dilated/strided kernels from earlier this year:
https://github.com/naibaf7/caffe_neural_models
Here the technical report:
http://arxiv.org/abs/1509.03371 (read the parts related to SK and USK nets)
which is based and extends on previous work:
http://arxiv.org/abs/1412.4526 (also introducing dilated kernels for pooling and convolution).

it also requires this feature on pooling layers though, which is quite trivial to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants