Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about MaxPool #10

Open
foralliance opened this issue Dec 20, 2020 · 5 comments
Open

about MaxPool #10

foralliance opened this issue Dec 20, 2020 · 5 comments

Comments

@foralliance
Copy link

foralliance commented Dec 20, 2020

@Nandan91 @rajatsaini0294 HI
For each subspace, the input is HxWxG, through DW + MaxPool + PW, the middle attention map is HxWx1, then through Softmax + Expand, the final attention map is HxWxG.

Because the output dimension of this PW operation is 1, the final attention map is equivalent to one weight shared by all channels. Why use this PW?? Why is it designed so that all channels share one weight?

If this PW operation is removed, that is, treat the output of the MaxPool operation as the final attention map. In this case, it is equivalent to that each point and each channel has its own independent weight. Why not design it this way?

many many thanks!!!

@rajatsaini0294
Copy link
Collaborator

Thanks foralliance for the question

Your question is equivalent to Case 3 in Section 3.2 of the paper. Please refer it.

@foralliance
Copy link
Author

foralliance commented Jan 30, 2021

@rajatsaini0294
Thanks your reply。
You are right.
This PW operation is necessary. Only in this way can interaction between channels be guaranteed.

Another question. If use the ordinary convolution whose output dimension is also G to replace the PW, this will not only ensure that each point and each channel has its own independent weight, but also ensure that there is interaction between channels in each group. Not sure if you have tried such a design?

@rajatsaini0294
Copy link
Collaborator

You mean without partitioning the input into G groups, use convolution to generate G output channels and generate G attention maps from that?
If I mis-understood, can you explain your idea in detail?

@foralliance
Copy link
Author

Sorry for not expressing clearly.

My idea is that all the designs are exactly the same as in Figure 2, the only difference is that use the ordinary convolution whose output dimension is also G to replace the original PW.

This replacement can also ensure that there is interaction between channels in each group, that is, capture the cross channel information as you mentioned in Case 3 in Section 3.2. In addition, this replacement can bring an additional effect that each point and each channel has its own independent weight rather than all channels(in group) sharing one weight.

@rajatsaini0294
Copy link
Collaborator

I understood your point. We have not tried this design because this will increase the number of parameters. Surely you can try this and let us know how it worked. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants