-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This seems to be an idea that has been demonstrated by the existing method. #1
Comments
Hello and thank you for your interest. Thank you for pointing this out. We actually cited a more recent follow-up work to SASA by the same group of authors, HaloNet. Another big difference between the papers is the application of NA vs SASA. I hope that answers your question, and feel free to reopen the issues if you have any more questions. |
Okay, thanks for your reply! |
Thanks for the explanations. Is it possible to also provide a run-time comparison between NAT and Swin? It seems the current paper only compares the FLOPS which is not always coherent with the run-time. |
@xuxy09 That is true, FLOPs are not a direct measure of time. They are though a measure of computational cost and we are particularly interested in that as the kernel is still not as fast as it can potentially be. As far as runtime goes, both training and inference on classification run with the same throughput as Swin at the Tiny scale, but they grow apart with NAT being slower than Swin. But again, that is only a limitation of the existing implementation, which we expect will change in the near future. You can also refer to this issue #13 for details. |
First of all, thank you very much for your contribution to the community. A similar approach seems to be mentioned in Swin Transformer, and Swin's repo contains an implementation of this variant (sliding). The experimental results of NAT for this part of the ablation seem to be consistent with Swin, and the quantitative experimental results seem likely to be caused in part by a narrower but deeper network? I'm very much looking forward to your new CUDA implementation, as I tried a similar idea but gave up because of the speed and memory overhead. |
@IDKiro Based on my reading, the sliding window approach seems to be more similar to SASA, than NA. We also observed that NAT-T is just as fast as Swin-T in inference on ImageNet, while their sliding window approach seems to be much slower. The Swin-T based result on ImageNet happens to be the same, 81.4%, but this is likely coincidental: as you can see in our ablation table, this gap grows as we shift to our NAT configuration (~0.5%). I'd also point out that our segmentation result on that model (not in the paper) was 46.3 mIoU, while the number in Swin's table is 45.8. We also did a detection run on that model, but using Mask RCNN (the table from Swin is Cascade Mask RCNN), and observed it performed on par with Swin (46.1 mAP vs Swin-T's 46.0), while this sliding window approach seems to be doing worse than Swin-T. |
First of all, it is really a very interesting work.
This article has a very similar strategy to one I've read: Stand-Alone Self-Attention in Vision Models.
However, I did not find a relevant comparison in the paper, and the author should probably add more content to explain the difference.
The text was updated successfully, but these errors were encountered: