Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How would you apply this to a ViT with no clear ERF / TRF? #68

Open
JohnMBrandt opened this issue May 24, 2024 · 1 comment
Open

How would you apply this to a ViT with no clear ERF / TRF? #68

JohnMBrandt opened this issue May 24, 2024 · 1 comment

Comments

@JohnMBrandt
Copy link

This work is very helpful for my research. I am training detectors using a ViT backbone. I have used RFLA for both a ResNet and a ViT backbone and I find that in either case it improves the detection accuracy of small objects compared to NWD RKA.

However, this work is built on the ERF / TRF of the ResNet, which is computed based on the gaussian of the series of convolutional layers in a ResNet. But ViTs don't have as clear of a way of attributing the receptive field for each pyramid in a FPN built on the ViT output (e.g. https://openreview.net/pdf?id=Gl8FHfMVTZu). I'm curious whether you have any suggestions for modifying the ERF calculations for a ViT.

Thanks!

@Chasel-Tsui
Copy link
Owner

Very interesting question. At now, it is hard to estimate the effective receptive field for vits. If you want to adapt the pipeline into ViT-based methids, a simple solution may be directly using the receptive field (from bottom to top) in this repo for calculation, and discard those redundant receptive fields (for example, if you only have 4 FPN levels in ViT, you can use the lowest 4 level receptive field calculation from use code). However, i am not sure whether this way will perform well or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants