How would you apply this to a ViT with no clear ERF / TRF? #68

JohnMBrandt · 2024-05-24T15:35:37Z

This work is very helpful for my research. I am training detectors using a ViT backbone. I have used RFLA for both a ResNet and a ViT backbone and I find that in either case it improves the detection accuracy of small objects compared to NWD RKA.

However, this work is built on the ERF / TRF of the ResNet, which is computed based on the gaussian of the series of convolutional layers in a ResNet. But ViTs don't have as clear of a way of attributing the receptive field for each pyramid in a FPN built on the ViT output (e.g. https://openreview.net/pdf?id=Gl8FHfMVTZu). I'm curious whether you have any suggestions for modifying the ERF calculations for a ViT.

Thanks!

Chasel-Tsui · 2024-06-23T02:16:24Z

Very interesting question. At now, it is hard to estimate the effective receptive field for vits. If you want to adapt the pipeline into ViT-based methids, a simple solution may be directly using the receptive field (from bottom to top) in this repo for calculation, and discard those redundant receptive fields (for example, if you only have 4 FPN levels in ViT, you can use the lowest 4 level receptive field calculation from use code). However, i am not sure whether this way will perform well or not

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How would you apply this to a ViT with no clear ERF / TRF? #68

How would you apply this to a ViT with no clear ERF / TRF? #68

JohnMBrandt commented May 24, 2024

Chasel-Tsui commented Jun 23, 2024

How would you apply this to a ViT with no clear ERF / TRF? #68

How would you apply this to a ViT with no clear ERF / TRF? #68

Comments

JohnMBrandt commented May 24, 2024

Chasel-Tsui commented Jun 23, 2024