Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of Vision Transformer #133

Open
Tracked by #136
Yung-zi opened this issue Jan 22, 2022 · 6 comments
Open
Tracked by #136

Add support of Vision Transformer #133

Yung-zi opened this issue Jan 22, 2022 · 6 comments
Assignees
Labels
module: methods Related to torchcam.methods type: improvement New feature or request
Milestone

Comments

@Yung-zi
Copy link

Yung-zi commented Jan 22, 2022

馃殌 Feature

I am appreciated for your great job! However, I have a question. Can Layer-CAM be used with Vision Transformer Network? If it does work, what aspects should I change?

Motivation & pitch

I'm working on the job related to CAM.

Alternatives

No response

Additional context

No response

@Yung-zi Yung-zi added the type: improvement New feature or request label Jan 22, 2022
@frgfm frgfm added this to the 0.4.0 milestone Feb 1, 2022
@frgfm frgfm added the module: methods Related to torchcam.methods label Feb 1, 2022
@frgfm
Copy link
Owner

frgfm commented Feb 1, 2022

Hello @Yung-zi 馃憢

My apologies, I've been busy with other projects lately!
As of right now, the library is designed to work with CNNs. However, the way it was designed basically only relies on forward activation and backpropagated gradient hooks. So to answer your question, I'd need to run some tests but if the output activation of a given layer is of shape (N, C, H, W), whatever the way it was computed as long as this doesn't break the backprop (i.e. being differentiable), the library should work without much (perhaps any) change 馃槃

Either way, I intend on spending more time on Vision transformers compatibility for the next release 馃憤
If you're interested in helping / or providing feedback once it's in progress, let me know!

@frgfm frgfm changed the title Vision Transformer Add support of Vision Transformer Feb 6, 2022
@frgfm frgfm mentioned this issue Feb 6, 2022
8 tasks
@Yung-zi
Copy link
Author

Yung-zi commented Jul 8, 2022

Hello @Yung-zi 馃憢

My apologies, I've been busy with other projects lately! As of right now, the library is designed to work with CNNs. However, the way it was designed basically only relies on forward activation and backpropagated gradient hooks. So to answer your question, I'd need to run some tests but if the output activation of a given layer is of shape (N, C, H, W), whatever the way it was computed as long as this doesn't break the backprop (i.e. being differentiable), the library should work without much (perhaps any) change 馃槃

Either way, I intend on spending more time on Vision transformers compatibility for the next release 馃憤 If you're interested in helping / or providing feedback once it's in progress, let me know!

I am so sorry for late reply. I tried to change your code before. However, the effect looked not well maybe I made some mistakes. Have you ever made it on Vision transformer?

@frgfm
Copy link
Owner

frgfm commented Aug 2, 2022

Partially yes!
But I have staged this for the next release anyway so I'll dive into it to make it available :)

@frgfm
Copy link
Owner

frgfm commented Dec 31, 2022

Quick update!
As of today, here is the support status of Torchvision transformer architectures:

  • maxvit
  • swin
  • swin_v2
  • vit (so far I can't see a way to make this integration seamless, because of the concatenation on the channel dimension and the dimension swapping)

@frgfm
Copy link
Owner

frgfm commented Jan 2, 2023

Another update: VIT requires another method called Attention flow!
I'll try to investigate & implement this but this is a bit more complex than just inverting the axis swap & slicing.

@frgfm frgfm modified the milestones: 0.4.0, 0.4.1, 0.5.0 Oct 19, 2023
@YAN-0802
Copy link

Your excellent work has helped me a lot! Thank you for this! However, I have a question. I downloaded torchcam0.4.0 and had good visualization results on the CNN models. But it didn't work on the Vit model. Here's what happened: Since I was working offline, I downloaded the ViT weight file and loaded the model using timm. The result was blue pixels covering the entire image, i.e. no heatmap area was found. What do I need to change in the code to make it work? Or as you mentioned above, are you still working on it? Thank you for taking time out of your busy schedule.
raw_ILSVRC2012_val_00000024 JPEG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: methods Related to torchcam.methods type: improvement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants