New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImageNet Pretrained TimesFormer #5
Comments
Hello, Is the space attention pretrained from ViT in imagenet? I was not aware of that. Maybe we can work out a solution by taking a pretrained Vit like this repo. Would you be interested in something like that? |
Hey. That would definitely be good. I think most people including me are not able to make the TimesFormer converge without ViT initialization. Unfortunately, I don't have any experience in porting weights or knowing how exactly weights are organized in torch modules. I see that the author of this repo https://github.com/m-bain/video-transformers, however, has been able to include ViT weight initialization. Maybe this helps you in case you are looking to include this in your repo. Sorry, I would help if I could 😬👍. |
Kinda late on this, but I tried it out and it works! Check it out here: I managed to load 176 layers from vit. This is a pretty awesome functionality of pytorch that i had no idea about!!! Thanks a ton for pointing out this github!!!!! @RaivoKoot The basic idea is this: import torch.nn as nn
from timm.models import vision_transformer
vit_model = vision_transformer.vit_base_patch16_224(pretrained=True)
model.load_state_dict(vit_model.state_dict(), strict=False) (dont forget to |
I see you have recently added the TimesFormer model to this repository. In the paper, they initialize their model weights from ImageNet pretrained weights of ViT. Does your implementation offer this too? Thanks!
The text was updated successfully, but these errors were encountered: