Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ViT model output + rewrite attention layer + adapt torchvision script #230

Merged
merged 6 commits into from
May 5, 2023

Conversation

CarloLucibello
Copy link
Member

@CarloLucibello CarloLucibello commented Apr 25, 2023

Current status of this PR is that all weights are copied but the outputs on the test image differ (and flux's one don't make sense)

close #231

@theabhirath
Copy link
Member

theabhirath commented Apr 25, 2023

I'm not sure that I want the porting scripts to be generalised to all models. We used it for CNNs because it's convenient but even there the script is not exactly always directly usable (for example, SqueezeNets require you to remove the reverse added in #229 for the script to work). At best we perhaps link the scripts in the docs somewhere but having it in the repo IMO seems to suggest to users that somehow Metalhead cannot be used by them to train models on their own. The ideal solution, of course, is to train all the models in Metalhead on Imagenet and host those weights. But since there are certain obstacles to that which I encountered last summer, porting only the CNN weights (and leaving the porting script as a link instead of directly in the repo) is a good middle ground I feel. However, this is only my opinion. @darsnack and @ToucheSir may also think of something?

@darsnack
Copy link
Member

Yeah that's the same reason that I only linked to it from the model card for the HF upload that used it (for reproducibility). I was afraid to suggest to users that it is a robust way to port weights from torchvision.

Maybe the scripts folder like Carlo created is fine. Why not have separate scripts in the folder that work for different sets of models? Since this isn't shipped code for the package, it does not need to be generic or robust. Having something cobbled together that allows us to ship more pre-trained models is the sweet spot. We can include a scripts/README.md that warns very explicitly that these scripts are a starting point and not a robust solution. Once something like FluxML/Flux.jl#2239 lands, we should only need the script once to generate the initial weights that work. Keeping it in the repo will only be to have a historical record for the model cards.

@ToucheSir
Copy link
Member

I agree we need not be worried about having a single one-size-fits-all script. If one script per model family/group of model families helps simplify the porting code, that sounds good to me.

@CarloLucibello
Copy link
Member Author

The problem with ViT is in the attention module, probably the weights have to be copied in some particular fashion, I will have to investigate further.

I really want to get ViT in because it is the most popular vision backbone these days.

@CarloLucibello CarloLucibello changed the title adapt porting script for ViT fix ViT model output and rewrite attention layer May 5, 2023
@CarloLucibello CarloLucibello changed the title fix ViT model output and rewrite attention layer fix ViT model output + rewrite attention layer + adapt torchvision script May 5, 2023
@@ -1,5 +1,5 @@
"""
MHAttention(planes::Integer, nheads::Integer = 8; qkv_bias::Bool = false,
MultiHeadSelfAttention(planes::Integer, nheads::Integer = 8; qkv_bias::Bool = false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the name more informative

pool === :class ? x -> x[:, 1, :] : seconddimmean),
Chain(LayerNorm(embedplanes), Dense(embedplanes, nclasses, tanh_fast)))
Chain(LayerNorm(embedplanes), Dense(embedplanes, nclasses)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this final tanh had no reason to exist

@CarloLucibello
Copy link
Member Author

After rewriting the attention layer on top of NNlib and removing the final tanh from ViT I can reproduce pytorch's outputs although there is still a slight discrepancy:

Flux:
    acoustic guitar: 0.90519154
    stage: 0.0040107034
    harmonica: 0.0028614246
    microphone: 0.002621256
    electric guitar: 0.0025401094
PyTorch:
    acoustic guitar: 0.90745604
    stage: 0.0038461224
    harmonica: 0.002782756
    microphone: 0.0025289422
    electric guitar: 0.0023941135

This could be due to differences in the implementation of layer norm see FluxML/Flux.jl#2220

@CarloLucibello
Copy link
Member Author

I'm happy with this. If i can get an approval I'll merge and move on

@CarloLucibello
Copy link
Member Author

Looks like changing the implementation of LayerNorm has little effect:

Flux (LayerNormV2):
    acoustic guitar: 0.9051971
    stage: 0.0040056095
    harmonica: 0.0028621724
    microphone: 0.0026183864
    electric guitar: 0.0025359727
Flux:
    acoustic guitar: 0.90519154
    stage: 0.0040107034
    harmonica: 0.0028614246
    microphone: 0.002621256
    electric guitar: 0.0025401094
PyTorch:
    acoustic guitar: 0.90745604
    stage: 0.0038461224
    harmonica: 0.002782756
    microphone: 0.0025289422
    electric guitar: 0.0023941135

so I don't know why we observe these discrepancies. I added LayerNormv2 to the Layers module but didn't use it anywhere since I'm not sure it will be really needed

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor changes before merging but otherwise looks good

src/layers/Layers.jl Outdated Show resolved Hide resolved
src/layers/normalise.jl Outdated Show resolved Hide resolved
@@ -100,9 +102,10 @@ end
@functor ViT

function ViT(config::Symbol; imsize::Dims{2} = (256, 256), patch_size::Dims{2} = (16, 16),
pretrain::Bool = false, inchannels::Integer = 3, nclasses::Integer = 1000)
pretrain::Bool = false, inchannels::Integer = 3, nclasses::Integer = 1000,
qkv_bias=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless it is typical to adjust this toggle, I think it should not get exposed going from vit to ViT. The logic with the codebase has been to make the uppercase exports as simple as possible.

Copy link
Member Author

@CarloLucibello CarloLucibello May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add it since the default for torchvision is true, here is false. The torchvision model is given by

ViT(:base, imsize=(224,224), qkv_bias=true)

I think we should change the defaults here to match that before the tag of the breaking release, but this can be done in another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so change the default to true and remove the keyword? I assume you almost always want it as true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll do it in the next PR

@CarloLucibello CarloLucibello merged commit 278bab6 into master May 5, 2023
@CarloLucibello CarloLucibello deleted the cl/vit branch July 17, 2023 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cannot match attention layer output to pytorch's one
4 participants