Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation bug when using GELU vs QuickGELU -- changes the results for the trained models #7

Open
bryant1410 opened this issue Dec 23, 2023 · 0 comments

Comments

@bryant1410
Copy link

bryant1410 commented Dec 23, 2023

Hey, I believe there's a bug when evaluating hard-negative augmented training. Your code uses open_clip, which in turn supports both the original ViT-B/32 architecture, which uses QuickGELU (they name it ViT-B-32-quickgelu) and their "standard" one (ViT-B-32).

When you use their code, you specify the model (config) and pretrained checkpoint, where the pretrained checkpoint is either a name supported for the given model or a path. They support "openai" checkpoint for both ViT-B-32 and ViT-B-32-quickgelu (and similarly for RN50) because they hardcode this pretrained checkpoint name to change it to a QuickGELU implementation, regardless of which one of these two was used.

The problem is that ViT-B-32 also seems to have been used for evaluation (by specifying a path to pretrained instead of "openai"). However, this will make QuickGELU not to be used but GELU because the hardcoded if path won't be triggered. And this affects the results. This is an error-prone behavior from open_clip, in my humble opinion. The needed change to fix it would be to use ViT-B-32-quickgelu in the evaluation or the flag --force-quick-gelu.

How do I know that you ran it this way for evaluation (that you ran into this bug)? Because: When I use GELU, I can reproduce your numbers from Table 6, but when I use QuickGELU, I get different numbers. I'm reproducing the numbers using a fork of open_clip and running my own evaluation of SugarCrepe using the checkpoints you shared.

Next, I compare the results I obtained with the ones reported by you for two checkpoints:

Numbers for NegCLIP FT:

Model Replace-obj Replace-att Replace-rel Swap-obj Swap-att Add-obj Add-att
Reported by you 92.68 85.91 76.46 75.20 75.38 88.80 82.80
My evaluation with ViT-B-32 92.62 85.91 76.81 75.61 75.08 88.80 82.95
My evaluation with ViT-B-32-quickgelu 93.83 88.20 74.54 75.61 76.88 89.91 85.12

Numbers for ViT-B-32 fine-tuned with Replace:

Model Replace-obj Replace-att Replace-rel Swap-obj Swap-att Add-obj Add-att
Reported by you 93.46 90.36 81.01 73.98 75.23 90.93 87.86
My evaluation using ViT-B-32 93.46 90.23 80.94 73.98 75.53 90.93 88.01
My evaluation using ViT-B-32-quickgelu 95.34 89.97 80.01 75.61 76.58 90.93 87.27

BTW, the original NegCLIP paper also seems to have had this issue.

The numbers improve considerably for other benchmarks, such as ImageNet (I have also tried with others). For examples, see the numbers for Replace:

Model ImageNet
My evaluation using ViT-B-32 52.9
My evaluation using ViT-B-32-quickgelu 59.1
My evaluation for OpenAI CLIP 63.4

As we can see, the numbers are much closer to the original OpenAI-pre-trained CLIP numbers when fixing this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant