Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean/Standard deviation over training dataset to remove rogue embedding dimensions outside normal magnitudes? #22

Open
BradNeuberg opened this issue Feb 28, 2024 · 7 comments

Comments

@BradNeuberg
Copy link

I've previously been experimenting with the RSICD CLIP model [1], which was trained just over the RSICD dataset. I'm very impressed at how many data sources you've used to train RemoteCLIP, which you document in your paper with this table:

image

In my experiments with RSICD I've found that having the mean and standard deviation of the training data in order to do normalization was an important part of getting quality results, so that input imagery follows the remote sensing distribution set by the training data rather than the mean/std that the parent OpenCLIP model has, as remote sensing imagery obviously has a very different distribution that standard consumer photography.

Since you have so many datasets for your model, which is a strength, that unfortunately makes it quite hard for us to compute our own mean/std. Might it be possible for you to compute a per-band mean and std over the datasets you trained with that you might have locally? It's probably fine to compute this using a sampling strategy as long as the sample size is large enough.

Thanks again for such a great paper and open sourcing your project :)

[1] https://github.com/arampacha/CLIP-rsicd

@BradNeuberg
Copy link
Author

If the training data has been aggregated on your side and easily accessible I can do this myself, but I'm not sure if I have access to all of these datasets.

@ChenDelong1999
Copy link
Owner

Hi, actually in RemoteCLIP we use the default normalization mean/std for training, which can be found in: https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/constants.py#L1-L2

We haven't done any ablation on these mean/std. Since the final performance is not bad so we don't see this as a major bottleneck, also using default parameters can keep the input images in the domain that CLIP vision encoders are familiar.

But it will be interesting to dive deeper into that! Thank you for your suggestion and we will investigate it further in future works.

@BradNeuberg
Copy link
Author

Thanks! When working with RSICD CLIP [1] we actually found some interesting "rogue dimensions", where even after normalization some vector components were dominating across all generated embeddings. We found that if we normalized these rogue dimensions we got much better class separability and performance on downstream tasks:

image

This is what rogue dimensions look like (on the left). A handful of the 512 dimensions have values that are routinely 2-10x larger than other dimensions. We can choose to treat these by calculating the mean and standard deviation for each dimension over a large enough batch of vectors and then rescaling those dimensions, giving the image on the right. In the right-hand case, each dimension has zero mean and unit standard deviation.

This has the property that when taking differences between vectors, the vector magnitudes are not dominated by the rogue dimensions. This can lead to cleaner cluster separation and better downstream performance on zero-shot tasks.

We suspect that the following paper found this problem first, "All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality" [2], and they found that normalizing and removing these rogue dimensions is important in an applied setting.

We do this normalization using Scikit Learns StandardScaler [3], taking the RSICD training dataset, getting image and text embeddings on them, and then getting batch statistics on it to get appropriate means and standard deviations per dimension of the 512 length embedding. This means we end up with 512-length means and standard deviations for images and the same for text embeddings, which we can then use to normalize embeddings to remove rogue dimensions.

We suspect that RemoteCLIP will suffer from these same rogue dimensions, and be greatly helped by normalizing and removing them. Unfortunately, RemoteCLIP was trained on a very diverse set of data that we don't have access to.

If you all are able to, it would greatly help if you could randomly sample some subset of your training data, compute their image and text embeddings, and then provide these 512-length mean and standard deviation values for images and texts. Others have found this kind of normalization important. We are fine to do it ourselves but then would need access to your training corpus. You could also randomly sample some representative subset of your training images and text and put it in a cloud bucket of some kind and we could compute those statistics and provide them here.

[1] https://huggingface.co/flax-community/clip-rsicd-v2
[2] https://arxiv.org/abs/2109.04404
[3] https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

@BradNeuberg BradNeuberg changed the title Mean/Standard deviation over training dataset? Mean/Standard deviation over training dataset to remove rogue embedding dimensions outside normal magnitudes? Mar 4, 2024
@ChenDelong1999
Copy link
Owner

Hmm, that's very interesting, and thanks for your comments! 😀

Not sure if you do normalization on each single embedding before evaluation? (e.g., https://github.com/ChenDelong1999/RemoteCLIP/blob/main/retrieval.py#L138-L139)

For data samples, the publicly available RSICD, RSITMD, and UCM datasets have both images and captions, and they are used for training RemoteCLIP.

I guess using them can give a good enough estimation of full RemoteCLIP training data, which I won't be able to release before paper acceptance : (

Btw, I am very curious about how much can this rogue dimension normalization operation improve retrieval/zero-shot performance?

@BradNeuberg
Copy link
Author

BradNeuberg commented Mar 5, 2024 via email

@ChenDelong1999
Copy link
Owner

I think calculating the values is not very complicated, we could first try by ourselves and investigate it further.

I am now super curious about which types of inputs can make these rogue dimensions activate 🤔

Thank you very much for the discussion, let us try try and get back to you if we have some results!

@BradNeuberg
Copy link
Author

BradNeuberg commented Mar 6, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants