Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does CLIP always need softmax and not simple Cosine Similarity #310

Open
evergreenllc2020 opened this issue Jan 4, 2023 · 5 comments

Comments

@evergreenllc2020
Copy link

I would like to use CLIP embeddings for text and images in elastic search. It turns out that CLIP embeddings always need at least 2 text inputs for every image, and it does softmax on that. Is there a way to regenerate embeddings so that I can directly use simple cosine similarity between one text input and one image input? Doing softmax in elastic search for 2 text inputs and one image embedding at run time is complicated and expensive. For other embeddings like BERT , we can directly use cosine similarity

Here is sample code. How can I avoid softmax at runtime and just use one text input per image?

with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
image_features /= image_features.norm(dim=-1, keepdim=True)
print(image_features.shape)
print(text_features.shape)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy()
print(similarity)
similarity = np.exp(similarity) / np.sum(np.exp(similarity), axis=0)

@Rijgersberg
Copy link

No need to do the softmax unless you want to do classification. You can do comparisons by computing the cosine similarity between image_features and text_features directly.

@fractaldna22
Copy link

You don't need to do

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy

@fractaldna22
Copy link

They only do x /= x.norm() when they use

similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy()

since this isnt actualy cosine similarity, its x @ y.T, thus x and y need to be divided by norm first. But if you use

similarity = (torch.cosine_similarity(image_features, text_features).view(-1, image_features.shape[0]).T.mean(1)).mean(0, True)

im pretty sure youd get the similarity for 1 and 1 and you can even use that for loss in a backward pass if you first multiple that similarity by -1

@Externalhappy
Copy link

Externalhappy commented Aug 25, 2023

So, if we calculate the similarity using ,
similarity = F.cosine_similarity(x, y)
without normalizing the image and text features, the computed cosine similarity will have the same output as

x = x / x.norm(dim=-1, keepdim=True) 
y = y / y.norm(dim=-1, keepdim=True)
similarity = x @y.T

Is that correct?

@Dinosaurcubs
Copy link

You don't need to do

image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)

if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy

If I just want to use the visual encoder to get the output visual featuere for downstream tasks,is it necessay to add the 'image_features /= image_features.norm(dim=-1, keepdim=True)'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants