New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does CLIP always need softmax and not simple Cosine Similarity #310
Comments
No need to do the softmax unless you want to do classification. You can do comparisons by computing the cosine similarity between |
You don't need to do image_features /= image_features.norm(dim=-1, keepdim=True) if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy |
They only do x /= x.norm() when they use similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy() since this isnt actualy cosine similarity, its x @ y.T, thus x and y need to be divided by norm first. But if you use similarity = (torch.cosine_similarity(image_features, text_features).view(-1, image_features.shape[0]).T.mean(1)).mean(0, True) im pretty sure youd get the similarity for 1 and 1 and you can even use that for loss in a backward pass if you first multiple that similarity by -1 |
So, if we calculate the similarity using ,
Is that correct? |
If I just want to use the visual encoder to get the output visual featuere for downstream tasks,is it necessay to add the 'image_features /= image_features.norm(dim=-1, keepdim=True)'? |
I would like to use CLIP embeddings for text and images in elastic search. It turns out that CLIP embeddings always need at least 2 text inputs for every image, and it does softmax on that. Is there a way to regenerate embeddings so that I can directly use simple cosine similarity between one text input and one image input? Doing softmax in elastic search for 2 text inputs and one image embedding at run time is complicated and expensive. For other embeddings like BERT , we can directly use cosine similarity
Here is sample code. How can I avoid softmax at runtime and just use one text input per image?
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
image_features /= image_features.norm(dim=-1, keepdim=True)
print(image_features.shape)
print(text_features.shape)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy()
print(similarity)
similarity = np.exp(similarity) / np.sum(np.exp(similarity), axis=0)
The text was updated successfully, but these errors were encountered: