Why does CLIP always need softmax and not simple Cosine Similarity #310

evergreenllc2020 · 2023-01-04T04:32:26Z

I would like to use CLIP embeddings for text and images in elastic search. It turns out that CLIP embeddings always need at least 2 text inputs for every image, and it does softmax on that. Is there a way to regenerate embeddings so that I can directly use simple cosine similarity between one text input and one image input? Doing softmax in elastic search for 2 text inputs and one image embedding at run time is complicated and expensive. For other embeddings like BERT , we can directly use cosine similarity

Here is sample code. How can I avoid softmax at runtime and just use one text input per image?

with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
image_features /= image_features.norm(dim=-1, keepdim=True)
print(image_features.shape)
print(text_features.shape)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy()
print(similarity)
similarity = np.exp(similarity) / np.sum(np.exp(similarity), axis=0)

Rijgersberg · 2023-01-24T12:36:04Z

No need to do the softmax unless you want to do classification. You can do comparisons by computing the cosine similarity between image_features and text_features directly.

fractaldna22 · 2023-03-30T04:05:54Z

You don't need to do

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy

fractaldna22 · 2023-03-30T04:12:38Z

They only do x /= x.norm() when they use

similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy()

since this isnt actualy cosine similarity, its x @ y.T, thus x and y need to be divided by norm first. But if you use

similarity = (torch.cosine_similarity(image_features, text_features).view(-1, image_features.shape[0]).T.mean(1)).mean(0, True)

im pretty sure youd get the similarity for 1 and 1 and you can even use that for loss in a backward pass if you first multiple that similarity by -1

Externalhappy · 2023-08-25T14:21:38Z

So, if we calculate the similarity using ,
similarity = F.cosine_similarity(x, y)
without normalizing the image and text features, the computed cosine similarity will have the same output as

x = x / x.norm(dim=-1, keepdim=True) 
y = y / y.norm(dim=-1, keepdim=True)
similarity = x @y.T

Is that correct?

Dinosaurcubs · 2024-04-27T04:28:08Z

You don't need to do

image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)

if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy

If I just want to use the visual encoder to get the output visual featuere for downstream tasks，is it necessay to add the 'image_features /= image_features.norm(dim=-1, keepdim=True)'?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does CLIP always need softmax and not simple Cosine Similarity #310

Why does CLIP always need softmax and not simple Cosine Similarity #310

evergreenllc2020 commented Jan 4, 2023

Rijgersberg commented Jan 24, 2023

fractaldna22 commented Mar 30, 2023

fractaldna22 commented Mar 30, 2023

Externalhappy commented Aug 25, 2023 •

edited

Dinosaurcubs commented Apr 27, 2024

Why does CLIP always need softmax and not simple Cosine Similarity #310

Why does CLIP always need softmax and not simple Cosine Similarity #310

Comments

evergreenllc2020 commented Jan 4, 2023

Rijgersberg commented Jan 24, 2023

fractaldna22 commented Mar 30, 2023

fractaldna22 commented Mar 30, 2023

Externalhappy commented Aug 25, 2023 • edited

Dinosaurcubs commented Apr 27, 2024

Externalhappy commented Aug 25, 2023 •

edited