Image similarity? #1

youssefavx · 2021-01-05T22:38:43Z

Incredible work as always you guys! In looking at the Colab, it seems it's possible to do image-to-text similarity but I'm curious if it's possible to compare image similarity as well.

For instance, if I just replace 'text_features' with 'image_features' would that work / be the best way to do this?

image_features /= image_features.norm(dim=-1, keepdim=True)
image_2_features /= image_2_features.norm(dim=-1, keepdim=True)
similarity = image_2_features.cpu().numpy() @ image_features.cpu().numpy().T

The text was updated successfully, but these errors were encountered:

jongwook · 2021-01-06T23:14:36Z

Thanks!

I believe image similarities will work, although we haven't experimented/evaluated tasks using image similarity measures specifically.

youssefavx · 2021-01-10T00:35:22Z

Thank you! I'll try it out and see

rom1504 · 2021-01-13T16:44:22Z

I tried it and it seems to work pretty well. (actually better than using a pretrained on classification efficientnet for example)

thoppe · 2021-01-13T17:02:33Z

I've also tried it and it works well. Once you have an image and sentence embedding, all that CLIP does is a dot product. This essentially means that both the image latents and the text latents are embedded in the same space (and that space is a good one!).

woctezuma · 2021-01-13T18:21:15Z

I tried it and it seems to work pretty well. (actually better than using a pretrained on classification efficientnet for example)

Just to be clear: when you say that it works well (and even "better" than another embedding), you mean that you have this feeling based on visual inspection of a few examples, or is it based on an image dataset which you had built beforehand to assess the different embeddings? I would not have any issue with any of the evaluation methods, but it is just out of curiosity (and clarity).

By the way, I am reading your nice blog post: https://rom1504.medium.com/image-embeddings-ed1b194d113e

rom1504 · 2021-01-13T21:07:21Z

Visual inspection for now (putting a few thousands image embeddings from clip in a faiss index and trying a few queries), but I also intend to do a more extensive evaluation later on.

I know there are also image retrieval research dataset and tasks where evaluating this model would give the most reliable and comparable results.

woctezuma · 2021-01-16T21:38:06Z

Results are interesting, but one has to be careful because the model likes to pick text from images.

For instance, here, I suspect that images are similar because the word "story" appears.
The query is the top left picture.
In the second to last picture, there is the word "stonies".
It is only in the last picture that there is a visually similar planet shape in the same position as in the query.

Similarly, here, it is the word "empire" which seems to drive the matching between these images.

Here, the word "forest". Etc.

It is both impressive... and a bit disappointing if I want to use the model for retrieving similar images.

scorpionsky · 2021-01-19T19:24:03Z

Maybe that is because the words (e.g., "story") appear in the text..

woctezuma · 2021-01-19T19:40:29Z

Maybe that is because the words (e.g., "story") appear in the text..

There is zero text. I use the pre-trained model, and I only show the images to it.

By the way, if you want to see more results by yourself, you can refresh this page:
https://damp-brushlands-51855.herokuapp.com/render/
Be careful though, as these are Steam games, and some of them can be more aimed at a mature/adult audience.

If you find some interesting results, feel free to post about it here.

scorpionsky · 2021-01-19T19:52:24Z

Thanks for the clarification. Then the model might have learned attentions to the embedded text regions during the training. It is very interesting, in both of the cases.

woctezuma · 2021-01-22T23:42:47Z

I have added a search engine to look for specific Steam games:
https://woctezuma.github.io/steam-svelte-autocomplete/index.html

I am still having fun with this.

thoppe · 2021-01-27T20:28:31Z

I made an app to explore how image similarity works in CLIP by tying it to the closest images in Unsplash. Like @woctezuma noted, it seems to latch onto images with matching text, but in generally there is a lot more going on:

htoyryla · 2021-02-24T14:57:05Z

Anyone has any experience on using CLIP image encoding to compare two images in a loss function (instead pixel loss like MSE, perceptional loss like LPIPS or structural similar like SSIM) to guide image generation (for instance, using generating images matching both text and a given image) or, for example, VAE training. My experiments with this have not been successful, it either does not converge, or does not result in anything like the given image. I wonder if it is the case that CLIP image encoding is strictly semantic and does not contain enough visual features to properly guide training to produce images which resemble, for instance, compositionally to a given image.

EDITED to clarify that comparing two image embedding is meant.

thoppe · 2021-02-24T15:40:51Z

@htoyryla there is an active community on twitter trying to do just that! I found an interesting point about the text vectors vs image vectors -- they aren't colocated!

https://twitter.com/metasemantic/status/1356406256802607112

htoyryla · 2021-02-24T15:45:43Z

Thanks... my problem at the moment seems BTW more like finding the image from the image encoding. If the loss converges, the resulting image is nowhere near the reference. I am leaning towards the assumption that CLIP encoding is not suitable for this, but I've heard others say I should not put the blame on CLIP.

woctezuma · 2021-02-24T18:37:59Z

Anyone has any experience on using CLIP image encoding in a loss function (instead pixel loss like MSE, perceptional loss like LPIPS or structural similar like SSIM) to guide image generation (for instance, using generating images matching both text and a given image)

https://github.com/orpatashnik/StyleCLIP

htoyryla · 2021-02-24T19:20:23Z

Oh, sorry, it looks like I formulated my question ambiguously. I am specifically looking for a case which uses CLIP to compare similarity between two images, i.e. loss calculated from two image embeddings instead of using a more conventional image loss (MSE, LPIPS or SSIM), possibly together with CLIP to compare text with image.

The above arrangement uses CLIP in the normal way to compare text and image embeddings.

In the most simple form, the problem can be formulated: can you use CLIP to find the image if you have the image embedding.

A practical application I have done is to generate an image with a loss function that combines SSIM(generated_image, reference_image) and CLIP cosine distance between embeddings of generated image and a prompt text, for structural or compositional control. Then, somebody suggested to use CLIP also for comparing the two images, and to me it looks like it does not work. Maybe cosine similarity between two image embeddings simply does not make a good loss function to guide image generation (while it is obvious that cosine distance between a text and image embedding does work).

woctezuma · 2021-03-05T22:16:18Z

Relevant to some discussion earlier in the thread (and at least an interesting read):

Sxela · 2022-02-17T20:18:37Z

Results are interesting, but one has to be careful because the model likes to pick text from images.

For instance, here, I suspect that images are similar because the word "story" appears. The query is the top left picture. In the second to last picture, there is the word "stonies". It is only in the last picture that there is a visually similar planet shape in the same position as in the query.

Similarly, here, it is the word "empire" which seems to drive the matching between these images.

Here, the word "forest". Etc.

It is both impressive... and a bit disappointing if I want to use the model for retrieving similar images.

You can actually remove text (via some text detector) from those images before feeding them to clip and see how it goes.

matrixgame2018 · 2022-03-09T01:55:04Z

Hello， I just want to know the L2-norm and cos will be working in the case which between the image and image ？

iremonur · 2022-04-12T06:47:12Z

Oh, sorry, it looks like I formulated my question ambiguously. I am specifically looking for a case which uses CLIP to compare similarity between two images, i.e. loss calculated from two image embeddings instead of using a more conventional image loss (MSE, LPIPS or SSIM), possibly together with CLIP to compare text with image.

The above arrangement uses CLIP in the normal way to compare text and image embeddings.

In the most simple form, the problem can be formulated: can you use CLIP to find the image if you have the image embedding.

A practical application I have done is to generate an image with a loss function that combines SSIM(generated_image, reference_image) and CLIP cosine distance between embeddings of generated image and a prompt text, for structural or compositional control. Then, somebody suggested to use CLIP also for comparing the two images, and to me it looks like it does not work. Maybe cosine similarity between two image embeddings simply does not make a good loss function to guide image generation (while it is obvious that cosine distance between a text and image embedding does work).

Hi @htoyryla, I am working on the same thing as you, I aim to use the CLIP model for the problem of image retrieval. Basically, I am trying to reach similar images to the specific image that I fed into the CLIP model. Since my image pool comprises just highway images, I have done some transfer learning experiments with autonomous vehicle datasets (I re-trained the pre-trained CLIP model on the autonomous vehicle dataset) but this does not work well.
Have you made any progress?

woctezuma · 2022-04-13T07:40:45Z

Slightly relevant: this figure in the DALL-E 2 paper.

Surprisingly, the decoder still recovers Granny Smith apples even when the predicted probability for this label is near 0%.
We also find that our CLIP model is slightly less susceptible to the “pizza” attack than the models investigated in [20].

Reference [20] is the one which was linked in this post.

FurkanGozukara · 2023-03-11T19:16:04Z

can anyone post a fully working code sample (a google colab notebook would be amazing) that will take image A and image B and calculate their image similarity?

Thank you so much

@woctezuma @iremonur @matrixgame2018 @Sxela @htoyryla @thoppe @scorpionsky @rom1504 @youssefavx @jongwook

woctezuma · 2023-03-11T22:50:41Z

First, you extract features as shown in the README:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)

Then you compute similarity between features, e.g. as in the original post or as in the README:

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

FurkanGozukara · 2023-03-12T00:04:02Z

First, you extract features as shown in the README:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)

Then you compute similarity between features, e.g. as in the original post or as in the README:

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

thanks a lot for answer

i did come up with this way what you think?

also do you think clip is the best available method atm for such task?
lets say i liked a jacket on a person and i want to find similar jackets from my database

import torch
from transformers import CLIPImageProcessor, CLIPModel, CLIPTokenizer
from PIL import Image

# Load the CLIP model
model_ID = "openai/clip-vit-large-patch14"
model = CLIPModel.from_pretrained(model_ID)

preprocess = CLIPImageProcessor.from_pretrained(model_ID)

# Define a function to load an image and preprocess it for CLIP
def load_and_preprocess_image(image_path):
    # Load the image from the specified path
    image = Image.open(image_path)

    # Apply the CLIP preprocessing to the image
    image = preprocess(image, return_tensors="pt")

    # Return the preprocessed image
    return image

# Load the two images and preprocess them for CLIP
image_a = load_and_preprocess_image('/content/a.JPG')["pixel_values"]
image_b = load_and_preprocess_image('/content/e.png')["pixel_values"]

# Calculate the embeddings for the images using the CLIP model
with torch.no_grad():
    embedding_a = model.get_image_features(image_a)
    embedding_b = model.get_image_features(image_b)

# Calculate the cosine similarity between the embeddings
similarity_score = torch.nn.functional.cosine_similarity(embedding_a, embedding_b)

# Print the similarity score
print('Similarity score:', similarity_score.item())

woctezuma · 2023-03-12T08:06:39Z

Sorry, I don't have any authority on this subject. I cannot help more or recommend a method over another. 😅

That being said, out of curiosity, I would probably try X-Decoder. I am not sure if one can extract "image features" with this.

Zou, Xueyan, et al. "Generalized Decoding for Pixel, Image, and Language." arXiv preprint arXiv:2212.11270 (2022).

If you have more questions, it would be better to create a new "issue", so that you get more visibility and people are not pinged.

woctezuma mentioned this issue Jan 17, 2021

References lucidrains/deep-daze#1

Closed

KyloRen1 mentioned this issue Mar 10, 2021

Image to text translation #59

Closed

woctezuma mentioned this issue May 5, 2021

interpolate_pos_encoding(x, pos_embed) doesnt return correct dimension for images that is not square (w != h) facebookresearch/dino#8

Closed

jongwook closed this as completed Sep 24, 2021

haltingstate mentioned this issue Oct 3, 2022

Example in /notebook/ of comparing image-image similarity kk-digital/open_clip#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image similarity? #1

Image similarity? #1

youssefavx commented Jan 5, 2021

jongwook commented Jan 6, 2021

youssefavx commented Jan 10, 2021

rom1504 commented Jan 13, 2021

thoppe commented Jan 13, 2021

woctezuma commented Jan 13, 2021 •

edited

rom1504 commented Jan 13, 2021 •

edited

woctezuma commented Jan 16, 2021 •

edited

scorpionsky commented Jan 19, 2021

woctezuma commented Jan 19, 2021 •

edited

scorpionsky commented Jan 19, 2021

woctezuma commented Jan 22, 2021

thoppe commented Jan 27, 2021

htoyryla commented Feb 24, 2021 •

edited

thoppe commented Feb 24, 2021

htoyryla commented Feb 24, 2021 •

edited

woctezuma commented Feb 24, 2021 •

edited

htoyryla commented Feb 24, 2021 •

edited

woctezuma commented Mar 5, 2021 •

edited

Sxela commented Feb 17, 2022

matrixgame2018 commented Mar 9, 2022

iremonur commented Apr 12, 2022 •

edited

woctezuma commented Apr 13, 2022 •

edited

FurkanGozukara commented Mar 11, 2023

woctezuma commented Mar 11, 2023 •

edited

FurkanGozukara commented Mar 12, 2023

woctezuma commented Mar 12, 2023 •

edited

Image similarity? #1

Image similarity? #1

Comments

youssefavx commented Jan 5, 2021

jongwook commented Jan 6, 2021

youssefavx commented Jan 10, 2021

rom1504 commented Jan 13, 2021

thoppe commented Jan 13, 2021

woctezuma commented Jan 13, 2021 • edited

rom1504 commented Jan 13, 2021 • edited

woctezuma commented Jan 16, 2021 • edited

scorpionsky commented Jan 19, 2021

woctezuma commented Jan 19, 2021 • edited

scorpionsky commented Jan 19, 2021

woctezuma commented Jan 22, 2021

thoppe commented Jan 27, 2021

htoyryla commented Feb 24, 2021 • edited

thoppe commented Feb 24, 2021

htoyryla commented Feb 24, 2021 • edited

woctezuma commented Feb 24, 2021 • edited

htoyryla commented Feb 24, 2021 • edited

woctezuma commented Mar 5, 2021 • edited

Sxela commented Feb 17, 2022

matrixgame2018 commented Mar 9, 2022

iremonur commented Apr 12, 2022 • edited

woctezuma commented Apr 13, 2022 • edited

FurkanGozukara commented Mar 11, 2023

woctezuma commented Mar 11, 2023 • edited

FurkanGozukara commented Mar 12, 2023

woctezuma commented Mar 12, 2023 • edited

woctezuma commented Jan 13, 2021 •

edited

rom1504 commented Jan 13, 2021 •

edited

woctezuma commented Jan 16, 2021 •

edited

woctezuma commented Jan 19, 2021 •

edited

htoyryla commented Feb 24, 2021 •

edited

htoyryla commented Feb 24, 2021 •

edited

woctezuma commented Feb 24, 2021 •

edited

htoyryla commented Feb 24, 2021 •

edited

woctezuma commented Mar 5, 2021 •

edited

iremonur commented Apr 12, 2022 •

edited

woctezuma commented Apr 13, 2022 •

edited

woctezuma commented Mar 11, 2023 •

edited

woctezuma commented Mar 12, 2023 •

edited