Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image similarity? #1

Closed
youssefavx opened this issue Jan 5, 2021 · 26 comments
Closed

Image similarity? #1

youssefavx opened this issue Jan 5, 2021 · 26 comments

Comments

@youssefavx
Copy link

Incredible work as always you guys! In looking at the Colab, it seems it's possible to do image-to-text similarity but I'm curious if it's possible to compare image similarity as well.

For instance, if I just replace 'text_features' with 'image_features' would that work / be the best way to do this?

image_features /= image_features.norm(dim=-1, keepdim=True)
image_2_features /= image_2_features.norm(dim=-1, keepdim=True)
similarity = image_2_features.cpu().numpy() @ image_features.cpu().numpy().T
@jongwook
Copy link
Collaborator

jongwook commented Jan 6, 2021

Thanks!

I believe image similarities will work, although we haven't experimented/evaluated tasks using image similarity measures specifically.

@youssefavx
Copy link
Author

Thank you! I'll try it out and see

@rom1504
Copy link
Contributor

rom1504 commented Jan 13, 2021

I tried it and it seems to work pretty well. (actually better than using a pretrained on classification efficientnet for example)

@thoppe
Copy link

thoppe commented Jan 13, 2021

I've also tried it and it works well. Once you have an image and sentence embedding, all that CLIP does is a dot product. This essentially means that both the image latents and the text latents are embedded in the same space (and that space is a good one!).

@woctezuma
Copy link

woctezuma commented Jan 13, 2021

I tried it and it seems to work pretty well. (actually better than using a pretrained on classification efficientnet for example)

Just to be clear: when you say that it works well (and even "better" than another embedding), you mean that you have this feeling based on visual inspection of a few examples, or is it based on an image dataset which you had built beforehand to assess the different embeddings? I would not have any issue with any of the evaluation methods, but it is just out of curiosity (and clarity).

By the way, I am reading your nice blog post: https://rom1504.medium.com/image-embeddings-ed1b194d113e

@rom1504
Copy link
Contributor

rom1504 commented Jan 13, 2021

Visual inspection for now (putting a few thousands image embeddings from clip in a faiss index and trying a few queries), but I also intend to do a more extensive evaluation later on.

I know there are also image retrieval research dataset and tasks where evaluating this model would give the most reliable and comparable results.

@woctezuma
Copy link

woctezuma commented Jan 16, 2021

Results are interesting, but one has to be careful because the model likes to pick text from images.

For instance, here, I suspect that images are similar because the word "story" appears.
The query is the top left picture.
In the second to last picture, there is the word "stonies".
It is only in the last picture that there is a visually similar planet shape in the same position as in the query.

Images with the word story

Similarly, here, it is the word "empire" which seems to drive the matching between these images.

Images with the word empire

Here, the word "forest". Etc.

Images with the word forest

It is both impressive... and a bit disappointing if I want to use the model for retrieving similar images.

@scorpionsky
Copy link

Maybe that is because the words (e.g., "story") appear in the text..

@woctezuma
Copy link

woctezuma commented Jan 19, 2021

Maybe that is because the words (e.g., "story") appear in the text..

There is zero text. I use the pre-trained model, and I only show the images to it.

By the way, if you want to see more results by yourself, you can refresh this page:
https://damp-brushlands-51855.herokuapp.com/render/
Be careful though, as these are Steam games, and some of them can be more aimed at a mature/adult audience.

If you find some interesting results, feel free to post about it here.

@scorpionsky
Copy link

Thanks for the clarification. Then the model might have learned attentions to the embedded text regions during the training. It is very interesting, in both of the cases.

@woctezuma
Copy link

I have added a search engine to look for specific Steam games:
https://woctezuma.github.io/steam-svelte-autocomplete/index.html

I am still having fun with this.

@thoppe
Copy link

thoppe commented Jan 27, 2021

I made an app to explore how image similarity works in CLIP by tying it to the closest images in Unsplash. Like @woctezuma noted, it seems to latch onto images with matching text, but in generally there is a lot more going on:

Screenshot from 2021-01-27 14-29-05

Screenshot from 2021-01-27 14-35-48

Screenshot from 2021-01-27 14-38-37

@htoyryla
Copy link

htoyryla commented Feb 24, 2021

Anyone has any experience on using CLIP image encoding to compare two images in a loss function (instead pixel loss like MSE, perceptional loss like LPIPS or structural similar like SSIM) to guide image generation (for instance, using generating images matching both text and a given image) or, for example, VAE training. My experiments with this have not been successful, it either does not converge, or does not result in anything like the given image. I wonder if it is the case that CLIP image encoding is strictly semantic and does not contain enough visual features to properly guide training to produce images which resemble, for instance, compositionally to a given image.

EDITED to clarify that comparing two image embedding is meant.

@thoppe
Copy link

thoppe commented Feb 24, 2021

@htoyryla there is an active community on twitter trying to do just that! I found an interesting point about the text vectors vs image vectors -- they aren't colocated!

https://twitter.com/metasemantic/status/1356406256802607112

@htoyryla
Copy link

htoyryla commented Feb 24, 2021

Thanks... my problem at the moment seems BTW more like finding the image from the image encoding. If the loss converges, the resulting image is nowhere near the reference. I am leaning towards the assumption that CLIP encoding is not suitable for this, but I've heard others say I should not put the blame on CLIP.

@woctezuma
Copy link

woctezuma commented Feb 24, 2021

Anyone has any experience on using CLIP image encoding in a loss function (instead pixel loss like MSE, perceptional loss like LPIPS or structural similar like SSIM) to guide image generation (for instance, using generating images matching both text and a given image)

https://github.com/orpatashnik/StyleCLIP

Diagram

@htoyryla
Copy link

htoyryla commented Feb 24, 2021

Oh, sorry, it looks like I formulated my question ambiguously. I am specifically looking for a case which uses CLIP to compare similarity between two images, i.e. loss calculated from two image embeddings instead of using a more conventional image loss (MSE, LPIPS or SSIM), possibly together with CLIP to compare text with image.

The above arrangement uses CLIP in the normal way to compare text and image embeddings.

In the most simple form, the problem can be formulated: can you use CLIP to find the image if you have the image embedding.

A practical application I have done is to generate an image with a loss function that combines SSIM(generated_image, reference_image) and CLIP cosine distance between embeddings of generated image and a prompt text, for structural or compositional control. Then, somebody suggested to use CLIP also for comparing the two images, and to me it looks like it does not work. Maybe cosine similarity between two image embeddings simply does not make a good loss function to guide image generation (while it is obvious that cosine distance between a text and image embedding does work).

@woctezuma
Copy link

woctezuma commented Mar 5, 2021

Relevant to some discussion earlier in the thread (and at least an interesting read):

@Sxela
Copy link

Sxela commented Feb 17, 2022

Results are interesting, but one has to be careful because the model likes to pick text from images.

For instance, here, I suspect that images are similar because the word "story" appears. The query is the top left picture. In the second to last picture, there is the word "stonies". It is only in the last picture that there is a visually similar planet shape in the same position as in the query.

Images with the word story

Similarly, here, it is the word "empire" which seems to drive the matching between these images.

Images with the word empire

Here, the word "forest". Etc.

Images with the word forest

It is both impressive... and a bit disappointing if I want to use the model for retrieving similar images.

You can actually remove text (via some text detector) from those images before feeding them to clip and see how it goes.

@matrixgame2018
Copy link

Hello, I just want to know the L2-norm and cos will be working in the case which between the image and image ?

@iremonur
Copy link

iremonur commented Apr 12, 2022

Oh, sorry, it looks like I formulated my question ambiguously. I am specifically looking for a case which uses CLIP to compare similarity between two images, i.e. loss calculated from two image embeddings instead of using a more conventional image loss (MSE, LPIPS or SSIM), possibly together with CLIP to compare text with image.

The above arrangement uses CLIP in the normal way to compare text and image embeddings.

In the most simple form, the problem can be formulated: can you use CLIP to find the image if you have the image embedding.

A practical application I have done is to generate an image with a loss function that combines SSIM(generated_image, reference_image) and CLIP cosine distance between embeddings of generated image and a prompt text, for structural or compositional control. Then, somebody suggested to use CLIP also for comparing the two images, and to me it looks like it does not work. Maybe cosine similarity between two image embeddings simply does not make a good loss function to guide image generation (while it is obvious that cosine distance between a text and image embedding does work).

Hi @htoyryla, I am working on the same thing as you, I aim to use the CLIP model for the problem of image retrieval. Basically, I am trying to reach similar images to the specific image that I fed into the CLIP model. Since my image pool comprises just highway images, I have done some transfer learning experiments with autonomous vehicle datasets (I re-trained the pre-trained CLIP model on the autonomous vehicle dataset) but this does not work well.
Have you made any progress?

@woctezuma
Copy link

woctezuma commented Apr 13, 2022

Slightly relevant: this figure in the DALL-E 2 paper.

Surprisingly, the decoder still recovers Granny Smith apples even when the predicted probability for this label is near 0%.
We also find that our CLIP model is slightly less susceptible to the “pizza” attack than the models investigated in [20].

Picture

Reference [20] is the one which was linked in this post.

@FurkanGozukara
Copy link

can anyone post a fully working code sample (a google colab notebook would be amazing) that will take image A and image B and calculate their image similarity?

Thank you so much

@woctezuma @iremonur @matrixgame2018 @Sxela @htoyryla @thoppe @scorpionsky @rom1504 @youssefavx @jongwook

@woctezuma
Copy link

woctezuma commented Mar 11, 2023

First, you extract features as shown in the README:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)

Then you compute similarity between features, e.g. as in the original post or as in the README:

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

@FurkanGozukara
Copy link

First, you extract features as shown in the README:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)

Then you compute similarity between features, e.g. as in the original post or as in the README:

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

thanks a lot for answer

i did come up with this way what you think?

also do you think clip is the best available method atm for such task?
lets say i liked a jacket on a person and i want to find similar jackets from my database

import torch
from transformers import CLIPImageProcessor, CLIPModel, CLIPTokenizer
from PIL import Image

# Load the CLIP model
model_ID = "openai/clip-vit-large-patch14"
model = CLIPModel.from_pretrained(model_ID)

preprocess = CLIPImageProcessor.from_pretrained(model_ID)

# Define a function to load an image and preprocess it for CLIP
def load_and_preprocess_image(image_path):
    # Load the image from the specified path
    image = Image.open(image_path)

    # Apply the CLIP preprocessing to the image
    image = preprocess(image, return_tensors="pt")

    # Return the preprocessed image
    return image

# Load the two images and preprocess them for CLIP
image_a = load_and_preprocess_image('/content/a.JPG')["pixel_values"]
image_b = load_and_preprocess_image('/content/e.png')["pixel_values"]

# Calculate the embeddings for the images using the CLIP model
with torch.no_grad():
    embedding_a = model.get_image_features(image_a)
    embedding_b = model.get_image_features(image_b)

# Calculate the cosine similarity between the embeddings
similarity_score = torch.nn.functional.cosine_similarity(embedding_a, embedding_b)

# Print the similarity score
print('Similarity score:', similarity_score.item())

@woctezuma
Copy link

woctezuma commented Mar 12, 2023

Sorry, I don't have any authority on this subject. I cannot help more or recommend a method over another. 😅

That being said, out of curiosity, I would probably try X-Decoder. I am not sure if one can extract "image features" with this.

Zou, Xueyan, et al. "Generalized Decoding for Pixel, Image, and Language." arXiv preprint arXiv:2212.11270 (2022).

If you have more questions, it would be better to create a new "issue", so that you get more visibility and people are not pinged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests