# Use CLIP embeddings to find similar images and to cluster images.

- Minor AAI - Hogeschool van Amsterdam
- Docenten:  Michiel Bontenbal & Maarten Post
- Vrijdag 6 september 2024

In this notebook you learn about two applications for cosine similarity: finding similar images and clustering similar images.

First, we'll get an understanding how the maths for cosine similarity work using a basic example.
Next, we'll calculatie the similarity between two images.
Third, we'll find similar images from a set of images. 
And finally, we'll cluster images based on their similarity.

### Contents
1. Intro:  calculate the cosine similarity (nl: cosinusgelijkenis)
2. Calculate cosine similarity for 2 images
3. Find similar images with cosine similarity
4. Clustering images with cosine similarity and KMeans clustering

### Sources
- https://nl.wikipedia.org/wiki/Cosinusgelijkenis
- https://openai.com/research/clip
- https://medium.com/@jeremy-k/unlocking-openai-clip-part-2-image-similarity-bf0224ab5bb0
- https://www.geeksforgeeks.org/how-to-calculate-cosine-similarity-in-python/
- https://www.geeksforgeeks.org/python-measure-similarity-between-two-sentences-using-cosine-similarity/

## 1. Intro:  calculate the cosine similarity (nl: cosinusgelijkenis)

To calculate the cosine similarity we take to vectors (A,B). 

We then calculate their dotproduct and devide it by the product of the normalized vectors. 

cos(θ) = (A · B) / (||A|| ||B||)

See below. 

In [None]:
#Define two points

# Choose values [1-10] (integers/floats) to properly plot it below

# point 1
p1x = 4
p1y = 3

#point 2
p2x = 3
p2y = 4

In [None]:
#function to calculate cosine similarity
def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)

cosine_similarity((p1x,p1y), (p2x,p2y))

In [None]:
#plot the points and create a vector from 0,0
%matplotlib inline
import matplotlib.pyplot as plt 

#get points and plot them
points_x = p1x, p2x
points_y = p1y, p2y
plt.plot(points_x, points_y, 'o') 

#plot vectors 
plt.quiver([0, 0], [0, 0], [p1x, p2x], [p1y,p2y], angles='xy', scale_units='xy', scale=1)

#plot axes and show it
plt.xlim(0, 10)
plt.ylim(0, 10)
plt.show()

### Exercise 1: 
1. play with the points and see what happens
2. create two examples:
    - Where cosine similarity is 1
    - Where cosine similarity is 0

In [None]:
# Result cosine similarity is 1

# point 1
p1x = 0
p1y = 1

#point 2
p2x = 1
p2y = 0

In [None]:
# Result cosine similarity is 0

# point 1
p1x = 2
p1y = 1

#point 2
p2x = 1
p2y = 2

## 2. Calculate cosine similarity for 2 images

In [None]:
# first install packages! Uncomment if necessary
%pip install torch torchvision
%pip install git+https://github.com/openai/CLIP.git

Here we will use two images and calculate how similar they are. 

To do so we'll create embeddings using CLIP. CLIP as several methods of creating embeddings, but here we'll use a standard Vision Transformer (ViT).

In [None]:
# import packages
from PIL import Image
import torch
import torch.nn as nn
import clip

#check device
device = "cuda" if torch.cuda.is_available() else "cpu"

#select embedding model
model, preprocess = clip.load("ViT-B/32", device=device)#we use a Vision Transformer (ViT) for the embeddings

We will download the image dataset from Huggingface. Normally you do this with Hugginface Datasets package, but for now it's easier to clone the dataset as you can easily manually inspect and display the images. 

In [None]:
!git clone https://huggingface.co/datasets/MichielBontenbal/sim_search_mini

In [None]:
ls

### TO DO: Manually inspect the images

Open the folder 'sim_search_mini' and take a look at the images

In [None]:
#select two images
image1 = "./sim_search_mini/apple1.jpg"
image2= "./sim_search_mini/apple2.jpg"

In [None]:
#Do preprocessing and create embeddings for the images
image1_preprocess = preprocess(Image.open(image1)).unsqueeze(0).to(device)
image1_features = model.encode_image( image1_preprocess)

image2_preprocess = preprocess(Image.open(image2)).unsqueeze(0).to(device)
image2_features = model.encode_image( image2_preprocess)

Inspect the embeddings your created. 
Print it, the number of dimensions and the datatype

In [None]:
#YOUR CODE HERE
print(image1_features)
print(image1_features.ndim)
print(image1_features.dtype)

In [None]:
#calculate the cosine similarity
cos = torch.nn.CosineSimilarity(dim=0)

similarity = cos(image1_features[0],image2_features[0]).item()
similarity = (similarity+1)/2

print("Image similarity: ", similarity)

## Exercise 2: 
Calculate cosine similarity for images apple1.jpg and elephant1.jpg.  

In [None]:
#YOUR CODE HERE


### Reflect on what you just did:
- How high is the cosine similarity for apple/apple and apple/elephant?
- Explain difference between apple/apple and apple/elephant.

## 3. Find similar images with cosine similarity

In [None]:
import torch
import clip
from PIL import Image
import os
import itertools
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

print(device)

dataset_folder = './sim_search_mini/'

images = []
for root, dirs, files in os.walk(dataset_folder):
    for file in files:
        if file.endswith(('jpg','jpeg')):
            images.append(  root  + '/'+ file)


#Embedding of the input image
original_image = './sim_search_mini/face3.jpg'
input_image = preprocess(Image.open(original_image)).unsqueeze(0).to(device) #
input_image_features = model.encode_image(input_image)

result = {}
for img in images:
    with torch.no_grad():
        image_preprocess = preprocess(Image.open(img)).unsqueeze(0).to(device)
        image_features = model.encode_image( image_preprocess)
        cos = torch.nn.CosineSimilarity(dim=0)
        sim = cos(image_features[0],input_image_features[0]).item()
        sim = (sim+1)/2
        result[img]=sim


sorted_value = sorted(result.items(), key=lambda x:x[1], reverse=True)
sorted_res = dict(sorted_value)

top_3 = dict(itertools.islice(sorted_res.items(), 3))

print(top_3)

In [None]:
#display most similar images
from IPython.display import Image
from IPython.display import display

original_image = original_image
first = list(top_3.keys())[0]
second = list(top_3.keys())[1]
third = list(top_3.keys())[2]

#original image
img0 = Image(original_image, width = 400) 

#top 3
img1 = Image(first, width = 400) 
img2 = Image(second, width = 400) 
img3 = Image(third, width = 400)

print("The original image is:")
display(img0)
print('The duplicate is: ')
display(img1)
print('And the most similar images are:')
display(img2, img3)

## Exercise 3:
Get the image that is least similar to the original image. 

In [None]:
# Select the least similar image and display it

#YOUR CODE HERE

## 4. Clustering images with cosine similarity and KMeans clustering.

In [None]:
#uncomment if necessary
#pip install scikit-learn

In [None]:
#code to ignore warnings
import warnings
warnings.simplefilter('ignore')

In [None]:
#Function to create the embeddings for the images
import torch
from PIL import Image

def get_image_embeddings(image1, image2):
    global embedding1, embedding2
    # Preprocess and encode the first image
    image1_preprocess = preprocess(Image.open(image1)).unsqueeze(0)
    embedding1 = model.encode_image(image1_preprocess)

    # Preprocess and encode the second image
    image2_preprocess = preprocess(Image.open(image2)).unsqueeze(0)
    embedding2 = model.encode_image(image2_preprocess)

    return embedding1, embedding2

In [None]:

#Function to calculate cosine similarity
import torch

def calculate_image_similarity(embedding1, embedding2):
    """
    Calculate the cosine similarity between two sets of image features.
    """
    global similarity
    
    # Create a cosine similarity module
    cos = torch.nn.CosineSimilarity(dim=0)

    # Calculate the cosine similarity between the first features of each image
    similarity = cos(embedding1[0], embedding2[0]).item()

    # Scale the similarity to the range [0, 1]
    similarity = (similarity + 1) / 2
    #print(round(similarity,8))
    return similarity

In [None]:
#select the images you want to use
#images_list = ['./images/apple1.jpg', './images/apple2.jpg', './images/banana1.jpg', './images/banana2.jpg', './images/face1.jpg','./images/face2.jpg']

dataset_folder = './sim_search_mini/'

images_list = []
for root, dirs, files in os.walk(dataset_folder):
    for file in files:
        if file.endswith(('jpg','jpeg')):
            images_list.append(  root  + '/'+ file)
images_list

In [None]:
#calculate the cosine similarity to each image in the list. This may take some time like 2 min or longer.
cos_sim_list =[]
for i in range(len(images_list)):
    image1 = images_list[i]
    for j in range(len(images_list)):
        image2 = images_list[j]
        get_image_embeddings(image1, image2)
        calculate_image_similarity(embedding1, embedding2)
        cos_sim_list.append(similarity)

print(cos_sim_list)


In [None]:
#convert this list to a numpy array
import numpy as np
num_rows = len(images_list)
cosine_similarity_matrix = np.array(cos_sim_list).reshape(num_rows, -1)
cosine_similarity_matrix

In [None]:
## Exercise 4
For KMeans you'll have to set the number of clusters manuall. 
Do so in the code below by replacing the question mark. 


In [None]:
# Do the clustering with the K-Means algo and show it using Matplotlib. 
# Warning: this is a little different because we do clustering based on 1 Dimension (instead of 2 or more)
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Convert cosine similarity to distance (1 - similarity)
distance_matrix = 1 - cosine_similarity_matrix

# Choose the number of clusters (k)
num_clusters = ?

# Perform K-Means clustering
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(distance_matrix)

# Get cluster labels
labels = kmeans.labels_

# Print the cluster assignments
for i, label in enumerate(labels):
    print(f"Image {i} is in cluster {label}")

# Optional: Visualize the clustering result
# Here we assume you have 2D data, for visualization purposes only
plt.scatter(distance_matrix[:, 0], distance_matrix[:, 1], c=labels)
plt.title('Clustering of Images based on Cosine Similarity')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()

In [None]:
#create a result list and merge it with the images_list to check your result manually
result_list = []
for i, label in enumerate(labels):
    #print(f"Image {i} is in cluster {label}")
    result_list.append(label)
result_list

# Merge the lists into a dictionary
merged_dict = dict(zip(images_list, result_list))

# Print the resulting dictionary
print(merged_dict)

## Conclusion
We've succesfully clustered a set of diverse images. To achieve this, we have calculated the cosine similartity for each image to the others. We've used a standerd KMeans clustering algorith to cluster the images and used Matplotlib to plot the results.

### Reflection questions

#### 1. Write down the steps you did to get to the clustering in your own words.


#### 2. Describe an embedding in your own words



#### 3: Hard question. How can we speed up the calculation of the cosine similarity?
Hint: check the datatype of the embedding. 



