It is highly recommended to use a powerful **GPU**, you can use it for free uploading this notebook to [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb).
<table align="center">
 <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ezponda/intro_deep_learning/blob/main/class/NLP/Image_search.ipynb">
        <img src="https://colab.research.google.com/img/colab_favicon_256px.png"  width="50" height="50" style="padding-bottom:5px;" />Run in Google Colab</a></td>
  <td align="center"><a target="_blank" href="https://github.com/ezponda/intro_deep_learning/blob/main/class/NLP/Image_search.ipynb">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png"  width="50" height="50" style="padding-bottom:5px;" />View Source on GitHub</a></td>
</table>

# Image search

In this notebook, we'll introduce image search using Sentence Transformers, by mapping images and texts into the same vector space. This enables us to perform search and retrieval tasks for images based on textual descriptions.

To achieve this, we'll utilize the [CLIP (Contrastive Language-Image Pretraining)](https://openai.com/research/clip) model, which is designed to learn a joint embedding space for both images and texts.

Contrastive Language-Image Pretraining (CLIP) is an AI model developed by OpenAI. It is designed to learn from a wide range of tasks by leveraging the connection between natural language and images.

1. Multimodal Learning: CLIP is a multimodal model that can understand both images and text. It is pretrained on a large dataset containing pairs of images and their associated text captions, learning to associate visual concepts with natural language.

2. Contrastive Learning: CLIP learns by optimizing a contrastive objective. It is trained to recognize which image-caption pairs are correct among a set of negative examples. By learning to score the correct image-text pairs higher than incorrect ones, the model learns a useful representation for both modalities.

3. Architecture: CLIP uses a Transformer-based architecture for processing text and a Vision Transformer or ResNet architecture for processing images. The image and text encoders are jointly trained, allowing the model to align both modalities in a shared embedding space.


In [None]:
# Install the sentence-transformers library
!pip install -U sentence-transformers

In [None]:
import sentence_transformers
from sentence_transformers import SentenceTransformer, util
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from PIL import Image
import glob
import pickle
import zipfile
import copy
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm

In [None]:
# First, we load the respective CLIP model
model_name = 'clip-ViT-B-32'
model = SentenceTransformer(model_name)

In [None]:
import requests
from io import StringIO, BytesIO

def get_image_from_url(url):
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

For searching images, we need an image set

In [None]:
img_url_path = 'https://github.com/ezponda/intro_deep_learning/raw/main/images/'
img_urls = [
    f'{img_url_path}eiffel_tower.jpeg',
    f'{img_url_path}taj_mahal.jpeg',
    f'{img_url_path}colosseum.jpeg',
    f'{img_url_path}great_wall_of_china.jpeg',
    f'{img_url_path}statue_of_liberty.jpeg',
]
images = [get_image_from_url(url) for url in img_urls]

print('Sample images: ')
for url, image in zip(img_urls, images):
    print('_'*50)
    print(f'url: {url}')
    display(image)

In [None]:
img_embeddings = model.encode(images,
                       batch_size=128,
                       convert_to_tensor=True,
                       show_progress_bar=True)
img_embeddings = img_embeddings.cpu()
print(img_embeddings.shape)

Now, let's define a function to perform image search, given a query and a list of image embeddings.

In [None]:
def image_search(query, model, img_embeddings, images, top_k=2):
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], img_embeddings)[0]
    indexes = np.argpartition(similarities, -top_k)[-top_k:]
    indexes = indexes[np.argsort(-similarities[indexes])]
    print(f"Input query: {query}")
    print()
    for ind, sim in zip(list(indexes), similarities[indexes].tolist()):
        print('_'*50)
        print(sim)
        display(images[ind])

In [None]:
image_search('A building in Paris', model, img_embeddings, images, top_k=2)

In [None]:
image_search('Find me an image of a famous monument in India', model, img_embeddings, images, top_k=2)

In [None]:
image_search('A building in China', model, img_embeddings, images, top_k=2)

## Unsplash subset dataset

[Unsplash](https://unsplash.com/data) is a collaborative image dataset openly shared.

In [None]:
# Next, we get about 25k images from Unsplash 
img_folder = './photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)
    
    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
        
    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)

In [None]:
# Now, we need to compute the embeddings
# To speed things up, we destribute pre-computed embeddings
# Otherwise you can also encode the images yourself.
# To encode an image, you can use the following code:
# from PIL import Image
# img_emb = model.encode(Image.open(filepath))
def read_image_from_path(file_path):
    img = Image.open(file_path)
    return img

use_precomputed_embeddings = True

if use_precomputed_embeddings: 
    emb_filename = 'unsplash-25k-photos-embeddings.pkl'
    if not os.path.exists(emb_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+emb_filename, emb_filename)
        
    with open(emb_filename, 'rb') as fIn:
        img_names, img_embeddings = pickle.load(fIn)  
    

    print("Images:", len(img_names))
else:
    img_names = list(glob.glob('photos/*.jpg'))[:5_000]
    print("Images:", len(img_names))
    images = [read_image_from_path(img_name) for img_name in  img_names]
    img_embeddings = model.encode(images, batch_size=128, convert_to_tensor=True, show_progress_bar=True)
    img_embeddings = img_embeddings.cpu()

In [None]:
def image_search_from_path(query, model, img_embeddings, img_folder, img_names, top_k=2):
    query_embedding = model.encode([query])[0]
    similarities = cosine_similarity([query_embedding], img_embeddings)[0]
    indexes = np.argpartition(similarities, -top_k)[-top_k:]
    indexes = indexes[np.argsort(-similarities[indexes])]
    print(f"Input query: {query}")
    print()
    for ind, sim in zip(list(indexes), similarities[indexes].tolist()):
        print('_'*50)
        print(sim)
        path = os.path.join(img_folder, img_names[ind])
        img = copy.deepcopy(Image.open(open(path, 'rb')))
        display(img)

In [None]:
image_search_from_path('A building in Paris', model, img_embeddings, img_folder, img_names, top_k=2)

In [None]:
image_search_from_path('A building in China', model, img_embeddings, img_folder, img_names, top_k=2)

In [None]:
image_search_from_path('A building in China', model, img_embeddings, img_folder, img_names, top_k=2)

In [None]:
image_search_from_path('Two dogs playing in the snow', model, img_embeddings, img_folder, img_names, top_k=2)

## Image-to-Image Search
You can use the method also for image-to-image search.

To achieve this, you pass `get_image_from_url(url)` to the search method.

It will then return similar images

In [None]:
img = get_image_from_url(img_urls[0])
image_search_from_path(img, model, img_embeddings, img_folder, img_names, top_k=5)