<a href="https://colab.research.google.com/github/SaketMunda/clip-usage-101/blob/master/clip_interacting_with_unsplash_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CLIP Model train on Unsplash Dataset

This notebook illustrates CLIP Neural Network model trained on [Unsplash](https://unsplash.com/data) dataset.

Using examples from [GITHUB REPO](https://github.com/haltakov/natural-language-image-search)

Inspired by the work of [Vladimir Haltakov](https://twitter.com/haltakov)

## Goal of this Colab

Using this notebook you can search for images from the Unsplash dataset using natural language search. The search is powered by OpenAI's [CLIP](https://openai.com/blog/clip/) neural network.

## Setup Environment

In this section we will setup the environment

First we need to install CLIP and then make sure that we have torch 1.7.1 with CUDA support.

In [1]:
!pip install ftfy regex tqdm
!pip install git+https://github.com/openai/CLIP.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.0 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-9vfe9ecs
  Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-9vfe9ecs
Building wheels for collected packages: clip
  Building wheel for clip (setup.py) ... [?25l[?25hdone
  Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369409 sha256=4344adce2905af51d944d79e268d9575c641afd14196c08371a47a161a11b9ab
  Stored in directory: /tmp/pip-ephem-wheel-cache-ww_bk11w/wheels/fd/b9/c3/5b4470e35ed76e174bff77c92f91da82098d5e35fd5bc8cdac
Successfully

We can now load the pretrained public CLIP model

In [3]:
import clip
import torch

# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

100%|████████████████████████████████████████| 338M/338M [00:02<00:00, 148MiB/s]


## Download the precomputed data

In this section the precomputed feature vectors for all photos are downloaded.

In order to compare the photos from Unsplash dataset to a text query, we need to compute the feature vector of each photo using CLIP. This is a time consuming task, so I am taking the precomputed and uploaded by Vladimir in his [Github repo](https://github.com/haltakov/natural-language-image-search/tree/main/unsplash-dataset) for testing.

We need the data to be precomputed in two different files,
- `photo_ids.csv` - a list of the photo IDs for all images in the dataset. The photo ID can be used to get the actual photo from Unsplash
- `features.npy` - a matrix containing the precomputed 512 element feature vector for each photo in the dataset.

In [4]:
from pathlib import Path

# Create a folder for the precomputed features
!mkdir unsplash-dataset

# Download from Github Releases
if not Path('unsplash-dataset/photo_ids.csv').exists():
  !wget https://github.com/haltakov/natural-language-image-search/releases/download/1.0.0/photo_ids.csv -O unsplash-dataset/photo_ids.csv

if not Path('unsplash-dataset/features.npy').exists():
  !wget https://github.com/haltakov/natural-language-image-search/releases/download/1.0.0/features.npy -O unsplash-dataset/features.npy

--2022-11-21 15:17:09--  https://github.com/haltakov/natural-language-image-search/releases/download/1.0.0/photo_ids.csv
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/330162907/ea59cda9-85ee-4657-9fb5-ddad20060ccb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221121%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221121T151709Z&X-Amz-Expires=300&X-Amz-Signature=e65749d8145f50e6f418f052f05a579ccb1ce898983bf3357f595ee92fbb0f30&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=330162907&response-content-disposition=attachment%3B%20filename%3Dphoto_ids.csv&response-content-type=application%2Foctet-stream [following]
--2022-11-21 15:17:09--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/330162907/ea59cda9-85ee-4657-9fb5-

After the files are downloaded we need to load them using `pandas` and `numpy`

In [5]:
import pandas as pd
import numpy as np

# Load the photo IDs
photo_ids = pd.read_csv("unsplash-dataset/photo_ids.csv")
photo_ids = list(photo_ids["photo_id"])

# Load the feature vectors
photo_features = np.load("unsplash-dataset/features.npy")

# Convert features to Tensors: Float32 on CPU and Float16 on GPU
if device=="cpu":
  photo_features = torch.from_numpy(photo_features).float().to(device)
else:
  photo_features = torch.from_numpy(photo_features).to(device)


# Print some statistics
print(f"Photo loaded : {len(photo_ids)}")

Photo loaded : 1981161


## Define Functions

Some important functions for processing the data are defined here.

### 1. `encode_search_query` 

Function takes a text description and encodes it into a feature vector using the CLIP model.

In [6]:
def encode_search_query(search_query):
  """
  Function takes a text description in `search_query` variable and 
  encodes it into a feature vector using the CLIP model.
  """
  with torch.no_grad():
    # Encode and normalize the search query using CLIP
    text_encoded = model.encode_text(clip.tokenize(search_query).to(device))
    text_encoded /= text_encoded.norm(dim=-1, keepdim=True)

  # Retrieve the feature vector
  return text_encoded

### 2. `find_best_matches` 

Function compares the text feature vector to the feature vectors of all images and finds the best matches. The function returns the IDs of the best matching photos.

In [7]:
def find_best_matches(text_features, photo_features, photo_ids, result_count=3):
  """
  Function compares the text feature vector to the feature vectors of all 
  images and finds the best matches. The function returns the IDs of the best 
  matching photos.  
  """

  # Compute the similarity between the search query and each photo using the Cosine similarity
  similarities = (photo_features @ text_features.T).squeeze(1)

  # Sort the photos by their similarity score
  best_photo_ids = (-similarities).argsort()

  # Return the photo Ids of the best matches
  return [photo_ids[i] for i in best_photo_ids[:result_count]]

### 3. `display_photo`

Function displays a photo from Unsplash given its ID and link to the original photo on Unsplash

In [8]:
from IPython.display import Image
from IPython.core.display import HTML

def display_photo(photo_id):
  # Get the URL of the photo resized to have a width of 320px
  photo_image_url = f"https://unsplash.com/photos/{photo_id}/download?w=320"

  # Disply the photo
  display(Image(url=photo_image_url))

  # Disply the attribution text
  display(HTML(f'Photo on <a target="_blank" href="https://unsplash.com/photos/{photo_id}">Unsplash</a>'))
  print()

Putting it all together in one function

In [9]:
def search_unsplash(search_query, photo_features, photo_ids, result_count=3):
  # Encode the search query
  text_features = encode_search_query(search_query)

  # Find the best matches
  best_photo_ids = find_best_matches(text_features, photo_features, photo_ids, result_count)

  # Display the best photos
  for photo_id in best_photo_ids:
    display_photo(photo_id)

## Search Unsplash

Now we are ready to search the dataset using natural language. Let's try doing some queries.

### "Two dogs playing in the snow"

In [10]:
search_query = "Two dogs playing in the snow"

search_unsplash(search_query, photo_features, photo_ids, 3)










### "The feeling when your program finally works"

In [13]:
search_query = "The feeling when your program finally works"

search_unsplash(search_query, photo_features, photo_ids, result_count=3)










### "I want to eat something"

In [16]:
search_query = "I want to eat something"

search_unsplash(search_query, photo_features, photo_ids, result_count=3)










### "A girl in front of Tajmahal"


In [18]:
search_query = "A girl in front of Tajmahal"

search_unsplash(search_query, photo_features, photo_ids, result_count=5)
















### "A car in the woods"

In [19]:
search_query = "A car in the woods"

search_unsplash(search_query, photo_features, photo_ids, result_count=3)








