# Process the Unsplash dataset with CLIP

This notebook processes all the downloaded photos using OpenAI's [CLIP neural network](https://github.com/openai/CLIP). For each image we get a feature vector containing 512 float numbers, which we will store in a file. These feature vectors will be used later to compare them to the text feature vectors.

This step will be significantly faster if you have a GPU, but it will also work on the CPU.

## Load the photos

Load all photos from the folder they were stored.

In [1]:
from pathlib import Path

# Set the path to the photos
dataset_version = "lite"  # Use "lite" or "full"
photos_path = Path("unsplash-dataset") / dataset_version / "photos"

# List all JPGs in the folder
photos_files = list(photos_path.glob("*.jpg"))

# Print some statistics
print(f"Photos found: {len(photos_files)}")

Photos found: 24995


## Load the CLIP net

Load the CLIP net and define the function that computes the feature vectors

In [2]:
import clip
import torch
from PIL import Image

# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Function that computes the feature vectors for a batch of images
def compute_clip_features(photos_batch):
    # Load all the photos from the files
    photos = [Image.open(photo_file) for photo_file in photos_batch]
    
    # Preprocess all photos
    photos_preprocessed = torch.stack([preprocess(photo) for photo in photos]).to(device)

    with torch.no_grad():
        # Encode the photos batch to compute the feature vectors and normalize them
        photos_features = model.encode_image(photos_preprocessed)
        photos_features /= photos_features.norm(dim=-1, keepdim=True)

    # Transfer the feature vectors back to the CPU and convert to numpy
    return photos_features.cpu().numpy()

## Process all photos

Now we need to compute the features for all photos. We will do that in batches, because it is much more efficient. You should tune the batch size so that it fits on your GPU. The processing on the GPU is fairly fast, so the bottleneck will probably be loading the photos from the disk.

In this step the feature vectors and the photo IDs of each batch will be saved to a file separately. This makes the whole process more robust. We will merge the data later.

In [3]:
import math
import numpy as np
import pandas as pd

# Define the batch size so that it fits on your GPU. You can also do the processing on the CPU, but it will be slower.
batch_size = 16

# Path where the feature vectors will be stored
features_path = Path("unsplash-dataset") / dataset_version / "features"

# Compute how many batches are needed
batches = math.ceil(len(photos_files) / batch_size)

# Process each batch
for i in range(batches):
    print(f"Processing batch {i+1}/{batches}")

    batch_ids_path = features_path / f"{i:010d}.csv"
    batch_features_path = features_path / f"{i:010d}.npy"
    
    # Only do the processing if the batch wasn't processed yet
    if not batch_features_path.exists():
        try:
            # Select the photos for the current batch
            batch_files = photos_files[i*batch_size : (i+1)*batch_size]

            # Compute the features and save to a numpy file
            batch_features = compute_clip_features(batch_files)
            np.save(batch_features_path, batch_features)

            # Save the photo IDs to a CSV file
            photo_ids = [photo_file.name.split(".")[0] for photo_file in batch_files]
            photo_ids_data = pd.DataFrame(photo_ids, columns=['photo_id'])
            photo_ids_data.to_csv(batch_ids_path, index=False)
        except:
            # Catch problems with the processing to make the process more robust
            print(f'Problem with batch {i}')

Processing batch 1/1563
Processing batch 2/1563
Processing batch 3/1563
Processing batch 4/1563
Processing batch 5/1563
Processing batch 6/1563
Processing batch 7/1563
Processing batch 8/1563
Processing batch 9/1563
Processing batch 10/1563
Processing batch 11/1563
Processing batch 12/1563
Processing batch 13/1563
Processing batch 14/1563
Processing batch 15/1563
Processing batch 16/1563
Processing batch 17/1563
Processing batch 18/1563
Processing batch 19/1563
Processing batch 20/1563
Processing batch 21/1563
Processing batch 22/1563
Processing batch 23/1563
Processing batch 24/1563
Processing batch 25/1563
Processing batch 26/1563
Processing batch 27/1563
Processing batch 28/1563
Processing batch 29/1563
Processing batch 30/1563
Processing batch 31/1563
Processing batch 32/1563
Processing batch 33/1563
Processing batch 34/1563
Processing batch 35/1563
Processing batch 36/1563
Processing batch 37/1563
Processing batch 38/1563
Processing batch 39/1563
Processing batch 40/1563
Processin

Processing batch 914/1563
Processing batch 915/1563
Processing batch 916/1563
Processing batch 917/1563
Processing batch 918/1563
Processing batch 919/1563
Processing batch 920/1563
Processing batch 921/1563
Processing batch 922/1563
Processing batch 923/1563
Processing batch 924/1563
Processing batch 925/1563
Processing batch 926/1563
Processing batch 927/1563
Processing batch 928/1563
Processing batch 929/1563
Processing batch 930/1563
Processing batch 931/1563
Processing batch 932/1563
Processing batch 933/1563
Processing batch 934/1563
Processing batch 935/1563
Processing batch 936/1563
Processing batch 937/1563
Processing batch 938/1563
Processing batch 939/1563
Processing batch 940/1563
Processing batch 941/1563
Processing batch 942/1563
Processing batch 943/1563
Processing batch 944/1563
Processing batch 945/1563
Processing batch 946/1563
Processing batch 947/1563
Processing batch 948/1563
Processing batch 949/1563
Processing batch 950/1563
Processing batch 951/1563
Processing b

Processing batch 1221/1563
Processing batch 1222/1563
Processing batch 1223/1563
Processing batch 1224/1563
Processing batch 1225/1563
Processing batch 1226/1563
Processing batch 1227/1563
Processing batch 1228/1563
Processing batch 1229/1563
Processing batch 1230/1563
Processing batch 1231/1563
Processing batch 1232/1563
Processing batch 1233/1563
Processing batch 1234/1563
Processing batch 1235/1563
Processing batch 1236/1563
Processing batch 1237/1563
Processing batch 1238/1563
Processing batch 1239/1563
Processing batch 1240/1563
Processing batch 1241/1563
Processing batch 1242/1563
Processing batch 1243/1563
Processing batch 1244/1563
Processing batch 1245/1563
Processing batch 1246/1563
Processing batch 1247/1563
Processing batch 1248/1563
Processing batch 1249/1563
Processing batch 1250/1563
Processing batch 1251/1563
Processing batch 1252/1563
Processing batch 1253/1563
Processing batch 1254/1563
Processing batch 1255/1563
Processing batch 1256/1563
Processing batch 1257/1563
P

Processing batch 1525/1563
Processing batch 1526/1563
Processing batch 1527/1563
Processing batch 1528/1563
Processing batch 1529/1563
Processing batch 1530/1563
Processing batch 1531/1563
Processing batch 1532/1563
Processing batch 1533/1563
Processing batch 1534/1563
Processing batch 1535/1563
Processing batch 1536/1563
Processing batch 1537/1563
Processing batch 1538/1563
Processing batch 1539/1563
Processing batch 1540/1563
Processing batch 1541/1563
Processing batch 1542/1563
Processing batch 1543/1563
Processing batch 1544/1563
Processing batch 1545/1563
Processing batch 1546/1563
Processing batch 1547/1563
Processing batch 1548/1563
Processing batch 1549/1563
Processing batch 1550/1563
Processing batch 1551/1563
Processing batch 1552/1563
Processing batch 1553/1563
Processing batch 1554/1563
Processing batch 1555/1563
Processing batch 1556/1563
Processing batch 1557/1563
Processing batch 1558/1563
Processing batch 1559/1563
Processing batch 1560/1563
Processing batch 1561/1563
P

Merge the features and the photo IDs. The resulting files are `features.npy` and `photo_ids.csv`. Feel free to delete the intermediate results.

In [4]:
import numpy as np
import pandas as pd

# Load all numpy files
features_list = [np.load(features_file) for features_file in sorted(features_path.glob("*.npy"))]

# Concatenate the features and store in a merged file
features = np.concatenate(features_list)
np.save(features_path / "features.npy", features)

# Load all the photo IDs
photo_ids = pd.concat([pd.read_csv(ids_file) for ids_file in sorted(features_path.glob("*.csv"))])
photo_ids.to_csv(features_path / "photo_ids.csv", index=False)