# Install requirements

Faiss and CLIP

Faiss: Based on [Faiss paper](https://arxiv.org/abs/2401.08281), Faiss is a library for efficient similarity search and clustering of dense vectors. Optimized for search through millions or billions of heigh-dimenstion vectors quickly.

CLIP: It's designed to understand the relationship between images and text by learning a joint embedding space.

In [1]:
!pip install faiss-cpu  git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-o6zmf4rx
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-o6zmf4rx
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting ftfy (from clip==1.0)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->clip==1.0)
 

# Import library

In [2]:
import os
import clip
import torch
from PIL import Image
import numpy as np
import faiss
from tqdm import tqdm
import gradio as gr

# Load CLIP model

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)


100%|███████████████████████████████████████| 338M/338M [00:08<00:00, 40.6MiB/s]


# Prepare image dataset
Image collection is train folder of COCO 2017 dataset

In [5]:
image_folder = "/kaggle/input/2017-2017/train2017/train2017"
image_paths = [os.path.join(image_folder, fname) for fname in os.listdir(image_folder) if fname.lower().endswith(('.png', '.jpg', '.jpeg'))]

# Extract features 
This block extracts high-dimensional numerical "feature vectors" (embeddings) for every image in your collection using the CLIP model.
First preprocess the image using the preprocess function we got earlier. This performs a few things to ensure the input to the CLIP model is of the right format and dimensionality including resizing, normalization, colour channel adjustment etc.

In [6]:
#Extract features 
features_path = "image_features.npy"
if os.path.exists(features_path):
    image_features = np.load(features_path)
else:
    image_features = []
    for path in tqdm(image_paths, desc="Extracting image features"):
        image = preprocess(Image.open(path)).unsqueeze(0).to(device)
        with torch.no_grad():
            feature = model.encode_image(image)
            feature /= feature.norm(dim=-1, keepdim=True)
            image_features.append(feature.cpu().numpy())
    image_features = np.concatenate(image_features, axis=0).astype("float32")
    np.save(features_path, image_features)

Extracting image features: 100%|██████████| 118287/118287 [51:55<00:00, 37.97it/s]


# FAISS index
Initializes a FAISS index (IndexFlatIP) designed for cosine similarity search (because your features are L2-normalized) and then populates it with all the extracted CLIP features from your image collection. 

In [7]:
#FAISS index
index = faiss.IndexFlatIP(image_features.shape[1])
index.add(image_features)

# Search functions
After a query image or text is encoded by the model's encoder, the resulting embedding must be normalized for inner product search through other image embeddings. 

In [8]:
#Search functions
def search_by_text(query, top_k=5):
    text = clip.tokenize([query]).to(device)
    with torch.no_grad():
        text_features = model.encode_text(text)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    text_features = text_features.cpu().numpy().astype("float32")
    D, I = index.search(text_features, top_k)
    return [image_paths[i] for i in I[0]]

def search_by_image(query_image, top_k=5):
    image = preprocess(query_image).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features_query = model.encode_image(image)
        image_features_query /= image_features_query.norm(dim=-1, keepdim=True)
    image_features_query = image_features_query.cpu().numpy().astype("float32")
    D, I = index.search(image_features_query, top_k)
    return [image_paths[i] for i in I[0]]

# Gradio
Built a Gradio-based visual search demo to display the top 5 images similar to a given text or image query.

In [9]:
#Gradio 
def visual_search(text_query, image_query):
    if text_query:
        results = search_by_text(text_query)
    elif image_query is not None:
        results = search_by_image(image_query)
    else:
        return []
    return [Image.open(p) for p in results]

with gr.Blocks() as demo:
    gr.Markdown("# CLIP Visual Search Engine")
    with gr.Row():
        text_input = gr.Textbox(label="Text Query", placeholder="Describe the image you want to find...")
        image_input = gr.Image(type="pil", label="Or upload an image")
    output_gallery = gr.Gallery(label="Top Results", columns=5, height="auto")
    search_btn = gr.Button("Search")
    search_btn.click(
        fn=visual_search,
        inputs=[text_input, image_input],
        outputs=output_gallery
    )

demo.launch()

* Running on local URL:  http://127.0.0.1:7860
It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

* Running on public URL: https://7392b8c77154e90362.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


