# Install requirements

### Faiss & CLIP:

**Faiss:** Based on [Faiss paper](https://arxiv.org/abs/2401.08281), Faiss, which stands for "Facebook AI Similarity Search," is a powerful and efficient library for similarity search and similarity indexing. Optimized for search through millions or billions of heigh-dimenstion vectors quickly. Key features and characteristics of Faiss include:

1. Efficient Vector Search: Faiss is optimized for fast similarity search in large datasets of high-dimensional vectors. It provides both exact and approximate search algorithms, making it suitable for a wide range of use cases.
2. GPU Support: Faiss includes GPU support, allowing users to take advantage of the computational power of modern graphics processing units to accelerate similarity search operations.
3. Diverse Indexing Structures: Faiss provides a variety of indexing structures, including flat indexes, IVF (Inverted File) indexes, HNSW (Hierarchical Navigable Small World) indexes, and more, each tailored to specific data and performance requirements. 
4. Integration with Deep Learning: Faiss is often used in conjunction with deep learning models, making it a valuable tool for encoding text, images, and other data into high-dimensional vectors. These vectors can then be efficiently searched and indexed.
5. Wide Range of Applications: Faiss is used in various applications, including content-based recommendation systems, similarity-based search engines, image retrieval, and document clustering, to name a few.
6. Scalability: Faiss is designed to handle large datasets, making it suitable for both small-scale projects and large-scale production systems.


**CLIP:** The CLIP (Contrastive Language-Image Pre-training) model, developed by OpenAI, is a multi-modal vision and language model that maps images and text to the same latent space. Since we will use both image and text queries to search for images, we will use the CLIP model to embed our data. 


In [1]:
!pip install faiss-cpu  git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-75ffq4v6
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-75ffq4v6
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Collecting ftfy (from clip==1.0)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->clip==1.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.1

# Import library

In [2]:
import os
import clip
import torch
from PIL import Image
import numpy as np
import faiss
from tqdm import tqdm
import gradio as gr

# Load CLIP model

The model uses a **ViT-B/32** Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)


100%|███████████████████████████████████████| 338M/338M [00:04<00:00, 88.1MiB/s]


# Prepare image dataset
Image collection is train folder of COCO 2017 dataset

In [4]:
image_folder = "/kaggle/input/coco25k/images"
image_paths = [os.path.join(image_folder, fname) for fname in os.listdir(image_folder) if fname.lower().endswith(('.png', '.jpg', '.jpeg'))]

# Extract features 
This block extracts high-dimensional numerical "feature vectors" (embeddings) for every image in your collection using the CLIP model.

First preprocess the image using the preprocess function we got earlier. This performs a few things to ensure the input to the CLIP model is of the right format and dimensionality including resizing, normalization, colour channel adjustment etc.

In [5]:
#Extract features 
features_path = "image_features.npy"
if os.path.exists(features_path):
    image_features = np.load(features_path)
else:
    image_features = []
    for path in tqdm(image_paths, desc="Extracting image features"):
        image = preprocess(Image.open(path)).unsqueeze(0).to(device)
        with torch.no_grad():
            feature = model.encode_image(image)
            feature /= feature.norm(dim=-1, keepdim=True)
            image_features.append(feature.cpu().numpy())
    image_features = np.concatenate(image_features, axis=0).astype("float32")
    np.save(features_path, image_features)

Extracting image features: 100%|██████████| 25000/25000 [08:43<00:00, 47.77it/s]


# FAISS index
Initializes a FAISS index (IndexFlatIP) designed for cosine similarity search (because your features are L2-normalized) and then populates it with all the extracted CLIP features from your image collection. 

### IndexFlatIP:

A flat index takes your high-dimensional feature vectors and stores them exactly as they are. It is one of the simplest index structure where all data points are stored without any transformation (compression). This type of index doesn’t compress or cluster your vectors. Flat indexes are ‘flat’ because they do not modify the vectors that we feed into them.

Because there is no approximation or clustering of vectors — these indexes produce the most accurate results. We have perfect search quality, but this comes at the cost of significant search times.

With flat indexes, we introduce our query vector xq and compare it against every other full-size vector in our index — calculating the distance/inner-product to each. This is an EXHAUSTIVE SERACH.

After calculating all of these distances, we will return the nearest k of those as our nearest matches. A k-nearest neighbors (kNN) search.

And for flat indexes, that is all we need to do — there is no training (as we have no parameters to optimize when storing vectors without transformations or clustering).

When To Use:

* Search quality is a very high priority.

* Search time does not matter OR when using a small index (<10K)

In [6]:
#FAISS index
index = faiss.IndexFlatIP(image_features.shape[1])
index.add(image_features)

# Search functions
After a query image or text is encoded by the model's encoder, the resulting embedding must be normalized for inner product search through other image embeddings. 


### Normalization

Vector normalization ensures all vectors have a magnitude of 1, which simplifies the computation of similarity metrics.

* **Cosine similarity** inherently measures the angle between vectors, ignoring their magnitudes. Normalization simplifies the calculation of Cosine similarity to a dot product.

* By normalization, both images are compared purely by direction (cosine similarity), prioritizing semantic relevance over pixel intensity. 

* Without normalization, a high-resolution image (large vector magnitude) might appear “closer” to a query vector than a semantically similar low-resolution image due to Euclidean distance favoring magnitude.

**Semantic Meaning:** In the context of embeddings (like those from CLIP or ViT), the direction of the vector in the high-dimensional space often represents its semantic meaning or content. For example, all embeddings for "cats" might point generally in one direction, while all embeddings for "dogs" point in another. The length of the vector might represent other less important factors (like how common the object is, or the "intensity" of the image). By ignoring magnitude, cosine similarity ensures that two items are compared purely on their conceptual or visual similarity, not on incidental differences in their numerical "strength" or "loudness." This makes the search results more relevant to what the user actually means.



In [7]:
#Search functions
def search_by_text(query, top_k=5):
    text = clip.tokenize([query]).to(device)
    with torch.no_grad():
        text_features = model.encode_text(text)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    text_features = text_features.cpu().numpy().astype("float32")
    D, I = index.search(text_features, top_k)
    return [image_paths[i] for i in I[0]]

def search_by_image(query_image, top_k=5):
    image = preprocess(query_image).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features_query = model.encode_image(image)
        image_features_query /= image_features_query.norm(dim=-1, keepdim=True)
    image_features_query = image_features_query.cpu().numpy().astype("float32")
    D, I = index.search(image_features_query, top_k)
    return [image_paths[i] for i in I[0]]

The Retrieval is happening on the index.search method. It implements a k-Nearest Neighbors (kNN) search to find the k most similar vectors to the query vector. We can adjust the value of k by changing the top_k parameter. The distance metric used in the kNN search in our implementation is the cosine similarity. The function returns a list of retrieve images paths.

# Gradio
Built a Gradio-based visual search demo to display the top 5 images similar to a given text or image query.

In [8]:
#Gradio 
def visual_search(text_query, image_query):
    if text_query:
        results = search_by_text(text_query)
    elif image_query is not None:
        results = search_by_image(image_query)
    else:
        return []
    return [Image.open(p) for p in results]

with gr.Blocks() as demo:
    gr.Markdown("# CLIP Visual Search Engine")
    with gr.Row():
        text_input = gr.Textbox(label="Text Query", placeholder="Describe the image you want to find...")
        image_input = gr.Image(type="pil", label="Or upload an image")
    output_gallery = gr.Gallery(label="Top Results", columns=5, height="auto")
    search_btn = gr.Button("Search")
    search_btn.click(
        fn=visual_search,
        inputs=[text_input, image_input],
        outputs=output_gallery
    )

demo.launch()

* Running on local URL:  http://127.0.0.1:7860
It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

* Running on public URL: https://6c08a9446c6f126edb.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## How can we make our search faster?

There are two primary approaches:

**Reduce vector size** — through dimensionality reduction or reducing the number of bits representing our vectors values.

**Reduce search scope** — we can do this by clustering or organizing vectors into tree structures based on certain attributes, similarity, or distance — and restricting our search to closest clusters or filter through most similar branches.

Using either of these approaches means that we are no longer performing an exhaustive nearest-neighbors search but an approximate nearest-neighbors (ANN) search — as we no longer search the entire, full-resolution dataset. So, what we produce is a more balanced mix that prioritizes both search-speed and search-time.