## Image Embeddings

In [1]:
# We tried running ChromaDB in memory, but whenever we tried adding embedding into it, the kernel crashes.
# It's because of that, we chose to launch a ChromaDB container to do it.

In [2]:
#!pip install chromadb torch open-clip-torch

In [1]:
# Importing useful dependencies
import io
import os
import boto3
import torch
import chromadb
import open_clip
import numpy as np
from PIL import Image
from chromadb.config import Settings

# Set a seed for reproducibility
np.random.seed(10721)
torch.manual_seed(10721)
torch.cuda.manual_seed_all(10721)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\SakuraSnow\AppData\Local\Programs\Python\Python311\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\SakuraSnow\AppData\Local\Programs\Python\Python311\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\SakuraSnow\AppData\Local\Programs\Python\Python311\Lib\site-packages\ipykernel\

AttributeError: _ARRAY_API not found

In [2]:
# Setup S3 client for MinIO (MinIO implements Amazon S3 API)
s3 = boto3.client(
    "s3",
    endpoint_url="http://127.0.0.1:9000", # MinIO API endpoint
    aws_access_key_id="minioadmin", # User name
    aws_secret_access_key="minioadmin", # Password
)

In [3]:
# Connect to the server (Docker Container)
client = chromadb.HttpClient(host="localhost", port=8000)
# Although we set a path for persistent directory when defining the Docker Container
# It actually stores the embeddings inside the container

# We can use the following line to remove all the stored data in a collection
#client.delete_collection(name="images")

# Create or get the collection named "images"
collection = client.create_collection(name="images", get_or_create=True, embedding_function=None)

In [4]:
# Function that prints the embeddings stored in a collection
def print_stored_embeddings(collection, x=None): # x is the maximum number of files to print
    results = collection.get(include=["documents", "embeddings"])
    for i in range(len(results["documents"])):
        print("ID:", results['ids'][i])
        print("Document:", results["documents"][i])
        print("Embedding (first 5 dims):", results["embeddings"][i][:5])
        print("---")
        if x and (x-1) == i:
            break

# We can use this function to print the embeddings stored in chromaDB
print_stored_embeddings(collection, x = 10)

**Creating Embeddings**

Here we are going to generate, for each image stored in the Trusted Zone, a corresponding embedding and store it in the ChromaDB collection images. To achieve this, we select the CLIP-ViT-L-14 model from OpenAI’s CLIP family as our pretrained embedding generator. Although ChromaDB provides an inherent embedding function, we choose not to use it in order to maintain full control over the embedding generation process.

In [6]:
# Just in case our device has gpu
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model + preprocessing
model, _, preprocess = open_clip.create_model_and_transforms("hf-hub:laion/CLIP-ViT-L-14-laion2B-s32B-b82K") # ≈3× bigger than ViT-B-16
model.to(device)



CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (patch_dropout): Identity()
    (ln_pre): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): ModuleList(
        (0-23): 24 x ResidualAttentionBlock(
          (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
          )
          (ls_1): Identity()
          (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=1024, out_features=4096, bias=True)
            (gelu): GELU(approximate='none')
            (c_proj): Linear(in_features=4096, out_features=1024, bias=True)
          )
          (ls_2): Identity()
        )
      )
    )
    (ln_post): LayerNorm((1024,), eps=1e-05, elementwi

In [7]:
# ---- Show parameter counts ----
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model: CLIP-ViT-L-14")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

Model: CLIP-ViT-L-14 (pretrained='openai')
Total parameters: 427,616,513
Trainable parameters: 427,616,513


In [8]:
# The next function returns the embedding of the given PIL Image
def embed_image(preprocess, model, pil_img):
    img_tensor = preprocess(pil_img).unsqueeze(0).to(device)
    with torch.no_grad():
        feats = model.encode_image(img_tensor)
    feats = feats / feats.norm(dim=-1, keepdim=True)
    return feats.cpu().numpy().squeeze()
# We can use this function to retrieve an image from our bucket in PIL Image format
def get_image(bucket, key):
    resp = s3.get_object(Bucket=bucket, Key=key)
    body = resp["Body"].read()
    img = Image.open(io.BytesIO(body))
    return img

In [15]:
# The next function stores the embeddings of the images stored in the Trusted Zone and store them in the collection named 'images' of our ChromaDB
def images_to_embeddings(src_bucket, collection, preprocess, model, src_prefix=""):

    # Incremental id assigned to each image embedding
    id_counter = 0
    
    paginator = s3.get_paginator("list_objects_v2") # It returns objects in pages and not all at once.
    for page in paginator.paginate(Bucket=src_bucket, Prefix=src_prefix):

        # List of paths (meta_data)
        image_paths = []
        # List of embeddings
        embeddings = []
        # List of unique IDs for each embedding
        ids = []
        
        for obj in page.get("Contents", []):

            key = obj["Key"]

            if obj['Size'] == 0 and key.endswith("/"): # skip the folder itself
                continue

            id_counter += 1

            # Download the image
            img = get_image(src_bucket, key)
            
            # Compute embedding
            vector = embed_image(preprocess, model, img) # A numerical vector of size 768

            print(f"Created embedding for {key} ({len(embeddings)} items in current batch).")

            # Storing data
            image_paths.append(f"{src_bucket}/{key}")
            embeddings.append(vector)
            ids.append(f"img_{id_counter}")

        # Store the images of a page at once
        collection.add(
                ids=ids,
                documents=image_paths,
                embeddings=embeddings
        )

        print(f"All embeddings in the current batch are store successfully in the collection {collection.name}.")

In [10]:
# Store embeddings
images_to_embeddings(src_bucket = "trusted-zone", src_prefix = "images/", collection = collection, preprocess = preprocess, model = model)

Created embedding for images/image_1762966866790.png (0 items in current batch).
Created embedding for images/image_1762966866959.png (1 items in current batch).
Created embedding for images/image_1762966867034.png (2 items in current batch).
Created embedding for images/image_1762966867095.png (3 items in current batch).
Created embedding for images/image_1762966867157.png (4 items in current batch).
Created embedding for images/image_1762966867220.png (5 items in current batch).
Created embedding for images/image_1762966867282.png (6 items in current batch).
Created embedding for images/image_1762966867347.png (7 items in current batch).
Created embedding for images/image_1762966867413.png (8 items in current batch).
Created embedding for images/image_1762966867474.png (9 items in current batch).
Created embedding for images/image_1762966867538.png (10 items in current batch).
Created embedding for images/image_1762966867599.png (11 items in current batch).
Created embedding for imag

In [11]:
# Check the embeddings stored in chromaDB
print_stored_embeddings(collection)

ID: img_1
Document: trusted-zone/images/image_1762966866790.png
Embedding (first 5 dims): [ 0.04946182 -0.02563596 -0.00275467  0.07938568 -0.02156337]
---
ID: img_2
Document: trusted-zone/images/image_1762966866959.png
Embedding (first 5 dims): [-0.02449172 -0.01995861 -0.02604526  0.01625412  0.04409551]
---
ID: img_3
Document: trusted-zone/images/image_1762966867034.png
Embedding (first 5 dims): [-0.04309645  0.08990799 -0.00668683  0.06573138  0.04949455]
---
ID: img_4
Document: trusted-zone/images/image_1762966867095.png
Embedding (first 5 dims): [-0.03896771 -0.01799422 -0.00205197  0.03902427 -0.00297691]
---
ID: img_5
Document: trusted-zone/images/image_1762966867157.png
Embedding (first 5 dims): [-4.7101914e-03 -2.5247540e-02 -5.7388210e-07  3.5136775e-03
  2.6678363e-02]
---
ID: img_6
Document: trusted-zone/images/image_1762966867220.png
Embedding (first 5 dims): [-0.0522121   0.03042585  0.00810968  0.01641891  0.00334097]
---
ID: img_7
Document: trusted-zone/images/image_17