## Combining Text Embeddings and Image Embeddings

The idea here is to use a model that produces embeddings for multiple modalities (text and image) into a shared space, store them in Chroma, and then perform cross-modal similarity search.

In this case, we will use a model that creates embeddings for both texts and images. For simplicity, we will use the "ViT-B-16" model from open_clip library. This model allows us to get embeddings from images and texts.

In [1]:
# Importing useful dependencies
import io
import torch
import boto3
import chromadb
import open_clip
import numpy as np
from PIL import Image
from io import BytesIO
import ipywidgets as widgets


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\SakuraSnow\AppData\Local\Programs\Python\Python311\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "C:\Users\SakuraSnow\AppData\Local\Programs\Python\Python311\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "C:\Users\SakuraSnow\AppData\Local\Programs\Python\Python311\Lib\site-packages\ipykernel\

AttributeError: _ARRAY_API not found

In [2]:
# Setup S3 client for MinIO (MinIO implements Amazon S3 API)
s3 = boto3.client(
    "s3",
    endpoint_url="http://127.0.0.1:9000", # MinIO API endpoint
    aws_access_key_id="minioadmin", # User name
    aws_secret_access_key="minioadmin", # Password
)

In [3]:
# Connect to the server (Docker Container)
client = chromadb.HttpClient(host="localhost", port=8000)

# Create or get the collection named "images" (the embeddings of images are from "ViT-B-16" model)
collection_images = client.create_collection(name="images", get_or_create=True, embedding_function=None)

# Create or get the collection named "texts_images" to store embeddings of images and texts created by "ViT-B-16"
collection_texts_images = client.create_collection(name="texts_images", get_or_create=True, embedding_function=None)

In [4]:
# Just in case our device has gpu
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model
model, _, _ = open_clip.create_model_and_transforms("ViT-B-16", pretrained="openai")
tokenizer = open_clip.get_tokenizer("ViT-B-16") # Tokenizer for texts
model.to(device)



CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16), bias=False)
    (patch_dropout): Identity()
    (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): ModuleList(
        (0-11): 12 x ResidualAttentionBlock(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (ls_1): Identity()
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=768, out_features=3072, bias=True)
            (gelu): GELU(approximate='none')
            (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          )
          (ls_2): Identity()
        )
      )
    )
    (ln_post): LayerNorm((768,), eps=1e-05, elementwise_affine

As the images are already embedded, in the following cells we will embed the texts using "ViT-B-16". One of the reasons we chose "ViT-B-16" instead of BERT is that it is faster and light-weighted when creating embeddings.

In [5]:
@torch.no_grad()
# The next function returns the embedding of the given text
def embed_text(model, texts: str):
    tokens = tokenizer([texts]).to(device) # tokenized batch
    feats = model.encode_text(tokens)
    feats = feats / feats.norm(dim=-1, keepdim=True) # normalize
    return feats.cpu().numpy()[0]

In [6]:
# The next function stores the embeddings of the texts stored in the Trusted Zone and store them in the collection named 'texts' of our ChromaDB
def texts_to_embeddings(src_bucket, collection, model, src_prefix=""):

    # Incremental id assigned to each embedding
    id_counter = 0
    
    paginator = s3.get_paginator("list_objects_v2") # It returns objects in pages and not all at once.
    for page in paginator.paginate(Bucket=src_bucket, Prefix=src_prefix):

        # List of paths (meta_data)
        file_paths = []
        # List of embeddings
        embeddings = []
        # List of unique IDs for each embedding
        ids = []
        
        for obj in page.get("Contents", []):

            key = obj["Key"]

            if obj['Size'] == 0 and key.endswith("/"): # skip the folder itself
                continue

            id_counter += 1

            # Fetch and open the text file
            response = s3.get_object(Bucket=src_bucket, Key=key)
            body = response["Body"].read().decode("utf-8")
            
            # Compute embedding
            vector = embed_text(model, body) # A numerical vector of size 512

            print(f"Created embedding for {key} ({len(embeddings)} items in current batch).")

            # Storing data
            file_paths.append(f"{src_bucket}/{key}")
            embeddings.append(vector)
            ids.append(f"text_{id_counter}")

        # Store the images of a page at once
        collection.add(
                ids=ids,
                documents=file_paths,
                embeddings=embeddings
        )

        print(f"All embeddings in the current batch are store successfully in the collection {collection.name}.")

In [7]:
# Store embeddings
texts_to_embeddings(src_bucket = "trusted-zone", src_prefix = "texts/", collection = collection_texts_images, model = model)

Created embedding for texts/text_1760786400687.txt (0 items in current batch).
Created embedding for texts/text_1760786400752.txt (1 items in current batch).
Created embedding for texts/text_1760786400827.txt (2 items in current batch).
Created embedding for texts/text_1760786400902.txt (3 items in current batch).
Created embedding for texts/text_1760786400988.txt (4 items in current batch).
Created embedding for texts/text_1760786401058.txt (5 items in current batch).
Created embedding for texts/text_1760786401122.txt (6 items in current batch).
Created embedding for texts/text_1760786401209.txt (7 items in current batch).
Created embedding for texts/text_1760786401285.txt (8 items in current batch).
Created embedding for texts/text_1760786401393.txt (9 items in current batch).
Created embedding for texts/text_1760786401463.txt (10 items in current batch).
Created embedding for texts/text_1760786401534.txt (11 items in current batch).
Created embedding for texts/text_1760786401623.txt

In [8]:
# Fetch all embeddings of images
images_data = collection_images.get(include=["embeddings","documents"])
# Copy them to our new collection
collection_texts_images.add(
    ids=images_data["ids"],
    embeddings=images_data["embeddings"],
    documents=images_data["documents"],
)