Note for those in Github: This notebook contains a failed attempt at using CLIP Embeddings for multimodal retrieval. It fails because our image data at this point is super niche (simple diagrams of specific bee parts), so, CLIP doesn't know what to do with it. The biggest symptom of it not working is that the retrieval is super biased towards text embeddings (it won't retrieve image embeddings, no matter the query), probably because the image embeddings were generated poorly. SO, this notebook is mainly just here for the sake of bookkeeping. However, if we get more generic image data, this notebook may be good a resource. Also, be weary of the tests at the end, they were written half heartedly after we knew that CLIP was a sinking ship.

PS: If you do ultimately get new, generic image data, consider checking out BioCLIP

## In this project, we are going to read in image and text data from an AWS s3 Bucket. Then, we will embed them into the same ChromaDB vector store using the Transformers library and `openai/clip-vit-base-patch32`

### First, we will install the necessary packages

In [None]:
!pip install boto3
!pip install PyPDF2
!pip install langchain
!pip install langchain_community
!pip install chromadb
!pip install torch
!pip install torchvision





### We will then gather keys to all of our documents in our AWS S3 Bucket. We will seperate the objects into two arrays, one for text data and one for image data

In [None]:
import boto3
from google.colab import userdata

def get_objects(bucket_name):
  AWS_SERVER_PUBLIC_KEY = userdata.get('AWS_SERVER_PUBLIC_KEY')
  AWS_SERVER_SECRET_KEY = userdata.get('AWS_SERVER_SECRET_KEY')

  s3 = boto3.client('s3',
                    aws_access_key_id=AWS_SERVER_PUBLIC_KEY,
                    aws_secret_access_key=AWS_SERVER_SECRET_KEY)

  s3_objects = s3.list_objects_v2(Bucket=bucket_name)

  image_document_keys = []
  text_document_keys = []
  for obj in s3_objects['Contents']:
    file_type = obj["Key"][-3:]
    if file_type == 'png' or file_type == 'jpg' or file_type == 'peg':
      image_document_keys.append(obj["Key"])
    if file_type == 'pdf' or file_type == 'txt':
      text_document_keys.append(obj["Key"])
  print(f"Image keys: {image_document_keys}")
  print(f"Text keys: {text_document_keys}")
  return s3, image_document_keys, text_document_keys


# Part 1: Data Embedding

### Now, we must write the helper functions to embed our text data as well as our image data.

First, we'll define our data processor, as well as our models

In [None]:
from transformers import CLIPProcessor, CLIPModel, CLIPVisionModelWithProjection, CLIPTextModelWithProjection

# Load CLIP processor and models
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_vision_model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
clip_text_model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")

Next, we will write a function to extract text from our PDFs and chunk it.

In [None]:
from io import BytesIO
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Function to read in PDF, extract text, and chunk the text
def chunk_text(s3, bucket_name, text_document_keys):

  for obj_key in text_document_keys:
    #Get PDF
    pdf_file = s3.get_object(Bucket=bucket_name, Key=obj_key)[
      "Body"
    ].read()

    # Let's extract the pdf's text
    pdf = PdfReader(BytesIO(pdf_file))
    text = " ".join(page.extract_text() for page in pdf.pages)

    # Initialize the text chunker with custom parameters
    custom_text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 400,
      chunk_overlap  = 30,
      length_function = len
      )

    # Chunk the text
    chunks = custom_text_splitter.create_documents([text])

    # Format chunks to go into nomic embedding model correctly
    chunk_sources = []
    text_chunks = []
    chunk_ids = []
    for idx, chunk in enumerate(chunks):
      chunk_sources.append({"source": obj_key})
      text_chunks.append(chunk.page_content)
      chunk_ids.append(f"{idx}_{obj_key}")
    return text_chunks, chunk_sources, chunk_ids

Now that we have our text, we will write the helper function to embed the chunks.

In [None]:


def embed_text_chunks_with_clip(text_chunks):
    text_embeddings = []
    for text in text_chunks:
        # Process text and generate embeddings
        inputs = clip_processor(text=text, return_tensors="pt", padding=True, truncation=True)
        outputs = clip_text_model(**inputs)
        text_embedding = outputs.text_embeds.squeeze(0).detach().numpy().tolist()
        text_embeddings.append(text_embedding)
    return text_embeddings


Great. Now we will move onto the image data.

In [None]:

from PIL import Image
import requests
def embed_images_with_clip(bucket_name, image_document_keys):
    image_embeddings = []
    image_sources = []
    image_ids = []
    for index, image_key in enumerate(image_document_keys):
        image_url = f"https://{bucket_name}.s3.us-east-1.amazonaws.com/{image_key}"
        image = Image.open(requests.get(image_url, stream=True).raw)

        # Process image and generate embeddings
        inputs = clip_processor(images=image, return_tensors="pt", padding=True)
        outputs = clip_vision_model(**inputs)
        image_embedding = outputs.image_embeds.squeeze(0).detach().numpy().tolist()

        image_embeddings.append(image_embedding)
        image_sources.append({"source": image_url})
        image_ids.append(f"{index}_{image_key}")
    return image_embeddings, image_sources, image_ids

Finally, we can write the master function to embed and store all of our data. We'll pass in the s3 client we created when we got our object keys, the name of our bucket, our chromadb collection name, as well as the object key arrays.

In [None]:
import chromadb
from langchain.vectorstores import Chroma


def get_and_store_embeddings(s3, bucket_name, collection_name, text_document_keys, image_document_keys):
    # Create Chroma Client
    chroma_client = chromadb.Client()

    collection = chroma_client.get_or_create_collection(
        name=collection_name,
        metadata={"embedding_dimension": 512}  # Set dimensionality to 512 for CLIP
    )

    # Read in, chunk, and embed text data using CLIPTextModelWithProjection
    text_chunks, chunk_sources, chunk_ids = chunk_text(s3, bucket_name, text_document_keys)
    text_embeddings = embed_text_chunks_with_clip(text_chunks)

    # Add text embeddings to ChromaDB collection
    collection.add(
        documents=text_chunks,
        metadatas=chunk_sources,
        embeddings=text_embeddings,
        ids=chunk_ids
    )

    # Read in and embed image data using CLIPVisionModelWithProjection
    image_embeddings, image_sources, image_ids = embed_images_with_clip(bucket_name, image_document_keys)

    # Add image embeddings to ChromaDB collection
    collection.add(
        documents=image_document_keys,  # Use image keys as documents
        metadatas=image_sources,
        embeddings=image_embeddings,
        ids=image_ids
    )

# Part 3: Data Retrieval

### Cool, now, we can send queries into our ChromaDB to retrieve relevant data.  The function below will allow us to send in a query and retrieve the k most semantically similar embeddings, whether they be image or text embeddings. These embeddings may be used as context for RAG, or for other use cases

In [None]:
def get_relevant_docs_with_clip(collection_name, question, top_k):
    # Generate query embedding using CLIP's text model
    inputs = clip_processor(text=question, return_tensors="pt", padding=True, truncation=True)
    outputs = clip_text_model(**inputs)
    query_embedding = outputs.text_embeds.squeeze(0).detach().numpy().tolist()

    # Query ChromaDB
    chroma_client = chromadb.Client()
    collection = chroma_client.get_or_create_collection(name=collection_name)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where_document={"$contains":"jpeg"}, #<- optional filter to only include images
    )
    return results

### Great, and just like that, we can semantically retrieve data to help answer queries.

### Now, let's feed the query, as well as the context, into an LLM and see what this RAG chatbot is really capable of. (JK we arent worried about this right now)

In [None]:
# from openai import OpenAI

# def get_chat_response(question, context):
#   OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
#   client = OpenAI(api_key=OPENAI_API_KEY)
#   question = question
#   context = " ".join(doc.replace("search_document: ", "") for doc in context['documents'][0]) # Extract and concatenate search_document values
#   content = f"You are a bee expert. Please use the following context to answer the user's question. Context: {context}"
#   completion = client.chat.completions.create(
#       model="gpt-4o",
#       messages=[
#           {"role": "developer", "content": context},
#           {
#               "role": "user",
#               "content": question
#           }
#       ]
#   )
#   print(f"Context retrieved: {context}")
#   print(f"Response: {completion.choices[0].message.content}")


KeyboardInterrupt: Interrupted by user

# Let's Try it Out

In [None]:
bucket_name="testing-generic-rag"
collection_name = 'my_second_collection'
s3, image_document_keys, text_document_keys = get_objects(bucket_name)

Image keys: ['image_100.jpeg', 'image_101.jpeg', 'image_102.jpeg', 'image_103.jpeg', 'image_104.jpeg', 'image_105.jpeg', 'image_106.png', 'image_107.png', 'image_108.jpeg', 'image_109.png', 'image_110.png', 'image_111.png', 'image_112.png', 'image_113.jpeg', 'image_114.jpeg', 'image_115.jpeg', 'image_116.jpeg', 'image_117.jpeg', 'image_118.jpeg', 'image_119.jpeg', 'image_120.jpeg', 'image_121.jpeg', 'image_122.jpeg', 'image_123.jpeg', 'image_124.jpeg', 'image_125.jpeg', 'image_126.jpeg', 'image_127.jpeg', 'image_128.jpeg', 'image_129.jpeg', 'image_130.png', 'image_131.png', 'image_132.jpeg', 'image_133.png', 'image_134.png', 'image_135.png', 'image_136.jpeg', 'image_137.png', 'image_138.png', 'image_139.png', 'image_140.jpeg', 'image_141.png', 'image_142.png', 'image_143.png', 'image_144.jpeg', 'image_50.jpeg', 'image_51.jpeg', 'image_52.jpeg', 'image_53.jpeg', 'image_75.jpeg', 'image_76.jpeg', 'image_77.jpeg', 'image_78.jpeg', 'image_79.jpeg', 'image_80.jpeg', 'image_81.jpeg', 'image_

In [None]:
get_and_store_embeddings(s3 = s3, bucket_name=bucket_name, collection_name=collection_name, text_document_keys=text_document_keys, image_document_keys=image_document_keys)

In [None]:
# chroma_client = chromadb.Client()

# collection = chroma_client.get_or_create_collection(name="newest_test_collection")


# collection.get(include=["embeddings"])

{'ids': ['image_1', 'text_1'],
 'embeddings': array([[-0.10570487,  0.13790806, -0.29611444, ...,  0.86681843,
         -0.01457981,  0.25631657],
        [ 0.32377037, -0.03722687, -0.69636118, ...,  0.12958616,
         -0.4299742 , -0.11625163]]),
 'documents': None,
 'uris': None,
 'data': None,
 'metadatas': None,
 'included': [<IncludeEnum.embeddings: 'embeddings'>]}

In [None]:
question = 'Cutest little animal ever.'
results = get_relevant_docs_with_clip(collection_name, question=question, top_k=20)
print(results)

{'ids': [['28_image_128.jpeg', '52_image_78.jpeg', '69_image_95.jpeg', '15_image_115.jpeg', '20_image_120.jpeg', '65_image_91.jpeg', '50_image_76.jpeg', '22_image_122.jpeg', '56_image_82.jpeg', '68_image_94.jpeg', '18_image_118.jpeg', '71_image_97.jpeg', '19_image_119.jpeg', '27_image_127.jpeg', '67_image_93.jpeg', '25_image_125.jpeg', '73_image_99.jpeg', '3_image_103.jpeg', '54_image_80.jpeg', '51_image_77.jpeg']], 'embeddings': None, 'documents': [['image_128.jpeg', 'image_78.jpeg', 'image_95.jpeg', 'image_115.jpeg', 'image_120.jpeg', 'image_91.jpeg', 'image_76.jpeg', 'image_122.jpeg', 'image_82.jpeg', 'image_94.jpeg', 'image_118.jpeg', 'image_97.jpeg', 'image_119.jpeg', 'image_127.jpeg', 'image_93.jpeg', 'image_125.jpeg', 'image_99.jpeg', 'image_103.jpeg', 'image_80.jpeg', 'image_77.jpeg']], 'uris': None, 'data': None, 'metadatas': [[{'source': 'https://testing-generic-rag.s3.us-east-1.amazonaws.com/image_128.jpeg'}, {'source': 'https://testing-generic-rag.s3.us-east-1.amazonaws.com

### Tester function
This code is meant to pretty much send a single image, text chunk, and text query through the same process that our own image/text data goes through above. This is for the sake of exploring issues with the embeddings storage/retrieval

In [None]:


import chromadb
from transformers import CLIPProcessor, CLIPVisionModelWithProjection, CLIPTextModelWithProjection
from sklearn.preprocessing import normalize
import numpy as np
from PIL import Image
import requests
from sklearn.metrics.pairwise import cosine_distances

import numpy as np


def test_image_text_embedding_and_retrieval():
    # Initialize CLIP models and processor
    clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    clip_vision_model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
    clip_text_model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")

    # Example image URL and related sentence
    image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # can change
    related_sentence = "Two cats on a couch"  # can change

    # Load and embed the image
    image = Image.open(requests.get(image_url, stream=True).raw)
    image_inputs = clip_processor(images=image, return_tensors="pt", padding=True)
    image_outputs = clip_vision_model(**image_inputs)
    image_embedding = image_outputs.image_embeds.squeeze(0).detach().numpy().tolist()

    # Embed the related sentence
    text_inputs = clip_processor(text=related_sentence, return_tensors="pt", padding=True, truncation=True)
    text_outputs = clip_text_model(**text_inputs)
    text_embedding = text_outputs.text_embeds.squeeze(0).detach().numpy().tolist()


    # Define a query (e.g., a sentence similar to the image)
    query = "This photo features two cats lying on a bright pink couch, appearing relaxed and comfortable. One cat is a fluffy, long-haired tabby with dark and light brown fur, stretched out with its paws extended and head slightly tilted. The other cat is a short-haired tabby with a more compact body, resting with its head down and legs stretched out. Two remote controls are placed on the couch near the cats, adding a cozy, homey touch to the scene. The vibrant pink fabric of the couch contrasts with the cats’ fur, making them stand out in the composition."  # can change

    # Generate query embedding
    query_inputs = clip_processor(text=query, return_tensors="pt", padding=True, truncation=True)
    query_outputs = clip_text_model(**query_inputs)
    query_embedding = query_outputs.text_embeds.squeeze(0).detach().numpy().tolist()

    # Normalize query embedding
    query_embedding = normalize([query_embedding], norm="l2")[0].tolist()

    # Calculate cosine similarity before storing in ChromaDB
    query_embedding_np = np.array(query_embedding).reshape(1, -1)
    image_embedding_np = np.array(image_embedding).reshape(1, -1)
    text_embedding_np = np.array(text_embedding).reshape(1, -1)

    image_similarity_before = cosine_distances(query_embedding_np, image_embedding_np)[0][0]
    text_similarity_before = cosine_distances(query_embedding_np, text_embedding_np)[0][0]

    print("Cosine similarity before storing in ChromaDB:")
    print(f"  Image similarity: {image_similarity_before}")
    print(f"  Text similarity: {text_similarity_before}")
    print()


    # Initialize ChromaDB client and collection
    chroma_client = chromadb.Client()

    collection = chroma_client.create_collection(name="fake_test_collection", metadata={"embedding_dimension": 512})

    # Add the image and text embeddings to the collection
    collection.add(
        documents=[image_url, related_sentence],  # Use image URL and sentence as documents
        # metadatas=[{"type": "image"}, {"type": "text"}],  # Add metadata to distinguish between image and text
        embeddings=[image_embedding, text_embedding],  # Add normalized embeddings
        ids=["image_1", "text_1"]  # Unique IDs for each document
    )

    # Query the collection
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=2,  # Retrieve top 2 results
    )

    # Print the results
    print("Query Results from ChromaDB:")
    for i, (document, metadata, distance) in enumerate(zip(results["documents"][0], results["metadatas"][0], results["distances"][0])):
        print(f"Result {i + 1}:")
        print(f"  Document: {document}")
        print(f"  Metadata: {metadata}")
        print(f"  Cosine Distance: {distance}")  # Convert distance to similarity score
        print()

    retrieved_embedding = collection.get(include=["embeddings"])["embeddings"]
    original_embedding = image_embedding

    # Compare using NumPy
    print(f"original embedding: {original_embedding}")
    print(f"retrieved embedding: {retrieved_embedding}")


# Run the test function
test_image_text_embedding_and_retrieval()



Cosine similarity before storing in ChromaDB:
  Image similarity: 0.6434045364724016
  Text similarity: 0.5291088754348807

Query Results from ChromaDB:
Result 1:
  Document: Two cats on a couch
  Metadata: None
  Cosine Distance: 65.58438110351562

Result 2:
  Document: http://images.cocodataset.org/val2017/000000039769.jpg
  Metadata: None
  Cosine Distance: 109.9305648803711

original embedding: [-0.10570486634969711, 0.13790805637836456, -0.296114444732666, 0.021248916164040565, -0.06406959891319275, -0.1686188280582428, -0.1351427286863327, -0.0024489674251526594, 0.47376549243927, -0.17626884579658508, 0.24439945816993713, -0.3797180950641632, 0.048325929790735245, -0.13981136679649353, -0.34044772386550903, -0.12675876915454865, -0.23266154527664185, -0.29759618639945984, 0.17886193096637726, 0.049607954919338226, -1.3074297904968262, -0.032436810433864594, 0.4214426875114441, -0.3336309790611267, -0.0473744198679924, 0.29804500937461853, 0.23910123109817505, -0.1842911243438720