# Lab 4.1.3: Multimodal RAG System

**Module:** 4.1 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how CLIP enables image-text similarity search
- [ ] Build a vector database that indexes both images and text
- [ ] Query the database with natural language to find relevant content
- [ ] Combine image retrieval with VLM analysis for RAG
- [ ] Create a complete multimodal search pipeline

---

## üìö Prerequisites

- Completed: Lab 4.1.1 (Vision-Language Models)
- Knowledge of: Vector databases, embeddings, RAG concepts
- Running in: NGC PyTorch container

---

## üåç Real-World Context

Multimodal RAG is revolutionizing how we search and retrieve information:

- **E-commerce**: "Show me dresses similar to this photo but in blue"
- **Medical**: Find X-rays similar to a patient's scan with relevant notes
- **Legal**: Search through document archives with both text and images
- **Creative**: Find reference images that match a mood or concept
- **Education**: Search lecture slides by visual content or topics

---

## üßí ELI5: What is Multimodal RAG?

> **Imagine you're organizing a photo album with notes.** You want to find photos and notes about your beach vacation, but you can't remember exact words you used.
>
> **Regular text search** would only look at your written notes.  
> **Regular image search** would only look at the photos.  
> **Multimodal RAG** understands BOTH! It can:
> - Find beach photos even if you search "sunny ocean day"
> - Find notes about sunsets even if you search with a sunset image
> - Combine results from both to give the best answer
>
> **In AI terms:** Multimodal RAG uses CLIP to create embeddings that live in the same "meaning space" for both images and text. Similar concepts cluster together regardless of whether they're images or text!

---

## Part 1: Environment Setup

Let's set up our environment for multimodal RAG.

In [None]:
# Check GPU
import torch

print("=" * 50)
print("DGX Spark Environment Check")
print("=" * 50)

if torch.cuda.is_available():
    device = torch.cuda.get_device_properties(0)
    print(f"GPU: {device.name}")
    print(f"Memory: {device.total_memory / 1024**3:.1f} GB")
else:
    print("WARNING: No GPU detected!")

In [None]:
# Install dependencies (run once)
# !pip install chromadb>=0.4.22 sentence-transformers>=2.3.0 transformers>=4.45.0 pillow>=10.0.0 requests

In [None]:
# Import libraries
import gc
import time
import hashlib
import requests
from io import BytesIO
from pathlib import Path
from typing import Optional, Union, List, Dict, Any
from dataclasses import dataclass

import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

print("‚úÖ Libraries imported!")

---

## Part 2: Understanding CLIP Embeddings

CLIP is the foundation of multimodal RAG. It creates embeddings for both images and text in the same vector space.

### üßí ELI5: CLIP Embeddings

> **Think of CLIP as a universal translator.** It can "read" an image and an English sentence and tell you how similar they are.
>
> It converts both into "coordinates" in a special 768-dimensional space where:
> - Similar things are close together
> - Different things are far apart
> - Images and text that match are at the same location!

In [None]:
from transformers import CLIPModel, CLIPProcessor

# Load CLIP model
print("Loading CLIP model...")
start_time = time.time()

clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model = clip_model.to(device)
clip_model.eval()

print(f"‚úÖ Loaded in {time.time() - start_time:.1f}s")
print(f"Embedding dimension: {clip_model.config.projection_dim}")

In [None]:
def get_image_embedding(image: Image.Image) -> np.ndarray:
    """Get CLIP embedding for a single image."""
    inputs = clip_processor(images=image, return_tensors="pt", padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        features = clip_model.get_image_features(**inputs)
        # Normalize to unit vector
        features = features / features.norm(dim=-1, keepdim=True)
    
    return features.cpu().numpy()[0]

def get_text_embedding(text: str) -> np.ndarray:
    """Get CLIP embedding for text."""
    inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        features = clip_model.get_text_features(**inputs)
        features = features / features.norm(dim=-1, keepdim=True)
    
    return features.cpu().numpy()[0]

def get_batch_embeddings(images: List[Image.Image] = None, texts: List[str] = None) -> Dict[str, np.ndarray]:
    """Get embeddings for batches of images and/or texts."""
    result = {}
    
    if images:
        inputs = clip_processor(images=images, return_tensors="pt", padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            features = clip_model.get_image_features(**inputs)
            features = features / features.norm(dim=-1, keepdim=True)
        
        result['image_embeddings'] = features.cpu().numpy()
    
    if texts:
        inputs = clip_processor(text=texts, return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            features = clip_model.get_text_features(**inputs)
            features = features / features.norm(dim=-1, keepdim=True)
        
        result['text_embeddings'] = features.cpu().numpy()
    
    return result

print("‚úÖ Embedding functions ready!")

In [None]:
# Demonstrate embedding similarity
# Let's see how CLIP embeddings capture semantic similarity

texts = [
    "a golden retriever playing in the park",
    "a dog running in the grass",
    "a cute puppy",
    "a cat sleeping on a couch",
    "a red sports car",
]

# Get embeddings
embeddings = [get_text_embedding(t) for t in texts]

# Compute pairwise similarity
print("üìä Text Similarity Matrix:")
print("(Higher values = more similar)\n")

# Print header
print(" " * 35, end="")
for i in range(len(texts)):
    print(f"{i+1:6}", end="")
print()

for i, (text_i, emb_i) in enumerate(zip(texts, embeddings)):
    print(f"{i+1}. {text_i[:30]:32}", end="")
    for emb_j in embeddings:
        similarity = np.dot(emb_i, emb_j)
        print(f"{similarity:6.2f}", end="")
    print()

### üîç What Just Happened?

Notice how:
- Texts about dogs (1-3) are highly similar to each other (~0.7-0.8)
- The cat text (4) is moderately similar to dog texts (~0.5) since they're both pets
- The car text (5) is very different from animal texts (~0.2)

This is the power of CLIP - it understands **semantic meaning**, not just keywords!

---

## Part 3: Building the Vector Database

We'll use ChromaDB to store and search our multimodal embeddings.

### üßí ELI5: Vector Database

> **A vector database is like a magical library** where instead of organizing books by title, you organize them by what they're about.
>
> Want to find books about "adventure"? The librarian doesn't search titles - they go to the "adventure" section and grab the closest books!
>
> In our case, the "location" of each item is its CLIP embedding.

In [None]:
import chromadb
from chromadb.config import Settings

# Create in-memory ChromaDB client
# (Use PersistentClient for production)
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))

# Create a collection for our multimodal content
collection = chroma_client.get_or_create_collection(
    name="multimodal_demo",
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

print("‚úÖ ChromaDB collection created!")
print(f"Collection name: {collection.name}")

In [None]:
# Let's create a sample dataset with images and text
# For demo, we'll use some sample image URLs

sample_data = [
    {
        "type": "image",
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg",
        "description": "Orange tabby cat"
    },
    {
        "type": "image",
        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/YellowLabradorLooking_new.jpg/1200px-YellowLabradorLooking_new.jpg",
        "description": "Yellow labrador retriever"
    },
    {
        "type": "text",
        "content": "Dogs are loyal companions that love to play fetch and go for walks. They are known as man's best friend.",
    },
    {
        "type": "text",
        "content": "Cats are independent pets that enjoy lounging in sunny spots and playing with toys. They purr when happy.",
    },
    {
        "type": "text",
        "content": "Mountains are majestic natural formations that attract hikers and climbers from around the world.",
    },
    {
        "type": "text",
        "content": "The ocean is vast and mysterious, home to countless marine species from tiny plankton to massive whales.",
    },
]

print(f"Sample dataset: {len(sample_data)} items")

In [None]:
def load_image_from_url(url: str) -> Image.Image:
    """Load image from URL."""
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return Image.open(BytesIO(response.content)).convert("RGB")

# Index our sample data
print("Indexing sample data...")

for i, item in enumerate(sample_data):
    # Generate unique ID
    item_id = f"item_{i}"
    
    if item["type"] == "image":
        # Load and embed image
        print(f"  Loading image: {item['description']}...")
        image = load_image_from_url(item["url"])
        embedding = get_image_embedding(image)
        
        metadata = {
            "content_type": "image",
            "description": item["description"],
            "url": item["url"],
        }
        document = item["description"]
        
    else:
        # Embed text
        print(f"  Embedding text: {item['content'][:40]}...")
        embedding = get_text_embedding(item["content"])
        
        metadata = {
            "content_type": "text",
        }
        document = item["content"]
    
    # Add to collection
    collection.add(
        ids=[item_id],
        embeddings=[embedding.tolist()],
        metadatas=[metadata],
        documents=[document],
    )

print(f"\n‚úÖ Indexed {collection.count()} items!")

---

## Part 4: Querying the Multimodal Index

Now we can search our index using natural language or images!

In [None]:
def search(query: str, top_k: int = 3) -> List[Dict]:
    """
    Search the multimodal index with a text query.
    
    Args:
        query: Natural language search query
        top_k: Number of results to return
        
    Returns:
        List of search results with content and similarity scores
    """
    # Get query embedding
    query_embedding = get_text_embedding(query)
    
    # Search ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k,
        include=["metadatas", "documents", "distances"],
    )
    
    # Format results
    formatted = []
    for i in range(len(results["ids"][0])):
        # Convert distance to similarity (ChromaDB returns cosine distance)
        similarity = 1 - results["distances"][0][i]
        
        formatted.append({
            "id": results["ids"][0][i],
            "similarity": similarity,
            "metadata": results["metadatas"][0][i],
            "content": results["documents"][0][i],
        })
    
    return formatted

print("‚úÖ Search function ready!")

In [None]:
# Test search with different queries
queries = [
    "cute furry pet",
    "loyal companion animal",
    "nature and outdoors",
    "swimming and water",
]

for query in queries:
    print(f"\nüîç Query: '{query}'")
    print("-" * 50)
    
    results = search(query, top_k=3)
    
    for i, result in enumerate(results, 1):
        content_type = result["metadata"]["content_type"]
        similarity = result["similarity"]
        
        if content_type == "image":
            content = f"[Image: {result['metadata']['description']}]"
        else:
            content = result["content"][:60] + "..."
        
        emoji = "üñºÔ∏è" if content_type == "image" else "üìù"
        print(f"  {i}. {emoji} {similarity:.3f} - {content}")

### üîç What Just Happened?

Notice how:
- "cute furry pet" finds both cat and dog images AND the text about cats/dogs
- "loyal companion animal" prioritizes the dog content
- "nature and outdoors" finds the mountain text
- "swimming and water" finds the ocean text

**The magic**: Images and text are searched together, and the most semantically relevant content rises to the top regardless of format!

---

## Part 5: Building a Complete Multimodal RAG Pipeline

Now let's combine everything into a production-ready RAG system!

In [None]:
@dataclass
class SearchResult:
    """A single search result."""
    content_type: str  # "image" or "text"
    content: str       # Text content or image description
    score: float       # Similarity score (0-1)
    metadata: Dict[str, Any]
    image: Optional[Image.Image] = None  # Loaded image if applicable


class MultimodalRAG:
    """
    A complete multimodal RAG system using CLIP and ChromaDB.
    """
    
    def __init__(self, collection_name: str = "multimodal_rag"):
        """Initialize the RAG system."""
        self.collection_name = collection_name
        
        # Initialize ChromaDB
        self.client = chromadb.Client(Settings(anonymized_telemetry=False))
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
        
        # CLIP model reference (use existing loaded model)
        self.clip_model = clip_model
        self.clip_processor = clip_processor
        self.device = device
    
    def _get_id(self, content: str, content_type: str) -> str:
        """Generate unique ID for content."""
        return hashlib.md5(f"{content_type}:{content}".encode()).hexdigest()
    
    def _embed_image(self, image: Image.Image) -> np.ndarray:
        """Get CLIP embedding for image."""
        inputs = self.clip_processor(images=image, return_tensors="pt", padding=True)
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            features = self.clip_model.get_image_features(**inputs)
            features = features / features.norm(dim=-1, keepdim=True)
        
        return features.cpu().numpy()[0]
    
    def _embed_text(self, text: str) -> np.ndarray:
        """Get CLIP embedding for text."""
        inputs = self.clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            features = self.clip_model.get_text_features(**inputs)
            features = features / features.norm(dim=-1, keepdim=True)
        
        return features.cpu().numpy()[0]
    
    def add_image(self, image: Image.Image, description: str, metadata: Dict = None) -> str:
        """
        Add an image to the index.
        
        Args:
            image: PIL Image
            description: Text description for display
            metadata: Additional metadata
            
        Returns:
            ID of added item
        """
        item_id = self._get_id(description, "image")
        embedding = self._embed_image(image)
        
        meta = {
            "content_type": "image",
            "description": description,
        }
        if metadata:
            meta.update(metadata)
        
        self.collection.add(
            ids=[item_id],
            embeddings=[embedding.tolist()],
            metadatas=[meta],
            documents=[description],
        )
        
        return item_id
    
    def add_text(self, text: str, metadata: Dict = None) -> str:
        """
        Add text to the index.
        
        Args:
            text: Text content
            metadata: Additional metadata
            
        Returns:
            ID of added item
        """
        item_id = self._get_id(text, "text")
        embedding = self._embed_text(text)
        
        meta = {
            "content_type": "text",
        }
        if metadata:
            meta.update(metadata)
        
        self.collection.add(
            ids=[item_id],
            embeddings=[embedding.tolist()],
            metadatas=[meta],
            documents=[text],
        )
        
        return item_id
    
    def search(self, query: str, top_k: int = 5, content_type: str = None) -> List[SearchResult]:
        """
        Search the index with a text query.
        
        Args:
            query: Natural language search query
            top_k: Number of results
            content_type: Filter by "image" or "text" (None for both)
            
        Returns:
            List of SearchResult objects
        """
        query_embedding = self._embed_text(query)
        
        # Build where clause
        where = None
        if content_type:
            where = {"content_type": content_type}
        
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=top_k,
            where=where,
            include=["metadatas", "documents", "distances"],
        )
        
        search_results = []
        for i in range(len(results["ids"][0])):
            similarity = 1 - results["distances"][0][i]
            meta = results["metadatas"][0][i]
            
            search_results.append(SearchResult(
                content_type=meta["content_type"],
                content=results["documents"][0][i],
                score=similarity,
                metadata=meta,
            ))
        
        return search_results
    
    def search_by_image(self, image: Image.Image, top_k: int = 5) -> List[SearchResult]:
        """
        Search the index using an image query.
        
        Args:
            image: Query image
            top_k: Number of results
            
        Returns:
            List of SearchResult objects
        """
        query_embedding = self._embed_image(image)
        
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=top_k,
            include=["metadatas", "documents", "distances"],
        )
        
        search_results = []
        for i in range(len(results["ids"][0])):
            similarity = 1 - results["distances"][0][i]
            meta = results["metadatas"][0][i]
            
            search_results.append(SearchResult(
                content_type=meta["content_type"],
                content=results["documents"][0][i],
                score=similarity,
                metadata=meta,
            ))
        
        return search_results
    
    def count(self) -> int:
        """Get number of items in index."""
        return self.collection.count()
    
    def clear(self):
        """Clear all items from index."""
        self.client.delete_collection(self.collection_name)
        self.collection = self.client.create_collection(
            name=self.collection_name,
            metadata={"hnsw:space": "cosine"}
        )

print("‚úÖ MultimodalRAG class defined!")

In [None]:
# Create and populate a new RAG instance
rag = MultimodalRAG("production_rag")

# Add more diverse content
print("Building knowledge base...")

# Add images
image_data = [
    ("https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg", "Orange tabby cat with green eyes"),
    ("https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/YellowLabradorLooking_new.jpg/1200px-YellowLabradorLooking_new.jpg", "Yellow labrador retriever dog"),
]

for url, desc in image_data:
    print(f"  Adding image: {desc}")
    img = load_image_from_url(url)
    rag.add_image(img, desc)

# Add text documents
text_data = [
    "Dogs make wonderful family pets due to their loyal and playful nature. They need regular exercise and enjoy activities like walking, running, and playing fetch.",
    "Cats are independent and low-maintenance pets. They are excellent hunters and spend much of their time grooming and sleeping. Cats can live indoors or outdoors.",
    "Golden retrievers are one of the most popular dog breeds. They are known for their friendly temperament and are often used as therapy and service dogs.",
    "The African savanna is home to lions, elephants, giraffes, and many other amazing animals. It features vast grasslands with scattered trees.",
    "Deep sea creatures have adapted to extreme pressure and darkness. Bioluminescent fish create their own light to attract prey in the ocean depths.",
    "The Northern Lights, or Aurora Borealis, are natural light displays in the sky. They occur when solar particles interact with Earth's magnetic field.",
    "Coffee is made from roasted coffee beans, the seeds of berries from the Coffea plant. It is one of the most consumed beverages in the world.",
    "Machine learning is a subset of artificial intelligence that enables computers to learn from data. Neural networks are a popular machine learning architecture.",
]

for text in text_data:
    print(f"  Adding text: {text[:40]}...")
    rag.add_text(text)

print(f"\n‚úÖ Knowledge base built with {rag.count()} items!")

In [None]:
# Test the complete RAG system
test_queries = [
    "friendly pet that plays fetch",
    "wildlife in Africa",
    "lights in the night sky",
    "artificial intelligence and learning",
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"üîç Query: '{query}'")
    print("="*60)
    
    results = rag.search(query, top_k=3)
    
    for i, r in enumerate(results, 1):
        emoji = "üñºÔ∏è" if r.content_type == "image" else "üìù"
        content_preview = r.content[:70] + "..." if len(r.content) > 70 else r.content
        print(f"\n  {i}. {emoji} Score: {r.score:.3f}")
        print(f"     {content_preview}")

---

## Part 6: Image-Based Search

We can also search using an image as the query!

In [None]:
# Load a query image
query_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Cat_November_2010-1a.jpg/1200px-Cat_November_2010-1a.jpg"
query_image = load_image_from_url(query_image_url)

# Display the query image
plt.figure(figsize=(6, 6))
plt.imshow(query_image)
plt.axis('off')
plt.title("Query Image")
plt.show()

# Search using the image
print("\nüîç Searching with image query...")
print("=" * 60)

results = rag.search_by_image(query_image, top_k=5)

for i, r in enumerate(results, 1):
    emoji = "üñºÔ∏è" if r.content_type == "image" else "üìù"
    content_preview = r.content[:70] + "..." if len(r.content) > 70 else r.content
    print(f"\n  {i}. {emoji} Score: {r.score:.3f}")
    print(f"     {content_preview}")

### üîç What Just Happened?

When we searched with a cat image, the system found:
1. The cat image already in our database (highest similarity)
2. Text about cats
3. Other pet-related content

This is **reverse image search** enhanced with cross-modal understanding!

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Normalizing Embeddings
```python
# ‚ùå Wrong: Raw embeddings without normalization
embedding = clip_model.get_image_features(**inputs)
# Different scale embeddings break similarity calculations!

# ‚úÖ Right: Always normalize to unit vectors
embedding = clip_model.get_image_features(**inputs)
embedding = embedding / embedding.norm(dim=-1, keepdim=True)
```
**Why:** CLIP similarity is cosine similarity, which requires normalized vectors.

---

### Mistake 2: Using Wrong Distance Metric
```python
# ‚ùå Wrong: Using L2 distance with cosine embeddings
collection = client.create_collection(
    name="my_collection",
    metadata={"hnsw:space": "l2"}  # Wrong for CLIP!
)

# ‚úÖ Right: Use cosine distance
collection = client.create_collection(
    name="my_collection",
    metadata={"hnsw:space": "cosine"}
)
```
**Why:** CLIP embeddings are designed for cosine similarity.

---

### Mistake 3: Text Too Long for CLIP
```python
# ‚ùå Wrong: Long text gets truncated silently
text = "Very long document..." * 100  # 77 tokens max!
embedding = get_text_embedding(text)

# ‚úÖ Right: Chunk long text or summarize
chunks = split_into_chunks(text, max_tokens=70)
for chunk in chunks:
    rag.add_text(chunk)
```
**Why:** CLIP truncates text to 77 tokens. Long documents should be chunked.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How CLIP creates unified embeddings for images and text
- ‚úÖ Building a vector database with ChromaDB
- ‚úÖ Indexing both images and text in the same space
- ‚úÖ Querying with natural language to find multimodal content
- ‚úÖ Reverse image search across images and text
- ‚úÖ Building a production-ready MultimodalRAG class

---

## üöÄ Challenge (Optional)

Build a **Visual Question Answering RAG** that:
1. Takes a question about an image
2. Retrieves relevant context from your knowledge base
3. Uses a VLM to answer the question with the retrieved context

Example: "What type of animal is this and what do they eat?" + image of a cat

In [None]:
# Challenge: Your code here!

def visual_qa_with_rag(
    image: Image.Image,
    question: str,
    rag_system: MultimodalRAG,
    vlm_model = None,
    vlm_processor = None,
) -> str:
    """
    Answer questions about images using RAG context.
    
    Args:
        image: Query image
        question: Question about the image
        rag_system: Multimodal RAG instance
        vlm_model: Vision-language model
        vlm_processor: VLM processor
        
    Returns:
        Answer with RAG context
    """
    # Your implementation here!
    pass

---

## üìñ Further Reading

- [CLIP Paper](https://arxiv.org/abs/2103.00020)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [LlamaIndex Multimodal RAG](https://docs.llamaindex.ai/en/stable/examples/multi_modal/multi_modal_retrieval/)
- [OpenAI CLIP Repository](https://github.com/openai/CLIP)

---

## üßπ Cleanup

In [None]:
# Clean up
if 'clip_model' in dir():
    del clip_model
if 'clip_processor' in dir():
    del clip_processor

torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

---

## Next Steps

In the next lab, we'll build a **Document AI Pipeline** for processing PDFs with OCR and layout analysis!

‚û°Ô∏è Continue to [Lab 04: Document AI Pipeline](./04-document-ai-pipeline.ipynb)