# Task 14.3: Multimodal RAG

**Module:** 14 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐⭐

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how CLIP creates joint image-text embeddings
- [ ] Build a multimodal index with both images and text
- [ ] Search images using natural language queries
- [ ] Create a complete multimodal RAG pipeline
- [ ] Combine image search with VLM-powered Q&A

---

## Prerequisites

- Completed: Task 14.1 (Vision-Language Demo)
- Knowledge of: Embeddings, vector databases (from Module 13)
- Running in: NGC PyTorch container

---

## Real-World Context

Traditional RAG works with text only. But what if your knowledge base includes images, diagrams, and photos?

**Industry Applications:**
- **E-commerce**: "Show me red dresses similar to this photo"
- **Medical**: "Find X-rays showing similar conditions"
- **Manufacturing**: "Locate defect images matching this description"
- **Real Estate**: "Find properties with modern kitchens like this"
- **Fashion**: "Search for outfits matching this style"

**Why DGX Spark?**
- CLIP: ~2GB - extremely lightweight
- Can index thousands of images while running a VLM for Q&A
- Fast embedding generation with Blackwell's tensor cores

---

## ELI5: What is CLIP and Multimodal Embeddings?

> **Imagine you're organizing a photo album, but you want to find pictures using words, not scrolling!**
>
> CLIP learned by looking at 400 million images with captions. For each pair, it asked:
> - "What numbers describe this image?" (image embedding)
> - "What numbers describe this caption?" (text embedding)
>
> It learned to make the numbers for matching pairs SIMILAR, and non-matching pairs DIFFERENT.
>
> Now, you can:
> 1. Give CLIP all your photos → Get number codes for each
> 2. Type "sunset at the beach" → Get a number code
> 3. Find photos with similar numbers → Beach sunset photos appear!
>
> **In AI terms:**
> - **Contrastive Learning**: Train embeddings so matching pairs are close in vector space
> - **Shared Embedding Space**: Both images and text live in the same 768-dimensional space
> - **Zero-shot**: Works on any images without retraining!

---

## Part 1: Environment Setup

In [None]:
# Install required packages (run once)
# !pip install transformers sentence-transformers chromadb pillow -q

In [None]:
import torch
import gc
from PIL import Image
import numpy as np
import time
import requests
from io import BytesIO
from typing import List, Dict, Optional, Union, Tuple
import warnings
from IPython.display import display
warnings.filterwarnings('ignore')

# Check GPU
print("=" * 50)
print("GPU Configuration")
print("=" * 50)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"Total Memory: {total_memory:.1f} GB")
else:
    print("No GPU available")

print(f"\nPyTorch: {torch.__version__}")

In [None]:
def clear_gpu_memory():
    """Clear GPU memory."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("GPU memory cleared!")

def get_memory_usage():
    """Get GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1e9
        return f"Allocated: {allocated:.2f}GB"
    return "No GPU"

def load_image_from_url(url: str) -> Image.Image:
    """Load image from URL."""
    response = requests.get(url, timeout=10)
    return Image.open(BytesIO(response.content)).convert('RGB')

print("Utility functions loaded!")

---

## Part 2: Understanding CLIP Architecture

### The Two Towers

```
        IMAGE                           TEXT
          │                               │
          ▼                               ▼
  ┌───────────────┐               ┌───────────────┐
  │ Vision Trans. │               │ Text Trans.   │
  │  (ViT-L/14)   │               │  (Transformer)│
  └───────────────┘               └───────────────┘
          │                               │
          ▼                               ▼
  ┌───────────────┐               ┌───────────────┐
  │ Image Embed   │               │ Text Embed    │
  │  [768 dims]   │               │  [768 dims]   │
  └───────────────┘               └───────────────┘
          │                               │
          └──────────┬────────────────────┘
                     ▼
              Cosine Similarity
```

Both images and text get mapped to the SAME 768-dimensional space, allowing direct comparison!

---

## Part 3: Loading CLIP

In [None]:
from transformers import CLIPProcessor, CLIPModel

# Load CLIP model
print("Loading CLIP model...")

clip_model_id = "openai/clip-vit-large-patch14"

clip_processor = CLIPProcessor.from_pretrained(clip_model_id)
clip_model = CLIPModel.from_pretrained(
    clip_model_id,
    torch_dtype=torch.bfloat16  # Use bfloat16 for Blackwell optimization
).to("cuda")

clip_model.eval()
print(f"CLIP loaded! Model: {clip_model_id}")

In [None]:
def get_image_embedding(image: Image.Image) -> np.ndarray:
    """
    Get CLIP embedding for an image.
    
    Args:
        image: PIL Image
        
    Returns:
        Normalized embedding as numpy array
    """
    inputs = clip_processor(images=image, return_tensors="pt")
    inputs = {k: v.to(clip_model.device) for k, v in inputs.items()}
    
    with torch.inference_mode():
        image_features = clip_model.get_image_features(**inputs)
    
    # Normalize to unit vector
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    
    return image_features.cpu().numpy()[0]

def get_text_embedding(text: str) -> np.ndarray:
    """
    Get CLIP embedding for text.
    
    Args:
        text: Text string
        
    Returns:
        Normalized embedding as numpy array
    """
    inputs = clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(clip_model.device) for k, v in inputs.items()}
    
    with torch.inference_mode():
        text_features = clip_model.get_text_features(**inputs)
    
    # Normalize to unit vector
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    return text_features.cpu().numpy()[0]

def get_batch_image_embeddings(images: List[Image.Image], batch_size: int = 8) -> np.ndarray:
    """
    Get CLIP embeddings for a batch of images.
    
    Args:
        images: List of PIL Images
        batch_size: Processing batch size
        
    Returns:
        Array of embeddings [N, 768]
    """
    all_embeddings = []
    
    for i in range(0, len(images), batch_size):
        batch = images[i:i+batch_size]
        inputs = clip_processor(images=batch, return_tensors="pt")
        inputs = {k: v.to(clip_model.device) for k, v in inputs.items()}
        
        with torch.inference_mode():
            features = clip_model.get_image_features(**inputs)
            features = features / features.norm(dim=-1, keepdim=True)
            all_embeddings.append(features.cpu().numpy())
    
    return np.vstack(all_embeddings)

print("Embedding functions ready!")

In [None]:
# Test the embedding functions

# Create a simple test image
test_image = Image.new('RGB', (224, 224), color='red')

# Get embeddings
img_emb = get_image_embedding(test_image)
text_emb = get_text_embedding("a solid red image")

print(f"Image embedding shape: {img_emb.shape}")
print(f"Text embedding shape: {text_emb.shape}")

# Compute similarity
similarity = np.dot(img_emb, text_emb)
print(f"\nSimilarity between red image and 'a solid red image': {similarity:.4f}")

# Try a non-matching text
blue_emb = get_text_embedding("a solid blue image")
similarity2 = np.dot(img_emb, blue_emb)
print(f"Similarity between red image and 'a solid blue image': {similarity2:.4f}")

### What Just Happened?

CLIP correctly identified that:
- The red image is more similar to "a solid red image" (higher score)
- The red image is less similar to "a solid blue image" (lower score)

This is the foundation of multimodal search!

---

## Part 4: Creating a Sample Image Dataset

Let's create a small dataset of images to search through.

In [None]:
# Create a synthetic dataset with different colored shapes
from PIL import ImageDraw

def create_sample_image(shape: str, color: str, bg_color: str = "white") -> Image.Image:
    """
    Create a simple image with a colored shape.
    
    Args:
        shape: 'circle', 'square', or 'triangle'
        color: Shape color
        bg_color: Background color
        
    Returns:
        PIL Image
    """
    img = Image.new('RGB', (224, 224), color=bg_color)
    draw = ImageDraw.Draw(img)
    
    if shape == 'circle':
        draw.ellipse([50, 50, 174, 174], fill=color)
    elif shape == 'square':
        draw.rectangle([50, 50, 174, 174], fill=color)
    elif shape == 'triangle':
        draw.polygon([(112, 50), (50, 174), (174, 174)], fill=color)
    
    return img

# Create our dataset
shapes = ['circle', 'square', 'triangle']
colors = ['red', 'blue', 'green', 'yellow', 'purple', 'orange']

dataset = []
for shape in shapes:
    for color in colors:
        img = create_sample_image(shape, color)
        metadata = {
            'shape': shape,
            'color': color,
            'description': f"A {color} {shape} on a white background"
        }
        dataset.append({'image': img, 'metadata': metadata})

print(f"Created dataset with {len(dataset)} images")

# Show a few samples
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 3, figsize=(9, 6))
for idx, ax in enumerate(axes.flatten()):
    ax.imshow(dataset[idx]['image'])
    ax.set_title(dataset[idx]['metadata']['description'][:20])
    ax.axis('off')
plt.tight_layout()
plt.show()

---

## Part 5: Building the Multimodal Index

We'll use ChromaDB to store our embeddings and enable fast similarity search.

In [None]:
import chromadb
from chromadb.config import Settings
import base64
from io import BytesIO

def image_to_base64(image: Image.Image) -> str:
    """Convert PIL Image to base64 string for storage."""
    buffer = BytesIO()
    image.save(buffer, format="PNG")
    return base64.b64encode(buffer.getvalue()).decode('utf-8')

def base64_to_image(b64_string: str) -> Image.Image:
    """Convert base64 string back to PIL Image."""
    return Image.open(BytesIO(base64.b64decode(b64_string)))

# Create ChromaDB client (in-memory for demo)
chroma_client = chromadb.Client(Settings(
    anonymized_telemetry=False
))

# Create a collection for our images
# Delete if exists (for re-running)
try:
    chroma_client.delete_collection("multimodal_demo")
except ValueError:
    pass  # Collection doesn't exist

collection = chroma_client.create_collection(
    name="multimodal_demo",
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

print("ChromaDB collection created!")

In [None]:
# Index all images
print(f"Indexing {len(dataset)} images...")
start_time = time.time()

# Get all image embeddings
images = [item['image'] for item in dataset]
embeddings = get_batch_image_embeddings(images)

# Add to ChromaDB
for idx, item in enumerate(dataset):
    collection.add(
        ids=[f"img_{idx}"],
        embeddings=[embeddings[idx].tolist()],
        metadatas=[{
            'shape': item['metadata']['shape'],
            'color': item['metadata']['color'],
            'description': item['metadata']['description'],
            'image_b64': image_to_base64(item['image'])
        }]
    )

index_time = time.time() - start_time
print(f"Indexed in {index_time:.1f} seconds")
print(f"Collection contains {collection.count()} items")

---

## Part 6: Searching with Natural Language

Now the magic happens - we can search our image database using plain English!

In [None]:
def search_images(query: str, n_results: int = 5) -> List[Dict]:
    """
    Search images using a natural language query.
    
    Args:
        query: Natural language search query
        n_results: Number of results to return
        
    Returns:
        List of search results with images and metadata
    """
    # Get text embedding for query
    query_embedding = get_text_embedding(query)
    
    # Search in ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results,
        include=['metadatas', 'distances']
    )
    
    # Format results
    formatted_results = []
    for idx in range(len(results['ids'][0])):
        metadata = results['metadatas'][0][idx]
        distance = results['distances'][0][idx]
        
        formatted_results.append({
            'id': results['ids'][0][idx],
            'image': base64_to_image(metadata['image_b64']),
            'shape': metadata['shape'],
            'color': metadata['color'],
            'description': metadata['description'],
            'similarity': 1 - distance  # Convert distance to similarity
        })
    
    return formatted_results

def display_search_results(query: str, results: List[Dict]):
    """Display search results nicely."""
    print(f"\nQuery: '{query}'")
    print("=" * 50)
    
    n_cols = min(len(results), 5)
    fig, axes = plt.subplots(1, n_cols, figsize=(3*n_cols, 3))
    if n_cols == 1:
        axes = [axes]
    
    for idx, (ax, result) in enumerate(zip(axes, results)):
        ax.imshow(result['image'])
        ax.set_title(f"Sim: {result['similarity']:.3f}\n{result['color']} {result['shape']}", fontsize=10)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()

print("Search functions ready!")

In [None]:
# Test some searches!

# Search 1: Find red things
results = search_images("something red", n_results=5)
display_search_results("something red", results)

In [None]:
# Search 2: Find circles
results = search_images("circular shape", n_results=5)
display_search_results("circular shape", results)

In [None]:
# Search 3: More abstract query
results = search_images("warm colors", n_results=5)
display_search_results("warm colors", results)

In [None]:
# Search 4: Cool colors
results = search_images("cool colors like the ocean or sky", n_results=5)
display_search_results("cool colors like the ocean or sky", results)

### What Just Happened?

1. **Query Embedding**: We converted the text query to a 768-dimensional vector
2. **Similarity Search**: ChromaDB found images whose embeddings are closest to the query
3. **Semantic Understanding**: CLIP understands concepts like "warm colors" even though we never explicitly labeled them!

This is the power of CLIP - it has learned general visual-linguistic associations.

---

## Part 7: Image-to-Image Search

We can also search using an image as the query - find similar images!

In [None]:
def search_by_image(query_image: Image.Image, n_results: int = 5) -> List[Dict]:
    """
    Search images using another image as query.
    
    Args:
        query_image: PIL Image to search with
        n_results: Number of results to return
        
    Returns:
        List of similar images with metadata
    """
    # Get image embedding for query
    query_embedding = get_image_embedding(query_image)
    
    # Search in ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results,
        include=['metadatas', 'distances']
    )
    
    # Format results
    formatted_results = []
    for idx in range(len(results['ids'][0])):
        metadata = results['metadatas'][0][idx]
        distance = results['distances'][0][idx]
        
        formatted_results.append({
            'id': results['ids'][0][idx],
            'image': base64_to_image(metadata['image_b64']),
            'shape': metadata['shape'],
            'color': metadata['color'],
            'description': metadata['description'],
            'similarity': 1 - distance
        })
    
    return formatted_results

print("Image-to-image search ready!")

In [None]:
# Create a query image (not in our dataset)
query_image = create_sample_image('circle', 'pink', 'gray')

print("Query image:")
display(query_image)

# Find similar images
results = search_by_image(query_image, n_results=5)
display_search_results("Similar to pink circle (image query)", results)

---

## Part 8: Complete Multimodal RAG Pipeline

Now let's build a complete RAG pipeline that:
1. Accepts a natural language question
2. Retrieves relevant images
3. Uses a VLM to answer questions about the retrieved images

### ELI5: Multimodal RAG

> **Imagine you're a librarian with a photo library:**
>
> 1. Someone asks: "What blue things do you have that are round?"
> 2. You search your catalog and find matching photos
> 3. You look at the photos and describe them in detail
>
> That's multimodal RAG - retrieve images, then analyze them!

In [None]:
class MultimodalRAG:
    """
    Complete Multimodal Retrieval-Augmented Generation system.
    
    Combines CLIP for retrieval with a VLM for question answering.
    """
    
    # === Initialization ===
    def __init__(self, clip_model, clip_processor, collection):
        """
        Initialize the Multimodal RAG system.
        
        Args:
            clip_model: Loaded CLIP model
            clip_processor: CLIP processor
            collection: ChromaDB collection
        """
        self.clip_model = clip_model
        self.clip_processor = clip_processor
        self.collection = collection
        self.vlm_model = None
        self.vlm_processor = None
        
    # === Embedding Methods ===
    def get_text_embedding(self, text: str) -> np.ndarray:
        """Get CLIP embedding for text."""
        inputs = self.clip_processor(text=[text], return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(self.clip_model.device) for k, v in inputs.items()}
        
        with torch.inference_mode():
            features = self.clip_model.get_text_features(**inputs)
            features = features / features.norm(dim=-1, keepdim=True)
        
        return features.cpu().numpy()[0]
    
    def retrieve(self, query: str, n_results: int = 3) -> List[Dict]:
        """
        Retrieve relevant images for a query.
        
        Args:
            query: Natural language query
            n_results: Number of images to retrieve
            
        Returns:
            List of retrieved images with metadata
        """
        query_embedding = self.get_text_embedding(query)
        
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            include=['metadatas', 'distances']
        )
        
        retrieved = []
        for idx in range(len(results['ids'][0])):
            metadata = results['metadatas'][0][idx]
            distance = results['distances'][0][idx]
            
            retrieved.append({
                'id': results['ids'][0][idx],
                'image': base64_to_image(metadata['image_b64']),
                'metadata': {
                    'shape': metadata['shape'],
                    'color': metadata['color'],
                    'description': metadata['description']
                },
                'similarity': 1 - distance
            })
        
        return retrieved
    
    # === VLM Integration ===
    def load_vlm(self):
        """Load the VLM for question answering."""
        if self.vlm_model is not None:
            return
            
        from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
        
        print("Loading Qwen2-VL for Q&A...")
        
        self.vlm_processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
        self.vlm_model = Qwen2VLForConditionalGeneration.from_pretrained(
            "Qwen/Qwen2-VL-7B-Instruct",
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        print("VLM loaded!")
    
    def answer_with_images(self, question: str, images: List[Image.Image]) -> str:
        """
        Answer a question using retrieved images.
        
        Args:
            question: User's question
            images: Retrieved images for context
            
        Returns:
            Answer from the VLM
        """
        if self.vlm_model is None:
            self.load_vlm()
        
        # Create a combined image (side by side)
        total_width = sum(img.width for img in images)
        max_height = max(img.height for img in images)
        combined = Image.new('RGB', (total_width, max_height), 'white')
        
        x_offset = 0
        for img in images:
            combined.paste(img, (x_offset, 0))
            x_offset += img.width
        
        # Prepare prompt
        messages = [{
            "role": "user",
            "content": [
                {"type": "image", "image": combined},
                {"type": "text", "text": f"These are search results from an image database. {question}"}
            ]
        }]
        
        text = self.vlm_processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = self.vlm_processor(text=[text], images=[combined], return_tensors="pt", padding=True)
        inputs = inputs.to(self.vlm_model.device)
        
        with torch.inference_mode():
            output_ids = self.vlm_model.generate(
                **inputs,
                max_new_tokens=200,
                do_sample=True,
                temperature=0.7
            )
        
        generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
        response = self.vlm_processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        
        return response
    
    # === Main Query Interface ===
    def query(self, question: str, n_retrieve: int = 3) -> Dict:
        """
        Complete RAG query: retrieve images and answer question.
        
        Args:
            question: User's question
            n_retrieve: Number of images to retrieve
            
        Returns:
            Dictionary with retrieved images and answer
        """
        print(f"\nQuestion: {question}")
        print("-" * 50)
        
        # Retrieve relevant images
        print(f"Retrieving {n_retrieve} relevant images...")
        retrieved = self.retrieve(question, n_retrieve)
        
        # Display retrieved images
        print(f"Retrieved images:")
        for r in retrieved:
            print(f"  - {r['metadata']['description']} (sim: {r['similarity']:.3f})")
        
        # Get answer from VLM
        images = [r['image'] for r in retrieved]
        answer = self.answer_with_images(question, images)
        
        return {
            'question': question,
            'retrieved': retrieved,
            'answer': answer
        }

print("MultimodalRAG class defined!")

In [None]:
# Initialize the Multimodal RAG system
rag = MultimodalRAG(clip_model, clip_processor, collection)

# Let's ask a question!
result = rag.query("What colors are the circles in the database?")

print(f"\nAnswer: {result['answer']}")

In [None]:
# Try another query
result = rag.query("Which shapes have warm colors like red, orange, or yellow?")

print(f"\nAnswer: {result['answer']}")

# Display the retrieved images
fig, axes = plt.subplots(1, len(result['retrieved']), figsize=(9, 3))
for ax, r in zip(axes, result['retrieved']):
    ax.imshow(r['image'])
    ax.set_title(f"{r['metadata']['color']} {r['metadata']['shape']}")
    ax.axis('off')
plt.tight_layout()
plt.show()

---

## Try It Yourself: Build Your Own Image Database

Create a more interesting image database and test the search!

Ideas:
1. Download images from the web and index them
2. Create images with text labels
3. Mix different types of images (photos, icons, diagrams)

<details>
<summary>Hint: Loading Images from URLs</summary>

```python
# Unsplash provides royalty-free images
urls = [
    "https://images.unsplash.com/photo-1514888286974-6c03e2ca1dba?w=400",  # cat
    "https://images.unsplash.com/photo-1587300003388-59208cc962cb?w=400",  # dog
    "https://images.unsplash.com/photo-1501854140801-50d01698950b?w=400",  # nature
]

for url in urls:
    image = load_image_from_url(url)
    embedding = get_image_embedding(image)
    # Add to collection...
```
</details>

In [None]:
# YOUR CODE HERE
# Build your own image database and test the search!



---

## Common Mistakes

### Mistake 1: Not Normalizing Embeddings

```python
# Wrong - unnormalized embeddings give wrong similarity
embedding = clip_model.get_image_features(**inputs)

# Right - always normalize for cosine similarity
embedding = clip_model.get_image_features(**inputs)
embedding = embedding / embedding.norm(dim=-1, keepdim=True)
```

### Mistake 2: Mixing Embedding Types

```python
# Wrong - comparing CLIP embeddings with sentence-transformer embeddings
clip_emb = get_clip_embedding(image)
text_emb = sentence_transformer.encode(text)  # Different space!
similarity = cosine_similarity(clip_emb, text_emb)  # Meaningless!

# Right - use CLIP for both
clip_img_emb = clip_model.get_image_features(...)
clip_text_emb = clip_model.get_text_features(...)
similarity = cosine_similarity(clip_img_emb, clip_text_emb)  # Correct!
```

### Mistake 3: Forgetting Image Preprocessing

```python
# Wrong - passing raw image without processor
features = clip_model.get_image_features(image_tensor)  # May fail or give bad results

# Right - always use the processor
inputs = clip_processor(images=image, return_tensors="pt")
features = clip_model.get_image_features(**inputs)  # Correct!
```

### Mistake 4: Storing Images Incorrectly

```python
# Wrong - storing PIL Image directly (loses data)
collection.add(metadatas=[{'image': image}])  # Won't work!

# Right - convert to base64 or store path
collection.add(metadatas=[{'image_b64': image_to_base64(image)}])  # Works!
```

---

## Checkpoint

You've learned:
- How CLIP creates joint image-text embeddings
- How to build a multimodal index with ChromaDB
- How to search images using natural language
- How to search for similar images
- How to build a complete multimodal RAG pipeline

### Key Takeaways

1. **CLIP enables cross-modal search**: Same embedding space for images and text
2. **Semantic understanding**: CLIP understands concepts like "warm colors"
3. **RAG extends to images**: Retrieve relevant images, then reason with VLM
4. **Embeddings must be normalized**: For cosine similarity to work correctly

---

## Challenge (Optional)

### Build a Hybrid Search System

Create a system that can:
1. Accept queries with both text AND an example image
2. Weight the text and image embeddings (e.g., 70% text, 30% image)
3. Find results that match both criteria

Example: "Find red things that look similar to this circle" + [image of circle]

In [None]:
# YOUR CHALLENGE CODE HERE

def hybrid_search(
    text_query: str,
    image_query: Image.Image,
    text_weight: float = 0.7,
    n_results: int = 5
) -> List[Dict]:
    """
    Search using both text and image queries.
    
    Args:
        text_query: Natural language query
        image_query: Image to match
        text_weight: Weight for text (image_weight = 1 - text_weight)
        n_results: Number of results
        
    Returns:
        List of matching images
    """
    # TODO: Implement hybrid search
    pass

---

## Further Reading

- [CLIP Paper](https://arxiv.org/abs/2103.00020)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [OpenAI CLIP Blog Post](https://openai.com/blog/clip/)
- [Multimodal RAG Patterns](https://www.llamaindex.ai/blog/multimodal-rag)

---

## Cleanup

In [None]:
# Clean up
if 'rag' in dir() and rag.vlm_model is not None:
    del rag.vlm_model
    del rag.vlm_processor

del clip_model
del clip_processor

clear_gpu_memory()
print(f"Final memory state: {get_memory_usage()}")
print("\nNotebook complete! Ready for the next task.")