# Interleaving Text and Images in Prompts

This notebook demonstrates how to interleave text and images in prompts using MLX-VLM. This technique is useful when you want to:
- Compare multiple images by name or label
- Add context or metadata to each image
- Structure complex multi-image queries

The key is using the message structure with alternating `text` and `image` types in the content list.

In [None]:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import get_chat_template

## Load Model and Processor

We'll use Qwen3-VL which supports multi-image inputs with interleaved text.

In [None]:
# Load model and processor
model, processor = load("mlx-community/Qwen3-VL-2B-Instruct-4bit")
config = model.config

## Example 1: Basic Interleaved Text and Images

In this example, we'll compare two images by adding text labels before each image. This helps the model understand which image is which when analyzing them.

In [None]:
# Define images to compare
images = ["images/cats.jpg", "images/desktop_setup.png"]

# Create message with interleaved text and image placeholders
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Here is image1.jpg:"},
            {"type": "image"},
            {"type": "text", "text": "Here is image2.png:"},
            {"type": "image"},
            {"type": "text", "text": "What are the main differences between image1.jpg and image2.png?"}
        ]
    }
]

print("Message structure:")
print(messages)

In [None]:
# Apply chat template
prompt = get_chat_template(
    processor,
    messages,
    add_generation_prompt=True
)

print("\nFormatted prompt:")
print(prompt)

In [None]:
# Generate response
output = generate(
    model,
    processor,
    prompt,
    images,
    max_tokens=500,
    temperature=0.7,
    verbose=True
)

## Example 2: Advanced Example with Metadata

This example demonstrates adding rich metadata to images, such as:
- Filenames
- Timestamps
- GPS coordinates
- Other contextual information

This is particularly useful for tasks like:
- Clustering photos by location or time
- Analyzing photo collections
- Creating photo albums or timelines

In [None]:
# Create message with detailed metadata for each image
messages_with_metadata = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "This is cats.jpg, it was taken at 8am on 4/4/2024 at GPS coordinates (37.7749, -122.4194):"},
            {"type": "image"},
            {"type": "text", "text": "This is desktop_setup.png, it was taken at 2pm on 4/4/2024 at GPS coordinates (37.7750, -122.4195):"},
            {"type": "image"},
            {"type": "text", "text": "Based on the timestamps and locations, please analyze these images and determine if they might be from the same location or event. What common themes or differences do you notice?"}
        ]
    }
]

print("Message structure with metadata:")
print(messages_with_metadata)

In [None]:
# Apply chat template
prompt_with_metadata = get_chat_template(
    processor,
    messages_with_metadata,
    add_generation_prompt=True
)

print("\nFormatted prompt with metadata:")
print(prompt_with_metadata)

In [None]:
# Generate response
output_with_metadata = generate(
    model,
    processor,
    prompt_with_metadata,
    images,
    max_tokens=500,
    temperature=0.7,
    verbose=True
)

## Example 3: Multiple Images with Individual Descriptions

You can also use this pattern to provide individual descriptions or questions for each image before asking a comprehensive question.

In [None]:
# Create message with individual context for each image
messages_individual = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "First image - a cozy indoor scene:"},
            {"type": "image"},
            {"type": "text", "text": "Second image - a workspace setup:"},
            {"type": "image"},
            {"type": "text", "text": "What is the overall mood or atmosphere conveyed by these two environments? How might they complement each other in a home?"}
        ]
    }
]

# Apply chat template
prompt_individual = get_chat_template(
    processor,
    messages_individual,
    add_generation_prompt=True
)

# Generate response
output_individual = generate(
    model,
    processor,
    prompt_individual,
    images,
    max_tokens=500,
    temperature=0.7,
    verbose=True
)

## Key Takeaways

1. **Message Structure**: Use a list of dictionaries in the `content` field with alternating `{"type": "text", "text": "..."}` and `{"type": "image"}` entries.

2. **Order Matters**: The order of text and images in the content list determines how they are presented to the model.

3. **Flexible Formatting**: You can add as many text segments as needed between images to provide context, labels, or metadata.

4. **Use Cases**:
   - Comparing images with labels ("image1" vs "image2")
   - Adding metadata (timestamps, locations, filenames)
   - Providing individual context for each image
   - Structuring complex multi-image queries

5. **Model Support**: This pattern works with models that support multi-image inputs, such as Qwen3-VL, Pixtral, and others listed in the model configuration.