# Multi-Image Generation

In this example, you will learn how to generate text from multiple images using the supported models: `Qwen2-VL`, `Pixtral` and `llava-interleaved`.

Multi-image generation allows you to pass a list of images to the model and generate text conditioned on all the images.


In [1]:
from mlx_vlm import load, apply_chat_template, generate
from mlx_vlm.utils import load_image
from mlx_vlm.utils import process_image

In [2]:
images = ["images/cats.jpg", "images/desktop_setup.png"]

messages = [
    {"role": "user", "content": "Describe what you see in the images."}
]

## Qwen2-VL

In [None]:
# Load model and processor
qwen_vl_model, qwen_vl_processor = load("mlx-community/Qwen2-VL-7B-Instruct-4bit")
qwen_vl_config = qwen_vl_model.config

In [4]:
prompt = apply_chat_template(qwen_vl_processor, qwen_vl_config, messages, num_images=len(images))

In [5]:
qwen_vl_output = generate(
    qwen_vl_model,
    qwen_vl_processor,
    images,
    prompt,
    max_tokens=1000,
    temperature=0.7,
    verbose=True
)

Image: ['images/cats.jpg', 'images/desktop_setup.png'] 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe what you see in the images.<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

The image shows a cozy home office setup with a pink blanket covering the desk and chair. There are two cats lounging on the blanket, one on the left side and the other on the right side of the desk. The desk has a computer monitor, keyboard, and mouse. There are also speakers on either side of the monitor, a remote control, and a small plant on the left side of the desk. The wall behind the desk has a framed text that says, "Don't grow up, it's a trap." The overall scene is playful and relaxed, with the cats adding a touch of whimsy to the workspace.
Prompt: 10.047 tokens-per-sec
Generation: 28.392 tokens-per-sec


## Pixtral

In [None]:
# Load model and processor
pixtral_model, pixtral_processor = load("mlx-community/pixtral-12b-4bit")
pixtral_config = pixtral_model.config

In [4]:

prompt = apply_chat_template(pixtral_processor, pixtral_config, messages, num_images=len(images))

In [14]:
# Pixtral requires images to be resized to the same shape in multi-image generation
resized_images = [process_image(load_image(image), (560, 560), None) for image in images]

In [16]:
pixtral_output = generate(
    pixtral_model,
    pixtral_processor,
    resized_images,
    prompt,
    max_tokens=1000,
    temperature=0.7,
    verbose=True
)

Image: [<PIL.Image.Image image mode=RGB size=560x420 at 0x3ACDD5FF0>, <PIL.Image.Image image mode=RGB size=560x347 at 0x39D697820>] 

Prompt: <s>[INST]Describe what you see in the images.[IMG][IMG][/INST]
The first image shows two cats lying on a pink couch. One cat is on the left side, and the other is on the right side. Both cats appear to be relaxed and comfortable. There are two remote controls on the couch, one near each cat. The background of the image is a plain, light-colored wall.

The second image depicts a home office setup. The main elements include:

1. A wooden desk with a computer monitor in the center.
2. Two black speakers on either side of the monitor.
3. A black office chair in front of the desk.
4. A wooden shelf on the left side of the desk, holding various items including records and a potted plant.
5. A framed poster on the wall behind the desk with the text "Don't Grow Up, It's a Trap."
6. A drum set on the right side of the image, partially visible.

The overal

## Llava-Interleaved

In [None]:
# Load model and processor
llava_model, llava_processor = load("mlx-community/llava-interleave-qwen-0.5b-bf16")
llava_config = llava_model.config

In [4]:
prompt = apply_chat_template(llava_processor, llava_config, messages, num_images=len(images))

In [5]:
llava_output = generate(
    llava_model,
    llava_processor,
    images,
    prompt,
    max_tokens=1000,
    temperature=0.7,
    verbose=True
)

Expanding inputs for image tokens in LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.


Image: ['images/cats.jpg', 'images/desktop_setup.png'] 

Prompt: <|im_start|>user
<image><image>
Describe what you see in the images.<|im_end|>
<|im_start|>assistant

The image captures a cozy scene in a room. Two cats, one gray and the other brown and white, are lying on a pink couch. The gray cat is resting its head on the back of the couch, while the brown and white cat is lying on its side. They are both facing the camera, their relaxed postures suggesting a sense of comfort and tranquility.

The room itself is a study in simplicity. A whiteboard hangs on the wall, a black computer monitor sits on a wooden desk, and a black speaker stands tall on a shelf. The walls are painted a light pink, providing a warm and inviting backdrop to the scene.

On the desk, there's a black keyboard and a white mouse, ready for use. A plant sits on a small table next to the desk, adding a touch of nature to the room. A whiteboard eraser is also present on the desk, perhaps used for cleaning or markin