# Approach 3: Unified Multimodal Model

This notebook demonstrates a **Unified Multimodal Model** approach. Unlike previous approaches that use separate models for vision and language (e.g., a captioner + an LLM) or adapters to bridge them, a Unified Multimodal Model (like Qwen-VL) is designed to natively understand and process both visual and textual inputs in a single forward pass.

We will:
1.  Use **Qwen2-VL-2B-Instruct**, a small multimodal model.
2.  Feed it an image and a text instruction.
3.  Get a text description/prompt from the model.
4.  Use that prompt to generate a new image with **Stable Diffusion 1.5**.

In [None]:
# Install necessary libraries
!pip install -q git+https://github.com/huggingface/transformers accelerate qwen-vl-utils diffusers

import torch
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from diffusers import StableDiffusionPipeline
import gc

# Setup device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

In [None]:
# Load Qwen2-VL-2B-Instruct
# We use float16 to save memory and device_map="auto" for efficient placement
# Switched to Qwen2-VL-2B-Instruct as it is the standard public small model
model_name = "Qwen/Qwen2-VL-2B-Instruct"

print(f"Loading Qwen model: {model_name}...")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)
print("Qwen model loaded.")

In [None]:
# Load an example image
image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" # Dog and girl
image = Image.open(requests.get(image_url, stream=True).raw)
image = image.resize((512, 512)) # Resize for consistency
display(image)

# Prepare the input for Qwen-VL
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image_url,
            },
            {"type": "text", "text": "Describe this image in a concise sentence suitable for an image generation prompt."},
        ],
    }
]

# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(device)

In [None]:
# Generate text response
print("Generating description...")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(f"Generated Prompt: {output_text}")

In [None]:
# Cleanup to free VRAM for Stable Diffusion
del model
del processor
del inputs
del generated_ids
del generated_ids_trimmed
gc.collect()
torch.cuda.empty_cache()
print("Memory cleared.")

In [None]:
# Load Stable Diffusion 1.5
print("Loading Stable Diffusion...")
sd_pipeline = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
sd_pipeline = sd_pipeline.to(device)

# Generate image from the Qwen-generated prompt
print(f"Generating image for prompt: '{output_text}'")
generated_image = sd_pipeline(output_text, height=512, width=512).images[0]

# Display the result
display(generated_image)

### Summary: Unified Multimodal Model

In this approach, we used **Qwen2-VL**, a model trained to understand both images and text natively.

*   **Approach 1 (Pipeline)** would require a separate Image Captioning model (like BLIP) to convert the image to text, and then an LLM to process it.
*   **Approach 2 (Adapters)** would take a frozen LLM (like LLaMA) and a frozen Vision Encoder (like CLIP) and train a small adapter network to connect them.
*   **Approach 3 (Unified)** uses a single model where the visual encoder and language model are deeply integrated and trained together (or the LLM is fine-tuned to understand visual tokens directly).

This results in a more streamlined architecture that can perform complex reasoning on visual inputs without needing to "translate" everything to text first or rely on shallow adapters.