### Libraries and device

In [1]:
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
import torch
from PIL import Image
import os
from qwen_vl_utils import process_vision_info

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


### Model initialization

In [2]:
qwen_model_name = "Qwen/Qwen2.5-VL-3B-Instruct"

qwen_processor = AutoProcessor.from_pretrained(qwen_model_name, trust_remote_code=True)
print("✅ Qwen processor loaded.")

qwen_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    qwen_model_name,
    torch_dtype=torch.float16,
)
qwen_model.to(device)
print("✅ Qwen model loaded.")

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


✅ Qwen processor loaded.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Qwen model loaded.


### Inference

In [None]:
# Image setup
image_path = os.path.join("..", "..", "test_images", "5.jpg")
image = Image.open(image_path).convert("RGB")
W, H = image.size
scale = 384 / min(W, H)
new_W = int(W * scale)
new_H = int(H * scale)
image = image.resize((new_W, new_H), resample=Image.BICUBIC)
W, H = image.size
crop_size = min(W, H)
left = (W - crop_size) // 2
top = (H - crop_size) // 2
image = image.crop((left, top, left + crop_size, top + crop_size))


# Prepare inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = qwen_processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
image_inputs, video_inputs = process_vision_info(messages)

inputs = qwen_processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    return_tensors="pt",
    padding=True
).to(device)


# Inference
print("Generating...")
with torch.no_grad():
    outputs = qwen_model.generate(
        **inputs,
        max_new_tokens=128,
    )

generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs['input_ids'], outputs)
]
output_text = qwen_processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("\n--- Output ---")
print(output_text[0])

Generating...

--- Output ---
The image depicts an artificial satellite in orbit around Earth. The satellite has a rectangular shape with multiple panels, likely solar panels, attached to its exterior. These panels are designed to capture sunlight and convert it into electrical energy, which powers the satellite's systems. The satellite is positioned above the Earth's surface, showing a clear view of the planet's curvature and the blue oceans below. The background features a starry night sky, indicating that the satellite is in space. The overall scene suggests a typical scenario for a communication or observation satellite in operation.
