# Video Understanding

In this example, we will generate a description of a video using `Qwen2-VL`, `Qwen2-5-VL`, `LLava`, and `Idefics3`, with more models coming soon.

This feature is currently in beta, may not work as expected.



## Install Dependencies

!pip install -U mlx-vlm

```shell
micromamba create -n mlx-vlm ipykernel python=3.12 -y
micromamba activate mlx-vlm
pip install mlx-vlm
pip3 install torch torchvision torchaudio
model, processor = load("mlx-community/Qwen2-VL-72B-Instruct-4bit") # mlx-community/Qwen2-VL-72B-Instruct-4bit
```

## Import Dependencies

In [1]:
from pprint import pprint
from mlx_vlm import load
from mlx_vlm.utils import generate
from mlx_vlm.video_generate import process_vision_info

import mlx.core as mx

  from .autonotebook import tqdm as notebook_tqdm
This is a beta version of the video understanding. It may not work as expected.


In [2]:
# Load the model and processor
# model, processor = load("mlx-community/Qwen2.5-VL-7B-Instruct-4bit") # mlx-community/Qwen2-VL-72B-Instruct-4bit
model, processor = load("mlx-community/Qwen2-VL-72B-Instruct-4bit")

Fetching 18 files: 100%|██████████| 18/18 [00:00<00:00, 116149.96it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [3]:
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "videos/fastmlx_local_ai_hub.mp4",
                "max_pixels": 360 * 360,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)


numpy reader: video_path=videos/fastmlx_local_ai_hub.mp4, total_frames=1134, video_fps=59.941855343141576, time=0.000s


In [4]:
# Convert inputs to mlx arrays
input_ids = mx.array(inputs['input_ids'])
pixel_values = mx.array(inputs['pixel_values_videos'])
mask = mx.array(inputs['attention_mask'])
image_grid_thw = mx.array(inputs['video_grid_thw'])

kwargs = {
    "image_grid_thw": image_grid_thw,
}

In [5]:
kwargs["video"] = "videos/fastmlx_local_ai_hub.mp4"
kwargs["input_ids"] = input_ids
kwargs["pixel_values"] = pixel_values
kwargs["mask"] = mask
response = generate(model, processor, prompt=text, temp=0.7, max_tokens=100, **kwargs)

In [6]:
pprint(response)


('The video depicts a chaotic scene of numerous mobile phones floating in the '
 'air, their screens displaying various data and information. The phones '
 'appear to be in disarray, moving in different directions, creating a sense '
 'of urgency and overwhelming information. The screens show a mix of text, '
 'numbers, and graphs, suggesting a flood of data or notifications. The '
 'background is dark, which makes the illuminated screens stand out even more, '
 'adding to the overall impression of a digital overload.')


In [8]:
# open video and play it
from ipywidgets import Video
Video.from_file("videos/fastmlx_local_ai_hub.mp4", width=320, height=240)

Video(value=b'\x00\x00\x00\x18ftypisom\x00\x00\x00\x01isomiso4\x00\x00c<moov\x00\x00\x00lmvhd...', height='240…