Checking Awni's video branch improvements are merged into main?

Hi Prince,

Last December I asked Awni help to fix some issues of mlx-vlm video branch for a customer PoC, so I am wondering whether the improvements are already merged into main branch.  I use the following steps to build the mlx-vlm env for the PoC.

pip install git+https://github.com/awni/mlx-vlm.git@video
pip install qwen_vl_utils

1 The video branch can significantly reduce the memory footprint for video inference using Qwen-VL.
2 Also fix the some bugs during inference, for example cannot read the timestamp in the video frame.

```
python -m mlx_vlm.video_generate --model mlx-community/Qwen2-VL-7B-Instruct-bf16 --max-tokens 500 --prompt "Describe this video" --video /Users/dmp/vlm/test.mp4 --max-pixels 1920 1080 --fps 2.0
date
Thu Mar 13 21:08:42 CST 2025
objc[93822]: Class AVFFrameReceiver is implemented in both /Users/dmp/miniconda3/envs/vlm/lib/python3.11/site-packages/av/.dylibs/libavdevice.61.3.100.dylib (0x1238103a8) and /Users/dmp/miniconda3/envs/vlm/lib/python3.11/site-packages/decord/.dylibs/libavdevice.59.7.100.dylib (0x129cdca10). One of the two will be used. Which one is undefined.
objc[93822]: Class AVFAudioReceiver is implemented in both /Users/dmp/miniconda3/envs/vlm/lib/python3.11/site-packages/av/.dylibs/libavdevice.61.3.100.dylib (0x1238103f8) and /Users/dmp/miniconda3/envs/vlm/lib/python3.11/site-packages/decord/.dylibs/libavdevice.59.7.100.dylib (0x129cdca60). One of the two will be used. Which one is undefined.
Loading model: mlx-community/Qwen2-VL-7B-Instruct-bf16
Fetching 14 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 23850.63it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
==========
Video: /Users/dmp/vlm/test.mp4 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|video_pad|><|vision_end|>Describe this video<|im_end|>
<|im_start|>assistant

qwen-vl-utils using decord to read video.
The given max_pixels[2073600] exceeds limit[602112].
Token indices sequence length is longer than the specified maximum sequence length for this model (43224 > 32768). Running this sequence through the model will result in indexing errors
Generating video description...
The video depicts a virtual reality scene featuring a large, yellow excavator situated in a spacious, well-lit room with high ceilings. The excavator is positioned in the center of the room, surrounded by various objects and furniture. The room appears to be a museum or gallery space, as indicated by the presence of large, abstract paintings on the walls and sculptures displayed throughout the room. The excavator is the focal point of the scene, with its large size and bright color making it stand out against the more subdued background. The overall atmosphere of the video is modern and artistic, with a blend of technology and creativity.
==========
Prompt: 289.731 tokens-per-sec
Generation: 29.654 tokens-per-sec
Peak memory: 152.406 GB
Thu Mar 13 21:11:24 CST 2025

```
Thanks.

Nan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checking Awni's video branch improvements are merged into main? #280

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Checking Awni's video branch improvements are merged into main? #280

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions