# Running LLaVa-NeXT-Video: a large multi-modal model on Google Colab

LLaVa-NeXT-Video is a new Large Vision-Language Model that enables interaction with videos and images. The model is based on a previuos series of models: [LLaVa-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) that was trained exclusively on image-text data. The architecutre is same as in LLaVa-NeXT and is a decoder-based text model that takes concatenated vision hidden states with text hidden states.


<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">


LLaVA-NeXT surprisingly has strong performance in understanding video content with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT for videos has several improvements:

- LLaVA-Next-Video, with supervised fine-tuning (SFT) on top of LLaVA-Next on video data, achieves better video understanding capabilities and is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/pdf/2405.21075)
- LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), shows further performance boost.

Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next_video
project page: https://github.com/LLaVA-VL/LLaVA-NeXT



First we need to install the latest `transformers` from `main`, as the model has just been added. Also we'll install `bitsandbytes` to load the model in lower bits for [memory efficiency](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

In [None]:
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git

DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to c:\users\karis\appdata\local\temp\pip-req-build-_eo_9qku
  Resolved https://github.com/huggingface/transformers.git to commit 1f0b490a2c42eb129dccc69031ccb537058689c4
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting huggingface-hub==1.0.0.rc6 (from transformers==5.0.0.dev0)
  Using cached huggingface_hub-1.0.0rc6-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers==5.0.0.dev0)
  Using cached tokenizers-0.22.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Using cached huggingface_hub-1.0.0rc6-py3-none-any.whl (502 kB)
Using cached tokenizers-0

DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git 'C:\Users\karis\AppData\Local\Temp\pip-req-build-_eo_9qku'
ERROR: Could not install packages due to an OSError: [WinError 2] The system cannot find the file specified: 'C:\\Python312\\Scripts\\hf.exe' -> 'C:\\Python312\\Scripts\\hf.exe.deleteme'


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
# we need av to be able to read the video
!pip install -q av
!pip install pillow
!pip install torchvision
!pip install protobuf
!pip install sentencepiece

DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip




DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip




DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip




DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip




DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Load the model

In [11]:
!pip install huggingface-hub==0.26.0



DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Next, we load a model and corresponding processor from the hub.

We will specify a quantization config to load the model in 4 bits. Please refer to this [guide](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for more details.

In [12]:
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
)

processor = LlavaNextVideoProcessor.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf", 
    trust_remote_code=True,    
)

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    trust_remote_code=True
)

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
`low_cpu_mem_usage` was None, now default to True since model is quantized.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading shards: 100%|██████████| 3/3 [16:39<00:00, 333.20s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:48<00:00, 16.17s/it]


## Preparing the video and image inputs

In order to read the video we'll use `av` and sample 8 frames. You can try to sample more frames if the video is long. The model was trained with 32 frames, but can ingest more as long as we're in the LLM backbone's max sequence length range.

## Prepare a prompt and generate

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses USER and ASSISTANT respectively (note: it's true only for this checkpoint). The format looks as follows:

`USER: <video>\n<prompt> ASSISTANT:`


In other words, you always need to end your prompt with ASSISTANT:.


Manually adding USER and ASSISTANT to your prompt can be error-prone since each checkpoint has its own prompt format expected, depending on the backbone language model. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "Why is the police in this video?"},
              {"type": "video"},
              ],
      },
]

conversation_2 = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "A police officer is helping a civilian. What is the officer helping him with?"},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)


In [14]:
# As you can see we got the USER: ASSISTANT: format prompt
prompt

'USER: <video>\nWhy is the police in this video? ASSISTANT:'

In [15]:
!pip install torchcodec

Collecting torchcodec
  Downloading torchcodec-0.8.0-cp312-cp312-win_amd64.whl.metadata (9.7 kB)
Downloading torchcodec-0.8.0-cp312-cp312-win_amd64.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 2.1/2.1 MB 12.9 MB/s eta 0:00:00
Installing collected packages: torchcodec
Successfully installed torchcodec-0.8.0


DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [18]:
!pip install decord==0.6.0

Collecting decord==0.6.0
  Downloading decord-0.6.0-py3-none-win_amd64.whl.metadata (422 bytes)
Downloading decord-0.6.0-py3-none-win_amd64.whl (24.7 MB)
   ---------------------------------------- 0.0/24.7 MB ? eta -:--:--
   -- ------------------------------------- 1.6/24.7 MB 11.9 MB/s eta 0:00:02
   ------- -------------------------------- 4.7/24.7 MB 13.0 MB/s eta 0:00:02
   ------------- -------------------------- 8.1/24.7 MB 14.0 MB/s eta 0:00:02
   ------------------ --------------------- 11.3/24.7 MB 14.7 MB/s eta 0:00:01
   ------------------------- -------------- 15.7/24.7 MB 15.5 MB/s eta 0:00:01
   ------------------------------- -------- 19.7/24.7 MB 16.1 MB/s eta 0:00:01
   ------------------------------------ --- 22.5/24.7 MB 16.0 MB/s eta 0:00:01
   ---------------------------------------- 24.7/24.7 MB 15.2 MB/s eta 0:00:00
Installing collected packages: decord
Successfully installed decord-0.6.0


DEPRECATION: Loading egg at c:\python312\lib\site-packages\pybkt-1.4.1-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [33]:
from decord import VideoReader, cpu
import torch
video_path = r"C:\Users\karis\WorcesterPolytechnicInstitute\MQP\VLM\videos\edited-storm-body-cam.mp4"

vr = VideoReader(video_path, ctx=cpu())
frames = [vr[i].asnumpy() for i in range(0, len(vr), max(1, len(vr)//64))]  # sample up to 32 frames


In [34]:
print(len(frames), frames[0].shape)
# should print (32, (H, W, 3))


65 (720, 1280, 3)


In [35]:
print(model.vision_tower is not None)

True


In [36]:
# we still need to call the processor to tokenize the prompt and get pixel_values for videos
from PIL import Image
import numpy as np


inputs = processor(
    text=[prompt_2],
    videos=[frames],
    padding=True,
    return_tensors="pt"
).to(model.device)



In [37]:
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [38]:
print(generated_text)

["USER: \nA police officer is helping a civilian. What is the officer helping him with? ASSISTANT: The officer is assisting the civilian in clearing an overgrown and overgrown area. This could involve activities such as trimming or cutting bushes, removing trees, or maintaining walkways and roads to ensure public safety and accessibility. The police officer's presence suggests that they might be working to enforce local regulations or to address issues within the community. The assistance could be part of a routine service call, a volunteer effort, or a specific action taken to"]


### Generate from images and image+video data

To generate from images we have to change the special token to `<image>` or indicate an "image" modality in the chat template, that's it! Let's see how it works

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this image?"},
              {"type": "image"},
              ],
      },
]

conversation_2_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What color is the sign?"},
              {"type": "image"},
              ],
      },
]

prompt_image = processor.apply_chat_template(conversation_image, add_generation_prompt=True)
prompt_2_image = processor.apply_chat_template(conversation_2_image, add_generation_prompt=True)

In [None]:
prompt

'USER: <image>\nWhat do you see in this image? ASSISTANT:'

In [None]:
inputs = processor([prompt_image, prompt_2_image], images=[image_snowman, image_stop], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 50, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [None]:
print(generated_text)

["USER: \nWhat do you see in this image? ASSISTANT: In the image, I see an animated depiction of a snowman. The snowman appears to be sitting and gazing into the distance, seemingly contemplative. It's dressed in a scarf and hat, and there are two", 'USER: \nWhat color is the sign? ASSISTANT: The sign in the image is red with white text.']


We can feed images and videos in one go instead of running separate generations for image and video. Also we can interleave images with videos inside one prompt, although the training dataset didn't see that kind of examples.

For the processing just make sure to pass images/videos in the same order as they appear in the prompts, starting from the first prompt until the last prompt. You can pass all visual data as flattenned list as shown below, only order matters





In [None]:
inputs = processor([prompt, prompt_image, prompt_2_image], images=[image_snowman, image_stop], videos=[clip_baby], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 40, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
print(generated_text)

In [None]:
# For multi-turn convwersations just continue stacking up messages in the chat template
conversation_multiturn = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
      {
          "role": "assistant",
          "content": [
              {"type": "text", "text": "I see a baby reading a book."},
              ],
      },
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "Why is it funny?"},
              ],
      },
]

prompt_multiturn = processor.apply_chat_template(conversation_multiturn, add_generation_prompt=True)
print(prompt_multiturn)

USER: <video>
What do you see in this video? ASSISTANT: I see a baby reading a book. USER: Why is it funny? ASSISTANT:
