# Running LLaVa-NeXT-Video: a large multi-modal model on Google Colab

LLaVa-NeXT-Video is a new Large Vision-Language Model that enables interaction with videos and images. The model is based on a previuos series of models: [LLaVa-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) that was trained exclusively on image-text data. The architecutre is same as in LLaVa-NeXT and is a decoder-based text model that takes concatenated vision hidden states with text hidden states.


<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">


LLaVA-NeXT surprisingly has strong performance in understanding video content with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT for videos has several improvements:

- LLaVA-Next-Video, with supervised fine-tuning (SFT) on top of LLaVA-Next on video data, achieves better video understanding capabilities and is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/pdf/2405.21075)
- LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), shows further performance boost.

Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next_video
project page: https://github.com/LLaVA-VL/LLaVA-NeXT



First we need to install the latest `transformers` from `main`, as the model has just been added. Also we'll install `bitsandbytes` to load the model in lower bits for [memory efficiency](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

In [1]:
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-jviqyntc
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-jviqyntc
  Resolved https://github.com/huggingface/transformers.git to commit 1bd604d11c405dfb8b78bda4062d88fc75c17de0
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tokenizers<0.21,>=0.20 (from transformers==4.46.0.dev0)
  Downloading tokenizers-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tokenizers-0.20.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# we need av to be able to read the video
!pip install -q av

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.1/33.1 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h

## Load the model

Next, we load a model and corresponding processor from the hub.

We will specify a quantization config to load the model in 4 bits. Please refer to this [guide](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for more details.

In [3]:
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/741 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/70.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Preparing the video and image inputs

In order to read the video we'll use `av` and sample 8 frames. You can try to sample more frames if the video is long. The model was trained with 32 frames, but can ingest more as long as we're in the LLM backbone's max sequence length range.

In [4]:
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

In [5]:
!gdown 1j3iJ1GXYcYihijDX56TodfuKWyci7wwa
!unzip raw_video.zip

Downloading...
From (original): https://drive.google.com/uc?id=1j3iJ1GXYcYihijDX56TodfuKWyci7wwa
From (redirected): https://drive.google.com/uc?id=1j3iJ1GXYcYihijDX56TodfuKWyci7wwa&confirm=t&uuid=c17ee0c4-3082-4bdc-8293-99d253928865
To: /content/raw_video.zip
100% 145M/145M [00:02<00:00, 69.7MB/s]
Archive:  raw_video.zip
   creating: raw_video/
  inflating: raw_video/S2AfkLBFuP4.8.mp4  
  inflating: raw_video/RX5jaZv52rQ_280.000_290.000.mp4  
  inflating: raw_video/8aKhvS70_zY_40.000_50.000.mp4  
  inflating: raw_video/IauuGcFwWZg.49.mp4  
  inflating: raw_video/Uq3iKbCNDCM_60.000_70.000.mp4  
  inflating: raw_video/e6Df7Ocuqcs_180.000_190.000.mp4  
  inflating: raw_video/z3xiKSDoZEI_160.000_170.000.mp4  
  inflating: raw_video/4lUOJNaRy3I.3.mp4  
  inflating: raw_video/CgTc_-A_Gzw_0.000_8.000.mp4  
  inflating: raw_video/GfoyjUOE9Ek.8.mp4  
  inflating: raw_video/evy2azZk3kE_40.000_50.000.mp4  
  inflating: raw_video/Fn4ULUxA3KI.21.mp4  
  inflating: raw_video/QJgUEYFR49Y.46.mp4  
  i

In [None]:
# from huggingface_hub import hf_hub_download

# # Download video from the hub
# video_path_1 = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
# video_path_2 = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="karate.mp4", repo_type="dataset")

# container = av.open(video_path_1)

# # sample uniformly 8 frames from the video (we can sample more for longer videos)
# total_frames = container.streams.video[0].frames
# indices = np.arange(0, total_frames, total_frames / 8).astype(int)
# clip_baby = read_video_pyav(container, indices)


# container = av.open(video_path_2)

# # sample uniformly 8 frames from the video (we can sample more for longer videos)
# total_frames = container.streams.video[0].frames
# indices = np.arange(0, total_frames, total_frames / 8).astype(int)
# clip_karate = read_video_pyav(container, indices)

In [21]:
import os
import av
import numpy as np

def process_video(video_path, fps=4):
    container = av.open(video_path)
    video_stream = container.streams.video[0]
    video_duration = video_stream.duration * video_stream.time_base  # 获取视频时长
    original_fps = video_stream.average_rate  # 获取视频的原始帧率

    # 根据视频时长和指定的 fps 计算需要采样的帧数
    total_frames = int(video_duration * original_fps)  # 原视频的总帧数
    target_frame_count = int(video_duration * fps)  # 按照指定 fps 采样的帧数

    # 按照目标 fps 均匀采样帧
    indices = np.linspace(0, total_frames - 1, 8).astype(int)
    clip = read_video_pyav(container, indices)
    return clip

def process_all_videos_in_folder(folder_path, fps=4):
    video_clips = {}
    for filename in os.listdir(folder_path):
        if filename.endswith('.mp4'):
            video_path = os.path.join(folder_path, filename)
            print(f"Processing video: {filename}")
            clip = process_video(video_path, fps=fps)
            video_clips[filename] = clip
    return video_clips

# 假设视频文件已经下载到了指定的文件夹
folder_path = "/content/raw_video"  # 替换为你的文件夹路径
fps = 4  # 设置你想要的帧率
video_clips = process_all_videos_in_folder(folder_path, fps=fps)

# 你可以根据需要处理 video_clips 字典，里面存储了所有视频的采样帧

Processing video: wj-gglKQ3KI_30.000_40.000.mp4
Processing video: 94Y_QnedxB8_11.000_21.000.mp4
Processing video: UbpTXBNIXtQ.18.mp4
Processing video: y2ZbO4KBLnY.3.mp4
Processing video: Sv9fcuRfk2o_70.000_80.000.mp4
Processing video: Ln0Epmhtvao_150.000_160.000.mp4
Processing video: xfNeZaw4o3U_17.000_27.000.mp4
Processing video: Meshocn2mZQ_30.000_40.000.mp4
Processing video: TiFboW8mqd4_30.000_40.000.mp4
Processing video: piYDsIrSvME.27.mp4
Processing video: UmWkmGAu5Hc.14.mp4
Processing video: uLWBKxHSwyY.36.mp4
Processing video: p_o6NQX7lmE_0.000_10.000.mp4
Processing video: 1lfmwRJYins.9.mp4
Processing video: 4lUOJNaRy3I.3.mp4
Processing video: ZZKUxUjdcQw.17.mp4
Processing video: CZc9MZIxSSc_30.000_40.000.mp4
Processing video: ZHSPLH5-zvw_50.000_60.000.mp4
Processing video: brRabaaeoy4.36.mp4
Processing video: vB00cAT5iPo_60.000_70.000.mp4
Processing video: uLcn-Q-TLO8.39.mp4
Processing video: HTBxRn8Sbyc_7.000_17.000.mp4
Processing video: 7FT1WnwOcxA_40.000_50.000.mp4
Processin

In [None]:
len(video_clips)

In [None]:
# sample_clip = None
# for item in video_clips.values():
#   sample_clip = item
#   break

In [None]:
# from matplotlib import pyplot as plt
# from matplotlib import animation
# from IPython.display import HTML

# # np array with shape (frames, height, width, channels)
# video = sample_clip

# fig = plt.figure()
# im = plt.imshow(video[0,:,:,:])

# plt.close() # this is required to not display the generated image

# def init():
#     im.set_data(video[0,:,:,:])

# def animate(i):
#     im.set_data(video[i,:,:,:])
#     return im

# anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],
#                                interval=100)
# HTML(anim.to_html5_video())

## Prepare a prompt and generate

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses USER and ASSISTANT respectively (note: it's true only for this checkpoint). The format looks as follows:

`USER: <video>\n<prompt> ASSISTANT:`


In other words, you always need to end your prompt with ASSISTANT:.


Manually adding USER and ASSISTANT to your prompt can be error-prone since each checkpoint has its own prompt format expected, depending on the backbone language model. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [8]:
import json
with open('/content/unimodal_whisper_small_audiocap.json', 'r') as f:
  audio_cap = json.load(f)

In [9]:
audio_cap = audio_cap['annotations']

In [None]:
type(video_clips)

In [10]:
# Each "content" is a list of dicts and you can add image/video/text modalities
def set_prompt(audio_caption):
    PROMPT = f"""
    Given the video and the audio captioning of this video, describe this video.
    Audio captioning is as follows:
    {audio_caption}
    """

    conversation = [
          {
              "role": "user",
              "content": [
                  {"type": "text", "text": PROMPT},
                  {"type": "video"},
                  ],
          },
    ]

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    return prompt

In [11]:
audio_cap_dict = {item['video_id']: item['caption'] for item in audio_cap}

In [12]:
SKIPPED_IDS = [
    "p_o6NQX7lmE_0.000_10.000.mp4",
    "xJ-6ewqMyxY_410.000_420.000.mp4",
    "niJg7Q1XLyU_50.000_60.000.mp4",
    "wj-gglKQ3KI_30.000_40.000.mp4"]

In [19]:
# iterate with video_clips
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}
output_file = '/content/multimodal_llava-whisper-generation.json'
# make dir
os.makedirs(os.path.dirname(output_file), exist_ok=True)

# get exists video name
exits_video = []
import json

file_path = output_file
with open(file_path, 'r') as file:
    lines = file.readlines()

video_names = []
for line in lines:
    data = json.loads(line)
    video_name = data.get("video_name", "")
    video_names.append(video_name)

# 打印提取出的 video_name
for name in video_names:
    exits_video.append(name+'.mp4')



In [22]:
for video_name, video_clip in video_clips.items():
  if video_name in SKIPPED_IDS or video_name in exits_video:
    continue
  video_name = video_name.split('.mp4')[0]
  prompt = set_prompt(audio_cap_dict[video_name])
  # we need to call the processor to tokenize the prompt and get pixel_values for videos
  inputs = processor([prompt], videos=[video_clip], padding=True, return_tensors="pt").to(model.device)
  output = model.generate(**inputs, **generate_kwargs)
  generated_text = processor.batch_decode(output, skip_special_tokens=True)
  # append to the json file
  with open(output_file, 'a') as f:
    f.write(json.dumps({'video_name': video_name, 'generated_text': generated_text[0].split('ASSISTANT: ')[-1]}) + '\n')

In [23]:
import json

# Input file path
input_file = '/content/multimodal_llava-whisper-generation.json'  # Replace with your actual file path
output_file = '/content/multimodal_llava-whisper-ac.json'  # Output file path

annotations = []

# Read the input file line by line
with open(input_file, 'r') as f:
    for line in f:
        # Parse each line as JSON
        data = json.loads(line.strip())

        # Create a new dictionary with the desired format
        annotations.append({
            "video_id": data["video_name"],
            "caption": data["generated_text"].strip()
        })

# Create the final output structure
output_data = {
    "annotations": annotations
}

# Write the formatted output to a JSON file
with open(output_file, 'w') as f:
    json.dump(output_data, f, indent=4)

print(f"Formatted output saved to {output_file}")

Formatted output saved to /content/multimodal_llava-whisper-ac.json
