<a href="https://colab.research.google.com/github/LeoneFabio/Egocentric-Vision/blob/main/LLaVa_NeXT_Video_demo_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running LLaVa-NeXT-Video: a large multi-modal model on Google Colab

LLaVa-NeXT-Video is a new Large Vision-Language Model that enables interaction with videos and images. The model is based on a previuos series of models: [LLaVa-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) that was trained exclusively on image-text data. The architecutre is same as in LLaVa-NeXT and is a decoder-based text model that takes concatenated vision hidden states with text hidden states.


<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">


LLaVA-NeXT surprisingly has strong performance in understanding video content with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT for videos has several improvements:

- LLaVA-Next-Video, with supervised fine-tuning (SFT) on top of LLaVA-Next on video data, achieves better video understanding capabilities and is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/pdf/2405.21075)
- LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), shows further performance boost.

Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next_video
project page: https://github.com/LLaVA-VL/LLaVA-NeXT



First we need to install the latest `transformers` from `main`, as the model has just been added. Also we'll install `bitsandbytes` to load the model in lower bits for [memory efficiency](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

In [1]:
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-zeptulq9
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-zeptulq9
  Resolved https://github.com/huggingface/transformers.git to commit 5cabc75b4bdb2e67935f7195f901afd150746eb3
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.48.0.dev0-py3-none-any.whl size=10328720 sha256=d3bdde9b02b1d849995777b7fbd233d54c3c38d17922c3ea465ecb9617e13f58
  Stored in dir

In [2]:
# we need av to be able to read the video
!pip install -q av

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.0/33.0 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Load the model

Next, we load a model and corresponding processor from the hub.

We will specify a quantization config to load the model in 4 bits. Please refer to this [guide](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for more details.

In [3]:
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/741 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/209 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/70.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Preparing the video and image inputs

In order to read the video we'll use `av` and sample 8 frames. You can try to sample more frames if the video is long. The model was trained with 32 frames, but can ingest more as long as we're in the LLM backbone's max sequence length range.

In [4]:
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

In [5]:
from huggingface_hub import hf_hub_download

# Download video from the hub
video_path_1 = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
video_path_2 = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="karate.mp4", repo_type="dataset")
video_path_3 = '/content/sample-5s.mp4' #DA INSERIRE NELLA CARTELLA
container = av.open(video_path_1)

# sample uniformly 8 frames from the video (we can sample more for longer videos)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip_baby = read_video_pyav(container, indices)


container = av.open(video_path_2)

# sample uniformly 8 frames from the video (we can sample more for longer videos)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip_karate = read_video_pyav(container, indices)


container = av.open(video_path_3)

# sample uniformly 8 frames from the video (we can sample more for longer videos)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip_sample = read_video_pyav(container, indices)

sample_demo_1.mp4:   0%|          | 0.00/1.55M [00:00<?, ?B/s]

karate.mp4:   0%|          | 0.00/60.7M [00:00<?, ?B/s]

In [6]:
from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# np array with shape (frames, height, width, channels)
video = clip_baby

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],interval=100)
HTML(anim.to_html5_video())

## Prepare a prompt and generate

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses USER and ASSISTANT respectively (note: it's true only for this checkpoint). The format looks as follows:

`USER: <video>\n<prompt> ASSISTANT:`


In other words, you always need to end your prompt with ASSISTANT:.


Manually adding USER and ASSISTANT to your prompt can be error-prone since each checkpoint has its own prompt format expected, depending on the backbone language model. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [7]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "Why is this video funny?"},
              {"type": "video"},
              ],
      },
]

conversation_2 = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)


In [8]:
# As you can see we got the USER: ASSISTANT: format prompt
prompt

'USER: <video>\nWhy is this video funny? ASSISTANT:'

In [32]:
# we still need to call the processor to tokenize the prompt and get pixel_values for videos
inputs = processor([prompt, prompt_2, prompt_2], videos=[clip_baby, clip_karate, clip_sample], padding=True, return_tensors="pt").to(model.device)

In [37]:
generate_kwargs = {"max_new_tokens": 100, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [38]:
print(generated_text)

["USER: \nWhy is this video funny? ASSISTANT: The humor in this video comes from the innocent and enthusiastic nature of the child's reading, which is perceived as a potentially awkward or uncomfortable action. When the girl reads aloud, it's clear she is not reading the book cover to cover, but is reading the cover itself, which is generally considered more surface-level content than the actual book content. This action is humorous because it's not what one would expect someone to do when reading", 'USER: \nWhat do you see in this video? ASSISTANT: The video shows a male practicing karate moves. He appears to be alone and wearing traditional karate attire, including a gi and a black belt signifying a high-level practitioner. He starts by kneeling down onto his right knee with his left knee touching the ground and performs a front kick. The kick is directed upwards towards the sky, demonstrating a powerful, controlled movement. He then executes a front roundhouse kick while in a standi

In [39]:
# Funzione per formattare il testo generato
def format_output(outputs):
    formatted = ""
    for i, response in enumerate(outputs, start=1):
        # Dividi la stringa in USER e ASSISTANT
        user_part, assistant_part = response.split("ASSISTANT:")
        user_question = user_part.replace("USER:", "").strip()
        assistant_answer = assistant_part.strip()

        # Aggiungi la formattazione
        formatted += f"---\n\n**{i}. USER:**\n{user_question}\n\n**ASSISTANT:**\n{assistant_answer}\n\n"
    return formatted

# Formattazione direttamente dalla variabile generated_text
formatted_output = format_output(generated_text)

# Stampare il risultato formattato
print(formatted_output)

---

**1. USER:**
Why is this video funny?

**ASSISTANT:**
The humor in this video comes from the innocent and enthusiastic nature of the child's reading, which is perceived as a potentially awkward or uncomfortable action. When the girl reads aloud, it's clear she is not reading the book cover to cover, but is reading the cover itself, which is generally considered more surface-level content than the actual book content. This action is humorous because it's not what one would expect someone to do when reading

---

**2. USER:**
What do you see in this video?

**ASSISTANT:**
The video shows a male practicing karate moves. He appears to be alone and wearing traditional karate attire, including a gi and a black belt signifying a high-level practitioner. He starts by kneeling down onto his right knee with his left knee touching the ground and performs a front kick. The kick is directed upwards towards the sky, demonstrating a powerful, controlled movement. He then executes a front roundho

### Rouge and Blue Metrics

In [40]:
!pip install datasets
!pip install evaluate
!pip install rouge_score
from nltk.translate.bleu_score import sentence_bleu
import evaluate  # Usa evaluate invece di datasets.load_metric

def extract_assistant_response(generated_text):
    """
    Estrae solo la risposta dell'assistente dal testo generato.
    Assumiamo che il formato sia 'USER: ... ASSISTANT: ...'.
    """
    if "ASSISTANT:" in generated_text:
        return generated_text.split("ASSISTANT:")[1].strip()
    return generated_text.strip()

assistant_responses = [extract_assistant_response(text) for text in generated_text]

# Ground truth
reference_text = [
    "The humor in this video stems from the unexpected and unpredictable nature of the baby's actions.",
    "In the video, there is a person wearing a white karate gi performing a technique.",
    "In this video there is a green garden in a park, with green grass and a lot of trees, the sun is shining adn sky is blue"
]

# Calcola il punteggio BLEU
def calculate_bleu_score(candidate, references):
    references = [ref.split() for ref in references]  # Splittiamo in parole
    candidate = candidate.split()
    return sentence_bleu(references, candidate)

# Usa evaluate per ROUGE
rouge_metric = evaluate.load("rouge")

def calculate_rouge_score(candidate, references):
    results = rouge_metric.compute(predictions=[candidate], references=[references])
    return results

# Calcola i punteggi e formatta i risultati
formatted_results = []
for response in assistant_responses:
    bleu_score = calculate_bleu_score(response, reference_text)
    rouge_score = calculate_rouge_score(response, reference_text)

    formatted_results.append({
        "result": response,
        "bleu_score": bleu_score,
        "rouge_score": rouge_score["rouge1"]
    })

# Stampa i risultati
for i, res in enumerate(formatted_results, start=1):
    print(f"---\n**{i}. Result:**\n{res['result']}\n")
    print(f"**BLEU Score:** {res['bleu_score']:.4f}")
    print(f"**ROUGE Score:** {res['rouge_score']:.4f}\n")


---
**1. Result:**
The humor in this video comes from the innocent and enthusiastic nature of the child's reading, which is perceived as a potentially awkward or uncomfortable action. When the girl reads aloud, it's clear she is not reading the book cover to cover, but is reading the cover itself, which is generally considered more surface-level content than the actual book content. This action is humorous because it's not what one would expect someone to do when reading

**BLEU Score:** 0.0723
**ROUGE Score:** 0.2474

---
**2. Result:**
The video shows a male practicing karate moves. He appears to be alone and wearing traditional karate attire, including a gi and a black belt signifying a high-level practitioner. He starts by kneeling down onto his right knee with his left knee touching the ground and performs a front kick. The kick is directed upwards towards the sky, demonstrating a powerful, controlled movement. He then executes a front roundhouse kick while in a standing position


### Generate from images and image+video data (DON'T CARE FROM HERE)

To generate from images we have to change the special token to `<image>` or indicate an "image" modality in the chat template, that's it! Let's see how it works

In [None]:
# Lets also load 2 images for generation from image data

from PIL import Image
import requests

image_stop = Image.open(requests.get("https://www.ilankelman.org/stopsigns/australia.jpg", stream=True).raw)
image_snowman = Image.open(requests.get("https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg", stream=True).raw)

In [None]:
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this image?"},
              {"type": "image"},
              ],
      },
]

conversation_2_image = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What color is the sign?"},
              {"type": "image"},
              ],
      },
]

prompt_image = processor.apply_chat_template(conversation_image, add_generation_prompt=True)
prompt_2_image = processor.apply_chat_template(conversation_2_image, add_generation_prompt=True)

In [None]:
prompt

'USER: <video>\nWhy is this video funny? ASSISTANT:'

In [None]:
inputs = processor([prompt_image, prompt_2_image], images=[image_snowman, image_stop], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 50, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)

In [None]:
print(generated_text)

['USER: \nWhat do you see in this image? ASSISTANT: The image features an animated snowman sitting by a small campfire in the middle of a snowy forest. The snowman appears to be smiling, and it is wearing a cap and a scarf with a pattern, suggesting it is dressed', 'USER: \nWhat color is the sign? ASSISTANT: The sign in the image is red and white.']


We can feed images and videos in one go instead of running separate generations for image and video. Also we can interleave images with videos inside one prompt, although the training dataset didn't see that kind of examples.

For the processing just make sure to pass images/videos in the same order as they appear in the prompts, starting from the first prompt until the last prompt. You can pass all visual data as flattenned list as shown below, only order matters





In [None]:
inputs = processor([prompt, prompt_image, prompt_2_image], images=[image_snowman, image_stop], videos=[clip_baby], padding=True, return_tensors="pt").to(model.device)

In [None]:
generate_kwargs = {"max_new_tokens": 40, "do_sample": True, "top_p": 0.9}

output = model.generate(**inputs, **generate_kwargs)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
print(generated_text)

In [None]:
# For multi-turn convwersations just continue stacking up messages in the chat template
conversation_multiturn = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
      {
          "role": "assistant",
          "content": [
              {"type": "text", "text": "I see a baby reading a book."},
              ],
      },
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "Why is it funny?"},
              ],
      },
]

prompt_multiturn = processor.apply_chat_template(conversation_multiturn, add_generation_prompt=True)
print(prompt_multiturn)

USER: <video>
What do you see in this video? ASSISTANT: I see a baby reading a book. USER: Why is it funny? ASSISTANT:
