<a href="https://colab.research.google.com/github/LeoneFabio/Egocentric-Vision/blob/main/LLaVa_NeXT_Video_demo_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running LLaVa-NeXT-Video: a large multi-modal model on Google Colab

LLaVa-NeXT-Video is a new Large Vision-Language Model that enables interaction with videos and images. The model is based on a previuos series of models: [LLaVa-NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) that was trained exclusively on image-text data. The architecutre is same as in LLaVa-NeXT and is a decoder-based text model that takes concatenated vision hidden states with text hidden states.


<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">


LLaVA-NeXT surprisingly has strong performance in understanding video content with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT for videos has several improvements:

- LLaVA-Next-Video, with supervised fine-tuning (SFT) on top of LLaVA-Next on video data, achieves better video understanding capabilities and is a current SOTA among open-source models on [VideoMME bench](https://arxiv.org/pdf/2405.21075)
- LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), shows further performance boost.

Transformers docs: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next_video
project page: https://github.com/LLaVA-VL/LLaVA-NeXT



First we need to install the latest `transformers` from `main`, as the model has just been added. Also we'll install `bitsandbytes` to load the model in lower bits for [memory efficiency](https://huggingface.co/blog/4bit-transformers-bitsandbytes)

In [2]:
!pip install --upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-c_dwq3yx
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-c_dwq3yx
  Resolved https://github.com/huggingface/transformers.git to commit 42865860ec6dc135972d9555753cb7ee17f51fb4
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
# we need av to be able to read the video
!pip install -q av

## Load the model

Next, we load a model and corresponding processor from the hub.

We will specify a quantization config to load the model in 4 bits. Please refer to this [guide](https://huggingface.co/blog/4bit-transformers-bitsandbytes) for more details.

In [4]:
from transformers import BitsAndBytesConfig, LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    quantization_config=quantization_config,
    device_map='auto'
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Preparing the video and image inputs

In order to read the video we'll use `av` and sample 8 frames. You can try to sample more frames if the video is long. The model was trained with 32 frames, but can ingest more as long as we're in the LLM backbone's max sequence length range.

In [5]:
import av
import numpy as np

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

In [6]:
import os
import json
import re

def order_videos_by_clip_uid_and_prediction(json_file_path, video_folder_path):
    with open(json_file_path, 'r') as file:
        data = json.load(file)

    queries_info = [
        {"clip_uid": query["clip_uid"], "best_prediction": query["best_prediction"]}
        for query in data.get("top_50_queries", [])
    ]
    video_files = [f for f in os.listdir(video_folder_path) if f.endswith('.mp4')]

    video_mapping = {}
    for video in video_files:
        video_name = os.path.splitext(video)[0]

        video_mapping[video_name] = os.path.join(video_folder_path, video)

    ordered_videos = []
    for query_info in queries_info:

        best_prediction_int = [int(val) for val in query_info["best_prediction"]]
        combined_name = f"{query_info['clip_uid']}_{best_prediction_int[0]}_{best_prediction_int[1]}"

        if combined_name in video_mapping:
            ordered_videos.append(video_mapping[combined_name])
        else:
            ordered_videos.append(None)
    return ordered_videos

json_file_path = "/content/top_50_queries.json"
video_folder_path = "/content/cut_clips"

ordered_videos = order_videos_by_clip_uid_and_prediction(json_file_path, video_folder_path)
print(ordered_videos)


['/content/cut_clips/5726971c-b3cc-43ed-8071-f6ee143e417d_0_14.mp4', '/content/cut_clips/5d531ac1-010a-4e67-ba1a-96e485b14968_0_37.mp4', '/content/cut_clips/1731de62-b1b9-4b84-bd55-412cd67e9b3c_0_18.mp4', '/content/cut_clips/9093c341-5f95-4402-a2dd-6a84aa42bf1f_0_45.mp4', '/content/cut_clips/ab094ea2-9251-4f10-945b-c2ab00c5282e_0_7.mp4', '/content/cut_clips/b8654118-84a4-4167-83c9-f268cc15f7b2_0_26.mp4', '/content/cut_clips/75d3fc52-3776-47d4-b7fd-8074d30b06d1_131_180.mp4', '/content/cut_clips/4ba774a8-cd2a-4889-9971-cc91f5c1afd4_11_15.mp4', '/content/cut_clips/1731de62-b1b9-4b84-bd55-412cd67e9b3c_41_56.mp4', '/content/cut_clips/88dcb32f-a537-47de-b3bf-f9149352bbb9_232_242.mp4', '/content/cut_clips/3672773c-6ff8-47c2-9ef9-bb00c65814ef_0_67.mp4', '/content/cut_clips/cab983c1-d36e-4afa-8116-1e2bde4a4a4c_101_116.mp4', '/content/cut_clips/2237fc47-e8c9-4751-9b02-6189913b4b4d_0_3.mp4', '/content/cut_clips/49931037-b822-4c7b-baf4-4626c1e6b6ea_41_153.mp4', '/content/cut_clips/96453857-2454-41

In [14]:
clips = []
for video in ordered_videos:
  container = av.open(video)
  # sample uniformly 8 frames from the video (we can sample more for longer videos)
  total_frames = container.streams.video[0].frames
  indices = np.arange(0, total_frames, total_frames / 8).astype(int)
  clips.append(read_video_pyav(container, indices))



In [15]:
from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

# np array with shape (frames, height, width, channels)
video = clips[0]

fig = plt.figure()
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0],interval=100)
HTML(anim.to_html5_video())

## Prepare a prompt and generate

In the prompt, you can refer to video using the special `<video>` or `<image>` token. To indicate which text comes from a human vs. the model, one uses USER and ASSISTANT respectively (note: it's true only for this checkpoint). The format looks as follows:

`USER: <video>\n<prompt> ASSISTANT:`


In other words, you always need to end your prompt with ASSISTANT:.


Manually adding USER and ASSISTANT to your prompt can be error-prone since each checkpoint has its own prompt format expected, depending on the backbone language model. Luckily we can use `apply_chat_template` to make it easier.

Chat templates are special templates written in jinja and added to the model's config. Whenever we call `apply_chat_template`, the jinja template in filled in with your text instruction.

To use chat template simply build a list of messages, with role and content keys, and then pass it to the `apply_chat_template()` method. Once you do that, you’ll get output that’s ready to go! When using chat templates as input for model generation, it’s also a good idea to use `add_generation_prompt=True` to add a generation prompt. See [the docs](https://huggingface.co/docs/transformers/main/en/chat_templating) for more details

In [16]:
import json

json_file_path = '/content/top_50_queries.json'
with open(json_file_path, 'r') as file:
    data = json.load(file)

conversations = []
prompts = []
for query in data.get("top_50_queries", []):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": query["query"]},
                {"type": "video"}
            ]
        }
    ]
    conversations.append(conversation)
    # Apply the processor's chat template
    prompts.append(processor.apply_chat_template(conversation, add_generation_prompt=True))





In [17]:
print(prompts)

['USER: <video>\nWhere is the soap? ASSISTANT:', 'USER: <video>\nWhat vegetables did I chop?  ASSISTANT:', 'USER: <video>\nWhat food did I eat? ASSISTANT:', 'USER: <video>\nWhat did I put in the bowl? ASSISTANT:', 'USER: <video>\nwhere did i put my phone? ASSISTANT:', 'USER: <video>\nWhat tool did I use on the machine first? ASSISTANT:', 'USER: <video>\nWhat did I put in the dish? ASSISTANT:', 'USER: <video>\nWhere was the egg before I picked it? ASSISTANT:', 'USER: <video>\nWhat did I pour in the mug? ASSISTANT:', 'USER: <video>\nWho did I talk to at the workshop? ASSISTANT:', 'USER: <video>\nwhat did l put in the bucket? ASSISTANT:', 'USER: <video>\nIn what location did i see the car? ASSISTANT:', 'USER: <video>\nWhere is round brush before picked? ASSISTANT:', 'USER: <video>\nWhere did I wash the rice? ASSISTANT:', 'USER: <video>\nWhat did I take from the fridge? ASSISTANT:', 'USER: <video>\nwhere did  put the clothe before l hanged them? ASSISTANT:', 'USER: <video>\nHow many doughs

In [18]:

inputs = []
for prompt, clip in zip(prompts, clips):
    inputs.append(processor([prompt], videos=[clip], padding=True, return_tensors="pt").to(model.device))

In [19]:
generate_kwargs = {"max_new_tokens": 500, "do_sample": True, "top_p": 0.9}
generated_text = []
for input in inputs:
  output = model.generate(**input, **generate_kwargs)
  generated_text.append( processor.batch_decode(output, skip_special_tokens=True))

In [20]:
print(generated_text)

[['USER: \nWhere is the soap? ASSISTANT: The soap is on top of the sink counter, near the faucet.'], ['USER: \nWhat vegetables did I chop?  ASSISTANT: The vegetables you cut in the image are not identifiable, but based on the context and common culinary practices, they appear to be green onions or possibly herbs, as they are sliced up and chopped for use in cooking.'], ["USER: \nWhat food did I eat? ASSISTANT: Based on the image provided, it appears that you ate a meal consisting of what looks like pasta with some form of meat sauce and perhaps vegetables mixed in. The pasta appears to be a type of pasta noodle and there's a dish that looks like meat covered with sauce, possibly tomato-based, and there might be other ingredients that are not visible due to the view from above. The meal seems to be fairly hearty and could be a quick-and-easy, possibly a type of comfort food that is popular in many cuisines around the world. If you have any specific questions about the dish or the meal, 

In [22]:
import os
import json

# Funzione per formattare il testo generato e includere solo il nome del video
def format_output(outputs, video_paths, output_json_path="formatted_output.json"):
    formatted = ""
    counter = 1  # Contatore globale per numerare le risposte
    json_results = []  # Lista per salvare i risultati in formato JSON

    for response_group, video_path in zip(outputs, video_paths):  # Associa ogni gruppo di risposte al video
        # Estrai solo il nome del file dal percorso
        video_name = os.path.basename(video_path)

        for response in response_group:
            try:
                # Dividi la stringa in USER e ASSISTANT
                user_part, assistant_part = response.split("ASSISTANT:")
                user_question = user_part.replace("USER:", "").strip()
                assistant_answer = assistant_part.strip()

                # Aggiungi la formattazione al testo
                formatted += (
                    f"---\n\n**{counter}. VIDEO:** {video_name}\n\n"
                    f"**USER:**\n{user_question}\n\n"
                    f"**ASSISTANT:**\n{assistant_answer}\n\n"
                )

                # Aggiungi i dati al JSON
                json_results.append({
                    "id": counter,
                    "video_name": video_name,
                    "user_question": user_question,
                    "assistant_answer": assistant_answer
                })

                counter += 1
            except ValueError:
                # Se il formato non è corretto, aggiungi un avviso al testo e al JSON
                formatted += (
                    f"---\n\n**{counter}. VIDEO:** {video_name}\n\n"
                    f"**ERROR:**\nResponse format invalid: {response}\n\n"
                )

                json_results.append({
                    "id": counter,
                    "video_name": video_name,
                    "error": f"Response format invalid: {response}"
                })

                counter += 1

    # Salva il risultato in un file JSON
    with open(output_json_path, "w") as json_file:
        json.dump(json_results, json_file, indent=4)

    return formatted

# Esempio di utilizzo con la lista dei percorsi dei video
formatted_output = format_output(generated_text, ordered_videos, output_json_path="formatted_output.json")

# Stampare il risultato formattato
print(formatted_output)


---

**1. VIDEO:** 5726971c-b3cc-43ed-8071-f6ee143e417d_0_14.mp4

**USER:**
Where is the soap?

**ASSISTANT:**
The soap is on top of the sink counter, near the faucet.

---

**2. VIDEO:** 5d531ac1-010a-4e67-ba1a-96e485b14968_0_37.mp4

**USER:**
What vegetables did I chop?

**ASSISTANT:**
The vegetables you cut in the image are not identifiable, but based on the context and common culinary practices, they appear to be green onions or possibly herbs, as they are sliced up and chopped for use in cooking.

---

**3. VIDEO:** 1731de62-b1b9-4b84-bd55-412cd67e9b3c_0_18.mp4

**USER:**
What food did I eat?

**ASSISTANT:**
Based on the image provided, it appears that you ate a meal consisting of what looks like pasta with some form of meat sauce and perhaps vegetables mixed in. The pasta appears to be a type of pasta noodle and there's a dish that looks like meat covered with sauce, possibly tomato-based, and there might be other ingredients that are not visible due to the view from above. The m

In [23]:
!pip install datasets
!pip install evaluate
!pip install rouge_score

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

### Rouge and Blue Metrics

In [None]:

from datasets import load_metric

from nltk.translate.bleu_score import sentence_bleu
import evaluate

# Funzione per estrarre la risposta dell'assistente
def extract_assistant_response(generated_text):
    """
    Estrae solo la risposta dell'assistente dal testo generato.
    Assumiamo che il formato sia 'USER: ... ASSISTANT: ...'.
    """
    if "ASSISTANT:" in generated_text:
        return generated_text.split("ASSISTANT:")[1].strip()
    return generated_text.strip()

# Estrai tutte le risposte dal vettore di vettori
assistant_responses = []
for response_group in generated_text:
    for response in response_group:
        assistant_responses.append(extract_assistant_response(response))

# Ground truth
reference_text = [
    "The humor in this video stems from the unexpected and unpredictable nature of the baby's actions.",
    "In the video, there is a person wearing a white karate gi performing a technique.",
    "In this video there is a green garden in a park, with green grass and a lot of trees, the sun is shining adn sky is blue"
]

def sentence_bleu_score(candidate, references):
    references = [ref.split() for ref in references]  # Splitta in parole
    candidate = candidate.split()
    return sentence_bleu(references, candidate)

def sacre_bleu_score (candidate, references):
  sacrebleu = load_metric("sacrebleu")
  references = [ref.split() for ref in references]  # Splitta in parole
  candidate = candidate.split()
  return sacrebleu.compute(predictions=candidate, references=references)

# Usa evaluate per ROUGE
rouge_metric = evaluate.load("rouge")

def calculate_rouge_score(candidate, references):
    results = rouge_metric.compute(predictions=[candidate], references=[references])
    return results

# Calcola i punteggi e formatta i risultati
formatted_results = []
for response in assistant_responses:
    bleu_score = sentence_bleu_score(response, reference_text)
    rouge_score = calculate_rouge_score(response, reference_text)

    formatted_results.append({
        "result": response,
        "bleu_score": bleu_score,
        "rouge_score": rouge_score["rouge1"]
    })

# Stampa i risultati
for i, res in enumerate(formatted_results, start=1):
    print(f"---\n**{i}. Result:**\n{res['result']}\n")
    print(f"**BLEU Score:** {res['bleu_score']:.4f}")
    print(f"**ROUGE Score:** {res['rouge_score']:.4f}\n")


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


---
**1. Result:**
The soap is placed on the edge of the sink, on top of the faucet area, and it appears to be liquid soap. The person seems to be cleaning the faucet or spigot with the soap.

**BLEU Score:** 0.0000
**ROUGE Score:** 0.1887

---
**2. Result:**
In the image provided, you can see that the vegetables being chopped are onions, and it appears that some carrots may also be being cut as well, given the presence of a small portion on the countertop. The onions are being cut on a cutting board or a countertop, possibly to be used in a dish or meal preparation.

**BLEU Score:** 0.0000
**ROUGE Score:** 0.1860

---
**3. Result:**
Based on the image you've provided, it appears that you are eating a meal that includes what looks like a piece of chicken, possibly accompanied by some rice or a rice-based dish, and possibly some vegetables. There are also slices of tomato and a serving of what might be a grilled or fried food item, possibly a type of meat or a vegetable, served alongsid