🌟 **Exquisitely Crafted By - [OEvortex](https://www.youtube.com/channel/@OEvortex)**

🚀 Have questions or feedback about this video? Reach out to OEvortex via the social links below. Your insights and queries are greatly appreciated!

- **YouTube Channel**: [@OEvortex](https://www.youtube.com/@OEvortex)
- **Telegram Group**: [Telegram](https://t.me/vortexcodebase)
- **Discord Server**: [Join the Community on Discord](https://discord.gg/DugWefkN5Z)


----------------------------------------------------------------------
🌟 **The Video-LLaVA model** is an open-source multimodal model fine-tuned from LLM (Large Language Models) on multimodal instruction-following data. It falls under the auto-regressive language model category and is built on the transformer architecture. The base LLM used for fine-tuning is lmsys/vicuna-13b-v1.5.

**Model Description:**
- The Video-LLaVA model's key feature is its ability to generate interleaving images and videos, even when image-video pairs are absent in the dataset.
- This is achieved by leveraging an encoder trained for unified visual representation through alignment before projection.
- Extensive experiments have highlighted the complementarity of modalities, demonstrating significant superiority over models tailored solely for images or videos.

**Training Dataset:**
- The images pretraining and tuning datasets are sourced from LLaVA.
- The videos pretraining dataset is from Valley, while the videos tuning dataset is from Video-ChatGPT.

**Getting Started with the Model:**
- To begin using the model, users can refer to the provided code snippet, which involves importing essential libraries such as PIL, av, transformers, and huggingface_hub.
- The code includes functions for video reading with PyAV decoder and response generation based on prompts containing video or image inputs.

**Acknowledgments:**
- The model credits LLaVA as the foundational codebase and acknowledges Video-ChatGPT for contributing evaluation code and dataset.

**License and Citation:**
- The majority of the project is released under the Apache 2.0 license for non-commercial use, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT.
- Researchers are encouraged to cite and acknowledge the model and code if found beneficial for research purposes.

For comprehensive details, including the paper and resources, please visit the GitHub repository: [Video-LLaVA GitHub](https://github.com/PKU-YuanGroup/Video-LLaVA). 🚀✨
----------------------------------------------------------------------

## Step 1: Install necessary packages

In [None]:
# Install the ffmpeg package for handling multimedia data such as audio and video files
%pip install ffmpeg

# Install the Pillow package for image processing tasks such as opening, manipulating, and saving image files
%pip install pillow

# Install the transformers package for natural language processing tasks using pre-trained models
%pip install -U transformers
%pip install -U sentencepiece
# Install the huggingface_hub package for accessing the Hugging Face model hub for pre-trained models
%pip install huggingface_hub

# Install the av package for working with audio and video data in Python
%pip install av

## Step 2: Importing packages

In [None]:
from PIL import Image
import requests
import av
import numpy as np
from huggingface_hub import hf_hub_download  # Only needed if downloading video or image from Hugging Face (specific to this notebook)
from transformers import VideoLlavaForConditionalGeneration, VideoLlavaProcessor

## Step 3: Initialize Start and End Indices
⚙️ In this step, the function sets the start and end indices based on the first and last elements of the input `indices` list. These indices determine the range of frames to be processed from the video container.

### Function Summary:
📹 `read_video_pyav` function reads frames from a video using the PyAV library. It takes a video container and a list of indices as input.
1. It initializes an empty list to store selected frames.
2. It sets the start and end indices based on the input list.
3. It iterates over the frames, selects frames within the specified range and indices, and adds them to the list.
4. Finally, it returns a NumPy array containing the stacked RGB frames in ndarray format.

🚀 This function efficiently extracts specific frames from a video based on the provided indices, enabling targeted frame processing. Feel free to reach out if you have any questions or need further clarification! 🌟🎥

In [None]:
def read_video_pyav(container, indices):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

## Step 3: Download model and processor

In [None]:
model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

## Defining prompt and video_path

In [None]:
prompt = "USER: <video>What is this video about? ASSISTANT:"

In [None]:
video_path ="/teamspace/studios/this_studio/mixkit-person-watering-a-small-plant-by-hand-33422-medium.mp4"
container = av.open(video_path)

## Step 4: Processing Inputs
🔢 In this step, the code processes the inputs for a model by utilizing the `processor`. It combines the provided text prompt with the extracted video frames stored in the `clip` variable, converting them into tensor format and setting the `return_tensors` parameter to "pt".

### What does this cell do?:
🎥 **Sampling Frames:** The code calculates the total number of frames in the video and generates indices to uniformly sample 8 frames from the video.
🔍 **Reading Video Frames:** It extracts the selected frames using the `read_video_pyav` function and stores them in the `clip` variable.
🚀 **Processing Inputs:** The code prepares inputs for a model by combining the text prompt and video frames in tensor format using the `processor`.

🌟 This code segment effectively samples, reads, and processes video frames along with text prompts to prepare inputs for model utilization, setting the stage for further analysis or tasks involving multimedia data.

Feel free to place this markdown above the code cell in your Colab notebook for a comprehensive understanding of the code functionality! Let me know if you need any more assistance! 🚀🌟

In [None]:

total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")


## Step 5: Generating Output
🔮 In this step, the code generates output using the model by calling the `generate` method with the prepared inputs stored in the `inputs` variable. The `max_length` parameter is set to 80 for the generated output.

### What does this cell do?::
🚀 **Generating Output:** The code utilizes the model to generate output by passing the prepared inputs to the `generate` method, which produces output based on the specified maximum length of 80 tokens.
📄 **Decoding Output:** The generated output IDs are decoded using the `batch_decode` method from the `processor`, skipping special tokens and maintaining tokenization spaces for readability.

🌟 This code segment efficiently generates output based on the provided inputs using the model, completing the process of utilizing text prompts and video frames for model inference and analysis.

Feel free to place this markdown above the code cell in your Colab notebook to provide a clear explanation of the code functionality! Let me know if you need any further assistance! 🚀🌟

In [None]:
# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

# Generate response from images and videos mix

In [None]:
# Generate from images and videos mix
url = r"/teamspace/studios/this_studio/photo_2024-03-14_12-55-17.jpg"
image = Image.open(url)
# url = "http://images.cocodataset.org/val2017/000000039769.jpg"
# image = Image.open(requests.get(url, stream=True).raw)
prompt = [
    "USER: <image>What is in this image? ASSISTANT:",
    "USER: <video>Why is this video about? ASSISTANT:"
]
inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True))