Skip to content

Transform video content into a format that LLMs can understand.

License

Notifications You must be signed in to change notification settings

DiogoNeves/Video2LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

Video2LLM

Transform video content into a format that LLMs can understand.

Video2LLM converts video frames into a single, comprehensive image, enabling you to ask questions about the video to a Visual LLM.
Since LLMs can process images as inputs, this tool packages your video as a sequence of frames in one image, allowing the model to analyze and respond to your questions about the video.

The intention of this project is to make it simple to use video content in any LLM by exporting the image. However, there's also a way to directly ask questions using the video_gpt.py script.

This project is experimental, and I’m actively researching and refining the approach.

Feedback and suggestions are encouraged.
I'll also be preparing a YouTube video on this, which you can find on my YouTube channel.

I have since been informed of https://cookbook.openai.com/examples/gpt_with_vision_for_video_understanding which is a similar approach.
It's good to have validation, but that cookbook may be more reliable than this project.


🎥 Sample

Here's a video of a book with pages being turned by the wind:

sample
Source video on Pixabay

Let's ask a Visual LLM "What direction the wind is blowing?":
LLM Output

💬 Why This Question?

This question requires understanding the flow of the video, which can only be correctly interpreted by analyzing the sequence of events.

📼 How is the Video Processed?

Video2LLM processes the video by sampling frames at a specified rate, resizing them, and concatenating them into a single image. This image represents the flow of the video, making it possible for the LLM to analyze and respond accurately.

output

This image can also be used in any other visual LLM.

💬 The Prompt Used

The generated image was then sent to a Visual Large Language Model (LLM) with the following prompt:
Prompt:

You are observing a video. First, provide a brief sentence that explains what you observe in relation to the question. Then, answer the question directly. The input should be treated as a video.

Question: What direction is the wind blowing?

See video_gpt.py

⚙️ Setup

📋 Requirements

  • Python 3.x
  • Required Python packages: typer, opencv-python-headless, Pillow, openai

🛠 Installation

  1. Clone the repository:

    git clone https://github.com/DiogoNeves/Video2LLM.git
    cd Video2LLM
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set the OpenAI API Key: Make sure you have your OpenAI API key set as an environment variable:

    export OPENAI_API_KEY="your_openai_api_key"

🚀 Usage

Basic Usage

  1. Ask a Question About the Video
    Use the video_gpt.py script to ask a Visual LLM a question about your video. The script converts the video into an image and sends it to the LLM, which responds based on the video content.

    python video_gpt.py /path/to/video.mp4 "What is happening in this video?"
  2. Create Images to Use in Other Models
    Use the convert_video.py script to generate an image representing the video frames. This image can then be used with any other Visual LLM.

    python convert_video.py /path/to/video.mp4 --output output_image.jpg

Advanced Usage

  • video_gpt.py
    Convert a video to an image and ask a question about the content:

    python video_gpt.py /path/to/video.mp4 "Describe the actions in this video." --max-frames 30 --fps-sampling 5
  • convert_video.py
    Generate an image from a video:

    python convert_video.py /path/to/video.mp4 --output output_image.jpg --max-frames 30 --fps-sampling 5

Arguments:

  • --max-frames: Maximum number of frames to extract from the video. Increasing this value allows processing a longer segment of the video.
  • --fps-sampling: Frames per second to sample from the video. Lowering this value captures a longer segment of the video with fewer frames.

⚠️ Important Considerations

  • Video Duration: The duration of the video that can be processed depends on the max_frames and fps_sampling settings. The default configuration processes 2 seconds of video (20 frames at 10 fps).

  • Model Context Size: Not all videos will fit within the context size of the model. Longer videos or higher frame rates may produce images too large to be fully processed by some LLMs. Adjust the parameters accordingly to ensure the output image is suitable for your model's context window.

🤝 Contribution

I welcome suggestions and prompt improvements! If you have ideas for how to enhance the tool or ways to make the prompts more effective, feel free to share them. Your feedback is valuable to the ongoing development of Video2LLM.

📬 How to Reach Me

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Transform video content into a format that LLMs can understand.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages