Video2LLM

Transform video content into a format that LLMs can understand.

Video2LLM converts video frames into a single, comprehensive image, enabling you to ask questions about the video to a Visual LLM.
Since LLMs can process images as inputs, this tool packages your video as a sequence of frames in one image, allowing the model to analyze and respond to your questions about the video.

The intention of this project is to make it simple to use video content in any LLM by exporting the image. However, there's also a way to directly ask questions using the video_gpt.py script.

This project is experimental, and I’m actively researching and refining the approach.

Feedback and suggestions are encouraged.
I'll also be preparing a YouTube video on this, which you can find on my YouTube channel.

I have since been informed of https://cookbook.openai.com/examples/gpt_with_vision_for_video_understanding which is a similar approach.
It's good to have validation, but that cookbook may be more reliable than this project.

🎥 Sample

Here's a video of a book with pages being turned by the wind:

Source video on Pixabay

Let's ask a Visual LLM "What direction the wind is blowing?":

💬 Why This Question?

This question requires understanding the flow of the video, which can only be correctly interpreted by analyzing the sequence of events.

📼 How is the Video Processed?

Video2LLM processes the video by sampling frames at a specified rate, resizing them, and concatenating them into a single image. This image represents the flow of the video, making it possible for the LLM to analyze and respond accurately.

This image can also be used in any other visual LLM.

💬 The Prompt Used

The generated image was then sent to a Visual Large Language Model (LLM) with the following prompt:
Prompt:

You are observing a video. First, provide a brief sentence that explains what you observe in relation to the question. Then, answer the question directly. The input should be treated as a video.

Question: What direction is the wind blowing?

See video_gpt.py

⚙️ Setup

📋 Requirements

Python 3.x
Required Python packages: typer, opencv-python-headless, Pillow, openai

🛠 Installation

Clone the repository:

git clone https://github.com/DiogoNeves/Video2LLM.git
cd Video2LLM

Install dependencies:
```
pip install -r requirements.txt
```
Set the OpenAI API Key: Make sure you have your OpenAI API key set as an environment variable:
```
export OPENAI_API_KEY="your_openai_api_key"
```

🚀 Usage

Basic Usage

Ask a Question About the Video
Use the video_gpt.py script to ask a Visual LLM a question about your video. The script converts the video into an image and sends it to the LLM, which responds based on the video content.
```
python video_gpt.py /path/to/video.mp4 "What is happening in this video?"
```
Create Images to Use in Other Models
Use the convert_video.py script to generate an image representing the video frames. This image can then be used with any other Visual LLM.
```
python convert_video.py /path/to/video.mp4 --output output_image.jpg
```

Advanced Usage

video_gpt.py
Convert a video to an image and ask a question about the content:

python video_gpt.py /path/to/video.mp4 "Describe the actions in this video." --max-frames 30 --fps-sampling 5

convert_video.py
Generate an image from a video:

python convert_video.py /path/to/video.mp4 --output output_image.jpg --max-frames 30 --fps-sampling 5

Arguments:

--max-frames: Maximum number of frames to extract from the video. Increasing this value allows processing a longer segment of the video.
--fps-sampling: Frames per second to sample from the video. Lowering this value captures a longer segment of the video with fewer frames.

⚠️ Important Considerations

Video Duration: The duration of the video that can be processed depends on the max_frames and fps_sampling settings. The default configuration processes 2 seconds of video (20 frames at 10 fps).
Model Context Size: Not all videos will fit within the context size of the model. Longer videos or higher frame rates may produce images too large to be fully processed by some LLMs. Adjust the parameters accordingly to ensure the output image is suitable for your model's context window.

🤝 Contribution

I welcome suggestions and prompt improvements! If you have ideas for how to enhance the tool or ways to make the prompts more effective, feel free to share them. Your feedback is valuable to the ongoing development of Video2LLM.

📬 How to Reach Me

YouTube: DiogoNeves
Twitch: diogosnows
Threads: @diogosnows

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_video.py		convert_video.py
requirements.txt		requirements.txt
video_gpt.py		video_gpt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video2LLM

🎥 Sample

💬 Why This Question?

📼 How is the Video Processed?

💬 The Prompt Used

⚙️ Setup

📋 Requirements

🛠 Installation

🚀 Usage

Basic Usage

Advanced Usage

⚠️ Important Considerations

🤝 Contribution

📬 How to Reach Me

📄 License

About

Releases

Packages

Languages

License

DiogoNeves/Video2LLM

Folders and files

Latest commit

History

Repository files navigation

Video2LLM

🎥 Sample

💬 Why This Question?

📼 How is the Video Processed?

💬 The Prompt Used

⚙️ Setup

📋 Requirements

🛠 Installation

🚀 Usage

Basic Usage

Advanced Usage

⚠️ Important Considerations

🤝 Contribution

📬 How to Reach Me

📄 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages