# Libraries

In [1]:
import cv2
import numpy as np
import requests
import json
import base64
from PIL import Image
from io import BytesIO
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import pandas as pd

# Video Question Answering System using LLaVA

Our VideoQASystem leverages [LLaVA](https://llava-vl.github.io/), a state-of-the-art open-source multimodal model that combines vision and language understanding. While LLaVA was originally designed for single-image multimodal interactions, we extend its capabilities to video analysis through sequential frame processing. The system demonstrates how open-source multimodal AI can understand both visual content and natural language queries to enable intelligent video analysis.

**System Architecture**

1. **Video Processing**:
  - Extracts key frames uniformly from video (4 frames by default)
  - Processes each frame independently through LLaVA's multimodal encoder
  - Combines insights from multiple frames for temporal understanding

2. **Question Answering Pipeline**:
  - Takes free-form questions about video content
  - Analyzes each frame with LLaVA's multimodal capabilities
  - Synthesizes observations into coherent answers using Llama-2 LLM
  - Displays video alongside answers for reference

3. **Question Generation**:
  - Generates relevant questions from video descriptions
  - Creates diverse questions focusing on visible content
  - Automatically answers generated questions using multimodal analysis
  - Shows both video and original description

**Implementation Details**
- Uses open-source Ollama API to access LLaVA's multimodal capabilities
- Converts frames to base64 for API compatibility
- Handles both direct questions and auto-generated questions
- Provides comprehensive multimodal analysis with video display

**Key Features**
- Zero-shot multimodal understanding
- Multi-frame temporal analysis
- Natural language interaction
- Visual context preservation
- Automated question generation and answering

The system leverages open-source tools and LLaVA's powerful multimodal capabilities to bridge the gap between visual understanding and natural language processing, enabling sophisticated video analysis through sequential frame processing and temporal context integration. This open-source implementation makes advanced video understanding accessible and customizable for various applications.

In [2]:
class VideoQASystem:
    def __init__(self, df, video_base_path="test_1k_compress"):
        self.df = df
        self.video_base_path = video_base_path

    def extract_frames(self, video_id: str, num_frames: int = 4):
        """Extract frames from video"""
        video_path = f"{self.video_base_path}/{video_id}.mp4"
        cap = cv2.VideoCapture(video_path)
        
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        frames_to_sample = np.linspace(0, total_frames-1, num_frames, dtype=int)
        
        frames = []
        for frame_idx in frames_to_sample:
            cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
            ret, frame = cap.read()
            if ret:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                frames.append(frame)
        
        cap.release()
        return frames

    def frame_to_base64(self, frame):
        """Convert numpy array frame to base64 string"""
        success, buffer = cv2.imencode('.jpg', cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
        if not success:
            raise ValueError("Could not encode image")
        return base64.b64encode(buffer).decode('utf-8')

    def display_video(self, video_id: str):
        """Display video and its description"""
        video_desc = self.df[self.df['video_id'] == video_id]['sentence'].iloc[0]
        video_path = f"{self.video_base_path}/{video_id}.mp4"
        
        print("\n" + "="*50)
        print(f"📽️ Video Analysis: {video_id}")
        print(f"📝 Description: {video_desc}")
        print("="*50 + "\n")
        
        display(HTML(f"""
        <video width="500" controls>
            <source src="{video_path}" type="video/mp4">
            Your browser does not support the video tag.
        </video>
        """))

    def answer_question(self, video_id: str, question: str):
        """Analyze video frames sequentially with LLaVA"""
        try:
            # Display video and description first
            self.display_video(video_id)

            # Get video description
            video_desc = self.df[self.df['video_id'] == video_id]['sentence'].iloc[0]

            print(f"❓ Question: {question}\n")
            print("🔄 Analyzing frames...")

            frames = self.extract_frames(video_id, num_frames=4)

            # Analyze each frame
            frame_insights = []
            for i, frame in enumerate(frames, 1):
                print(f"Processing frame {i}/{len(frames)}...")
                frame_prompt = f"""Frame {i}/{len(frames)}
                Question: {question}
                Describe what you see in this frame that helps answer the question."""

                response = requests.post(
                    'http://opmlgpubuild01:11434/api/generate',
                    json={
                        'model': 'llava',
                        'prompt': frame_prompt,
                        'images': [self.frame_to_base64(frame)],
                        'stream': False
                    }
                )

                frame_insight = response.json().get('response', '').strip()
                if frame_insight:
                    frame_insights.append(frame_insight)

            # Synthesize final answer
            if frame_insights:
                synthesis_prompt = f"""Based on these observations:
                {' '.join(frame_insights)}

                Provide a clear and concise answer to: {question}"""

                final_response = requests.post(
                    'http://opmlgpubuild01:11434/api/generate',
                    json={
                        'model': 'llama3',
                        'prompt': synthesis_prompt,
                        'stream': False
                    }
                )

                answer = final_response.json().get('response', '').strip()
                print(f"\n💡 Answer: {answer}")
                return answer
            else:
                return "Could not analyze video frames."

        except Exception as e:
            return f"Error: {str(e)}"

    def analyze_video_with_questions(self, video_id: str):
        """Complete video analysis with auto-generated questions and answers"""
        # First display the video
        self.display_video(video_id)
        
        # Get video description
        video_desc = self.df[self.df['video_id'] == video_id]['sentence'].iloc[0]
        
        # Generate questions from description
        print("🤔 Generating relevant questions...")
        prompt = f"""Given this video description: "{video_desc}"
        Generate 5 specific and relevant questions about what might be shown in the video.
        Questions should focus on visible actions, objects, and details.
        Make questions diverse and interesting.
        Return only the questions, one per line."""
        
        try:
            response = requests.post(
                'http://opmlgpubuild01:11434/api/generate',
                json={
                    'model': 'llama3',
                    'prompt': prompt,
                    'stream': False
                }
            )
            
            questions = [q.strip().lstrip('123456789.)- ') 
                        for q in response.json()['response'].strip().split('\n')
                        if '?' in q][:5]
            
            # Answer each question
            print("\n📋 Questions & Answers:")
            for i, question in enumerate(questions, 1):
                print(f"\n{'-'*50}")
                print(f"Q{i}: {question}")
                print(f"A: {self.answer_question(video_id, question)}")
            
            print(f"\n{'-'*50}")
            
        except Exception as e:
            print(f"Error in analysis: {str(e)}")



# Data

We will use the 1K [MSR-VTT dataset](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/), a subset of the Microsoft Research Video to Text dataset designed for open-domain video captioning

In [3]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/text_video_search.zip -O
! unzip -q -o text_video_search.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  210M  100  210M    0     0  2877k      0  0:01:14  0:01:14 --:--:-- 2906k


In [4]:
data = pd.read_csv('MSRVTT_JSFUSION_test.csv')[['video_id', 'sentence']]
data.head()

Unnamed: 0,video_id,sentence
0,video9770,a person is connecting something to system
1,video9771,a little girl does gymnastics
2,video7020,a woman creating a fondant baby and flower
3,video9773,a boy plays grand theft auto 5
4,video7026,a man is giving a review on a vehicle


# Video Q&A

The answer_question function takes a video ID and a question as input. It first displays the video and its original description. Then it extracts 4 key frames from the video, which are processed sequentially through LLaVA, a multimodal AI model. For each frame, LLaVA analyzes the visual content in context of the question and provides insights. Finally, these frame-by-frame insights are combined to generate a comprehensive answer that considers information from all parts of the video.

In [5]:
# Initialize the system
video_qa = VideoQASystem(data)

# Test with the same video
video_id = "video8869"
question = "What are the main activities happening in this video?"

# Get answer
answer = video_qa.answer_question(video_id, question)
print(f"\nQ: {question}")
print(f"A: {answer}")


📽️ Video Analysis: video8869
📝 Description: a girl shows a pack of toy building blocks



❓ Question: What are the main activities happening in this video?

🔄 Analyzing frames...
Processing frame 1/4...
Processing frame 2/4...
Processing frame 3/4...
Processing frame 4/4...

💡 Answer: Based on the observations, the main activities happening in this video appear to be:

1. Unpacking or opening a package containing small items, such as Lego bricks.
2. Handling and assembling Lego bricks to create something, possibly working on a small project or playing with the Lego set.
3. Using special connectors and decorative elements to build and configure the Lego structure.

These activities suggest that the video is likely focused on creative play or building with Lego sets, and may be part of a larger activity or project.

Q: What are the main activities happening in this video?
A: Based on the observations, the main activities happening in this video appear to be:

1. Unpacking or opening a package containing small items, such as Lego bricks.
2. Handling and assembling Lego bricks 

## Automatic Video Analysis

The analyze_video_with_questions function automatically processes a video in two steps: First, it uses the video's text description to generate relevant questions using an LLM (Llama-3). Then, for each generated question, it performs visual analysis using LLaVA to provide answers, displaying the video and its description alongside the Q&A results. This creates a comprehensive analysis of the video content through automatically generated and answered questions.

In [6]:
# Analyze a video with auto-generated questions
video_id = "video8869"
video_qa.analyze_video_with_questions(video_id)


📽️ Video Analysis: video8869
📝 Description: a girl shows a pack of toy building blocks



🤔 Generating relevant questions...

📋 Questions & Answers:

--------------------------------------------------
Q1: Are the toy building blocks being built into a specific structure or shape?

📽️ Video Analysis: video8869
📝 Description: a girl shows a pack of toy building blocks



❓ Question: Are the toy building blocks being built into a specific structure or shape?

🔄 Analyzing frames...
Processing frame 1/4...
Processing frame 2/4...
Processing frame 3/4...
Processing frame 4/4...

💡 Answer: Based on the observations, it appears that the person is likely building according to a design plan, possibly following instructions from an included LEGO instruction manual. While the specific shape or structure being built is not clearly visible due to the angle of the photo and the positioning of the person's hands and the blocks, the fact that the person is carefully placing a block suggests that they are intentionally building a specific design rather than simply playing with the blocks.
A: Based on the observations, it appears that the person is likely building according to a design plan, possibly following instructions from an included LEGO instruction manual. While the specific shape or structure being built is not clearly visible due to the angle of the photo and

❓ Question: Can you identify any distinctive colors or patterns on the blocks themselves?

🔄 Analyzing frames...
Processing frame 1/4...
Processing frame 2/4...
Processing frame 3/4...

💡 Answer: Yes, based on the observations, some distinctive colors and patterns that can be identified on the LEGO blocks include:

* Solid colors such as yellow, red, blue, white, purple, green, and possibly black
* Patterns such as stripes, dots, textures, and geometric shapes like rectangles and circles
* Additional decorations or stickers on some blocks adding to their visual diversity
* Various shades of these colors, including different hues of yellow, green, red, and blue

These colors and patterns suggest a wide range of options for creating different scenes or structures with the LEGO blocks.
A: Yes, based on the observations, some distinctive colors and patterns that can be identified on the LEGO blocks include:

* Solid colors such as yellow, red, blue, white, purple, green, and possibly black

❓ Question: Does the girl demonstrate any unique techniques for stacking or connecting the blocks?

🔄 Analyzing frames...
Processing frame 1/4...
Processing frame 2/4...
Processing frame 3/4...
Processing frame 4/4...

💡 Answer: Based on the provided observation, it is difficult to determine if the girl demonstrates any unique techniques for stacking or connecting the blocks without more context or specific knowledge about the type of blocks and how they are intended to fit together. The image shows a person's hands interacting with Lego blocks in a typical manner, suggesting that she may be using standard manual assembly methods rather than any extraordinary or innovative approaches.
A: Based on the provided observation, it is difficult to determine if the girl demonstrates any unique techniques for stacking or connecting the blocks without more context or specific knowledge about the type of blocks and how they are intended to fit together. The image shows a person's hands interactin

❓ Question: Are there any other toys or objects visible in the background that might be relevant to the building activity?

🔄 Analyzing frames...
Processing frame 1/4...
Processing frame 2/4...
Processing frame 3/4...
Processing frame 4/4...

💡 Answer: Yes, there are several small Lego-like bricks scattered around on the table, suggesting that the person is either organizing or sorting them for a building project. Additionally, there appears to be a small toy car among the toys, which may also be related to the construction activity, as it could serve as part of the structure being built or simply as a decoration or additional piece to add to the Lego creation.
A: Yes, there are several small Lego-like bricks scattered around on the table, suggesting that the person is either organizing or sorting them for a building project. Additionally, there appears to be a small toy car among the toys, which may also be related to the construction activity, as it could serve as part of the structu

❓ Question: Do the blocks appear to have different shapes or textures, such as spheres, cylinders, or flat pieces?

🔄 Analyzing frames...
Processing frame 1/4...
Processing frame 2/4...
Processing frame 3/4...
Processing frame 4/4...

💡 Answer: Yes, the blocks appear to have different shapes and textures. The image shows cylindrical pieces (standard LEGO bricks), flat plates, and rectangular bricks with varying colors and designs. There may also be spherical or other non-rectangular pieces present, although it is difficult to determine their exact shape due to the low resolution of the photo.
A: Yes, the blocks appear to have different shapes and textures. The image shows cylindrical pieces (standard LEGO bricks), flat plates, and rectangular bricks with varying colors and designs. There may also be spherical or other non-rectangular pieces present, although it is difficult to determine their exact shape due to the low resolution of the photo.

-----------------------------------------

# Additional Thoughts 

**Additional Enhancements for Video Q&A System:**

**Advanced Model Integration**

- Direct integration with [MiniGPT4-video](https://vision-cair.github.io/MiniGPT4-video/):
   * Replace frame-by-frame LLaVA analysis with single pass through MiniGPT4-video
   * Better temporal understanding and more coherent answers
   * Supports longer video context windows

- Object and action detection:
   * Use DINO-v2 to detect key objects and activities
   * Add detection results to prompts ("I see a person cooking, holding a pan...")
   * Create richer context for more accurate answers

- Scene segmentation:
   * Detect scene changes in video
   * Split questions/answers by relevant scenes
   * Enable temporal localization ("in the first scene...")

**Improved Analysis**

- Multi-turn conversations:
   * Keep conversation history and video context
   * Allow follow-up questions about previous answers
   * Enable clarifications and deeper exploration

- Timestamp-specific questions:
   * Link answers to specific video timestamps
   * Allow questions about specific moments
   * Navigate to relevant parts of video when answering

- Visual highlighting:
   * Overlay detected objects/actions on video
   * Show relevant regions for each answer
   * Help users understand model's focus areas

- Cross-video references:
   * Compare similar scenes across videos
   * Answer questions spanning multiple videos
   * Find related content in video database

This enhanced system would combine specialized video understanding with rich contextual information from detectors, enabling more precise and interactive video analysis.