# Video Exploration with Qwen3-VL-4B-Instruct

This notebook demonstrates how to use the **Qwen3-VL-4B-Instruct** vision-language model to analyze and describe video frames. 

Qwen3-VL is a state-of-the-art multimodal model from Alibaba's Qwen team that can:
- Understand images and videos
- Answer questions about visual content
- Perform OCR in 32 languages
- Handle spatial grounding and visual reasoning
- Support up to 256K token context (expandable to 1M)

Reference: [Qwen3-VL GitHub](https://github.com/QwenLM/Qwen3-VL) | [Model Card](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)


In [None]:
# Install required packages for Databricks MLR 16.4 LTS
%pip install -U opencv-python-headless transformers accelerate qwen-vl-utils numpy<2
%restart_python


## Setup

First, let's configure the location of video files and UC Catalog/Schema.


In [None]:
import os
import cv2
import torch
from PIL import Image
import numpy as np

catalog = 'brian_ml_dev'
schema = 'image_processing'
raw_data = 'raw_data'

video_folder = f'/Volumes/{catalog}/{schema}/{raw_data}'
print(f"Video folder: {video_folder}")


## List Available Videos


In [None]:
# List all video files
files = os.listdir(video_folder)
print(f"Found {len(files)} video file(s):")
for idx, file in enumerate(files):
    print(f"  {idx}: {file}")

files


## Load Video and Extract Frames

Let's load the first video and extract some sample frames for analysis.


In [None]:
# Select first video file
video_file = os.path.join(video_folder, files[0])
print(f"Selected video: {files[0]}")

# Open video capture
capture = cv2.VideoCapture(video_file)

# Get video properties
fps = capture.get(cv2.CAP_PROP_FPS)
total_frames = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps if fps > 0 else 0

print(f"Video properties:")
print(f"  FPS: {fps}")
print(f"  Total frames: {total_frames}")
print(f"  Duration: {duration:.2f} seconds")

# Extract frames at intervals (e.g., every 30 frames or 1 second for 30fps video)
frame_interval = int(fps) if fps > 0 else 30
sample_frames = []
frame_indices = []

frame_index = 0
while True:
    success, frame = capture.read()
    if not success:
        break
    
    # Sample frames at intervals (max 10 frames for this demo)
    if frame_index % frame_interval == 0 and len(sample_frames) < 10:
        sample_frames.append(frame)
        frame_indices.append(frame_index)
    
    frame_index += 1

capture.release()

print(f"\nExtracted {len(sample_frames)} sample frames at indices: {frame_indices}")


## Display Sample Frame


In [None]:
# Display first sample frame
if len(sample_frames) > 0:
    bgr_image = sample_frames[0]
    rgb_array = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(rgb_array)
    display(pil_image)
    print(f"Frame {frame_indices[0]} - Image size: {pil_image.size}")


## Load Qwen3-VL-4B-Instruct Model

Qwen3-VL-4B-Instruct is a compact 4.8B parameter vision-language model that:
- Uses the same architecture as Qwen2-VL (hence the `Qwen2VLForConditionalGeneration` class)
- Supports multimodal inputs (text, images, and videos)
- Uses `qwen-vl-utils` for processing vision inputs
- Optimized for efficient inference with bfloat16 precision

The model uses a chat template format with structured messages containing both text and vision inputs.


In [None]:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the Qwen3-VL-4B-Instruct model
# Note: Qwen3-VL uses the same Qwen2VLForConditionalGeneration class architecture
model_name = "Qwen/Qwen3-VL-4B-Instruct"

print("Loading Qwen3-VL-4B-Instruct model...")
print("This may take a few minutes on first run...")

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_name)

print(f"Model loaded successfully!")
print(f"Model device: {model.device}")
print(f"Model dtype: {model.dtype}")


## Analyze Frame with Qwen VL

Let's ask the model to describe what's happening in the video frame.


In [None]:
def analyze_frame_with_qwen(pil_image, prompt="Describe this image in detail."):
    """
    Analyze a single frame using Qwen VL model.
    
    Args:
        pil_image: PIL Image object
        prompt: Question or instruction for the model
    
    Returns:
        Generated text description
    """
    # Prepare the conversation format
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": pil_image,
                },
                {"type": "text", "text": prompt},
            ],
        }
    ]
    
    # Prepare for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
    
    # Generate response
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=512)
    
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    return output_text[0]

print("Analysis function ready!")


In [None]:
# Analyze first frame
if len(sample_frames) > 0:
    print("Analyzing first frame...\n")
    
    # Convert frame to PIL Image
    bgr_image = sample_frames[0]
    rgb_array = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(rgb_array)
    
    # Get detailed description
    description = analyze_frame_with_qwen(pil_image, "Describe this image in detail. What objects, people, or activities do you see?")
    
    print(f"Frame {frame_indices[0]} Analysis:")
    print("=" * 80)
    print(description)
    print("=" * 80)


## Analyze Multiple Frames

Let's analyze several frames from the video to understand the temporal progression.


In [None]:
# Analyze multiple frames with different prompts
analysis_results = []

prompts = [
    "What is the main subject of this image?",
    "Describe the scene and any notable activities.",
    "What objects can you identify in this image?",
    "Describe the environment and setting.",
]

num_frames_to_analyze = min(4, len(sample_frames))

for i in range(num_frames_to_analyze):
    bgr_image = sample_frames[i]
    rgb_array = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(rgb_array)
    
    prompt = prompts[i % len(prompts)]
    
    print(f"\nAnalyzing Frame {frame_indices[i]}...")
    print(f"Prompt: {prompt}")
    
    result = analyze_frame_with_qwen(pil_image, prompt)
    
    analysis_results.append({
        'frame_index': frame_indices[i],
        'prompt': prompt,
        'response': result
    })
    
    print(f"Response: {result}")
    print("-" * 80)

print(f"\nCompleted analysis of {len(analysis_results)} frames")


## Custom Query Interface

Try asking your own questions about the frames!


In [None]:
# Interactive analysis - change this prompt to ask different questions
custom_prompt = "Count the number of people visible in this image."

if len(sample_frames) > 0:
    bgr_image = sample_frames[0]
    rgb_array = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(rgb_array)
    
    print(f"Query: {custom_prompt}\n")
    response = analyze_frame_with_qwen(pil_image, custom_prompt)
    print(f"Response: {response}")


## Direct Video Analysis (No Frame Decomposition)

Qwen3-VL natively supports video inputs! Instead of manually extracting frames, we can pass the video file directly to the model. This approach:
- Leverages Qwen3-VL's temporal understanding across the entire video
- Uses the 256K token context window for long-form video comprehension
- Eliminates the need for manual frame sampling
- Provides holistic video understanding rather than frame-by-frame analysis

Based on the [qwen-vl-utils examples](https://github.com/QwenLM/Qwen3-VL/tree/main/qwen-vl-utils), we can use video paths directly in the message format.


In [None]:
def analyze_video_direct(video_path, prompt="Describe what happens in this video in detail."):
    """
    Analyze an entire video using Qwen3-VL's native video understanding.
    No frame extraction needed - the model processes the video directly.
    
    Args:
        video_path: Path to the video file
        prompt: Question or instruction for the model
    
    Returns:
        Generated text description of the video
    """
    # Prepare the conversation format with video input
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": video_path,  # Direct video path
                    "max_pixels": 360 * 420,  # Control video resolution for memory efficiency
                    "fps": 1.0,  # Sample rate for video frames (1 fps = 1 frame per second)
                },
                {"type": "text", "text": prompt},
            ],
        }
    ]
    
    # Prepare for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
    
    # Generate response
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=1024)
    
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    return output_text[0]

print("Direct video analysis function ready!")
print("Note: Processing full videos requires more memory than single frames.")


In [None]:
# Analyze the first video directly (without frame extraction)
if len(files) > 0:
    print(f"Analyzing video: {files[0]}")
    print("This may take a moment as the model processes the entire video...\n")
    
    # Analyze with different prompts to showcase temporal understanding
    video_prompts = [
        "Describe what happens in this video in detail.",
        "What is the main activity or scene shown in this video?",
        "Summarize the key events and objects visible throughout this video.",
    ]
    
    for i, prompt in enumerate(video_prompts):
        print(f"\n{'='*80}")
        print(f"Query {i+1}: {prompt}")
        print('='*80)
        
        result = analyze_video_direct(video_file, prompt)
        
        print(f"\nResponse:\n{result}")
        
        if i < len(video_prompts) - 1:
            print("\n" + "-"*80)
    
    print(f"\n{'='*80}")
    print("Direct video analysis complete!")


### Comparison: Frame-by-Frame vs Direct Video Analysis

**Frame-by-Frame Analysis (OpenCV):**
- ✅ Fine-grained control over which frames to analyze
- ✅ Lower memory usage per inference call
- ✅ Can process specific moments in time
- ❌ Loses temporal context between frames
- ❌ Requires manual frame extraction and management
- ❌ Multiple API calls for multiple frames

**Direct Video Analysis (Native Qwen3-VL):**
- ✅ Holistic understanding with temporal context
- ✅ Automatic frame sampling by the model
- ✅ Single API call for entire video
- ✅ Leverages full 256K context window
- ❌ Higher memory requirements
- ❌ Less control over specific frames
- ❌ May be slower for very long videos

**Recommendation:** Use direct video analysis for understanding overall video content and temporal relationships. Use frame-by-frame for detailed analysis of specific moments or when memory is constrained.


## Save Results to DataFrame

Let's structure our analysis results and save them for further processing.


In [None]:
import pandas as pd

# Create pandas DataFrame from results
results_df = pd.DataFrame(analysis_results)
results_df['video_file'] = files[0]

display(results_df)

print(f"\nDataFrame shape: {results_df.shape}")


## Save to Delta Table (Optional)

We can save our analysis results to a Delta table for future reference.


In [None]:
# Convert to Spark DataFrame
spark_df = spark.createDataFrame(results_df)

# Define table name
results_table = f"{catalog}.{schema}.qwen_vl_video_analysis"

# Save to Delta table
spark_df.write.mode('append').saveAsTable(results_table)

print(f"Results saved to: {results_table}")


## Summary

This notebook demonstrated:
1. Loading video files from Unity Catalog Volumes
2. **Frame-by-Frame Analysis:** Extracting sample frames from videos using OpenCV for detailed moment analysis
3. **Direct Video Analysis:** Using Qwen3-VL's native video input for temporal understanding (no frame extraction needed)
4. Using **Qwen3-VL-4B-Instruct** for visual understanding with custom prompts
5. Processing vision inputs with `qwen-vl-utils` (supports both image and video types)
6. Comparing two approaches: frame-level vs video-level analysis
7. Structuring and saving results to Delta tables

### Key Insights
- Qwen3-VL-4B is a compact yet powerful multimodal model (4.8B params)
- Supports 256K context window (expandable to 1M tokens)
- **Natively handles both images and videos** - can process entire videos with temporal context
- Uses structured message format for multimodal inputs
- Frame-by-frame approach offers precision; direct video analysis offers holistic understanding

### Next Steps
- Batch process multiple videos using Spark for distributed inference
- Implement hybrid approach: direct video for summary + frame analysis for key moments
- Fine-tune the model on domain-specific data (traffic, surveillance, etc.)
- Combine with object detection (DETR) for comprehensive analysis
- Optimize video processing parameters (fps, max_pixels) for different use cases
- Explore multi-video comparison and temporal event detection
