SceneIQ

Turn hours of video into structured intelligence dossiers in minutes.

SceneIQ is an adaptive multimodal AI system that processes long-form video—podcasts, webinars, senate hearings, technical lectures—and generates comprehensive intelligence dossiers with speaker attribution, topic maps, key quotes, and highlight suggestions. All in 0.15× real-time on a single AMD MI300X GPU.

Why SceneIQ?

Humans can't scale to the volume of video content produced daily. A 3-hour congressional hearing requires 6+ hours of human review to extract actionable insights. SceneIQ does it in 28 minutes.

Speed: 3-hour video → full dossier in 28 min (0.15× real-time)
Quality: Speaker-attributed transcript, adaptive keyframe selection (40–60 moments vs. 600+ dense frames)
Accuracy: Qwen3-VL-32B-Thinking vision + VibeVoice-ASR for robust multimodal analysis
Scale: Process 10+ hours per day on a single GPU

Architecture

Input Video
    ↓
┌─ Extract ─────────────────────────────────┐
│ • Adaptive keyframe selection (hybrid strategy)  │
│ • Audio extraction at 24kHz                     │
│ • Real-time frame deduplication                 │
└───────────────────┬────────────────────────┘
                    ↓
        ┌───────────┴───────────┐
        ↓                       ↓
    [Vision]              [Audio]        (parallel)
    (Qwen3-VL-32B)     (VibeVoice ASR)
        ↓                       ↓
    Frame Analysis        Speaker-Attributed
    + Thinking Traces     Transcript
        └───────────┬───────────┘
                    ↓
            ┌── Synthesis ──────────────┐
            │ Qwen3-VL-32B-Thinking     │
            │ generates structured JSON │
            └───────────┬───────────────┘
                        ↓
                  Intelligence Dossier
         • Executive brief
         • Topic map with timestamps
         • Speaker claims & quotes
         • Highlight clip suggestions
         • Benchmark metrics

Key Design:

Parallel Processing: Audio transcription runs concurrently with vision analysis, not sequentially. Vision takes ~13 min for 3h video; audio takes ~5 min. Total: ~13 min, not 18 min.
Adaptive Keyframes: 8×8 grayscale thumbnail mean-absolute-diff deduplication reduces frame count by 90% without losing content changes (slide transitions, scene cuts).
Streaming Output: SSE (Server-Sent Events) streams analysis phases, metrics, and report chunks to a React UI in real-time.

Tech Stack

Component	Technology
Vision Model	Qwen3-VL-32B-Thinking (on vLLM)
Audio Model	VibeVoice-ASR-HF + faster-whisper (fallback)
GPU Backend	AMD MI300X (192GB VRAM) + ROCm
Inference Engine	vLLM OpenAI-compatible API
Backend	FastAPI + LangGraph orchestration
Frontend	React 18 + SSE streaming

Getting Started

Prerequisites

AMD MI300X GPU with ROCm 6.x+
Python 3.10+
Node.js 18+
50GB free disk (model cache)

Installation

# Clone the repo
git clone https://github.com/yourusername/sceneiq.git
cd sceneiq

# Backend setup
python3 -m venv venv
source venv/bin/activate
pip install -r backend/requirements.txt

# Frontend setup
cd frontend
npm ci
npm run build
cd ..

# Bootstrap script (recommended for fresh AMD instance)
bash scripts/bootstrap.sh [HF_TOKEN]

Run Locally

# Terminal 1: Start vision model (Qwen3-VL)
bash start_vllm.sh

# Terminal 2: Start audio service (VibeVoice)
cd scripts && python3 vibeVoice_service.py

# Terminal 3: Start backend
source venv/bin/activate
uvicorn backend.main:app --host 0.0.0.0 --port 8002

# Terminal 4: Start frontend
cd frontend && PORT=7860 npm run start

# Open http://localhost:7860

Quick Test

# Run end-to-end pipeline test
BASE_URL=http://localhost bash scripts/test_pipeline.sh

Demo

Pre-computed result for the 2023 Senate Judiciary AI Hearing (3h 12m, 5 speakers):

Video Duration: 3h 12m (11,520 seconds)
Total Processing Time: 28 minutes
Real-Time Factor: 0.15×
Keyframes Selected: 47 (from 648 dense equivalent)
Vision Time: 13 min @ 17s/frame
Audio Time: 5 min (VibeVoice primary, faster-whisper fallback)
Synthesis Time: ~8 min (JSON report generation + streaming)

Output Dossier Sections:

Executive brief (1–2 paragraphs)
Topic map (5–7 topics with timestamps)
Speaker claims (per-speaker key statements)
Best quotes (top 10 verbatim quotes)
Highlight clip suggestions (3–5 key moments to extract)
Visual context moments (keyframe annotations)
Entities (people, organizations, legislation)
Contradictions/open questions
Repurposable social media posts

Configuration

Environment variables:

# Models
WHISPER_MODEL=large-v3          # faster-whisper fallback size
HF_TOKEN=hf_xxx                 # Hugging Face API key

# Services
VIBEVOICE_API_URL=http://localhost:8001
QWEN_API_URL=http://localhost:8000

# Pipeline tuning
SCENEIQ_DEDUP_THRESHOLD=0.08    # Frame dedup: 0–1 (lower = more aggressive)
SCENEIQ_MAX_SYNTHESIS_CHUNKS=40 # Cap on transcript chunks sent to synthesis

Benchmark Results

Tested on AMD MI300X (192GB VRAM):

Duration	Keyframes	Total Time	RTF	Vision	Audio	Synthesis
10 min	8	3 min	0.30×	2 min	20s	40s
1 hour	15	9 min	0.15×	7 min	1 min	2 min
3 hours	47	28 min	0.15×	13 min	5 min	8 min

Note: RTF stabilizes around 0.15× for videos >30 min due to fixed synthesis overhead.

Architecture Decisions

Why Qwen3-VL-32B-Thinking?

Instruction-tuned vision-language model with extended reasoning (thinking tokens)
32B parameter size fits MI300X with room for batching
Supports long context (32K tokens) for multi-frame analysis

Why VibeVoice-ASR over Whisper?

Native speaker diarization (who said what) without post-processing
Optimized for diverse audio quality and accents
Lower latency than Whisper with comparable accuracy

Why Parallel Audio+Vision?

Sequential pipeline would add 5 min to total time (13m + 5m = 18m)
Parallel: max(13m, 5m) = 13m effective time
Saves ~27% overall latency with minimal implementation complexity

Limitations & Future Work

Live streaming: Currently requires pre-recorded video files
Languages: English audio/video only (Qwen3-VL and VibeVoice support)
Output size: Dossier is ~5–10KB JSON; no multimedia embedding
Privacy: All processing happens on your GPU; no data sent externally

Future:

Multi-language support (with multilingual models)
Live video stream ingestion (RTMP/HLS)
Custom dossier templates (user-defined report schema)
Batch processing API for 100+ videos

Contributing

Contributions welcome. Areas of interest:

Additional keyframe strategies (optical flow, attention maps)
Report template customization
Performance optimizations (quantization, pruning)
E2E tests for pipeline phases

Please open an issue or PR with clear description of changes.

Citation

If you use SceneIQ in research, please cite:

@software{sceneiq2025,
  title = {SceneIQ: Adaptive Multimodal Dossiers on AMD MI300X},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/yourusername/sceneiq}
}

License

MIT License. See LICENSE for details.

Acknowledgments

Qwen3-VL for vision-language modeling
VibeVoice for speaker diarization
vLLM for efficient inference
AMD ROCm for GPU support

Questions? Open an issue or reach out.

Try the live demo: [Coming soon — will add link once deployed]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SceneIQ

Why SceneIQ?

Architecture

Tech Stack

Getting Started

Prerequisites

Installation

Run Locally

Quick Test

Demo

Configuration

Benchmark Results

Architecture Decisions

Why Qwen3-VL-32B-Thinking?

Why VibeVoice-ASR over Whisper?

Why Parallel Audio+Vision?

Limitations & Future Work

Contributing

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
frontend		frontend
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
start_vllm.sh		start_vllm.sh

Folders and files

Latest commit

History

Repository files navigation

SceneIQ

Why SceneIQ?

Architecture

Tech Stack

Getting Started

Prerequisites

Installation

Run Locally

Quick Test

Demo

Configuration

Benchmark Results

Architecture Decisions

Why Qwen3-VL-32B-Thinking?

Why VibeVoice-ASR over Whisper?

Why Parallel Audio+Vision?

Limitations & Future Work

Contributing

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages