Turn hours of video into structured intelligence dossiers in minutes.
SceneIQ is an adaptive multimodal AI system that processes long-form video—podcasts, webinars, senate hearings, technical lectures—and generates comprehensive intelligence dossiers with speaker attribution, topic maps, key quotes, and highlight suggestions. All in 0.15× real-time on a single AMD MI300X GPU.
Humans can't scale to the volume of video content produced daily. A 3-hour congressional hearing requires 6+ hours of human review to extract actionable insights. SceneIQ does it in 28 minutes.
- Speed: 3-hour video → full dossier in 28 min (0.15× real-time)
- Quality: Speaker-attributed transcript, adaptive keyframe selection (40–60 moments vs. 600+ dense frames)
- Accuracy: Qwen3-VL-32B-Thinking vision + VibeVoice-ASR for robust multimodal analysis
- Scale: Process 10+ hours per day on a single GPU
Input Video
↓
┌─ Extract ─────────────────────────────────┐
│ • Adaptive keyframe selection (hybrid strategy) │
│ • Audio extraction at 24kHz │
│ • Real-time frame deduplication │
└───────────────────┬────────────────────────┘
↓
┌───────────┴───────────┐
↓ ↓
[Vision] [Audio] (parallel)
(Qwen3-VL-32B) (VibeVoice ASR)
↓ ↓
Frame Analysis Speaker-Attributed
+ Thinking Traces Transcript
└───────────┬───────────┘
↓
┌── Synthesis ──────────────┐
│ Qwen3-VL-32B-Thinking │
│ generates structured JSON │
└───────────┬───────────────┘
↓
Intelligence Dossier
• Executive brief
• Topic map with timestamps
• Speaker claims & quotes
• Highlight clip suggestions
• Benchmark metrics
Key Design:
- Parallel Processing: Audio transcription runs concurrently with vision analysis, not sequentially. Vision takes ~13 min for 3h video; audio takes ~5 min. Total: ~13 min, not 18 min.
- Adaptive Keyframes: 8×8 grayscale thumbnail mean-absolute-diff deduplication reduces frame count by 90% without losing content changes (slide transitions, scene cuts).
- Streaming Output: SSE (Server-Sent Events) streams analysis phases, metrics, and report chunks to a React UI in real-time.
| Component | Technology |
|---|---|
| Vision Model | Qwen3-VL-32B-Thinking (on vLLM) |
| Audio Model | VibeVoice-ASR-HF + faster-whisper (fallback) |
| GPU Backend | AMD MI300X (192GB VRAM) + ROCm |
| Inference Engine | vLLM OpenAI-compatible API |
| Backend | FastAPI + LangGraph orchestration |
| Frontend | React 18 + SSE streaming |
- AMD MI300X GPU with ROCm 6.x+
- Python 3.10+
- Node.js 18+
- 50GB free disk (model cache)
# Clone the repo
git clone https://github.com/yourusername/sceneiq.git
cd sceneiq
# Backend setup
python3 -m venv venv
source venv/bin/activate
pip install -r backend/requirements.txt
# Frontend setup
cd frontend
npm ci
npm run build
cd ..
# Bootstrap script (recommended for fresh AMD instance)
bash scripts/bootstrap.sh [HF_TOKEN]# Terminal 1: Start vision model (Qwen3-VL)
bash start_vllm.sh
# Terminal 2: Start audio service (VibeVoice)
cd scripts && python3 vibeVoice_service.py
# Terminal 3: Start backend
source venv/bin/activate
uvicorn backend.main:app --host 0.0.0.0 --port 8002
# Terminal 4: Start frontend
cd frontend && PORT=7860 npm run start
# Open http://localhost:7860# Run end-to-end pipeline test
BASE_URL=http://localhost bash scripts/test_pipeline.shPre-computed result for the 2023 Senate Judiciary AI Hearing (3h 12m, 5 speakers):
- Video Duration: 3h 12m (11,520 seconds)
- Total Processing Time: 28 minutes
- Real-Time Factor: 0.15×
- Keyframes Selected: 47 (from 648 dense equivalent)
- Vision Time: 13 min @ 17s/frame
- Audio Time: 5 min (VibeVoice primary, faster-whisper fallback)
- Synthesis Time: ~8 min (JSON report generation + streaming)
Output Dossier Sections:
- Executive brief (1–2 paragraphs)
- Topic map (5–7 topics with timestamps)
- Speaker claims (per-speaker key statements)
- Best quotes (top 10 verbatim quotes)
- Highlight clip suggestions (3–5 key moments to extract)
- Visual context moments (keyframe annotations)
- Entities (people, organizations, legislation)
- Contradictions/open questions
- Repurposable social media posts
Environment variables:
# Models
WHISPER_MODEL=large-v3 # faster-whisper fallback size
HF_TOKEN=hf_xxx # Hugging Face API key
# Services
VIBEVOICE_API_URL=http://localhost:8001
QWEN_API_URL=http://localhost:8000
# Pipeline tuning
SCENEIQ_DEDUP_THRESHOLD=0.08 # Frame dedup: 0–1 (lower = more aggressive)
SCENEIQ_MAX_SYNTHESIS_CHUNKS=40 # Cap on transcript chunks sent to synthesisTested on AMD MI300X (192GB VRAM):
| Duration | Keyframes | Total Time | RTF | Vision | Audio | Synthesis |
|---|---|---|---|---|---|---|
| 10 min | 8 | 3 min | 0.30× | 2 min | 20s | 40s |
| 1 hour | 15 | 9 min | 0.15× | 7 min | 1 min | 2 min |
| 3 hours | 47 | 28 min | 0.15× | 13 min | 5 min | 8 min |
Note: RTF stabilizes around 0.15× for videos >30 min due to fixed synthesis overhead.
- Instruction-tuned vision-language model with extended reasoning (thinking tokens)
- 32B parameter size fits MI300X with room for batching
- Supports long context (32K tokens) for multi-frame analysis
- Native speaker diarization (who said what) without post-processing
- Optimized for diverse audio quality and accents
- Lower latency than Whisper with comparable accuracy
- Sequential pipeline would add 5 min to total time (13m + 5m = 18m)
- Parallel: max(13m, 5m) = 13m effective time
- Saves ~27% overall latency with minimal implementation complexity
- Live streaming: Currently requires pre-recorded video files
- Languages: English audio/video only (Qwen3-VL and VibeVoice support)
- Output size: Dossier is ~5–10KB JSON; no multimedia embedding
- Privacy: All processing happens on your GPU; no data sent externally
Future:
- Multi-language support (with multilingual models)
- Live video stream ingestion (RTMP/HLS)
- Custom dossier templates (user-defined report schema)
- Batch processing API for 100+ videos
Contributions welcome. Areas of interest:
- Additional keyframe strategies (optical flow, attention maps)
- Report template customization
- Performance optimizations (quantization, pruning)
- E2E tests for pipeline phases
Please open an issue or PR with clear description of changes.
If you use SceneIQ in research, please cite:
@software{sceneiq2025,
title = {SceneIQ: Adaptive Multimodal Dossiers on AMD MI300X},
author = {Your Name},
year = {2025},
url = {https://github.com/yourusername/sceneiq}
}MIT License. See LICENSE for details.
- Qwen3-VL for vision-language modeling
- VibeVoice for speaker diarization
- vLLM for efficient inference
- AMD ROCm for GPU support
Questions? Open an issue or reach out.
Try the live demo: [Coming soon — will add link once deployed]