Skip to content

LalwaniPalash/sceneiq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SceneIQ

Turn hours of video into structured intelligence dossiers in minutes.

SceneIQ is an adaptive multimodal AI system that processes long-form video—podcasts, webinars, senate hearings, technical lectures—and generates comprehensive intelligence dossiers with speaker attribution, topic maps, key quotes, and highlight suggestions. All in 0.15× real-time on a single AMD MI300X GPU.

Why SceneIQ?

Humans can't scale to the volume of video content produced daily. A 3-hour congressional hearing requires 6+ hours of human review to extract actionable insights. SceneIQ does it in 28 minutes.

  • Speed: 3-hour video → full dossier in 28 min (0.15× real-time)
  • Quality: Speaker-attributed transcript, adaptive keyframe selection (40–60 moments vs. 600+ dense frames)
  • Accuracy: Qwen3-VL-32B-Thinking vision + VibeVoice-ASR for robust multimodal analysis
  • Scale: Process 10+ hours per day on a single GPU

Architecture

Input Video
    ↓
┌─ Extract ─────────────────────────────────┐
│ • Adaptive keyframe selection (hybrid strategy)  │
│ • Audio extraction at 24kHz                     │
│ • Real-time frame deduplication                 │
└───────────────────┬────────────────────────┘
                    ↓
        ┌───────────┴───────────┐
        ↓                       ↓
    [Vision]              [Audio]        (parallel)
    (Qwen3-VL-32B)     (VibeVoice ASR)
        ↓                       ↓
    Frame Analysis        Speaker-Attributed
    + Thinking Traces     Transcript
        └───────────┬───────────┘
                    ↓
            ┌── Synthesis ──────────────┐
            │ Qwen3-VL-32B-Thinking     │
            │ generates structured JSON │
            └───────────┬───────────────┘
                        ↓
                  Intelligence Dossier
         • Executive brief
         • Topic map with timestamps
         • Speaker claims & quotes
         • Highlight clip suggestions
         • Benchmark metrics

Key Design:

  • Parallel Processing: Audio transcription runs concurrently with vision analysis, not sequentially. Vision takes ~13 min for 3h video; audio takes ~5 min. Total: ~13 min, not 18 min.
  • Adaptive Keyframes: 8×8 grayscale thumbnail mean-absolute-diff deduplication reduces frame count by 90% without losing content changes (slide transitions, scene cuts).
  • Streaming Output: SSE (Server-Sent Events) streams analysis phases, metrics, and report chunks to a React UI in real-time.

Tech Stack

Component Technology
Vision Model Qwen3-VL-32B-Thinking (on vLLM)
Audio Model VibeVoice-ASR-HF + faster-whisper (fallback)
GPU Backend AMD MI300X (192GB VRAM) + ROCm
Inference Engine vLLM OpenAI-compatible API
Backend FastAPI + LangGraph orchestration
Frontend React 18 + SSE streaming

Getting Started

Prerequisites

  • AMD MI300X GPU with ROCm 6.x+
  • Python 3.10+
  • Node.js 18+
  • 50GB free disk (model cache)

Installation

# Clone the repo
git clone https://github.com/yourusername/sceneiq.git
cd sceneiq

# Backend setup
python3 -m venv venv
source venv/bin/activate
pip install -r backend/requirements.txt

# Frontend setup
cd frontend
npm ci
npm run build
cd ..

# Bootstrap script (recommended for fresh AMD instance)
bash scripts/bootstrap.sh [HF_TOKEN]

Run Locally

# Terminal 1: Start vision model (Qwen3-VL)
bash start_vllm.sh

# Terminal 2: Start audio service (VibeVoice)
cd scripts && python3 vibeVoice_service.py

# Terminal 3: Start backend
source venv/bin/activate
uvicorn backend.main:app --host 0.0.0.0 --port 8002

# Terminal 4: Start frontend
cd frontend && PORT=7860 npm run start

# Open http://localhost:7860

Quick Test

# Run end-to-end pipeline test
BASE_URL=http://localhost bash scripts/test_pipeline.sh

Demo

Pre-computed result for the 2023 Senate Judiciary AI Hearing (3h 12m, 5 speakers):

  • Video Duration: 3h 12m (11,520 seconds)
  • Total Processing Time: 28 minutes
  • Real-Time Factor: 0.15×
  • Keyframes Selected: 47 (from 648 dense equivalent)
  • Vision Time: 13 min @ 17s/frame
  • Audio Time: 5 min (VibeVoice primary, faster-whisper fallback)
  • Synthesis Time: ~8 min (JSON report generation + streaming)

Output Dossier Sections:

  • Executive brief (1–2 paragraphs)
  • Topic map (5–7 topics with timestamps)
  • Speaker claims (per-speaker key statements)
  • Best quotes (top 10 verbatim quotes)
  • Highlight clip suggestions (3–5 key moments to extract)
  • Visual context moments (keyframe annotations)
  • Entities (people, organizations, legislation)
  • Contradictions/open questions
  • Repurposable social media posts

Configuration

Environment variables:

# Models
WHISPER_MODEL=large-v3          # faster-whisper fallback size
HF_TOKEN=hf_xxx                 # Hugging Face API key

# Services
VIBEVOICE_API_URL=http://localhost:8001
QWEN_API_URL=http://localhost:8000

# Pipeline tuning
SCENEIQ_DEDUP_THRESHOLD=0.08    # Frame dedup: 0–1 (lower = more aggressive)
SCENEIQ_MAX_SYNTHESIS_CHUNKS=40 # Cap on transcript chunks sent to synthesis

Benchmark Results

Tested on AMD MI300X (192GB VRAM):

Duration Keyframes Total Time RTF Vision Audio Synthesis
10 min 8 3 min 0.30× 2 min 20s 40s
1 hour 15 9 min 0.15× 7 min 1 min 2 min
3 hours 47 28 min 0.15× 13 min 5 min 8 min

Note: RTF stabilizes around 0.15× for videos >30 min due to fixed synthesis overhead.

Architecture Decisions

Why Qwen3-VL-32B-Thinking?

  • Instruction-tuned vision-language model with extended reasoning (thinking tokens)
  • 32B parameter size fits MI300X with room for batching
  • Supports long context (32K tokens) for multi-frame analysis

Why VibeVoice-ASR over Whisper?

  • Native speaker diarization (who said what) without post-processing
  • Optimized for diverse audio quality and accents
  • Lower latency than Whisper with comparable accuracy

Why Parallel Audio+Vision?

  • Sequential pipeline would add 5 min to total time (13m + 5m = 18m)
  • Parallel: max(13m, 5m) = 13m effective time
  • Saves ~27% overall latency with minimal implementation complexity

Limitations & Future Work

  • Live streaming: Currently requires pre-recorded video files
  • Languages: English audio/video only (Qwen3-VL and VibeVoice support)
  • Output size: Dossier is ~5–10KB JSON; no multimedia embedding
  • Privacy: All processing happens on your GPU; no data sent externally

Future:

  • Multi-language support (with multilingual models)
  • Live video stream ingestion (RTMP/HLS)
  • Custom dossier templates (user-defined report schema)
  • Batch processing API for 100+ videos

Contributing

Contributions welcome. Areas of interest:

  • Additional keyframe strategies (optical flow, attention maps)
  • Report template customization
  • Performance optimizations (quantization, pruning)
  • E2E tests for pipeline phases

Please open an issue or PR with clear description of changes.

Citation

If you use SceneIQ in research, please cite:

@software{sceneiq2025,
  title = {SceneIQ: Adaptive Multimodal Dossiers on AMD MI300X},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/yourusername/sceneiq}
}

License

MIT License. See LICENSE for details.

Acknowledgments

  • Qwen3-VL for vision-language modeling
  • VibeVoice for speaker diarization
  • vLLM for efficient inference
  • AMD ROCm for GPU support

Questions? Open an issue or reach out.

Try the live demo: [Coming soon — will add link once deployed]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors