This project is a configurable backend for building a video assistant around a single video.
The assistant can:
- extract speech when the video has audio
- describe visual moments from sampled frames
- generate a summary of the video
- search the video with natural-language queries
- answer questions about the video from retrieved evidence
- return timestamps for matched moments
The runtime builds a searchable index for one video by combining audio evidence and visual evidence.
The current pipeline is:
- Extract audio from the video with
ffmpeg. - Run speech transcription with NVIDIA Riva if audio is available.
- Sample frames from the video with
ffmpeg. - Caption those frames with a NVIDIA vision-language model.
- Chunk the transcript into search passages.
- Merge transcript chunks and visual captions into one searchable corpus.
- Embed that corpus with a NVIDIA embedding model.
- Optionally rerank search hits with a NVIDIA reranking model.
- Generate a video-level summary.
- Answer questions from retrieved evidence.
The important design point is that the assistant does not search the raw video directly. It searches an intermediate index made of:
- transcript chunks
- visual frame captions
- embeddings for retrieval
- a generated summary
- Audio transcription when the video has usable speech.
- Visual indexing even when the video has no transcript.
- Timestamped search results.
- Search over both spoken content and visible content.
- Video-level summary generation.
- Question answering over one indexed video.
- Cached JSON indexes that can be reused later.
- For transcript matches, timestamps refer to the transcript chunk span.
- For visual matches, timestamps refer to the sampled frame timestamp.
- Visual search is based on sampled frames plus generated captions, not object detection or full video grounding.
- If a frame sample misses an event, that moment may not be retrievable.
- If the vision-language caption is vague or wrong, retrieval quality drops.
- Silent videos can still be indexed visually, but transcript-based questions obviously will not help.
- QA is retrieval-augmented generation, not guaranteed factual reasoning over the raw video.
- The system currently works video-by-video. Multi-video conversational memory is not implemented as a first-class feature.
The repository is organized as:
src/: backend packagetests/: unit tests runnable withuvexperiments/notebook/: notebooks for live testingexperiments/videos/: local sample videos for experiments
The main package lives under src/video_search_tool.
AppConfig centralizes the whole runtime configuration:
- API endpoints
- model ids
- media extraction parameters
- chunking parameters
- embedding settings
- search settings
- generation settings
- transcription backend settings
File: config.py
Two ffmpeg-based extractors are used:
- audio extraction for transcription
- frame extraction for visual indexing
File: media.py
The NVIDIA integration layer handles:
- embeddings
- reranking
- multimodal chat completions
- Riva transcription
File: nim.py
VideoIndexer creates the final VideoIndex.
VideoSearcher performs semantic retrieval and optional reranking over the indexed chunks.
File: service.py
The assistant layer adds:
- frame captioning
- summary generation
- question answering using retrieved evidence
File: assistant.py
Indexes are saved as JSON and reloaded later.
File: storage.py
The persisted video index contains:
video_idvideo_pathindexed_atmetadatasummarytranscript_segmentschunks
Each chunk is either:
- an
audiochunk from transcript text - a
visualchunk from a frame caption
Each chunk also stores:
start_secondsend_secondslabelembedding
This is why one search query can return either a spoken moment or a visual moment.
The default configuration currently uses:
- Transcription:
nvidia/parakeet-1_1b-rnnt-multilingual-asr - Embeddings:
nvidia/llama-nemotron-embed-1b-v2 - Reranking:
nvidia/llama-nemotron-rerank-1b-v2 - Vision-language chat / summary / frame captioning / QA:
nvidia/nemotron-nano-12b-v2-vl
- The Riva path is selected with the hosted
function-id. - The embedding model is asymmetric.
- Indexed chunks are embedded with
input_type="passage". - User queries are embedded with
input_type="query".
This project is designed for uv.
Install the project and the notebook/dev tools with:
uv sync --all-groupsffmpeg must be installed and available on PATH.
On Windows, a practical install path is:
winget install -e --id Gyan.FFmpegVerify it with:
ffmpeg -versionThe code expects a NVIDIA API key in:
NVIDIA_API_KEYExample PowerShell session:
$env:NVIDIA_API_KEY="your-key-here"uv run python -m unittest discover -s tests -vuv run video-search print-configuv run video-search index --video path/to/video.mp4 --output artifacts/video.index.jsonThis will:
- extract media evidence
- create embeddings
- build a summary
- store the JSON index
uv run video-search search --index artifacts/video.index.json --query "when does the Arc de Triomphe appear?"uv run video-search ask --index artifacts/video.index.json --question "What happens in this video and when does the main monument appear?"The main live testing notebook is:
- 01_full_feature_test.ipynb
It is meant to test the real backend end to end on local videos.
The notebook covers:
- path setup
- prerequisite checks
- index creation and cache reuse
- summary inspection
- transcript preview
- search tests
- QA tests
- manual playground cells
Indexes generated by the notebook are cached under:
experiments/notebook/artifacts/indexes
Temporary extraction files are written under:
experiments/notebook/artifacts/work
The CLI accepts a JSON config through --config.
Every major part of the runtime is parameterized:
- API endpoints
- selected model ids
- frame sampling rate
- maximum frame count
- whether indexing should continue without audio
- transcript chunk size
- embedding batch size
- search result limits
- reranking behavior
- summary / QA generation token budgets
- Riva backend settings
Minimal example:
{
"api": {
"api_key_env": "NVIDIA_API_KEY",
"embedding_url": "https://integrate.api.nvidia.com/v1/embeddings",
"reranking_url": "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-nemotron-rerank-1b-v2/reranking",
"chat_url": "https://integrate.api.nvidia.com/v1/chat/completions",
"transcription_url": "https://integrate.api.nvidia.com/v1/audio/transcriptions",
"timeout_seconds": 120.0
},
"models": {
"transcription_model": "nvidia/parakeet-1_1b-rnnt-multilingual-asr",
"embedding_model": "nvidia/llama-nemotron-embed-1b-v2",
"reranking_model": "nvidia/llama-nemotron-rerank-1b-v2",
"vision_language_model": "nvidia/nemotron-nano-12b-v2-vl"
},
"media": {
"ffmpeg_binary": "ffmpeg",
"audio_sample_rate": 16000,
"frame_interval_seconds": 2.0,
"max_frames": 120,
"continue_without_audio": true
},
"chunking": {
"max_chunk_duration_seconds": 24.0,
"max_chunk_characters": 700,
"overlap_segments": 1
},
"search": {
"candidate_pool_size": 8,
"result_limit": 5,
"min_semantic_score": 0.1,
"reranking_enabled": true,
"answer_evidence_count": 6
},
"generation": {
"frame_caption_max_tokens": 200,
"summary_max_tokens": 400,
"answer_max_tokens": 500,
"temperature": 0.2
},
"transcription": {
"backend": "riva_grpc",
"server_uri": "grpc.nvcf.nvidia.com:443",
"use_ssl": true,
"function_id": "71203149-d3b7-4460-8231-1be2543a1fca",
"riva_model_name": null,
"language_code": "en-US",
"max_alternatives": 1,
"enable_word_time_offsets": true,
"automatic_punctuation": true,
"profanity_filter": false,
"verbatim_transcripts": false
}
}When you run a query:
- The query is embedded with the NVIDIA embedding model as a
query. - Stored chunks are already embedded as
passage. - Cosine similarity produces an initial candidate set.
- Optional reranking reorders the best candidates.
- Results are returned with text, modality, and timestamps.
Search can retrieve:
- transcript matches
- visual frame-caption matches
When you ask a question:
- The system first searches the indexed chunks.
- The best hits become the evidence set.
- The assistant prompt includes:
- the video summary
- the retrieved evidence
- the user question
- The NVIDIA chat model generates the final answer.
That means QA quality depends directly on:
- frame sampling quality
- frame caption quality
- transcript quality
- retrieval quality
The repository includes unit tests for:
- chunking
- storage round-trip
- indexing behavior
- silent-video handling
- multimodal indexing
- assistant QA behavior
Run them with:
uv run python -m unittest discover -s tests -vMain test files:
- test_chunking.py
- test_service.py
- test_storage.py
- test_assistant.py
- This is a backend-oriented project, not a finished UI chatbot application.
- There is no persistent conversation memory layer beyond the current indexed video.
- Visual retrieval is approximate because it relies on sampled frames and generated captions.
- Some NVIDIA endpoints are strict about request shape; the implementation reflects the current API contracts used by this project.
- End-to-end quality depends on the selected model endpoints remaining compatible with the configured request formats.
This project is currently a multimodal video indexing and retrieval backend with a QA layer on top.
It is capable of:
- turning one video into a searchable index
- handling videos with or without useful audio
- searching for visible moments and spoken moments
- generating summaries
- answering grounded questions with timestamps
It is not yet a perfect video-grounding system, but it is a solid modular base for a chatbot-like assistant over local videos.