# 🎥 VidGenie : Your Smart Video Assistant

## Preprocessing 
1. Download Videos (Using yt-dlp)
2. Extract Metadata(Title, Description, Video URI)
3. Convert Video to Audio (For transcription)
4. Check for Existing Transcript (WebVTT format)
5. Generate Transcript (Using Speech-to-Text )

#### Technologies Used:  yt-dlp, ffmpeg

In [None]:
import os
import json
import cv2
import ffmpeg
import yt_dlp
import whisper
from pathlib import Path
from typing import Optional, List, Dict
from webvtt import WebVTT, Caption
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image


1. **Downloads videos** using `yt-dlp`, extracting metadata (title, description, path).  
2. **Fetches subtitles** (manual or auto-generated) in `.vtt` format.  
3. **Handles errors** gracefully during downloads.  

In [None]:
class VideoDownloader:
    def __init__(self, base_dir: str = "./shared_data/videos"):
        self.base_dir = Path(base_dir)
        self.base_dir.mkdir(parents=True, exist_ok=True)

    def download_video(self, url: str, video_id: str) -> Optional[Dict]:
        output_dir = self.base_dir / video_id
        output_dir.mkdir(exist_ok=True)
        ydl_opts = {
            'format': 'best',
            'outtmpl': str(output_dir / '%(title)s.%(ext)s'),
            'quiet': True,
            'ignoreerrors': True,
            'writedescription': True,  # Fetch video description
        }
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            try:
                info = ydl.extract_info(url, download=True)
            except Exception as e:
                print(f"Error during video download/extraction: {e}")
                return None

            if not info:
                return None
            video_path = Path(ydl.prepare_filename(info))
            if not video_path.exists():
                return None
            return {
                "video_path": video_path,
                "title": info.get("title", ""),
                "description": info.get("description", ""),
                "url": url,
            }

    def download_subtitles(self, url: str, video_id: str) -> Optional[Path]:
        output_dir = self.base_dir / video_id
        output_dir.mkdir(exist_ok=True)
        ydl_opts = {
            'skip_download': True,
            'writesubtitles': True,
            'writeautomaticsub': True,
            'subtitleslangs': ['en'],
            'subtitlesformat': 'vtt',
            'outtmpl': str(output_dir / '%(title)s.%(ext)s'),
            'quiet': True,
            'ignoreerrors': True
        }
        try:
            with yt_dlp.YoutubeDL(ydl_opts) as ydl:
                info = ydl.extract_info(url, download=True)
        except Exception as e:
            print(f"Error during subtitle download/extraction: {e}")
            return None

        if not info:
            return None

        requested_subtitles = info.get('requested_subtitles')
        if requested_subtitles:
            subs = requested_subtitles.get('en', {})
            if subs and (sub_path := Path(subs.get('filepath', ''))).exists():
                return sub_path

        # Check for existing vtt files in the output directory
        for f in output_dir.glob("*.en.vtt"):
            return f

        # If no subtitles were downloaded, return None
        return None


1. **Model Initialization** – Loads the Whisper ASR model for speech-to-text processing.  
2. **Subtitle Processing** – Utilizes existing subtitles if available; otherwise, generates new subtitles.  
3. **Audio Extraction** – Converts video audio to a compatible format using `ffmpeg`.  
4. **Speech Transcription** – Transcribes extracted audio and translates it into English.  
5. **Subtitle Formatting & Saving** – Converts transcriptions into `.vtt` format and stores them.

In [None]:
class TranscriptProcessor:
    def __init__(self):
        self.model = whisper.load_model("base")

    def process_subtitles(self, video_path: Path, sub_path: Optional[Path]) -> Path:
        video_dir = video_path.parent
        if sub_path and sub_path.exists() and sub_path.stat().st_size > 0:
            return sub_path
        return self._generate_subtitles(video_path, video_dir)

    def _generate_subtitles(self, video_path: Path, output_dir: Path) -> Path:
        audio_path = output_dir / "temp_audio.wav"
        vtt_path = output_dir / "generated_subtitles.vtt"
        (
            ffmpeg.input(str(video_path))
            .output(str(audio_path), acodec='pcm_s16le', ar=16000, ac=1)
            .run(quiet=True)
        )
        result = self.model.transcribe(str(audio_path), task="translate", language="en", fp16=False)
        self._create_vtt(result["segments"], vtt_path)
        if audio_path.exists():
            audio_path.unlink()
        return vtt_path

    def _create_vtt(self, segments: List[Dict], output_path: Path):
        vtt = WebVTT()
        for seg in segments:
            caption = Caption(
                self._format_time(seg['start']),
                self._format_time(seg['end']),
                seg['text'].strip()
            )
            vtt.captions.append(caption)
        vtt.save(str(output_path))

    @staticmethod
    def _format_time(seconds: float) -> str:
        hours = int(seconds // 3600)
        mins = int((seconds % 3600) // 60)
        secs = seconds % 60
        return f"{hours:02}:{mins:02}:{secs:06.3f}"


1. **Model Initialization** – Loads the BLIP model for image captioning and sets up video frame processing.
2. **Video Frame Extraction** – Captures frames at specified intervals from the video.
3. **Frame Description Generation** – Uses the BLIP model to generate captions for each extracted frame.
4. **Subtitle Matching** – Matches subtitles to frames based on timestamp alignment.
5. **Metadata Collection** – Organizes and stores frame paths, timestamps, subtitles, and descriptions.

In [None]:
class FrameProcessor:
    def __init__(self):
        from transformers import BlipProcessor, BlipForConditionalGeneration
        from PIL import Image
        import cv2
        self.processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
        self.model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

    def process_video(self, video_path: Path, vtt_path: Path) -> List[Dict]:
        video_dir = video_path.parent
        frames_dir = video_dir / "frames"
        frames_dir.mkdir(exist_ok=True)
        subtitles = self._parse_vtt(vtt_path)
        import cv2
        cap = cv2.VideoCapture(str(video_path))
        fps = cap.get(cv2.CAP_PROP_FPS)
        frame_interval = int(round(fps))
        metadata = []
        frame_count = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            if frame_count % frame_interval == 0:
                timestamp = frame_count / fps
                frame_description = self._generate_frame_description(frame)
                self._process_frame(frame, frame_count, timestamp, frames_dir, subtitles, metadata, frame_description)
            frame_count += 1
        cap.release()
        return metadata

    def _generate_frame_description(self, frame) -> str:
        from PIL import Image
        pil_image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        inputs = self.processor(pil_image, return_tensors="pt")
        out = self.model.generate(**inputs)
        return self.processor.decode(out[0], skip_special_tokens=True)

    def _process_frame(self, frame, count: int, timestamp: float, frames_dir: Path, subtitles: List[Dict], metadata: List, frame_description: str):
        frame_path = frames_dir / f"frame_{count}_time_{timestamp:.2f}.jpg"
        cv2.imwrite(str(frame_path), frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
        metadata.append({
            "frame_path": str(frame_path.relative_to(frames_dir.parent)),
            "timestamp": round(timestamp, 2),
            "subtitles": self._get_matching_subtitles(subtitles, timestamp),
            "frame_description": frame_description
        })

    def _parse_vtt(self, path: Path) -> List[Dict]:
        return [{
            "start": self._time_to_sec(caption.start),
            "end": self._time_to_sec(caption.end),
            "text": caption.text.strip()
        } for caption in WebVTT().read(str(path)).captions]

    def _get_matching_subtitles(self, subtitles: List[Dict], timestamp: float) -> List[str]:
        return [sub['text'] for sub in subtitles if sub['start'] <= timestamp <= sub['end']]

    @staticmethod
    def _time_to_sec(time_str: str) -> float:
        parts = list(map(float, time_str.replace(',', '.').split(':')))
        if len(parts) == 3:
            return parts[0] * 3600 + parts[1] * 60 + parts[2]
        elif len(parts) == 2:
            return parts[0] * 60 + parts[1]
        return 0.0


1. **Video Pipeline**: Downloads and processes 5 YouTube videos, extracting metadata and subtitles.
2. **Subtitle Handling**: Downloads or generates subtitles for each video, processes them into a readable format.
3. **Frame Processing**: Extracts frames from each video and generates associated metadata.
4. **Metadata Generation**: Combines title, description, transcript, and frame data for each video.
5. **Metadata Storage**: Saves the processed metadata for all 5 videos into individual JSON files.

In [None]:
def process_video_pipeline(url: str, video_id: str):
    downloader = VideoDownloader()
    transcript_processor = TranscriptProcessor()
    frame_processor = FrameProcessor()

    # Step 1: Download video and extract metadata
    video_info = downloader.download_video(url, video_id)
    if not video_info or not video_info["video_path"].exists():
        print("Video download failed.")
        return

    # Step 2: Download or generate subtitles
    sub_path = downloader.download_subtitles(url, video_id)
    if not sub_path:
       print("No subtitles downloaded, generating subtitles...")
    vtt_path = transcript_processor.process_subtitles(video_info["video_path"], sub_path)

    # Step 3: Process frames and generate descriptions
    frame_metadata = frame_processor.process_video(video_info["video_path"], vtt_path)

    # Step 4: Prepare final metadata
    metadata = {
        "title": video_info["title"],
        "description": video_info["description"],
        "transcript": [{
            "start_time": seg["start"],
            "end_time": seg["end"],
            "text": seg["text"]
        } for seg in frame_processor._parse_vtt(vtt_path)],
        "frames": frame_metadata,
        "video_uri": video_info["url"],
    }

    # Step 5: Save metadata
    metadata_path = video_info["video_path"].parent / "metadata.json"
    import json
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)

    print(f"Metadata saved to {metadata_path}")



if __name__ == "__main__":
    video_urls = [
        "https://www.youtube.com/watch?v=ftDsSB3F5kg",
        "https://www.youtube.com/watch?v=kKFrbhZGNNI",
        "https://www.youtube.com/watch?v=6qUxwZcTXHY",
        "https://www.youtube.com/watch?v=MspNdsh0QcM",
        "https://www.youtube.com/watch?v=Kf57KGwKa0w"
    ]
    
    for idx, url in enumerate(video_urls, 1):
        video_id = f"video_{idx}"
        process_video_pipeline(url, video_id)

## Retrieval & Generation

1. Metadata Chunking (JSON, Dict, List)
2. Embedding Generation (SentenceTransformer)
3. Vector Database: ChromaDB, Python (chromadb library)
4. Data Storage & Retrieval(for vector storage and querying)


#### Technologies used : json, ChromaDB, SentenceTransformer

In [7]:
#IMPORTING NECESSARY LIBRARIES

import chromadb
from sentence_transformers import SentenceTransformer
from chromadb.config import Settings

- **Chunking**: Divides metadata into smaller segments with title, description, transcript, and frame data.
- **Frame Matching**: Associates frame descriptions with corresponding transcript segments.
- **Return**: Outputs a list of structured metadata chunks.

In [8]:
class MetadataChunker:
    def __init__(self, metadata: Dict):
        self.metadata = metadata

    def chunk_metadata(self) -> List[Dict]:
        """
        Chunk the metadata into smaller, meaningful pieces.
        Each chunk will include:
        - Title
        - Description
        - Transcript segment (start_time, end_time, text)
        - Frame description (if available for the segment)
        - Video URI
        """
        chunks = []

        # Chunk based on transcript segments
        for transcript_segment in self.metadata["transcript"]:
            chunk = {
                "title": self.metadata["title"],
                "description": self.metadata["description"],
                "start_time": transcript_segment["start_time"],
                "end_time": transcript_segment["end_time"],
                "text": transcript_segment["text"],
                "video_uri": self.metadata["video_uri"],
            }

            # Add frame descriptions if available for the segment
            frame_descriptions = []
            for frame in self.metadata["frames"]:
                if transcript_segment["start_time"] <= frame["timestamp"] <= transcript_segment["end_time"]:
                    frame_descriptions.append(frame["frame_description"])
            chunk["frame_descriptions"] = frame_descriptions

            chunks.append(chunk)

        return chunks

- **Initialization**: Initializes a SentenceTransformer model for embedding generation.
- **Text Preparation**: Combines relevant fields from each chunk into a text string.
- **Embedding Generation**: Generates embeddings for the prepared text using the model.

In [9]:
class EmbeddingGenerator:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def generate_embeddings(self, chunks: List[Dict]) -> List[List[float]]:
        """
        Generate embeddings for each chunk's text.
        """
        texts = [self._prepare_text(chunk) for chunk in chunks]
        return self.model.encode(texts).tolist()

    @staticmethod
    def _prepare_text(chunk: Dict) -> str:
        """
        Prepare the text for embedding by combining relevant fields.
        """
        text = f"Title: {chunk['title']}\nDescription: {chunk['description']}\nTranscript: {chunk['text']}"
        if chunk["frame_descriptions"]:
            text += f"\nFrame Descriptions: {' '.join(chunk['frame_descriptions'])}"
        return text


- **Initialization**: Sets up ChromaDB client and collection.
- **Storing Chunks**: Stores chunks, embeddings, documents, and metadata.
- **Document Preparation**: Prepares text document with title, description, and transcript.
- **Metadata Preparation**: Prepares metadata (start time, end time, video URI).

In [10]:
class VectorDB:
    def __init__(self, db_path: str = "./chroma_db"):
        self.client = chromadb.Client(Settings(persist_directory=db_path, is_persistent=True))
        self.collection = self.client.get_or_create_collection(name="video_metadata")

    def store_chunks(self,
                     chunks: List[Dict], embeddings: List[List[float]]):
        """
        Store chunks and their embeddings in the vector database.
        """
        ids = [str(i) for i in range(len(chunks))]
        documents = [self._prepare_document(chunk) for chunk in chunks]
        metadatas = [self._prepare_metadata(chunk) for chunk in chunks]

        self.collection.add(
            ids=ids,
            embeddings=embeddings,
            documents=documents,
            metadatas=metadatas,
        )

    @staticmethod
    def _prepare_document(chunk: Dict) -> str:
        """
        Prepare the document text for storage.
        """
        return f"Title: {chunk['title']}\nDescription: {chunk['description']}\nTranscript: {chunk['text']}"

    @staticmethod
    def _prepare_metadata(chunk: Dict) -> Dict:
        """
        Prepare metadata for storage.
        """
        return {
            "start_time": chunk["start_time"],
            "end_time": chunk["end_time"],
            "video_uri": chunk["video_uri"],
        }


- **Retrieval**: Converts the query into an embedding, queries the database for relevant chunks, and returns the top results.
- **Results Formatting**: Formats the results into a list of relevant chunks with video URI, start time, and text.

In [11]:
class Retriever:
    def __init__(self, db_path: str = "./chroma_db"):
        self.client = chromadb.Client(Settings(persist_directory=db_path, is_persistent=True))
        self.collection = self.client.get_collection(name="video_metadata")
        self.embedding_generator = EmbeddingGenerator()

    def retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """
        Retrieve the most relevant chunks based on the user's query.
        """
        query_embedding = self.embedding_generator.generate_embeddings([{"title": "", "description": "", "text": query, "frame_descriptions": []}])[0]
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
        )

        # Format the results
        retrieved_chunks = []
        for i in range(len(results["ids"][0])):
            retrieved_chunks.append({
                "video_uri": results["metadatas"][0][i]["video_uri"],
                "start_time": results["metadatas"][0][i]["start_time"],
                "text": results["documents"][0][i],
            })

        return retrieved_chunks


- **Initialization**: Sets up the directory path, VectorDB, and embedding generator.
- **Metadata Processing**: Iterates through metadata files, loads them, chunks the data, and generates embeddings.
- **Storing in VectorDB**: Stores the chunked data and embeddings in the vector database.

In [12]:
def process_all_metadata_for_vectordb(metadata_dir: str):
    """
    Process all metadata files in the directory and store them in the vector database.
    """
    metadata_dir = Path(metadata_dir)
    vectordb = VectorDB()
    embedding_generator = EmbeddingGenerator()

    # Iterate through all metadata files
    for metadata_file in metadata_dir.glob("**/metadata.json"):
        print(f"Processing {metadata_file}...")

        # Load metadata
        with open(metadata_file, "r") as f:
            metadata = json.load(f)

        # Chunk metadata
        chunker = MetadataChunker(metadata)
        chunks = chunker.chunk_metadata()

        # Generate embeddings
        embeddings = embedding_generator.generate_embeddings(chunks)

        # Store in VectorDB
        vectordb.store_chunks(chunks, embeddings)

    print(f"All metadata processed and stored in VectorDB.")


- **Initialization**: Retrieves relevant results from the vector database based on the user’s query.
- **Query Execution**: Calls `Retriever` to fetch top results for the query.
- **Results Display**: Prints out the video URI, start time, and corresponding text for each retrieved result.
- **Main Execution**: Processes all metadata and then queries the VectorDB with a specific question.

In [13]:
def query_vectordb(query: str):
    """
    Query the vector database and retrieve relevant results.
    """
    retriever = Retriever()
    results = retriever.retrieve(query)

    print("Retrieved Results:")
    for result in results:
        print(f"Video URI: {result['video_uri']}")
        print(f"Start Time: {result['start_time']}")
        print(f"Text: {result['text']}")
        print("-" * 50)


if __name__ == "__main__":
    # Step 1: Process all metadata files and store in VectorDB
    metadata_dir = "./shared_data/videos"
    process_all_metadata_for_vectordb(metadata_dir)

    # Step 2: Query the VectorDB
    query = "After completing any story, what is the next crucial step"
    query_vectordb(query)

Processing shared_data/videos/video_2/metadata.json...
Processing shared_data/videos/video_5/metadata.json...


Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Insert of existing embedding ID: 3
Insert of existing embedding ID: 4
Insert of existing embedding ID: 5
Insert of existing embedding ID: 6
Insert of existing embedding ID: 7
Insert of existing embedding ID: 8
Insert of existing embedding ID: 9
Insert of existing embedding ID: 10
Insert of existing embedding ID: 11
Insert of existing embedding ID: 12
Insert of existing embedding ID: 13
Insert of existing embedding ID: 14
Insert of existing embedding ID: 15
Insert of existing embedding ID: 16
Insert of existing embedding ID: 17
Insert of existing embedding ID: 18
Insert of existing embedding ID: 19
Insert of existing embedding ID: 20
Insert of existing embedding ID: 21
Insert of existing embedding ID: 22
Insert of existing embedding ID: 23
Insert of existing embedding ID: 24
Insert of existing embedding ID: 25
Insert of existing embedding ID: 26
Insert of existing embedding ID: 27
In

Processing shared_data/videos/video_4/metadata.json...


Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Insert of existing embedding ID: 3
Insert of existing embedding ID: 4
Insert of existing embedding ID: 5
Insert of existing embedding ID: 6
Insert of existing embedding ID: 7
Insert of existing embedding ID: 8
Insert of existing embedding ID: 9
Insert of existing embedding ID: 10
Insert of existing embedding ID: 11
Insert of existing embedding ID: 12
Insert of existing embedding ID: 13
Insert of existing embedding ID: 14
Insert of existing embedding ID: 15
Insert of existing embedding ID: 16
Insert of existing embedding ID: 17
Insert of existing embedding ID: 18
Insert of existing embedding ID: 19
Insert of existing embedding ID: 20
Insert of existing embedding ID: 21
Insert of existing embedding ID: 22
Insert of existing embedding ID: 23
Insert of existing embedding ID: 24
Insert of existing embedding ID: 25
Insert of existing embedding ID: 26
Insert of existing embedding ID: 27
In

Processing shared_data/videos/video_3/metadata.json...


Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Insert of existing embedding ID: 3
Insert of existing embedding ID: 4
Insert of existing embedding ID: 5
Insert of existing embedding ID: 6
Insert of existing embedding ID: 7
Insert of existing embedding ID: 8
Insert of existing embedding ID: 9
Insert of existing embedding ID: 10
Insert of existing embedding ID: 11
Insert of existing embedding ID: 12
Insert of existing embedding ID: 13
Insert of existing embedding ID: 14
Insert of existing embedding ID: 15
Insert of existing embedding ID: 16
Insert of existing embedding ID: 17
Insert of existing embedding ID: 18
Insert of existing embedding ID: 19
Insert of existing embedding ID: 20
Insert of existing embedding ID: 21
Insert of existing embedding ID: 22
Insert of existing embedding ID: 23
Insert of existing embedding ID: 24
Insert of existing embedding ID: 25
Insert of existing embedding ID: 26
Insert of existing embedding ID: 27
In

Processing shared_data/videos/video_1/metadata.json...


Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Insert of existing embedding ID: 3
Insert of existing embedding ID: 4
Insert of existing embedding ID: 5
Insert of existing embedding ID: 6
Insert of existing embedding ID: 7
Insert of existing embedding ID: 8
Insert of existing embedding ID: 9
Insert of existing embedding ID: 10
Insert of existing embedding ID: 11
Insert of existing embedding ID: 12
Insert of existing embedding ID: 13
Insert of existing embedding ID: 14
Insert of existing embedding ID: 15
Insert of existing embedding ID: 16
Insert of existing embedding ID: 17
Insert of existing embedding ID: 18
Insert of existing embedding ID: 19
Insert of existing embedding ID: 20
Insert of existing embedding ID: 21
Insert of existing embedding ID: 22
Insert of existing embedding ID: 23
Insert of existing embedding ID: 24
Insert of existing embedding ID: 25
Insert of existing embedding ID: 26
Insert of existing embedding ID: 27
In

All metadata processed and stored in VectorDB.
Retrieved Results:
Video URI: https://www.youtube.com/watch?v=MspNdsh0QcM
Start Time: 249.55
Text: Title: स्टोरीबोर्ड का निर्माण भाग - 1
Description: For more information and related videos visit us on http://www.digitalgreen.org/
Transcript: how to convert a story into a script and story board
--------------------------------------------------
Video URI: https://www.youtube.com/watch?v=MspNdsh0QcM
Start Time: 249.56
Text: Title: स्टोरीबोर्ड का निर्माण भाग - 1
Description: For more information and related videos visit us on http://www.digitalgreen.org/
Transcript: how to convert a story into a script and story board
Thank you
--------------------------------------------------
Video URI: https://www.youtube.com/watch?v=MspNdsh0QcM
Start Time: 245.84
Text: Title: स्टोरीबोर्ड का निर्माण भाग - 1
Description: For more information and related videos visit us on http://www.digitalgreen.org/
Transcript: what a story board is and
how to convert a s

--------------------------------------------------------------------------**END**----------------------------------------------------------------------------------------