# Setup Tasks

Installation: Intructions below

* ffmpeg
* tesseract


Beyond pip installing the relevant libraries, you also need to do the following to ensure that full youtube processing occurs (for the vision transformer portion).

**Need to install ffmpeg and add to path:https://www.gyan.dev/ffmpeg/builds/
get the ffmpeg-git-full.7z, extract it, add the bin to your System Environment path**

* https://www.7-zip.org/ is where you can download the extractor specific to your machine
* then you right click the ffmpeg-git-full.7z file, and choose 'Extract here'
* rename the file to 'ffmpeg-download', then move it to a newly created folder with the path 'C:\ffmpeg'
* Add new path in system environment variables with the following 'C:\ffmpeg\ffmpeg_download\bin'
* In a new Command Prompt, confirm correct installation with ffmpeg -version

**Need to install tesseract and do the same: https://github.com/UB-Mannheim/tesseract/wiki**

* After installation with its default options chosen, the default path should be 'C:\Program Files\Tesseract-OCR'
* Add this path in system environment variables
* In a new Command Prompt, confirm correct installation with tesseract --version

Note: You likely will need to manually type in the new paths in system environment variables as copy and paste won't work


## Install Libraries

In [None]:
!pip install PyPDF2
!pip install python-docx
!pip install youtube_transcript_api
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install os
!pip install re
!pip install opencv-python
!pip install moviepy
!pip install pytesseract
!pip install transformers youtube_transcript_api pytesseract moviepy opencv-python-headless yt-dlp requests
!pip install tf-keras 
!pip install faiss-cpu
!pip install google-generativeai

In [None]:
!pip uninstall -y sentence-transformers transformers huggingface_hub

!pip install transformers==4.30.0
!pip install huggingface-hub==0.14.1
!pip install sentence-transformers==2.2.2


## Import Libraries

In [1]:
# Standard library
import os
import json
import re
import glob
import pickle
import warnings
from typing import List, Dict, Any, Tuple
from difflib import SequenceMatcher 

# 3rd-party libraries
import PyPDF2  # type: ignore
import docx  # type: ignore
from docx import Document
import cv2
import numpy as np
import pytesseract
import yt_dlp
import faiss
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity 
from youtube_transcript_api import YouTubeTranscriptApi
from moviepy.video.io.VideoFileClip import VideoFileClip
import moviepy.config as mpy_config
from sentence_transformers import SentenceTransformer
from transformers import (
    ViTForImageClassification, ViTFeatureExtractor,
    BlipProcessor, BlipForConditionalGeneration,
    CLIPProcessor, CLIPModel,
    pipeline, AutoTokenizer
)

# FFmpeg and environment setup
import os
import shutil
import warnings

# Add FFmpeg to PATH
os.environ["PATH"] += os.pathsep + r"C:\ffmpeg\ffmpeg_download\bin"
os.environ["IMAGEIO_FFMPEG_LOGLEVEL"] = "quiet"

# Check if FFmpeg is available
FFMPEG_INSTALLED = shutil.which("ffmpeg") is not None
print("FFmpeg installed:", FFMPEG_INSTALLED)

# Other setup
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
warnings.filterwarnings("ignore", category=UserWarning, module="transformers")


# Torch device setup
import torch
device = 0 if torch.cuda.is_available() else -1
print(f"Using {'GPU' if device == 0 else 'CPU'} for summarization")


  from .autonotebook import tqdm as notebook_tqdm


FFmpeg installed: False
Using GPU for summarization


## Set Designated Folder Path for your input files to be stored

In [2]:
set_folderpath = 'C:/Users/ethan/Documents/GitHub/MemoraVault/Documents'

## Add Youtube Links Through Code

You can manually create or open a file called 'youtube.txt' in the designated folderpath in the format of one valid url per line. This code block allows a user to do it within this file to add youtube urls to analyze. 

In [None]:
def is_valid_youtube_url(url):
    """Basic validation for YouTube URLs."""
    pattern = r"(https?://)?(www\.)?(youtube\.com/watch\?v=|youtu\.be/)[\w\-]{11}"
    return re.match(pattern, url) is not None

def create_or_update_youtube_txt(folder_path):
    youtube_file = None
    for f in os.listdir(folder_path):
        if f.lower() == "youtube.txt":
            youtube_file = f
            break

    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f"Created folder: {folder_path}")

    if youtube_file:
        file_path = os.path.join(folder_path, youtube_file)
        print(f"File already exists: {file_path}")
    else:
        file_path = os.path.join(folder_path, "YouTube.txt")
        with open(file_path, "w") as f:
            pass
        print(f"Created new file: {file_path}")

    print("\nEnter YouTube URLs (one per line). Type 'done' to finish:\n")

    with open(file_path, "a") as file:
        while True:
            user_input = input("> ").strip()
            if user_input.lower() == "done":
                break
            elif is_valid_youtube_url(user_input):
                file.write(user_input + "\n")
                print("Added.")
            else:
                print("Invalid YouTube URL. Please try again.")

    print(f"\nFinished updating: {file_path}")

In [None]:
if __name__ == "__main__":
    folder = set_folderpath
    create_or_update_youtube_txt(set_folderpath)


# File Intake Processor

#### PDF and Word Processor Functions

In [3]:
def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with open(pdf_path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    # Replace all whitespace (including newlines) with a single space
                    cleaned_text = " ".join(page_text.split())
                    text += cleaned_text + "\n"
    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")
    return text
def extract_text_from_word(doc_path):
    text = ""
    try:
        doc = docx.Document(doc_path)
        text = "\n".join(para.text for para in doc.paragraphs)
    except Exception as e:
        print(f"Error processing {doc_path}: {e}")
    return text
def process_documents(folder_path, output_folder):
    """
    Process all PDF and DOCX files in the folder and convert them into a list of
    content items in the format expected by the RAG pipeline. Then, append the entire
    list into a single JSON file in the output folder.
    """
    extracted_items = []
    os.makedirs(output_folder, exist_ok=True)  # Ensure output folder exists
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if filename.lower().endswith('.pdf'):
            print(f"Processing PDF: {filename}")
            pdf_text = extract_text_from_pdf(file_path)
            if pdf_text.strip():
                content_item = {
                    "video_id": filename,       # Use the filename as an ID
                    "frame": "full_text",       # Placeholder for full document
                    "timestamp": "0:00",        # No timestamp for a full document
                    "timestamp_seconds": 0,     # Default value
                    "content": pdf_text
                }
                extracted_items.append(content_item)
        elif filename.lower().endswith('.docx'):
            print(f"Processing Word Document: {filename}")
            word_text = extract_text_from_word(file_path)
            if word_text.strip():
                content_item = {
                    "video_id": filename,       # Use the filename as an ID
                    "frame": "full_text",
                    "timestamp": "0:00",
                    "timestamp_seconds": 0,
                    "content": word_text
                }
                extracted_items.append(content_item)
    output_file = os.path.join(output_folder, "extracted_text.json")
    # Check if the JSON file exists. If so, load its existing content.
    if os.path.exists(output_file):
        with open(output_file, "r", encoding="utf-8") as f:
            try:
                existing_items = json.load(f)
            except json.JSONDecodeError:
                existing_items = []
    else:
        existing_items = []
    # Append new items to the existing ones.
    existing_items.extend(extracted_items)
    # Write the combined list back to the JSON file.
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(existing_items, f, indent=4, ensure_ascii=False)
    print(f"Extracted text saved to: {output_file}")

#### Youtube Processor Function

In [4]:
def extract_video_id(url):
    """Extracts the YouTube video ID from a URL."""
    match = re.search(r"(?:v=|\/)([0-9A-Za-z_-]{11}).*", url)
    return match.group(1) if match else None
def get_youtube_transcript(video_url, output_folder):
    """Fetches and saves the transcript from a YouTube video as a .docx without timestamps."""
    video_id = extract_video_id(video_url)
    if not video_id:
        print(f"Could not extract video ID from: {video_url}")
        return None
    os.makedirs(output_folder, exist_ok=True)
    transcript_path = os.path.join(output_folder, f"{video_id}_transcript.docx")
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        combined_text = " ".join([entry["text"] for entry in transcript])
        doc = Document()
        doc.add_heading(f"YouTube Transcript - {video_id}", level=1)
        doc.add_paragraph(combined_text)
        doc.save(transcript_path)
        print(f"Transcript saved as Word doc: {transcript_path}")
        return transcript_path
    except Exception as e:
        print(f"Error fetching transcript: {e}")
        return None

### Combined Processor Function

In [5]:
def smart_process(folder_path):
    has_docs = False
    has_youtube = False
    output_folder = os.path.join(os.getcwd(), "output_data")
    os.makedirs(output_folder, exist_ok=True)  # Ensure output folder exists
    for filename in os.listdir(folder_path):
        if filename.lower().endswith(('.pdf', '.docx')):
            has_docs = True
        elif filename.lower() == 'youtube.txt':
            has_youtube = True
    # Process local PDF and Word documents
    if has_docs:
        print("Document files detected. Processing with `process_documents()`...")
        process_documents(folder_path, output_folder=output_folder)
    # Process YouTube URLs → save transcripts as .docx → then process those too
    if has_youtube:
        print("YouTube links detected. Fetching transcripts and saving as .docx...")
        youtube_file = os.path.join(folder_path, "YouTube.txt")
        with open(youtube_file, "r", encoding="utf-8") as f:
            urls = [line.strip() for line in f if line.strip()]
        for url in urls:
            print(f"Processing YouTube transcript: {url}")
            get_youtube_transcript(url, output_folder=output_folder)
        # Process the .docx files generated from YouTube transcripts
        print("Processing YouTube transcripts as documents...")
        process_documents(output_folder, output_folder=output_folder)
    if not has_docs and not has_youtube:
        print("No supported files (PDF, DOCX, or YouTube.txt) found in the folder.")

In [6]:
if __name__ == "__main__":
    smart_process(set_folderpath)

Document files detected. Processing with `process_documents()`...
Processing Word Document: AML Quiz 1 Study Guide.docx
Processing PDF: AML Quiz 1 Study Guide.pdf
Extracted text saved to: c:\Users\ethan\Documents\GitHub\MemoraVault\output_data\extracted_text.json
YouTube links detected. Fetching transcripts and saving as .docx...
Processing YouTube transcript: https://www.youtube.com/watch?v=Ilg3gGewQ5U
Transcript saved as Word doc: c:\Users\ethan\Documents\GitHub\MemoraVault\output_data\Ilg3gGewQ5U_transcript.docx
Processing YouTube transcripts as documents...
Processing Word Document: Ilg3gGewQ5U_transcript.docx
Extracted text saved to: c:\Users\ethan\Documents\GitHub\MemoraVault\output_data\extracted_text.json


# ViT for YouTube Videos

## Part 1: Initial Processing

In [9]:
# Load Vision Transformer (ViT) and BLIP models
vit_model_name = "google/vit-base-patch16-224"
blip_model_name = "Salesforce/blip-image-captioning-base"

feature_extractor = ViTFeatureExtractor.from_pretrained(vit_model_name)
vit_model = ViTForImageClassification.from_pretrained(vit_model_name)

blip_processor = BlipProcessor.from_pretrained(blip_model_name)
blip_model = BlipForConditionalGeneration.from_pretrained(blip_model_name)

# Ensure FFmpeg is installed
FFMPEG_INSTALLED = os.system("ffmpeg -version") == 0

# Create output directories
os.makedirs("videos", exist_ok=True)
os.makedirs("processed_frames", exist_ok=True)
os.makedirs("output_data", exist_ok=True)

# Define label mapping
def get_label_name(label_index):
    labels = vit_model.config.id2label  # Get label mapping from model config
    return labels.get(label_index, "unknown")

def extract_video_id(url):
    """Extracts the YouTube video ID from a URL."""
    import re
    match = re.search(r"(?:v=|\/)([0-9A-Za-z_-]{11}).*", url)
    return match.group(1) if match else None
    

def download_youtube_video(url):
    """Downloads a YouTube video and returns the saved file path."""
    video_id = extract_video_id(url)
    if not video_id:
        print(f"Could not extract video ID from: {url}")
        return None

    output_path = f"videos/{video_id}.mp4"

    # Skip download if video already exists
    if os.path.exists(output_path):
        print(f"Video already exists: {output_path}")
        return output_path

    ydl_opts = {
        'format': 'bestvideo+bestaudio/best',
        'outtmpl': output_path,
        'merge_output_format': 'mp4',
        'postprocessors': [{'key': 'FFmpegVideoConvertor', 'preferedformat': 'mp4'}],
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        try:
            ydl.download([url])
            return output_path
        except Exception as e:
            print(f"Video download failed: {e}")
            return None


def extract_frames(video_path, output_folder, interval=5):
    """Extracts frames from the video every 'interval' seconds."""
    os.makedirs(output_folder, exist_ok=True)
    clip = VideoFileClip(video_path)
    frame_times = np.arange(0, clip.duration, interval)

    frame_paths = []
    for time in frame_times:
        frame = clip.get_frame(time)
        frame_file = os.path.join(output_folder, f"frame_{int(time)}.jpg")
        cv2.imwrite(frame_file, cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
        frame_paths.append(frame_file)

    clip.close()
    print(f"Extracted {len(frame_paths)} frames from video.")
    return frame_paths

def classify_image(image_path):
    """Classifies an image using ViT to determine content type."""
    image = Image.open(image_path).convert("RGB")
    inputs = feature_extractor(images=image, return_tensors="pt")
    with torch.no_grad():
        outputs = vit_model(**inputs)
    predicted_label = outputs.logits.argmax(-1).item()
    return get_label_name(predicted_label)

def generate_caption(image_path):
    """Generates a caption for an image using BLIP."""
    image = Image.open(image_path).convert("RGB")
    inputs = blip_processor(images=image, return_tensors="pt")
    with torch.no_grad():
        caption_ids = blip_model.generate(**inputs)
    return blip_processor.decode(caption_ids[0], skip_special_tokens=True)

def extract_text(image_path):
    """Extracts text from images using OCR (Tesseract)."""
    image = cv2.imread(image_path)
    return pytesseract.image_to_string(image)

def process_video(video_url):
    """Main function to process the video: download, extract frames, analyze them."""
    # Ensure FFmpeg is available
    if not FFMPEG_INSTALLED:
        print("FFmpeg is not installed. Install it before running the script.")
        return

    video_path = download_youtube_video(video_url)
    if not video_path or not os.path.exists(video_path):
        print("Video download failed.")
        return

    print(f"Downloaded video: {video_path}")

    # Extract frames
    video_id = extract_video_id(video_url)
    output_folder = f"processed_frames/{video_id}"
    frame_paths = extract_frames(video_path, output_folder)

    # Process each frame
    processed_data = []
    for frame_path in frame_paths:
        label = classify_image(frame_path)
        caption = generate_caption(frame_path)
        text = extract_text(frame_path)

        processed_data.append({
            "frame": os.path.basename(frame_path),
            "label": label,
            "caption": caption,
            "extracted_text": text
        })

    # Save processed data as JSON
    json_path = f"output_data/{video_id}_processed.json"
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(processed_data, f, indent=4, ensure_ascii=False)

    if processed_data:
        sample = processed_data[0]
        print(f"Processed {len(processed_data)} frames. Sample output:")
        print(f"- Frame: {sample['frame']}")
        print(f"- Label: {sample['label']}")
        print(f"- Caption: {sample['caption'][:100]}...")
        print(f"- OCR Text: {sample['extracted_text'][:100]}...\n")
    else:
        print("No frames were processed.")
    
    print(f"Saved JSON: {json_path}")
    
    return processed_data


def process_youtube_list(youtube_txt_path="YouTube.txt"):
    """Reads YouTube URLs from a text file and processes each video."""
    if not os.path.exists(youtube_txt_path):
        print(f" YouTube.txt file not found at: {youtube_txt_path}")
        return

    with open(youtube_txt_path, "r", encoding="utf-8") as f:
        urls = [line.strip() for line in f if line.strip()]

    if not urls:
        print(" No valid URLs found in YouTube.txt.")
        return

    print(f" Found {len(urls)} video(s). Starting processing...\n")

    for url in urls:
        print(f" Processing: {url}")
        process_video(url)
        print("-" * 60)


In [10]:
if __name__ == "__main__":
    folder = set_folderpath

    youtube_txt_path = os.path.join(folder, "YouTube.txt")
    process_youtube_list(youtube_txt_path)


 Found 1 video(s). Starting processing...

 Processing: https://www.youtube.com/watch?v=P87CA4rI__w
Video already exists: videos/P87CA4rI__w.mp4
Downloaded video: videos/P87CA4rI__w.mp4
{'video_found': True, 'audio_found': True, 'metadata': {'major_brand': 'isom', 'minor_version': '512', 'compatible_brands': 'isomiso2mp41', 'encoder': 'Lavf61.9.107'}, 'inputs': [{'streams': [{'input_number': 0, 'stream_number': 0, 'stream_type': 'video', 'language': 'eng', 'default': True, 'size': [1920, 1080], 'bitrate': 221, 'fps': 29.97002997002997, 'codec_name': 'vp9', 'profile': '(Profile 0)', 'metadata': {'Metadata': '', 'handler_name': 'VideoHandler', 'vendor_id': '[0][0][0][0]'}}, {'input_number': 0, 'stream_number': 1, 'stream_type': 'audio', 'language': 'eng', 'default': True, 'fps': 48000, 'bitrate': 112, 'metadata': {'Metadata': '', 'handler_name': 'SoundHandler', 'vendor_id': '[0][0][0][0]'}}], 'input_number': 0}], 'duration': 375.26, 'bitrate': 341, 'start': 0.0, 'default_video_input_numb

## Part 2: Filtering Through Processed Data

#### How Caption Similarity is Calculated

The script calculates caption similarity using Sentence Transformers (BERT-based embeddings). Each caption is converted into a numerical vector representation using the "all-MiniLM-L6-v2" model. These embeddings capture the semantic meaning of the text. To compare two captions, the script uses cosine similarity, which measures how close the two vectors are in high-dimensional space. If the similarity score is greater than **0.85**, the captions are considered duplicates, and the older frame may be replaced with the newer one.

#### How Image Similarity is Calculated

The script calculates image similarity using CLIP embeddings, which encode images into feature vectors that represent their content. Each frame is passed through OpenAI’s CLIP model, generating a vector that describes the image in a way that captures semantic and structural details. The script then uses cosine similarity to compare these vectors—if the similarity is greater than **0.85**, the images are considered visually redundant. In such cases, the older frame is replaced by the newer one.

In [32]:
# Load CLIP model for image similarity
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load Sentence Transformer for caption similarity
caption_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def extract_frame_number(frame_name):
    """Extracts the numerical part of the frame name."""
    match = re.search(r'frame_(\d+)', frame_name)
    return int(match.group(1)) if match else float('inf')

def extract_video_id_from_filename(filename):
    """Extracts the YouTube video ID from a filename like 'Ilg3gGewQ5U_processed.json'"""
    match = re.search(r'([0-9A-Za-z_-]{11})_processed', filename)
    return match.group(1) if match else None

def get_image_embedding(image_path):
    """Extracts CLIP embedding for an image."""
    image = Image.open(image_path).convert("RGB")
    inputs = clip_processor(images=image, return_tensors="pt")
    with torch.no_grad():
        embedding = clip_model.get_image_features(**inputs)
    return embedding.squeeze().numpy()

def get_caption_embedding(caption):
    """Extracts sentence embedding for a caption using Sentence Transformers."""
    return caption_model.encode(caption)

def cosine_sim(vec1, vec2):
    """Computes cosine similarity between two vectors."""
    return cosine_similarity([vec1], [vec2])[0][0]

def filter_frames(json_file, similarity_threshold=0.85, image_threshold=0.85, base_image_folder="processed_frames"):
    """
    Filters frames based on both caption similarity and image similarity.
    """
    video_id = extract_video_id_from_filename(os.path.basename(json_file))
    if not video_id:
        raise ValueError("Could not extract video ID from JSON filename.")
    
    image_folder = f"{base_image_folder}/{video_id}"
    
    with open(json_file, "r", encoding="utf-8") as f:
        data = json.load(f)

    seen_captions = {}  # Stores caption embeddings
    seen_images = {}  # Stores image embeddings
    filtered_frames = []  # Stores selected frames

    for frame in data:
        caption = frame["caption"]
        frame_path = f"{image_folder}/{frame['frame']}"
        frame_number = extract_frame_number(frame["frame"])

        # Compute embeddings
        caption_embedding = get_caption_embedding(caption)
        image_embedding = get_image_embedding(frame_path)

        # Check caption similarity
        found_similar_caption = None
        for seen_caption, seen_caption_embedding in seen_captions.items():
            if cosine_sim(caption_embedding, seen_caption_embedding) > similarity_threshold:
                found_similar_caption = seen_caption
                break

        # Check image similarity
        found_similar_image = None
        for seen_image_path, seen_image_embedding in seen_images.items():
            if cosine_sim(image_embedding, seen_image_embedding) > image_threshold:
                found_similar_image = seen_image_path
                break

        if found_similar_caption or found_similar_image:
            # If a similar frame exists, replace it only if the new frame is more recent
            existing_frame_index = next(
                (i for i, f in enumerate(filtered_frames) if f["caption"] == found_similar_caption or f["frame"] == found_similar_image),
                None
            )

            if existing_frame_index is not None and extract_frame_number(filtered_frames[existing_frame_index]["frame"]) < frame_number:
                filtered_frames[existing_frame_index] = frame  # Replace old frame with newer one
                seen_captions[found_similar_caption] = caption_embedding
                seen_images[found_similar_image] = image_embedding
        else:
            filtered_frames.append(frame)
            seen_captions[caption] = caption_embedding
            seen_images[frame_path] = image_embedding

    # Sort frames by their chronological order
    filtered_frames.sort(key=lambda x: extract_frame_number(x["frame"]))

    # Save the filtered data
    output_file = json_file.replace(".json", "_filtered.json")
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(filtered_frames, f, indent=4, ensure_ascii=False)

    print(f"Filtered JSON saved to: {output_file}")
    return filtered_frames



  return torch.load(checkpoint_file, map_location="cpu")
Downloading .gitattributes: 100%|██████████████████████████████████████████████████| 1.23k/1.23k [00:00<00:00, 154kB/s]
Downloading config.json: 100%|████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 31.7kB/s]
Downloading README.md: 100%|██████████████████████████████████████████████████████| 10.5k/10.5k [00:00<00:00, 1.49MB/s]
Downloading config.json: 100%|█████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 122kB/s]
Downloading (…)ce_transformers.json: 100%|████████████████████████████████████████████| 116/116 [00:00<00:00, 23.7kB/s]
Downloading data_config.json: 100%|███████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 3.78MB/s]
Downloading model.safetensors: 100%|██████████████████████████████████████████████| 90.9M/90.9M [00:02<00:00, 32.0MB/s]
Downloading model.onnx: 100%|█████████████████████████████████████████████████████| 90.4M/90.4M [00:02<

In [33]:
if __name__ == "__main__":
    output_data_folder = os.path.join(os.getcwd(), "output_data")

    if not os.path.exists(output_data_folder):
        print(f"No 'output_data' folder found at: {output_data_folder}")
    else:
        processed_files = [
            os.path.join(output_data_folder, f)
            for f in os.listdir(output_data_folder)
            if f.endswith("_processed.json")
        ]

        print(f"Found {len(processed_files)} processed files.")
        for file in processed_files:
            print(f"\n🔍 Filtering: {file}")
            filter_frames(file)


Found 2 processed files.

🔍 Filtering: C:\Users\nehab\Documents\GitHub\MemoraVault\output_data\Ilg3gGewQ5U_processed.json
Filtered JSON saved to: C:\Users\nehab\Documents\GitHub\MemoraVault\output_data\Ilg3gGewQ5U_processed_filtered.json

🔍 Filtering: C:\Users\nehab\Documents\GitHub\MemoraVault\output_data\P87CA4rI__w_processed.json
Filtered JSON saved to: C:\Users\nehab\Documents\GitHub\MemoraVault\output_data\P87CA4rI__w_processed_filtered.json


# RAG Time

 1. Vectorizing all the content being fetched (Data Ingestion)
 2. Storing all the embeddings into a vector database and retrieving it (Retrieval)
 3. Querying the LLM (Generation)

## Part 1: Vectorization of all the content being fetched

In [7]:
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Splits the input text into smaller chunks.
    Args:
        text: The input text to split.
        chunk_size: Maximum number of words per chunk.
        overlap: Number of overlapping words between consecutive chunks.
    Returns:
        A list of text chunks.
    """
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += (chunk_size - overlap)
    return chunks
def process_documents(folder_path: str, chunk_size: int = 500, overlap: int = 50) -> List[Dict[str, Any]]:
    """
    Process all PDF and DOCX files with chunking for large documents.
    Assumes the existence of helper functions:
      - extract_text_from_pdf(pdf_path)
      - extract_text_from_word(doc_path)
    For each file, if the text is longer than chunk_size words, the text is split into chunks.
    Each chunk (or full text, if not too long) is formatted into a content item.
    Returns:
        List of content items.
    """
    extracted_items = []
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if filename.lower().endswith('.pdf'):
            print(f"Processing PDF: {filename}")
            pdf_text = extract_text_from_pdf(file_path)
            if pdf_text.strip():
                words = pdf_text.split()
                if len(words) > chunk_size:
                    chunks = chunk_text(pdf_text, chunk_size, overlap)
                    for i, chunk in enumerate(chunks):
                        content_item = {
                            "video_id": filename,
                            "frame": f"chunk_{i+1}",
                            "timestamp": f"chunk_{i+1}",
                            "timestamp_seconds": i,
                            "content": chunk
                        }
                        extracted_items.append(content_item)
                else:
                    content_item = {
                        "video_id": filename,
                        "frame": "full_text",
                        "timestamp": "0:00",
                        "timestamp_seconds": 0,
                        "content": pdf_text
                    }
                    extracted_items.append(content_item)
        elif filename.lower().endswith('.docx'):
            print(f"Processing Word Document: {filename}")
            word_text = extract_text_from_word(file_path)
            if word_text.strip():
                words = word_text.split()
                if len(words) > chunk_size:
                    chunks = chunk_text(word_text, chunk_size, overlap)
                    for i, chunk in enumerate(chunks):
                        content_item = {
                            "video_id": filename,
                            "frame": f"chunk_{i+1}",
                            "timestamp": f"chunk_{i+1}",
                            "timestamp_seconds": i,
                            "content": chunk
                        }
                        extracted_items.append(content_item)
                else:
                    content_item = {
                        "video_id": filename,
                        "frame": "full_text",
                        "timestamp": "0:00",
                        "timestamp_seconds": 0,
                        "content": word_text
                    }
                    extracted_items.append(content_item)
    return extracted_items
def load_json_files(json_folder: str) -> List[Dict[str, Any]]:
    """
    Load all JSON files from a folder and extract their content.
    Handles both:
      - Document extraction JSON (list of dicts with "video_id" and "content")
      - YouTube-formatted JSON (with frames that may contain "caption", "extracted_text", or "label")
    Args:
        json_folder: Path to folder containing JSON files
    Returns:
        List of dictionaries with extracted content.
    """
    json_files = glob.glob(os.path.join(json_folder, "*.json"))
    all_content = []
    for json_file in json_files:
        file_basename = os.path.basename(json_file)
        print(f"Processing: {file_basename}")
        try:
            with open(json_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            # Check if the file is in document extraction format
            if isinstance(data, list):
                if len(data) > 0 and all(isinstance(item, dict) and "video_id" in item and "content" in item for item in data):
                    print(f"  Detected document extraction JSON format")
                    all_content.extend(data)
                    print(f"  Added {len(data)} content items from {file_basename}")
                else:
                    # Process as YouTube formatted JSON
                    file_content = []
                    for i, frame in enumerate(data):
                        frame_name = frame.get("frame", f"frame_{i}")
                        text_content = []
                        if frame.get("caption"):
                            text_content.append(f"Caption: {frame['caption']}")
                        if frame.get("extracted_text"):
                            text_content.append(f"Text: {frame['extracted_text']}")
                        if frame.get("label"):
                            text_content.append(f"Visual: {frame['label']}")
                        if text_content:
                            try:
                                # Try to extract a numeric time from the frame name
                                time_str = frame_name.replace("frame_", "").replace(".jpg", "")
                                frame_time = int(time_str)
                            except:
                                frame_time = i * 5
                            time_min = frame_time // 60
                            time_sec = frame_time % 60
                            time_formatted = f"{time_min}:{time_sec:02d}"
                            content_item = {
                                "video_id": file_basename.replace("_processed.json", ""),
                                "frame": frame_name,
                                "timestamp": time_formatted,
                                "timestamp_seconds": frame_time,
                                "content": " ".join(text_content)
                            }
                            file_content.append(content_item)
                    all_content.extend(file_content)
                    print(f"  Added {len(file_content)} content items from {file_basename}")
            else:
                print(f"  Skipping {file_basename}: Unrecognized format")
        except Exception as e:
            print(f"Error processing {file_basename}: {e}")
    print(f"Total content items extracted: {len(all_content)}")
    return all_content

### Conversion of .json files into embeddings 

We are using the all-miniLM-l6-v2 model for vectorizing the json files into embeddings. 

The hugging face link: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [8]:
def vectorize_content(content_items: List[Dict[str, Any]],
                      model_name: str = "all-MiniLM-L6-v2",
                      chunk_size: int = 500,
                      overlap: int = 50) -> Tuple[np.ndarray, List[Dict[str, Any]]]:
    """
    Convert content items to vector embeddings. For long texts, perform chunking.
    
    Args:
        content_items: List of content items with text.
        model_name: Name of the SentenceTransformer model to use.
        chunk_size: Maximum number of words in each chunk.
        overlap: Number of overlapping words between chunks.
        
    Returns:
        Tuple of (embeddings array, updated content items with chunk info).
    """
    # Load the embedding model
    model = SentenceTransformer(model_name)
    print(f"Loaded embedding model: {model_name}")
    
    chunked_items = []
    for item in content_items:
        text = item["content"]
        # Chuking
        if len(text.split()) > chunk_size:
            chunks = chunk_text(text, chunk_size=chunk_size, overlap=overlap)
            for i, chunk in enumerate(chunks):
                new_item = item.copy()
                new_item["content"] = chunk
                new_item["chunk_index"] = i
                chunked_items.append(new_item)
        else:
            chunked_items.append(item)
    
    texts = [item["content"] for item in chunked_items]
    print(f"Generating embeddings for {len(texts)} chunks/items...")
    embeddings = model.encode(texts, show_progress_bar=True)
    
    return embeddings, chunked_items

def save_vectors(embeddings: np.ndarray, content_items: List[Dict[str, Any]], output_folder: str):
    """
    Save the vector embeddings and content items.
    
    Args:
        embeddings: NumPy array of embeddings.
        content_items: List of content items.
        output_folder: Folder to save the files.
    """
    os.makedirs(output_folder, exist_ok=True)
    
    embeddings_path = os.path.join(output_folder, "embeddings.npy")
    np.save(embeddings_path, embeddings)
    
    content_path = os.path.join(output_folder, "content_items.pkl")
    with open(content_path, 'wb') as f:
        pickle.dump(content_items, f)
    
    print(f"Saved embeddings to {embeddings_path}")
    print(f"Saved content items to {content_path}")

### Main Function call for chunking, vectorizing data and storing it

In [9]:
def vec_main():
    # Locate current file path (base folder where script is running)
    base_folder = os.getcwd()

    # Folder with processed JSON files
    json_folder = os.path.join(base_folder, "output_data")

    # Create a new folder for vector storage
    output_folder = os.path.join(base_folder, "vectorized_data")
    os.makedirs(output_folder, exist_ok=True)
    
    content_items = load_json_files(json_folder)
    print(f"Extracted {len(content_items)} total content items from all JSON files")
    
    embeddings, content_items = vectorize_content(content_items)
    print(f"Generated embeddings with shape: {embeddings.shape}")
    
    save_vectors(embeddings, content_items, output_folder)
    
    print("Vectorization complete!")


In [10]:
if __name__ == "__main__":
    vec_main()

Processing: AML Quiz 1 Study Guide_transcript.json
  Skipping AML Quiz 1 Study Guide_transcript.json: Unrecognized format
Processing: extracted_text.json
  Detected document extraction JSON format
  Added 4 content items from extracted_text.json
Processing: Ilg3gGewQ5U_processed.json
  Added 154 content items from Ilg3gGewQ5U_processed.json
Processing: Ilg3gGewQ5U_processed_filtered.json
  Added 28 content items from Ilg3gGewQ5U_processed_filtered.json
Processing: Ilg3gGewQ5U_transcript.json
  Skipping Ilg3gGewQ5U_transcript.json: Unrecognized format
Processing: Know the Bear Facts_transcript.json
  Skipping Know the Bear Facts_transcript.json: Unrecognized format
Total content items extracted: 186
Extracted 186 total content items from all JSON files
Loaded embedding model: all-MiniLM-L6-v2
Generating embeddings for 200 chunks/items...


Batches: 100%|██████████| 7/7 [00:00<00:00, 30.14it/s]

Generated embeddings with shape: (200, 384)
Saved embeddings to c:\Users\ethan\Documents\GitHub\MemoraVault\vectorized_data\embeddings.npy
Saved content items to c:\Users\ethan\Documents\GitHub\MemoraVault\vectorized_data\content_items.pkl
Vectorization complete!





## Part 2: Storing all the embeddings into a vector database

We will use the FAISS for searching embeddings

Link for FAISS : https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

In [11]:
def load_vectorized_data(embeddings_path: str, content_path: str):
    """
    Load the saved embeddings (NumPy array) and content items (pickle file).
    """
    embeddings = np.load(embeddings_path)
    with open(content_path, 'rb') as f:
        content_items = pickle.load(f)
    return embeddings, content_items

def create_faiss_index(embeddings: np.ndarray) -> faiss.Index:
    """
    Create a FAISS index using cosine similarity (normalize + inner product).
    """
    embeddings = embeddings.astype("float32")
    
    # Normalizing vectors for cosine similarity
    faiss.normalize_L2(embeddings)
    
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
    index.add(embeddings)
    print(f"FAISS index has {index.ntotal} vectors.")
    return index

def search_index(query: str, model: SentenceTransformer, index: faiss.Index, content_items: list, k: int = 5):
    """
    Convert the query to an embedding, search the FAISS index for the top-k nearest neighbors,
    and return the corresponding content items with their scores.
    """
    query_embedding = model.encode(query, show_progress_bar=False)
    query_embedding = np.array([query_embedding]).astype("float32")
    
    
    faiss.normalize_L2(query_embedding)
    
    scores, indices = index.search(query_embedding, k)
    
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append((score, content_items[idx]))
    return results


## Change below accordingly 

embeddings_path = "vectorized_data/embeddings.npy"
content_path = "vectorized_data/content_items.pkl"

# Load vectorized data
embeddings, content_items = load_vectorized_data(embeddings_path, content_path)
print(f"Loaded embeddings with shape: {embeddings.shape}")
print(f"Loaded {len(content_items)} content items.")

# Create the FAISS index
index = create_faiss_index(embeddings)

# Load the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
print("Setup complete!")

Loaded embeddings with shape: (200, 384)
Loaded 200 content items.
FAISS index has 200 vectors.
Setup complete!


### Testing of the Index

In [13]:
# Define your query text
query = "What is confounder bias?"

results = search_index(query, model, index, content_items, k=3)

for rank, (dist, item) in enumerate(results, start=1):
    print(f"Rank {rank}:")
    print(f"  Distance: {dist:.4f}")
    print(f"  Video ID: {item['video_id']}")
    print(f"  Frame: {item['frame']}")
    print(f"  Timestamp: {item['timestamp']}")
    print(f"  Content: {item['content']}")
    print("")

Rank 1:
  Distance: 0.2837
  Video ID: Ilg3gGewQ5U_transcript.docx
  Frame: full_text
  Timestamp: 0:00
  Content: a bias, which is all then plugged into something like the sigmoid squishification function, or a ReLU. So there are three different avenues that can team up together to help increase that activation. You can increase the bias, you can increase the weights, and you can change the activations from the previous layer. Focusing on how the weights should be adjusted, notice how the weights actually have differing levels of influence. The connections with the brightest neurons from the preceding layer have the biggest effect since those weights are multiplied by larger activation values. So if you were to increase one of those weights, it actually has a stronger influence on the ultimate cost function than increasing the weights of connections with dimmer neurons, at least as far as this one training example is concerned. Remember, when we talk about gradient descent, we don't j

# Using a LLM To Implement the RAG

## Using Gemini 

In [14]:
import google.generativeai as genai
from getpass import getpass 

In [15]:
def generate_answer(query: str, retrieved_context: list = None) -> str:
    """
    Generate an answer using the retrieved context and Gemini model.
    
    Args:
        query: The user's query.
        retrieved_context: Optional list of tuples (score, content_item).
        
    Returns:
        The generated answer as a string.
    """
    context_text = ""
    if retrieved_context:
        context_text = "\n\n".join([f"{item['content']}" for score, item in retrieved_context])

    prompt = (
        "You are an expert tutor. Use the context provided below to answer the following question.\n\n"
        f"Context:\n{context_text}\n\n"
        "Question:\n"
        f"{query}\n\n"
        "Answer:"
    )

    response = model.generate_content(prompt)
    return response.text.strip()


**Use the Following Link to Create a Free Gemini API Key:** 
https://ai.google.dev/gemini-api/docs/api-key

In [16]:
api_key = getpass("Enter your Gemini API key: ")
genai.configure(api_key = api_key)
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
encoder_model = SentenceTransformer("all-MiniLM-L6-v2")

In [17]:
query = "What is confounder bias?"
results = search_index(query, encoder_model, index, content_items, k=3)  # This comes from the FAISS Search
answer = generate_answer(query, results)
print("Generated Answer:\n", answer)

Generated Answer:
 Confounder bias occurs when a confounding variable is correlated with the independent variable (possibly causally) and is also causally related to the dependent variable.  This leads to biased results, preventing the observation of the true relationship between the independent and dependent variables.  The example provided uses ice cream consumption and sunburns, where temperature is the confounding variable affecting both.


In [18]:
query = "What is confounder bias?"
answer = generate_answer(query)
print("Generated Answer:\n", answer)

Generated Answer:
 Based on the provided context, which is empty, I cannot answer the question about confounder bias.  To answer this question accurately, I need information about confounding variables and their effect on the interpretation of data.  Confounder bias, also known as confounding, occurs when a third variable influences both the exposure and the outcome, thereby distorting the association between them.  Without further information, a complete and accurate definition cannot be given.
