## The following RAG notebook covers the following: 
### 1. Vectorizing all the content being fetched (Data Ingestion)
### 2. Storing all the embeddings into a vector database and retrieving it (Retrieval)
### 3. Querying the LLM (Generation)

## Part 1: Vectorization of all the content being fetched

### Loading and preparing the JSON Data (Data Ingestion)

Install the below dependencies first

In [1]:
#pip install sentence-transformers 

In [2]:
#!pip install tf-keras

In [1]:
import os
import json
import glob
import numpy as np
from sentence_transformers import SentenceTransformer
import pickle
from typing import List, Dict, Any, Tuple

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Splits the input text into smaller chunks.
    
    Args:
        text: The input text to split.
        chunk_size: Maximum number of words per chunk.
        overlap: Number of overlapping words between consecutive chunks.
        
    Returns:
        A list of text chunks.
    """
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += (chunk_size - overlap)
    return chunks

def load_json_files(json_folder: str) -> List[Dict[str, Any]]:
    """
    Load all JSON files from a folder and extract their content.
    
    Args:
        json_folder: Path to folder containing JSON files
        
    Returns:
        List of dictionaries with extracted content.
    """
    json_files = glob.glob(os.path.join(json_folder, "*.json"))
    all_content = []
    
    for json_file in json_files:
        video_id = os.path.basename(json_file).replace("_processed.json", "")
        print(f"Processing: {json_file}")
        
        try:
            with open(json_file, 'r', encoding='utf-8') as f:
                data = json.load(f)
            
            # Process each frame in the JSON
            for i, frame in enumerate(data):
                frame_name = frame.get("frame", f"frame_{i}")
                text_content = []
                
                if frame.get("caption"):
                    text_content.append(f"Caption: {frame['caption']}")
                if frame.get("extracted_text"):
                    text_content.append(f"Text: {frame['extracted_text']}")
                if frame.get("label"):
                    text_content.append(f"Visual: {frame['label']}")
                
                if text_content:
                    frame_time = 0
                    try:
                        time_str = frame_name.replace("frame_", "").replace(".jpg", "")
                        frame_time = int(time_str)
                    except:
                        frame_time = i * 5
                    
                    time_min = frame_time // 60
                    time_sec = frame_time % 60
                    time_str = f"{time_min}:{time_sec:02d}"
                    
                    content_item = {
                        "video_id": video_id,
                        "frame": frame_name,
                        "timestamp": time_str,
                        "timestamp_seconds": frame_time,
                        "content": " ".join(text_content)
                    }
                    
                    all_content.append(content_item)
            
            print(f"  Extracted {len(all_content)} content items so far")
            
        except Exception as e:
            print(f"Error processing {json_file}: {e}")
    
    return all_content


  from .autonotebook import tqdm as notebook_tqdm


### Conversion of .json files into embeddings 

We are using the all-miniLM-l6-v2 model for vectorizing the json files into embeddings. 

The hugging face link: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [2]:
def vectorize_content(content_items: List[Dict[str, Any]],
                      model_name: str = "all-MiniLM-L6-v2",
                      chunk_size: int = 500,
                      overlap: int = 50) -> Tuple[np.ndarray, List[Dict[str, Any]]]:
    """
    Convert content items to vector embeddings. For long texts, perform chunking.
    
    Args:
        content_items: List of content items with text.
        model_name: Name of the SentenceTransformer model to use.
        chunk_size: Maximum number of words in each chunk.
        overlap: Number of overlapping words between chunks.
        
    Returns:
        Tuple of (embeddings array, updated content items with chunk info).
    """
    # Load the embedding model
    model = SentenceTransformer(model_name)
    print(f"Loaded embedding model: {model_name}")
    
    chunked_items = []
    for item in content_items:
        text = item["content"]
        # Chuking
        if len(text.split()) > chunk_size:
            chunks = chunk_text(text, chunk_size=chunk_size, overlap=overlap)
            for i, chunk in enumerate(chunks):
                new_item = item.copy()
                new_item["content"] = chunk
                new_item["chunk_index"] = i
                chunked_items.append(new_item)
        else:
            chunked_items.append(item)
    
    texts = [item["content"] for item in chunked_items]
    print(f"Generating embeddings for {len(texts)} chunks/items...")
    embeddings = model.encode(texts, show_progress_bar=True)
    
    return embeddings, chunked_items

def save_vectors(embeddings: np.ndarray, content_items: List[Dict[str, Any]], output_folder: str):
    """
    Save the vector embeddings and content items.
    
    Args:
        embeddings: NumPy array of embeddings.
        content_items: List of content items.
        output_folder: Folder to save the files.
    """
    os.makedirs(output_folder, exist_ok=True)
    
    embeddings_path = os.path.join(output_folder, "embeddings.npy")
    np.save(embeddings_path, embeddings)
    
    content_path = os.path.join(output_folder, "content_items.pkl")
    with open(content_path, 'wb') as f:
        pickle.dump(content_items, f)
    
    print(f"Saved embeddings to {embeddings_path}")
    print(f"Saved content items to {content_path}")

### Main Function call for chunking, vectorizing data and storing it

In [3]:
def main():
    # Configuration folders
    json_folder = '/Users/advaith/Desktop/MSBA Related coursework/Spring term/Deep Learning/Final Project/Data to be considered'
    output_folder = '/Users/advaith/Desktop/MSBA Related coursework/Spring term/Deep Learning/Final Project/vectorized_data' 
    
    content_items = load_json_files(json_folder)
    print(f"Extracted {len(content_items)} total content items from all JSON files")
    
    embeddings, content_items = vectorize_content(content_items)
    print(f"Generated embeddings with shape: {embeddings.shape}")
    
    save_vectors(embeddings, content_items, output_folder)
    
    print("Vectorization complete!")

if __name__ == "__main__":
    main()

Processing: /Users/advaith/Desktop/MSBA Related coursework/Spring term/Deep Learning/Final Project/Data to be considered/Ilg3gGewQ5U_processed_filtered.json
  Extracted 38 content items so far
Extracted 38 total content items from all JSON files
Loaded embedding model: all-MiniLM-L6-v2
Generating embeddings for 38 chunks/items...


Batches: 100%|██████████| 2/2 [00:00<00:00,  5.70it/s]

Generated embeddings with shape: (38, 384)
Saved embeddings to /Users/advaith/Desktop/MSBA Related coursework/Spring term/Deep Learning/Final Project/vectorized_data/embeddings.npy
Saved content items to /Users/advaith/Desktop/MSBA Related coursework/Spring term/Deep Learning/Final Project/vectorized_data/content_items.pkl
Vectorization complete!





<br>

## Part 2: Storing all the embeddings into a vector database

We will use the FAISS for searching embeddings

Link for FAISS : https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

In [7]:
#pip install faiss-cpu

In [4]:
import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import pickle
import glob
from typing import List, Dict, Any, Tuple

def load_vectorized_data(embeddings_path: str, content_path: str):
    """
    Load the saved embeddings (NumPy array) and content items (pickle file).
    """
    embeddings = np.load(embeddings_path)
    with open(content_path, 'rb') as f:
        content_items = pickle.load(f)
    return embeddings, content_items

def create_faiss_index(embeddings: np.ndarray) -> faiss.Index:
    """
    Create a FAISS index using cosine similarity (normalize + inner product).
    """
    embeddings = embeddings.astype("float32")
    
    # Normalizing vectors for cosine similarity
    faiss.normalize_L2(embeddings)
    
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
    index.add(embeddings)
    print(f"FAISS index has {index.ntotal} vectors.")
    return index

def search_index(query: str, model: SentenceTransformer, index: faiss.Index, content_items: list, k: int = 5):
    """
    Convert the query to an embedding, search the FAISS index for the top-k nearest neighbors,
    and return the corresponding content items with their scores.
    """
    query_embedding = model.encode(query, show_progress_bar=False)
    query_embedding = np.array([query_embedding]).astype("float32")
    
    
    faiss.normalize_L2(query_embedding)
    
    scores, indices = index.search(query_embedding, k)
    
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append((score, content_items[idx]))
    return results


## Change below accordingly 

embeddings_path = "vectorized_data/embeddings.npy"
content_path = "vectorized_data/content_items.pkl"

# Load vectorized data
embeddings, content_items = load_vectorized_data(embeddings_path, content_path)
print(f"Loaded embeddings with shape: {embeddings.shape}")
print(f"Loaded {len(content_items)} content items.")

# Create the FAISS index
index = create_faiss_index(embeddings)

# Load the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
print("Setup complete!")

Loaded embeddings with shape: (38, 384)
Loaded 38 content items.
FAISS index has 38 vectors.
Setup complete!


### Testing of the Index

In [5]:
# Define your query text
query = "How does backpropagation work?"

results = search_index(query, model, index, content_items, k=5)

for rank, (dist, item) in enumerate(results, start=1):
    print(f"Rank {rank}:")
    print(f"  Distance: {dist:.4f}")
    print(f"  Video ID: {item['video_id']}")
    print(f"  Frame: {item['frame']}")
    print(f"  Timestamp: {item['timestamp']}")
    print(f"  Content: {item['content']}")
    print("")

Rank 1:
  Distance: 0.4931
  Video ID: Ilg3gGewQ5U_processed_filtered.json
  Frame: frame_700.jpg
  Timestamp: 11:40
  Content: Caption: the codel's book Text: def backprop(self, x, y):

"""Return a tuple **(nabla_b, nabla_w)** representing the

gradient for the cost function C_x. ~*‘nabla_b** and

*“nabla_w** are layer-by-layer lists of numpy arrays, similar

to ‘‘self.biases** and ‘*self.weights**."""

Nabla_b = [np.zeros(b.shape) for b in self.biases]

nabla_w = [np.zeros(w.shape) for w in self.weights]

# feedforward

activation = x

activations = [x] # list to store all the activations, layer by layer

zs = [] # list to store all the z vectors, layer by layer

for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = self.non_linearity(z)
activations. append(activation)

# backward pass

delta = self.cost_derivative(activations[-1], y) * \
self.d_non_linearity(zs[-1])

nabla_b[-1] = delta

nabla_w[-1] = np.dot(delta, activations [-2].transpo

<br>

## Using a LLM To Implement the RAG

### Using Gemini 

In [3]:
#pip install google-generativeai

In [6]:
def generate_answer(query: str, retrieved_context: list) -> str:
    """
    Generate an answer using the retrieved context and Gemini model.
    
    Args:
        query: The user's query.
        retrieved_context: List of tuples (score, content_item) from retrieval.
        
    Returns:
        The generated answer as a string.
    """
    # Combine retrieved content into a single context string
    context_text = "\n\n".join([f"{item['content']}" for score, item in retrieved_context])
    
    # Construct the prompt by including the retrieved context and the user query
    prompt = (
        "You are an expert tutor. Use the context provided below to answer the following question.\n\n"
        "Context:\n"
        f"{context_text}\n\n"
        "Question:\n"
        f"{query}\n\n"
        "Answer:"
    )
    response = model.generate_content(prompt)
    
    # Extract the answer from the response
    answer = response.text.strip()
    return answer



In [7]:
import google.generativeai as genai
from getpass import getpass 
api_key = getpass("Enter your Gemini API key: ")
genai.configure(api_key = api_key)
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
encoder_model = SentenceTransformer("all-MiniLM-L6-v2")

Enter your Gemini API key:  ········


In [8]:
query = "How does backpropagation work?"
results = search_index(query, encoder_model, index, content_items, k=5)  # This comes from the FAISS Search
answer = generate_answer(query, results)
print("Generated Answer:\n", answer)

Generated Answer:
 Backpropagation calculates the gradient of the cost function with respect to the network's weights and biases.  The process begins with a feedforward pass, computing the activations (outputs) of each layer and storing them.  Then, a backward pass commences.  It starts at the output layer, calculating the error (delta) using the cost derivative and the derivative of the activation function. This error is then propagated back layer by layer.  For each layer, the gradient of the biases is simply the error (delta) at that layer.  The gradient of the weights is the product of the error (delta) and the previous layer's activations, appropriately transposed. This process iteratively uses the weights from the next layer and the derivative of the activation function to compute the error signal for the previous layer.  The algorithm leverages the chain rule of calculus to efficiently compute these gradients.  The final output is a pair of layer-by-layer lists: `nabla_b` (gradi