# This section:

# *1. Embeds each transcript chunk using sentence-transformers,*

# *2. Builds a vector similarity index using FAISS for fast retrieval,*

# *3. Saves both the vector index and metadata for use in querying or RAG-style applications.*

#### This section takes the text chunks created earlier and transforms them into numerical embeddings that can be used to compare meaning between chunks — for example, to find which part of a transcript best matches a user’s question.

#### These embeddings are stored in a FAISS index, which is a high-speed search structure that allows you to quickly find the most similar chunks of text. This is the backbone for tasks like semantic search, where you want to retrieve the most relevant part of a video transcript given a natural language query.

#### It also saves the metadata (like which chunk belongs to which video and when it starts) so that once a matching chunk is found, it can be tied back to its original video and timestamp.

### Import required libraries

In [1]:
# --- Part 3: Chunk Embeddings + Upload to Qdrant ---
import pandas as pd
import numpy as np
import uuid
import pickle
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance




In [2]:
# --- Step 0: Load processed chunks ---
chunk_df = pd.read_csv("VideoProj_chunks.csv")  # Contains 'text', 'video_id', 'chunk_id', 'start'

# --- Step 1: Apply BERTopic for topic modeling ---
print("Running BERTopic on transcript chunks...")
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(chunk_df["text"])
chunk_df["topic"] = topics
print(f"Topics assigned: {len(set(topics))} unique topics")

# --- Step 2: Load sentence embedding model ---
print("Loading sentence transformer...")
model = SentenceTransformer('all-MiniLM-L6-v2')

# --- Step 3: Generate embeddings ---
print("📐 Encoding transcript chunks into vectors...")
embeddings = model.encode(chunk_df["text"].tolist(), show_progress_bar=True)
embeddings = np.array(embeddings).astype('float32')  # Required format for Qdrant

# --- Step 4: Connect to Qdrant ---
print("Connecting to Qdrant and storing vectors...")
qdrant = QdrantClient(host="localhost", port=6333)

# --- Create collection if not already created ---
if not qdrant.collection_exists("video_chunks"):
    qdrant.create_collection(
        collection_name="video_chunks",
        vectors_config=VectorParams(size=embeddings.shape[1], distance=Distance.COSINE)
    )
else:
    print("Collection 'video_chunks' already exists.")

# --- Sanitize payload to be JSON-safe ---
chunk_df = chunk_df.fillna("")  # Remove NaNs

def sanitize_payload(row):
    return {
        "video_id": str(row["video_id"]),
        "chunk_id": int(row["chunk_id"]),
        "text": str(row["text"]),
        "start": int(row["start"]),
        "topic": int(row["topic"])
    }

# --- Build all points ---
points = [
    PointStruct(
        id=str(uuid.uuid4()),
        vector=vec.tolist(),
        payload=sanitize_payload(row)
    )
    for vec, (_, row) in zip(embeddings, chunk_df.iterrows())
]

# --- Upload in batches to avoid size limits ---
print("Uploading to Qdrant in batches...")
batch_size = 1000
for i in range(0, len(points), batch_size):
    batch = points[i:i + batch_size]
    qdrant.upsert(collection_name="video_chunks", points=batch)
    print(f"Uploaded batch {i // batch_size + 1} of {len(points) // batch_size + 1}")

Running BERTopic on transcript chunks...
Topics assigned: 692 unique topics
Loading sentence transformer...
📐 Encoding transcript chunks into vectors...


Batches:   0%|          | 0/1105 [00:00<?, ?it/s]

Connecting to Qdrant and storing vectors...
Collection 'video_chunks' already exists.
Uploading to Qdrant in batches...
Uploaded batch 1 of 36
Uploaded batch 2 of 36
Uploaded batch 3 of 36
Uploaded batch 4 of 36
Uploaded batch 5 of 36
Uploaded batch 6 of 36
Uploaded batch 7 of 36
Uploaded batch 8 of 36
Uploaded batch 9 of 36
Uploaded batch 10 of 36
Uploaded batch 11 of 36
Uploaded batch 12 of 36
Uploaded batch 13 of 36
Uploaded batch 14 of 36
Uploaded batch 15 of 36
Uploaded batch 16 of 36
Uploaded batch 17 of 36
Uploaded batch 18 of 36
Uploaded batch 19 of 36
Uploaded batch 20 of 36
Uploaded batch 21 of 36
Uploaded batch 22 of 36
Uploaded batch 23 of 36
Uploaded batch 24 of 36
Uploaded batch 25 of 36
Uploaded batch 26 of 36
Uploaded batch 27 of 36
Uploaded batch 28 of 36
Uploaded batch 29 of 36
Uploaded batch 30 of 36
Uploaded batch 31 of 36
Uploaded batch 32 of 36
Uploaded batch 33 of 36
Uploaded batch 34 of 36
Uploaded batch 35 of 36
Uploaded batch 36 of 36


### Embed Chunks 

In [3]:
# --- Step 5: Save metadata ---
with open("VideoProj_metadata.pkl", "wb") as f:
    pickle.dump(chunk_df, f)
print("Metadata saved to: VideoProj_metadata.pkl")

# --- Step 6: Preview ---
print("\nPreview of stored data:")
print(chunk_df[['video_id', 'chunk_id', 'start', 'topic']].head())

Metadata saved to: VideoProj_metadata.pkl

Preview of stored data:
      video_id  chunk_id  start  topic
0  -Hv6OPTlUZU         0      0     40
1  -Hv6OPTlUZU         1      2     83
2  -Hv6OPTlUZU         2      4     18
3  -Hv6OPTlUZU         3      5    621
4  -Hv6OPTlUZU         4      7     16
