# Multimodal RAG with Video Data using KDB.AI and TwelveLabs

##### Note: This example requires KDB.AI Server and Twelve Labs API key. Sign up for a free KDB.AI Server trial [here](https://trykdb.kx.com/kdbaiserver/signup/) and Twelve Labs account if needed.

TwelveLabs is a company that deals with multimodal models, with a specific focus on video. We can use them to generate multimodal embeddings, and to chat with sections of our video that are retrieved. Here we will be making a multimodal RAG pipeline using KDB.AI and TwelveLabs.

This notebook demonstrates building a multimodal Retrieval-Augmented Generation (RAG) system capable of answering questions about video content. It leverages:

* **pytubefix & moviepy:** To download and process video.
* **Twelve Labs:** For video understanding, indexing, and search capabilities.
* **KDB.AI:** As the vector database to store and search video segment embeddings efficiently.

### Agenda:
0. Setup
1. Download Data and Initialize Clients
2. Video Indexing with Twelve Labs
3. Load Data and Search
4. Analyze Search Results
5. Cleanup

Relevant Links:
* [KDB.AI](https://kdb.ai/)
* [Twelve Labs](https://twelvelabs.io/)

## 0. Setup

Install required packages, import libraries, set up API credentials

### Install Required Dependencies
First we will install the necessary Python packages including kdbai-client, pytubefix for YouTube downloads, moviepy for video processing, and twelvelabs for video understanding and search capabilities. These packages will enable us to download videos, process them, and build our multimodal RAG system.

In [None]:
!pip install kdbai-client pytubefix twelvelabs moviepy==2.1.2

### Import Core Libraries

In [43]:
from moviepy import VideoFileClip
from pytubefix import YouTube
from pytubefix.cli import on_progress
import pandas as pd

import kdbai_client as kdbai
from kdbai_client import Session
from twelvelabs import TwelveLabs

### Set Up API Keys and KDB.AI Connection Details
Now we need to set up our API keys for the various services we'll be using.
For each key, we'll first check if it's already set as an environment variable.
If not, we'll prompt the user to enter it securely.

In [44]:
import os
import getpass

os.environ["TWELVE_LABS_API_KEY"] = (
    os.getenv("TWELVE_LABS_API_KEY")
    or getpass.getpass("TWELVE_LABS_API_KEY: ")
)

os.environ["KDBAI_API_KEY"] = (
    os.getenv("KDBAI_API_KEY")
    or getpass.getpass("KDB.AI API Key: ")
)

os.environ["KDBAI_ENDPOINT"] = (
    os.getenv("KDBAI_ENDPOINT")
    or input("KDB.AI Endpoint URL: ")
)

## 1. Download Data and Initialize Clients

Next we will download our YouTube video and initialize KDB.AI and TwelveLabs clients.

### Download Video

Now we download a YouTube video for analysis. We specify the video URL and local save path, create the destination directory, and use pytubefix to download the highest resolution

In [46]:
video_url = "https://www.youtube.com/watch?v=d_qvLDhkg00"  # Example: Central Limit Theorem video
output_vid = "./content/video_data/input_vid.mp4"
output_dir = os.path.dirname(output_vid)

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

print(f"Downloading video from {video_url} to {output_vid}...")

yt = YouTube(video_url, on_progress_callback=on_progress)
stream = yt.streams.get_highest_resolution()

if stream:
    stream.download(output_path=output_dir, filename=os.path.basename(output_vid))
    print(f"\nVideo downloaded successfully to {output_vid}")
else:
    print("Error: Could not find a suitable video stream.")

Downloading video from https://www.youtube.com/watch?v=d_qvLDhkg00 to ./content/video_data/input_vid.mp4...

Video downloaded successfully to ./content/video_data/input_vid.mp4


### Initialize TwelveLabs/Create Task

Let's initialize TwelveLabs and wait for the TwelveLabs video embedding task.

In [None]:
tl = TwelveLabs(api_key=os.getenv('TWELVE_LABS_API_KEY'))
task = tl.embed.task.create(
    model_name="Marengo-retrieval-2.7",
    video_file='./content/video_data/input_vid.mp4'
)
task.wait_for_done()

'ready'

We can now retrieve the embeddings for each clip.

In [48]:
segs = task.retrieve(embedding_option=["visual-text"]).video_embedding.segments

### Connect to KDB.AI session and Database
Let's initialize our KDB.AI session to connect to our default database.

In [None]:
# Initialize KDB.AI Session and Database connection
session = Session(endpoint=os.getenv("KDBAI_ENDPOINT"), api_key=os.getenv("KDBAI_API_KEY"))
db = session.database("default")
print(f"KDB.AI Session initialized for endpoint: {os.getenv('KDBAI_ENDPOINT')}")

## 2. Set Up KDB.AI Vector Database Table

Define the KDB.AI table schema and configure the vector index for storing and searching the video embeddings.

### Define KDB.AI Table Schema and Indexes
Now we'll define our table schema with four columns:
- segment_id: A unique identifier for each video segment
- start_offset_sec: The starting timestamp of the segment
- end_offset_sec: The ending timestamp of the segment
- embeddings: The vector representation of the segment content
We'll also configure a vector index using HNSW algorithm with Cosine Similarity
to enable efficient similarity search across our video embeddings.


In [50]:
import numpy as np

dim = len(segs[0].embeddings_float)

schema = [
    {"name":"segment_id",       "type":"str"},
    {"name":"start_offset_sec", "type":"float64"},
    {"name":"end_offset_sec",   "type":"float64"},
    {"name":"embeddings",       "type":"float32s"}
]

indexes = [{
    "type":   "qHnsw",
    "name":   "idx_emb",
    "column": "embeddings",
    "params": {"dims": dim, "metric": "CS"}
}]


print("KDB.AI schema and index defined successfully.")

KDB.AI schema and index defined successfully.


### Create KDB.AI Table

In this step, we will:
1. Create a new KDB.AI table named 'video_chunks'
2. First drop any existing table with this name (with error handling)
3. Create the table using our predefined schema and indexes
4. Store a reference to the table in the `table` variable for later use

In [51]:
try:
    db.table("video_chunks").drop()
except kdbai.KDBAIException:
    pass

table = db.create_table("video_chunks", schema=schema, indexes=indexes)
print(f"KDB.AI table '{table.name}' created successfully.")

KDB.AI table 'video_chunks' created successfully.


## 3. Load Data and Search

We now load the prepared video chunk data (segment IDs, start/end timestamps, and embeddings) into the KDB.AI vector table.

### Inserting Data into KDB.AI Table
Let's insert our data into KDB.AI and query it to make sure it was inserted correctly:

In [52]:
rows = []
for seg in segs:
    rows.append({
        "segment_id":       f"{int(seg.start_offset_sec)}",
        "start_offset_sec": seg.start_offset_sec,
        "end_offset_sec":   seg.end_offset_sec,
        "embeddings":       seg.embeddings_float
    })
    
df = pd.DataFrame(rows)

table.insert(df)

display(table.query(limit=5))

Unnamed: 0,segment_id,start_offset_sec,end_offset_sec,embeddings
0,0,0.0,6.0,"[0.025205607, 0.0032751139, -0.014859959, 0.02..."
1,6,6.0,12.0,"[0.02405941, 0.018603785, -0.02103669, 0.04805..."
2,12,12.0,18.0,"[0.023615949, 0.013139918, -0.022509707, 0.047..."
3,18,18.0,24.0,"[0.0229081, 0.015123129, -0.028940378, 0.04754..."
4,24,24.0,30.0,"[0.0034697105, 0.0023037025, -0.023470787, 0.0..."


### Running Vector Search on Video Content
Now we'll perform a vector search using the query "central limit theorem". We'll embed the query text with Twelve Labs' Marengo-retrieval-2.7 model, retrieve the top 3 most relevant video segments (n=3) from KDB.AI based on embedding similarity, and display these video clips in the notebook.

In [53]:
query = "central limit theorem"

q_emb = tl.embed.create(model_name="Marengo-retrieval-2.7", text=query)
q_vec = q_emb.text_embedding.segments[0].embeddings_float

hits  = table.search(vectors={"idx_emb": [q_vec]}, n=3)[0]

# play video starting first hit 
from IPython.display import display

for index in range(len(hits)):
    clip = VideoFileClip(output_vid).subclipped(hits.iloc[index].start_offset_sec, hits.iloc[index].end_offset_sec)
    display(clip.display_in_notebook(width=500))

{'video_found': True, 'audio_found': True, 'metadata': {'major_brand': 'mp42', 'minor_version': '0', 'compatible_brands': 'isommp42', 'creation_time': '2025-01-10T03:45:09.000000Z', 'encoder': 'Google'}, 'inputs': [{'streams': [{'input_number': 0, 'stream_number': 0, 'stream_type': 'video', 'language': None, 'default': True, 'size': [640, 360], 'bitrate': 56, 'fps': 30.0, 'codec_name': 'h264', 'profile': '(Main)', 'metadata': {'Metadata': '', 'creation_time': '2025-01-10T03:45:09.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 01/09/2025.', 'vendor_id': '[0][0][0][0]'}}, {'input_number': 0, 'stream_number': 1, 'stream_type': 'audio', 'language': None, 'default': True, 'fps': 44100, 'bitrate': 128, 'metadata': {'Metadata': '', 'creation_time': '2025-01-10T03:45:09.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 01/09/2025.', 'vendor_id': '[0][0][0][0]'}}], 'input_number': 0}], 'duration': 795.24, 'bitrate': 187, 'start': 0.0, '

                                                        

MoviePy - Done.
MoviePy - Writing video __temp__.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready __temp__.mp4




{'video_found': True, 'audio_found': True, 'metadata': {'major_brand': 'mp42', 'minor_version': '0', 'compatible_brands': 'isommp42', 'creation_time': '2025-01-10T03:45:09.000000Z', 'encoder': 'Google'}, 'inputs': [{'streams': [{'input_number': 0, 'stream_number': 0, 'stream_type': 'video', 'language': None, 'default': True, 'size': [640, 360], 'bitrate': 56, 'fps': 30.0, 'codec_name': 'h264', 'profile': '(Main)', 'metadata': {'Metadata': '', 'creation_time': '2025-01-10T03:45:09.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 01/09/2025.', 'vendor_id': '[0][0][0][0]'}}, {'input_number': 0, 'stream_number': 1, 'stream_type': 'audio', 'language': None, 'default': True, 'fps': 44100, 'bitrate': 128, 'metadata': {'Metadata': '', 'creation_time': '2025-01-10T03:45:09.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 01/09/2025.', 'vendor_id': '[0][0][0][0]'}}], 'input_number': 0}], 'duration': 795.24, 'bitrate': 187, 'start': 0.0, '

                                                        

MoviePy - Done.
MoviePy - Writing video __temp__.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready __temp__.mp4




{'video_found': True, 'audio_found': True, 'metadata': {'major_brand': 'mp42', 'minor_version': '0', 'compatible_brands': 'isommp42', 'creation_time': '2025-01-10T03:45:09.000000Z', 'encoder': 'Google'}, 'inputs': [{'streams': [{'input_number': 0, 'stream_number': 0, 'stream_type': 'video', 'language': None, 'default': True, 'size': [640, 360], 'bitrate': 56, 'fps': 30.0, 'codec_name': 'h264', 'profile': '(Main)', 'metadata': {'Metadata': '', 'creation_time': '2025-01-10T03:45:09.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 01/09/2025.', 'vendor_id': '[0][0][0][0]'}}, {'input_number': 0, 'stream_number': 1, 'stream_type': 'audio', 'language': None, 'default': True, 'fps': 44100, 'bitrate': 128, 'metadata': {'Metadata': '', 'creation_time': '2025-01-10T03:45:09.000000Z', 'handler_name': 'ISO Media file produced by Google Inc. Created on: 01/09/2025.', 'vendor_id': '[0][0][0][0]'}}], 'input_number': 0}], 'duration': 795.24, 'bitrate': 187, 'start': 0.0, '

                                                        

MoviePy - Done.
MoviePy - Writing video __temp__.mp4



                                                                          

MoviePy - Done !
MoviePy - video ready __temp__.mp4




Looks like we got accurate snippets! They all seem to be about the Central Limit Theorem.

## 4. Analyzing Search Results
In this step we will summarize each clip that was retrieved with the TwelveLabs Pegasus model.

### Indexing Video Content
Next, we'll create an index for our video using Twelve Labs' models. This index will enable advanced search and generation capabilities, allowing us to extract detailed information and generate summaries from specific segments of the video.

In [57]:
idx = None
try:
    idx = tl.index.create(
        name="clt-demo",
        models=[
            {"name":"marengo2.7","options":["visual","audio"]},
            {"name":"pegasus1.2","options":["visual","audio"]}
        ]
    )
except Exception:
    idx = tl.index.list()[0]

vid_task = tl.task.create(index_id=idx.id, file=output_vid)
vid_task.wait_for_done()
video_id = vid_task.video_id

### Summarize Clips

We can use the TwelveLabs summarize endpoint to summarize every clip by passing the start offset and end offset parameters. We end up with multiple summaries. An alternative is to use an open source video chat model and pass the combined clips!

In [58]:
print("\n📝 Summaries:")
for index in range(len(hits)):  # Iterate using index to access rows
    h = hits.iloc[index]  # Access row using .iloc
    start = h.start_offset_sec
    end   = h.end_offset_sec
    prompt = (
        f"Summarize the content of the video between {start:.1f}s and {end:.1f}s, "
        "with an emphasis on the Central Limit Theorem."
    )
    summary = tl.generate.summarize(
        video_id=video_id,
        type="summary",
        prompt=prompt
    )
    print(f"\n— Segment {start:.1f}–{end:.1f}s —\n{summary.summary}")


📝 Summaries:

— Segment 120.0–126.0s —
The video delves into the intricacies of the normal distribution, also known as the Gaussian distribution, and its significance in probability theory. The narrator begins by posing a question about the special place of the Gaussian function, \( e^{-x^2} \), in probability theory, setting the stage for an exploration of why this function is so important.
The discussion then transitions to a refresher on the Central Limit Theorem (CLT), which states that as you add multiple copies of a random variable, such as rolling a weighted die many times or letting a ball bounce off pegs repeatedly, the distribution of the sum tends to approximate a normal distribution. The CLT asserts that as the sum grows larger, this approximation becomes increasingly accurate. However, the video does not just restate the theorem; it aims to provide a deeper understanding of why the Gaussian function is the central limit.
The video introduces the concept of convolution, a 


## 5. Cleanup

Remove the KDB.AI table created during this notebook session to free up resources.


### Cleanup: Dropping the KDB.AI Table
Once finished with the table, it is best practice to drop it.

In [56]:
table.drop()