<a href="https://colab.research.google.com/github/video-db/videodb-cookbook/blob/main/docs/integrations/llama-index/simple_video_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
# VideoDB Retriever 

### RAG: Multimodal Search on Videos and Stream Video Results 📺

Constructing a RAG pipeline for text is relatively straightforward, thanks to the tools developed for parsing, indexing, and retrieving text data. 

However, adapting RAG models for video content presents a greater challenge. Videos combine visual, auditory, and textual elements, requiring more processing power and sophisticated video pipelines.

While Large Language Models (LLMs) excel with text, they fall short in helping you consume or create video clips. `VideoDB` provides a sophisticated database abstraction for your MP4 files, enabling the use of LLMs on your video data. With VideoDB, you can not only analyze but also `instantly watch video streams` of your search results.

> [VideoDB](https://videodb.io) is a serverless database designed to streamline the storage, search, editing, and streaming of video content. VideoDB offers random access to sequential video data by building indexes and developing interfaces for querying and browsing video content. Learn more at [docs.videodb.io](https://docs.videodb.io).


In this notebook, we introduce `VideoDBRetriever`, a tool specifically designed to simplify the creation of RAG pipelines for video content, without any hassle of dealing with complex video infrastructure.

&nbsp;
## 🛠️️ Setup 

---

### 🔑 Requirements

To connect to VideoDB, simply get the API key and create a connection. This can be done by setting the `VIDEO_DB_API_KEY` environment variable. You can get it from 👉🏼 [VideoDB Console](https://console.videodb.io). ( Free for first 50 uploads, **No credit card required!** )

Get your `OPENAI_API_KEY` from OpenAI platform for `llama_index` response synthesizer.

<!-- > Set the `OPENAI_API_KEY` & `VIDEO_DB_API_KEY` environment variable with your API keys. -->

In [43]:
import os

os.environ["VIDEO_DB_API_KEY"] = ""
os.environ["OPENAI_API_KEY"] = ""


### 📦 Installing Dependencies

To get started, we'll need to install the following packages:

- `llama-index`
- `llama-index-retrievers-videodb`
- `videodb`

In [None]:
%pip install videodb
%pip install llama-index 

In [None]:
%pip install llama-index-retrievers-videodb

## 🛠 Building Multimodal RAG

---


Implementing this multimodal search query involves following steps with VideoDB:

1. 🎬 **Upload and Index the Video**:
     - Upload the video and get the video object.
     - `index_scenes`  function to detect and recognize events, such as theft, within the video footage.
     - `index_spoken_words` function to index spoken words of the news anchor to enable keyword search.
2. 🧩 **Query Transformation**: Divide query into two parts that can be used with respective scene and spoken indexes.
3. 🔎 **Finding Relevant nodes for each modality**: Using `VideoDBRetriever` find relevant nodes from Spoken Index and Scene Index 
4. ✏️ **Viewing the result : Text**: Use Relevant Nodes to sythesize a text reponse Integrating the results from both indexes for precise video segment identification. 
5. 🎥 **Viewing the result : Video Clip**: Integrating the results from both indexes for precise video segment identification. 

### 📋 Step 1: Connect to VideoDB and Ingest Data

Let's upload a our video file first.

You can use any `public url`, `Youtube link` or `local file` on your system. 

> ✨ First 50 uploads are free!

In [44]:
from videodb import connect

# connect to VideoDB
conn = connect()
coll = conn.get_collection()

# upload videos to default collection in VideoDB
print("Uploading Video")
video = conn.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")
print(f"Video uploaded with ID: {video.id}")


# video = coll.get_video("m-56f55058-62b6-49c4-bbdc-43c0badf4c0b")

Uploading Video
Video uploaded with ID: m-0ccadfc8-bc8c-4183-b83a-543946460e2a


> * `coll = conn.get_collection()` : Returns default collection object.
> * `coll.get_videos()` : Returns list of all the videos in a collections.
> * `coll.get_video(video_id)`: Returns Video object from given`video_id`.


### 📸🗣️ Step 2: Index the Video on different Modalities

#### 🗣️ Indexing Spoken Content

In [45]:
print("Indexing spoken content in Video...")
video.index_spoken_words()

Indexing spoken content in Video...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:23<00:00,  4.27it/s]


#### 📸️ Indexing Visual Content

To learn more about Scene Index, explore the following guides:

- [Quickstart Guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb) guide provides a step-by-step introduction to Scene Index. It's ideal for getting started quickly and understanding the primary functions.

- [Scene Extraction Options Guide](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb) delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.

In [46]:
from videodb import SceneExtractionType

print("Indexing Visual content in Video...")

#Index scene content
index_id = video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 2, "select_frames": ["first", "last"]},
    prompt="Describe the scene in detail",
)
video.get_scene_index(index_id)

print(f"Scene Index successful with ID: {index_id}")


Indexing Visual content in Video...
Scene Index successful with ID: f3eef7aee2a0ff58


### 🛠️ Step3: Simple Multimodal RAG 

To construct a multimodal RAG pipeline, follow these steps:

- 🧩 **Segment the Query**: Split the query into two parts, each corresponding to the scene and spoken indexes.
- 🔎 **Retrieve Nodes**: Use each segment of the query to fetch relevant nodes from the respective modalities.
- 💬 **Generate Text**: Synthesize a text-based answer using the retrieved nodes.
- 🎥 **Extract Video Clips**: Identify and compile relevant video clips based on the retrieved nodes.

Now that the videos are indexed, leverage `VideoDBRetriever` to access the pertinent nodes from the VideoDB.

#### 🧩 Query Transformation

In [47]:
from llama_index.llms.openai import OpenAI


def split_spoken_visual_query(query):
    transformation_prompt = """
    Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
    """
    prompt = transformation_prompt.format(query=query)
    response = OpenAI(model="gpt-4").complete(prompt)
    divided_query = response.text.strip().split("\n")
    spoken_query = divided_query[0].replace("Spoken:", "").strip()
    scene_query = divided_query[1].replace("Visual:", "").strip()
    return spoken_query, scene_query 


query = "Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy"
spoken_query, scene_query = split_spoken_visual_query(query)
print("Query for Spoken retriever : ", spoken_query)
print("Query for Scene retriever : ", scene_query)

Query for Spoken retriever :  Discuss the formation of the solar system
Query for Scene retriever :  Visualize the milky way galaxy


##### 🔎 Finding Relevant nodes for each modality

In [48]:
from llama_index.retrievers.videodb import VideoDBRetriever
from llama_index.core import get_response_synthesizer

# VideoDBRetriever by default uses the default collection in the VideoDB
spoken_retriever = VideoDBRetriever(video=video.id, search_type="semantic", index_type="spoken_word", score_threshold=0.1)
scene_retriever = VideoDBRetriever(video=video.id, search_type="semantic", index_type="scene", scene_index_id=index_id, score_threshold=0.1)

nodes_spoken_index = spoken_retriever.retrieve(spoken_query)
nodes_scene_index = scene_retriever.retrieve(scene_query)

#### ️💬️ Viewing the result : Text

In [49]:
response_synthesizer = get_response_synthesizer()

response = response_synthesizer.synthesize(
    query, nodes=nodes_scene_index+nodes_spoken_index
)
print(response)

The narrator discusses the formation of the solar system when mentioning how it came into being about 4.5 billion years ago due to the collapse of a cloud of interstellar gas and dust, forming a solar nebula. To visualize the Milky Way Galaxy, the images depict a detailed illustration showing a spiral galaxy with multiple arms swirling from a bright central bulge, highlighting the position of our Solar System within one of the minor spiral arms known as the Orion Spur.


#### 🎥 Viewing the result : Video Clip


From each modality we have retrieved result that are relevant to the query in the relevant modality (semantic & scene/ visual, in this case).

Each node has `start` and `end` fields in meatadata. which represent the time the node represents. 

There are two ways to combine these search results:

- **Union**: This method takes all the timestamps from every node, creating a comprehensive list that includes every relevant time, even if some timestamps appear in only one modality.

<img src="https://raw.githubusercontent.com/video-db/videodb-cookbook-assets/main/images/guides/multimodal_quickstart_union.png" alt="Example Image" width="500"/>

- **Intersection**: This method only includes timestamps from every node, resulting in a smaller list with times that are universally relevant across all modalities.

<img src="https://raw.githubusercontent.com/video-db/videodb-cookbook-assets/main/images/guides/multimodal_quickstart_intersection.png" alt="Example Image" width="500"/>


Depending on which method you prefer, you can pass the appropriate argument to the `combine_results()` function below.

In [64]:

def merge_intervals(intervals):
    if not intervals:
        return []
    intervals.sort(key=lambda x: x[0])
    merged = [intervals[0]]
    for interval in intervals[1:]:
        if interval[0] <= merged[-1][1]:
            merged[-1][1] = max(merged[-1][1], interval[1])
        else:
            merged.append(interval)
    return merged

# Define a Function to Find Intersections
def process_shots(l1, l2, operation):
    def intersection(intervals1, intervals2):
        i, j = 0, 0
        result = []
        while i < len(intervals1) and j < len(intervals2):
            low = max(intervals1[i][0], intervals2[j][0])
            high = min(intervals1[i][1], intervals2[j][1])
            if low < high:
                result.append([low, high])
            if intervals1[i][1] < intervals2[j][1]:
                i += 1
            else:
                j += 1
        return result

    if operation.lower() == "intersection":
        return intersection(merge_intervals(l1), merge_intervals(l2))
    elif operation.lower() == "union":
        return merge_intervals(l1 + l2)
    else:
        raise ValueError("Invalid operation. Please choose 'intersection' or 'union'.")


def combine_results(spoken_results, scene_results, operation):
    spoken_timestamps = [[shot.node.metadata['start'], shot.metadata['end']] for shot in spoken_results]
    scene_timestamps = [[shot.node.metadata['start'], shot.metadata['end']] for shot in scene_results]
    result = process_shots(spoken_timestamps, scene_timestamps, operation)
    print("Spoken results: ", spoken_timestamps)
    print("Scene results: ", scene_timestamps)
    print("Combined results: ", result)
    return result


# Combine results
results = combine_results(nodes_scene_index, nodes_spoken_index, "union")

Spoken results:  [[26.026, 30.03]]
Scene results:  [[0.0, 37.63]]
Combined results:  [[0.0, 37.63]]


In [51]:
from videodb import play_stream
print(f"Multimodal Query: {query}")
stream_link = video.generate_stream(results)
play_stream(stream_link)

Multimodal Query: Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy


'https://console.videodb.io/player?url=https://dseetlpshk2tb.cloudfront.net/v3/published/manifests/133575ed-6c9e-4368-800e-60fc94fa2b53.m3u8'

&nbsp;
## Configuring `VideoDBRetriever`
---

### ⚙️ Retriever for only one Video
You can pass the `id` of the video object to search in only that video. 
```python
VideoDBRetriever(video="my_video.id")
```

### ⚙️ Retriever for a set of Video/ Collection
You can pass the `id` of the Collection to search in only that Collection. 
```python
VideoDBRetriever(collection="my_coll.id")
```

### ⚙️ Retriever for different type of Indexes
```python
spoken_word = VideoDBRetriever(index_type="spoken_word", search_type="semantic")

scene_retriever = VideoDBRetriever(index_type="scene", scene_index_id="my_index_id", search_type="semantic")
```

### ⚙️ Configuring Search Type of Retriever 
`search_type` determines the search method used to retrieve nodes against given query 
```python
keyword_spoken_search = VideoDBRetriever(search_type="keyword", index_type="spoken_word")

semantic_scene_search = VideoDBRetriever(search_type="semantic", index_type="spoken_word")
```

### ⚙️ Configure threshold parameters  
- `result_threshold`: is the threshold for number of results returned by retriever; the default value is `5`
- `score_threshold`: only nodes with score higher than `score_threshold` will be returned by retriever; the default value is `0.2`  

```python
custom_retriever = VideoDBRetriever(result_threshold=2, score_threshold=0.5)
```

### ✨ Incorporating VideoDB in your existing Llamaindex RAG Pipeline
---

If you want to use a Vector Index of your own choice, you can fetch all Transcript Nodes and Visual Nodes of a video, and then index or incorporate them into your existing LlamaIndex pipeline.

#### 🗣 Fetching Transcript Nodes

You can fetch transcript nodes using `Video.get_transcript()`

To configure the segmenter, use the `segmenter` and `length` arguments.

Possible values for segmenter are:
- `Segmenter.time`: Segments the video based on the specified `length` in seconds.
- `Segmenter.word`: Segments the video based on the word count specified by `length`

In [53]:
from videodb import Segmenter
from llama_index.core.schema import TextNode


# Fetch all Transcript Nodes
nodes_transcript_raw = video.get_transcript(segmenter=Segmenter.time, length=60)

# Convert the raw transcript nodes to TextNode objects
nodes_transcript = [
    TextNode(
        text=node["text"],
        metadata={key: value for key, value in node.items() if key != "text"},
    )
    for node in nodes_transcript_raw
]


#### 📸 Fetching Scene Nodes

In [None]:
# Fetch all Scenes
scenes = video.get_scene_index(index_id)

# Convert the scenes to TextNode objects
nodes_scenes = [
    TextNode(
        text=node["description"],
        metadata={key: value for key, value in node.items() if key != "description"},
    )
    for node in scenes
]

### 🔄 Simple RAG Pipeline with Transcript + Scene Nodes

In [58]:
from llama_index.core import VectorStoreIndex

# Index both Transcript and Scene Nodes
index = VectorStoreIndex(nodes_scenes + nodes_transcript)
q = index.as_query_engine()

The narrator discusses the location of our Solar System within the Milky Way galaxy, emphasizing its position in one of the minor spiral arms known as the Orion Spur. The images provided offer visual representations of the Milky Way's structure, with labels indicating the specific location of the Solar System within the galaxy.


#### ️💬️ Viewing the result : Text

In [None]:
res = q.query(
    "Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy"
)
print(res)

#### 🎥 Viewing the result : Video Clip

In [67]:
from videodb import play_stream

relevant_timestamps = [
    [node.metadata["start"], node.metadata["end"]] for node in res.source_nodes
]

stream_url = video.generate_stream(merge_intervals(relevant_timestamps))
play_stream(stream_url)

## 🏃‍♂️ Next Steps
---

In this guide, we built a Simple Multimodal RAG for Videos Using VideoDB, Llamaindex, and OpenAI

You can optimize the pipeline by incorporating more advanced techniques like
- Build a Search on Video Collection
- Optimize Query Transformation
- More methods to combine retrieved nodes from different modalities
- Experiment with Different RAG pipelines like Knowledge Graph


To learn more about Scene Index, explore the following guides:

- [Quickstart Guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb) 
- [Scene Extraction Options](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb)
- [Advanced Visual Search](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/advanced_visual_search.ipynb)
- [Custom Annotation Pipelines](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/custom_annotations.ipynb)

## 👨‍👩‍👧‍👦 Support & Community
---

If you have any questions or feedback. Feel free to reach out to us 🙌🏼

* [Discord](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fdiscord.gg%2Fpy9P639jGz)
* [GitHub](https://github.com/video-db)
* [Email](mailto:ashu@videodb.io)
