# 📸🗣️ Multimodal Quickstart 

<a href="https://colab.research.google.com/github/video-db/videodb-cookbook/blob/main/quickstart/Multimodal_Quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction 

Let’s first look at the example query that we want to unlock in our video library. 

> 📸🗣️ __Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy__

Implementing this multimodal search query involves following steps with VideoDB:

1. 🎬 **Upload and Index the Video**:
     - Upload the video and get the video object.
     - `index_scenes`  function to detect and recognize events, such as theft, within the video footage.
     - `index_spoken_words` function to index spoken words of the news anchor to enable keyword search.
2. 🧩 **Query Transformation**: Divide query into two parts that can be used with respective scene and spoken indexes.
3. 🔎 **Perform Search**: Using the queries search relevant segments in the indexes.
4. 🔀 **Combine Search Results of Both Modalities**: Integrating the results from both indexes for precise video segment identification. 
5. **Stream the Footage**: Generate and play video streams using the segments.

## Setup
---

### 📦  Installing packages 

In [None]:
%pip install openai
%pip install videodb

### 🔑 API keys
Before proceeding, ensure access to [VideoDB](https://videodb.io), [OpenAI](https://openai.com). If not, sign up for API access on the respective platforms.

> Get your API key from [VideoDB Console](https://console.videodb.io). ( Free for first 50 uploads, **No credit card required** ) 🎉

In [1]:
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["VIDEO_DB_API_KEY"] = ""

## Building Multimodal Search

---

### 📋 Step 1: Connect to VideoDB

Gear up by establishing a connection to VideoDB 

In [8]:
from videodb import connect
from videodb import play_stream

# Connect to VideoDB using your API key
conn = connect()
coll = conn.get_collection()

### 🎬 Step 2: Upload the Video 

In [31]:
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

In [32]:
stream = video.generate_stream()

### 📸🗣️ Step 3: Index the Video on different Modalities

#### 🗣️ Indexing Spoken Content

In [33]:
# Index spoken content

video.index_spoken_words()

100%|███████████████████████████████████████████████████████████████████████████


#### 📸️ Indexing Visual Content

To learn more about Scene Index, explore the following guides:

- [Quickstart Guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb) guide provides a step-by-step introduction to Scene Index. It's ideal for getting started quickly and understanding the primary functions.

- [Scene Extraction Options Guide](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb) delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.





In [34]:
from videodb import SceneExtractionType

# Index scene content
index_id = video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 2, "select_frames": ["first", "last"]},
    prompt="Describe the scene in detail",
)
video.get_scene_index(index_id)

[{'description': 'The scene presented in the images showcases a mesmerizing view of the night sky. It\'s a clear starry sky filled with countless small, sparkling stars scattered across a dark expanse. A prominent feature is the Milky Way galaxy, represented as a thick, cloudy band that stretches diagonally across the sky. This band is a mix of bright and dark areas, with a noticeable pinkish-purple hue indicating regions of dense star clusters and interstellar dust.\n\nIn terms of text and graphics, the second image includes overlays that provide context. At the top right corner, there is a small, yellow rectangular logo. At the bottom, a black banner with text appears. The text on the banner includes:\n\n- The word “SCIENCE” in large white letters, followed by the number “101” in bright yellow.\n- Below that, there are three smaller sections of text:\n  - "ANATOMY” on the left side.\n  - "SPACE” in the middle.\n  - "DISCOVERIES” on the right side.\n- On the far right of the black ban

### 🧩 Step 4: Query Transformation

In [35]:
from openai import OpenAI

transformation_prompt = """
Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
"""

# Initialize OpenAI client
client = OpenAI()


def divide_query(query):
    # Use the OpenAI client to create a chat completion with a structured prompt
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": transformation_prompt.format(query=query)}
        ],
    )

    message = response.choices[0].message.content
    divided_query = message.strip().split("\n")
    spoken_query = divided_query[0].replace("Spoken:", "").strip()
    visual_query = divided_query[1].replace("Visual:", "").strip()

    return spoken_query, visual_query


# Test the query
query = "Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy"

spoken_query, visual_query = divide_query(query)
print(f"Spoken Query: {spoken_query}")
print(f"Visual Query: {visual_query}")

Spoken Query: Show me where the narrator discusses the formation of the solar system
Visual Query: Visualize the milky way galaxy


### 🔍 Step 5:  Performing Searches
Now that we have our divided queries, let's perform searches on both the spoken word and scene indexes:

🗣️ **Search from Spoken Index**

In [36]:
from videodb import SearchType, IndexType

# Perform the search using the spoken query
spoken_results = video.search(
    query=spoken_query,
    index_type=IndexType.spoken_word,
    search_type=SearchType.semantic,
)

# View the results
spoken_results.play()

'https://console.videodb.io/player?url=https://stream.videodb.io/v3/published/manifests/2520e156-0a62-4d66-88c0-8f447f63b1d7.m3u8'

📸️ **Searching from Scene Index**

In [40]:
# Perform the search using the visual query
scene_results = video.search(
    query=visual_query, 
    index_type=IndexType.scene,
    search_type=SearchType.semantic,
    score_threshold=0.1,
    dynamic_score_percentage=100,
)

# View the results
scene_results.play()

'https://console.videodb.io/player?url=https://stream.videodb.io/v3/published/manifests/6e3570a9-bf3a-4612-925e-f158566c9d79.m3u8'

### 🔀 Step 6: Combining Spoken & Scene Search results

Each search result provides a list of timestamps that are relevant to the query in the relevant modality (semantic & scene/ visual, in this case).

There are two ways to combine these search results:

- **Union**: This method takes all the timestamps from every search result, creating a comprehensive list that includes every relevant time, even if some timestamps appear in only one result.

<img src="https://raw.githubusercontent.com/video-db/videodb-cookbook-assets/main/images/guides/multimodal_quickstart_union.png" alt="Example Image" width="500"/>

- **Intersection**: This method only includes timestamps that appear in all the search results, resulting in a smaller list with times that are universally relevant across all results.

<img src="https://raw.githubusercontent.com/video-db/videodb-cookbook-assets/main/images/guides/multimodal_quickstart_intersection.png" alt="Example Image" width="500"/>


Depending on which method you prefer, you can pass the appropriate argument to the `combine_results()` function below.

In [41]:
# Define a Function to Find Intersections
def process_shots(l1, l2, operation):
    def merge_intervals(intervals):
        if not intervals:
            return []
        intervals.sort(key=lambda x: x[0])
        merged = [intervals[0]]
        for interval in intervals[1:]:
            if interval[0] <= merged[-1][1]:
                merged[-1][1] = max(merged[-1][1], interval[1])
            else:
                merged.append(interval)
        return merged

    def intersection(intervals1, intervals2):
        i, j = 0, 0
        result = []
        while i < len(intervals1) and j < len(intervals2):
            low = max(intervals1[i][0], intervals2[j][0])
            high = min(intervals1[i][1], intervals2[j][1])
            if low < high:
                result.append([low, high])
            if intervals1[i][1] < intervals2[j][1]:
                i += 1
            else:
                j += 1
        return result

    if operation.lower() == "intersection":
        return intersection(merge_intervals(l1), merge_intervals(l2))
    elif operation.lower() == "union":
        return merge_intervals(l1 + l2)
    else:
        raise ValueError("Invalid operation. Please choose 'intersection' or 'union'.")


def combine_results(spoken_results, scene_results, operation):
    spoken_timestamps = [[shot.start, shot.end] for shot in spoken_results.get_shots()]
    scene_timestamps = [[shot.start, shot.end] for shot in scene_results.get_shots()]
    result = process_shots(spoken_timestamps, scene_timestamps, operation)
    print("Spoken results: ", spoken_timestamps)
    print("Scene results: ", scene_timestamps)
    print("Combined results: ", result)
    return result


# Combine results
results = combine_results(spoken_results, scene_results, "intersection")

Spoken results:  [[0.0, 37.63]]
Scene results:  [[12.012, 14.014], [26.026, 32.032], [34.034, 36.036]]
Combined results:  [[12.012, 14.014], [26.026, 32.032], [34.034, 36.036]]


### 🪄 Step7 : View Combined Results 

Finally, let's generate a stream of the intersecting segments and play it:

In [42]:
from videodb import play_stream
print(f"Multimodal Query: {query}")
stream_link = video.generate_stream(results)
play_stream(stream_link)

Multimodal Query: Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy


'https://console.videodb.io/player?url=https://stream.videodb.io/v3/published/manifests/a267b391-061a-4a05-ab00-5d2d732d0709.m3u8'

## Further Steps
---

In this guide, we've demonstrated how to perform multimodal search on educational videos using VideoDB and OpenAI. 

To learn more about Scene Index, explore the following guides:

- [Quickstart Guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb) 
- [Scene Extraction Options](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb)
- [Advanced Visual Search](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/advanced_visual_search.ipynb)
- [Custom Annotation Pipelines](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/custom_annotations.ipynb)


If you have any questions or feedback. Feel free to reach out to us 🙌🏼

* [Discord](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fdiscord.gg%2Fpy9P639jGz)
* [GitHub](https://github.com/video-db)
* [Email](ashu@videodb.io)