# 📸🗣️ Multimodal Quickstart 

<a href="https://colab.research.google.com/github/video-db/videodb-cookbook/blob/main/quickstart/Multimodal_Quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction 

This tutorial demonstrates how to perform multimodal search on videos using VideoDB and OpenAI. We'll accomplish this by searching both spoken and scene indexes and finding their intersection points.

In this tutorial, we'll:

1. Upload and index a sample educational video about the solar system
2. Use OpenAI to divide a query into spoken and visual components
3. Perform searches in both spoken word and scene indexes
4. Find intersecting segments between the two search results
5. Visualize the final combined results

## Setup
---

### 📦  Installing packages 

In [None]:
%pip install openai
%pip install videodb

### 🔑 API keys
Before proceeding, ensure access to [VideoDB](https://videodb.io), [OpenAI](https://openai.com). If not, sign up for API access on the respective platforms.

> Get your API key from [VideoDB Console](https://console.videodb.io). ( Free for first 50 uploads, **No credit card required** ) 🎉

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["VIDEO_DB_API_KEY"] = ""

## Building Multimodal Search

---

### 📋 Step 1: Connect to VideoDB

Gear up by establishing a connection to VideoDB 

In [None]:
from videodb import connect

# Connect to VideoDB using your API key
conn = connect()
coll = conn.get_collection()

### 🎬 Step 2: Upload the Video 

In [None]:
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

### 📸🗣️ Step 3: Index the Video on different Modalities

#### 🗣️ Indexing Spoken Content

In [None]:
# Index spoken content

video.index_spoken_words()

#### 📸️ Indexing Visual Content

To learn more about Scene Index, explore the following guides:

- [Quickstart Guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb) guide provides a step-by-step introduction to Scene Index. It's ideal for getting started quickly and understanding the primary functions.

- [Scene Extraction Options Guide](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb) delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.





In [None]:
from videodb import SceneExtractionType

# Index scene content
index_id = video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 2, "select_frames": ["first", "last"]},
    prompt="Describe the scene in detail",
)
video.get_scene_index(index_id)

### 🧩 Step 4: Query Transformation

In [9]:
from openai import OpenAI

transformation_prompt = """
Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
"""

# Initialize OpenAI client
client = OpenAI()


def divide_query(query):
    # Use the OpenAI client to create a chat completion with a structured prompt
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": transformation_prompt.format(query=query)}
        ],
    )

    message = response.choices[0].message.content
    divided_query = message.strip().split("\n")
    spoken_query = divided_query[0].replace("Spoken:", "").strip()
    visual_query = divided_query[1].replace("Visual:", "").strip()

    return spoken_query, visual_query


# Test the query
query = "Find the segment where the narrator talks about the terrestrial planets and show images of Mercury, Venus, Earth, and Mars"
# Example query 2 :- Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy.

spoken_query, visual_query = divide_query(query)
print(f"Spoken Query: {spoken_query}")
print(f"Visual Query: {visual_query}")

Spoken Query: Find the segment where the narrator talks about the terrestrial planets
Visual Query: Show images of Mercury, Venus, Earth, and Mars


### 🔍 Step 5:  Performing Searches

🗣️ **Search from Spoken Index**

In [10]:
from videodb import SearchType, IndexType

# Perform the search using the spoken query
spoken_results = video.search(
    query=spoken_query,
    index_type=IndexType.spoken_word,
    search_type=SearchType.semantic,
)

# View the results
spoken_results.play()

'https://console.videodb.io/player?url=https://dseetlpshk2tb.cloudfront.net/v3/published/manifests/7824e096-9a04-4bf0-8f98-1aac9723edc7.m3u8'

📸️ **Searching from Scene Index**

In [11]:
# Perform the search using the visual query
scene_results = video.search(
    query=visual_query, 
    index_type=IndexType.scene,
    search_type=SearchType.semantic
)

# View the results
scene_results.play()

'https://console.videodb.io/player?url=https://dseetlpshk2tb.cloudfront.net/v3/published/manifests/fec5712e-1e5b-419a-b86a-d1c8104d8e4b.m3u8'

### 🔀 Step 6: Combining Spoken & Scene Search results

Each search result provides a list of timestamps that are relevant to the query in the relevant modality.

There are two ways to combine these search results:

- **Union**: This method takes all the timestamps from every search result, creating a comprehensive list that includes every relevant time, even if some timestamps appear in only one result.

![](https://raw.githubusercontent.com/video-db/videodb-cookbook-assets/main/images/guides/multimodal_quickstart_union.png)


- **Intersection**: This method only includes timestamps that appear in all the search results, resulting in a smaller list with times that are universally relevant across all results.

![](https://raw.githubusercontent.com/video-db/videodb-cookbook-assets/main/images/guides/multimodal_quickstart_intersection.png)


Depending on which method you prefer, you can pass the appropriate argument to the `combine_results()` function below.

In [23]:
# Define a Function to Find Intersections
def process_shots(l1, l2, operation):
    def merge_intervals(intervals):
        if not intervals:
            return []
        intervals.sort(key=lambda x: x[0])
        merged = [intervals[0]]
        for interval in intervals[1:]:
            if interval[0] <= merged[-1][1]:
                merged[-1][1] = max(merged[-1][1], interval[1])
            else:
                merged.append(interval)
        return merged

    def intersection(intervals1, intervals2):
        i, j = 0, 0
        result = []
        while i < len(intervals1) and j < len(intervals2):
            low = max(intervals1[i][0], intervals2[j][0])
            high = min(intervals1[i][1], intervals2[j][1])
            if low < high:
                result.append([low, high])
            if intervals1[i][1] < intervals2[j][1]:
                i += 1
            else:
                j += 1
        return result

    if operation.lower() == "intersection":
        return intersection(merge_intervals(l1), merge_intervals(l2))
    elif operation.lower() == "union":
        return merge_intervals(l1 + l2)
    else:
        raise ValueError("Invalid operation. Please choose 'intersection' or 'union'.")


def combine_results(spoken_results, scene_results, operation):
    spoken_timestamps = [[shot.start, shot.end] for shot in spoken_results.get_shots()]
    scene_timestamps = [[shot.start, shot.end] for shot in scene_results.get_shots()]
    result = process_shots(spoken_timestamps, scene_timestamps, operation)
    print("Spoken results: ", spoken_timestamps)
    print("Scene results: ", scene_timestamps)
    print("Combined results: ", result)
    return result


# Combine results
results = combine_results(spoken_results, scene_results, "intersection")

Spoken results:  [[37.026, 79.33]]
Scene results:  [[44.044, 46.046], [54.054, 56.056]]
Combined results:  [[44.044, 46.046], [54.054, 56.056]]


### 🪄 Step7 : View Combined Results 

In [25]:
from videodb import play_stream

stream_link = video.generate_stream(results)
play_stream(stream_link)

'https://console.videodb.io/player?url=https://dseetlpshk2tb.cloudfront.net/v3/published/manifests/6f920601-41ca-4977-9767-8e5ecebdc3b9.m3u8'

## Further Steps
---

In this guide, we've demonstrated how to perform multimodal search on educational videos using VideoDB and OpenAI. 

To learn more about Scene Index, explore the following guides:

- [Quickstart Guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb) 
- [Scene Extraction Options](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb)
- [Advanced Visual Search](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/advanced_visual_search.ipynb)
- [Custom Annotation Pipelines](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/custom_annotations.ipynb)


If you have any questions or feedback. Feel free to reach out to us 🙌🏼

* [Discord](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fdiscord.gg%2Fpy9P639jGz)
* [GitHub](https://github.com/video-db)
* [VideoDB](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fvideodb.io)
* [Email](ashu@videodb.io)