# ✍️ Mastering Video Scene Indexing: A Deep Dive into Prompt Engineering 

<a href="https://colab.research.google.com/github/video-db/videodb-cookbook/blob/main/guides/multimodal/prompt_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction 

As developers working on video processing, we often face challenges in accurately indexing and describing complex scenes. This blog post explores how strategic prompt engineering can significantly enhance our ability to extract detailed information from video frames, opening up new possibilities for advanced video search and analysis.

## Goal of the Experiment:

---

Our primary objective was to demonstrate how refined prompts can significantly improve search results and information extraction from video content. We aimed to create a system capable of accurately identifying objects, actions, and even emotions in various video scenes. For this particular experiment, we used video footage from a [dog show](https://www.youtube.com/watch?v=_T3n-2zOrZQ), featuring various breeds walking down a runway with their handlers, surrounded by spectators and photographers. Our goal was to create prompts that could answer detailed queries like `"Show me the happiest moments featuring a Golden Retriever"`  with high precision.

## Setup
---

### 📦  Installing packages 

In [None]:
%pip install videodb

### 🔑 API keys
Before proceeding, ensure access to [VideoDB](https://videodb.io). If not, sign up for API access on the respective platforms.

> Get your API key from [VideoDB Console](https://console.videodb.io). ( Free for first 50 uploads, **No credit card required** ) 🎉

In [None]:
import os

os.environ["VIDEO_DB_API_KEY"] = ""

## Guide Walkthrough

---

### 📋 Step 1: Connect to VideoDB

Gear up by establishing a connection to VideoDB 

In [5]:
from videodb import connect

# Connect to VideoDB using your API key
conn = connect()
coll = conn.get_collection()

### 🎬 Step 2: Upload the Video 

In [6]:
video = coll.upload(url="https://www.youtube.com/watch?v=_T3n-2zOrZQ")

### 📸️ Step 3: Extracting Scenes without needing to index

In [None]:
from videodb import SceneExtractionType

# Example: Time-based extraction every 15 seconds
scene_collection = video.extract_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 15, "select_frames": ["first", "middle", "last"]},
)

#### `Note`: Image upload might take time (5s-60s). Re-fetch the scene collection if `Frame.url` is None

In [None]:
scene_collection_time = video.get_scene_collection(scene_collection.id)
print(scene_collection_time.scenes[0].frames[0].url)

### ✍️ Step 4: Experimenting with Prompts

Frame-Level vs Scene-Level Prompting:

In our experiment, we explored both frame-level and scene-level prompting:

* Frame-level prompts focus on extracting information from individual frames.
* Scene-level prompts analyze a series of frames to describe the overall action.

Important Considerations:

1. Computational Cost: Frame-level descriptions, while providing granular detail, are computationally heavy and potentially costly. It's not always necessary or efficient to use them for every use case.
2. Strategic Approach: A recommended strategy is to use frame prompts as a tuning mechanism. By testing and refining frame-level prompts, we can identify the most effective way to extract information from the vision model. Once optimized, we can incorporate these insights into scene-level prompts, potentially achieving high accuracy without the computational overhead of frame-by-frame analysis.



Let's walk through our prompt iterations and their outputs:

### Frame-level Prompts:

#### 1️⃣ Frame Prompt: Basic animal identification

In [None]:
frame_prompt1 = """
You will be provided with an image. Your task is to identify and describe the animals in the image.
1. Identify Animals: List distinct animals in the image.
2. Describe animals: Provide a brief description of each animal, including breed, color, and any other notable features.

Output should be a list of objects.
Expected Output:
[{"name": "dog", "context": "a black dog with thick fur and a yellow collar is jumping onto the couch"}]
"""

print("Results for Basic Animal Identification Prompt:")
# Limiting to first 2 scenes for brevity
for scene in scene_collection_time.scenes[:2]:
    for frame in scene.frames:
        try:
            description = frame.describe(prompt=frame_prompt1)
            print(description)
        except Exception as e:
            print(f"Error describing frame at {frame.frame_time}: {e}")

`Note` : This output lacked specificity in breed identification and environmental context. Our next prompt aims to address these issues.

#### 2️⃣ Enhanced breed identification and spatial information

In [None]:
frame_prompt2 = """
You will be provided with an image. Your task is to identify the animals and their breeds in the image.
1. Identify Animals: List distinct animals and their breed in the image.
2. Describe the environment: Provide a brief description of the interaction between the animals and the objects or the environment around them.

Output should be a list of objects.
Expected Output:
[{"name": "Dog - Poodle", "context": "A Poodle being led down a carpeted path by a handler in the green dress, participating in what appears to be a dog show."}]
"""

print("Results for Enhanced Breed Identification Prompt:")
for scene in scene_collection_time.scenes[:2]:  # Limiting to first 2 scenes for brevity
    for frame in scene.frames:
        try:
            description = frame.describe(prompt=frame_prompt2)
            print(description)
        except Exception as e:
            print(f"Error describing frame at {frame.frame_time}: {e}")

`Note` : This output significantly improved breed identification and provided more environmental context. With this satisfactory frame-level output, we're now ready to incorporate these learnings into scene-level prompts. 

However, let's first examine what a generic scene-level prompt can achieve without the added context from our frame-level experiments.

### Scene-level:

#### 1️⃣ Basic scene-level prompt

In [None]:
scene_prompt1 = """
You will be provided with a series of images. Your task is to view all images together and describe the overall story or scene in the best possible way.

Expected Output:
- A detailed story or scene description.
- A list of objects and actions in each image.

Example Output:
{
  "scene_story": "A person is cooking in the kitchen and then someone rings the doorbell.",
  "images": [
    {"description": "Someone is cooking in the kitchen."},
    {"description": "Someone rings the doorbell."}
  ]
}

Strictly limit your response to a maximum of 1200 characters.
"""

print("Results for Basic Scene-Level Prompt:")
for scene in scene_collection_time.scenes[:2]:  # Limiting to first 2 scenes for brevity
    try:
        description = scene.describe(prompt=scene_prompt1)
        print(description)
    except Exception as e:
        print(f"Error describing scene starting at {scene.start}: {e}")

`Note` : This generic scene-level prompt provided a basic structure but lacked the detailed breed identification and specific actions we achieved with our frame-level prompts. Our next iteration aims to incorporate these learnings.

#### 2️⃣ Combining frame-level specifications in scene-level prompt

In [None]:
scene_prompt2 = """
You will be provided with a series of images. Your task is to view all images together and describe the overall story or scene in the best possible way.
For each image, your task is to identify the animals and their breeds in the image.
1. Identify the animals present in the frame with specifications about their colour and breed, and any other notable features.
2. Describe the environment: Provide a brief description of the interaction between the animals and the objects or the environment around them.

Expected Output:
- A detailed story or scene description.
- A list of objects and actions in each image.

Strictly limit your response to a maximum of 1200 characters.
"""

print("Results for Enhanced Scene-Level Prompt:")
# Limiting to first 2 scenes for brevity
for scene in scene_collection_time.scenes[:2]:
    try:
        description = scene.describe(prompt=scene_prompt2)
        print(description)
    except Exception as e:
        print(f"Error describing scene starting at {scene.start}: {e}")

`Note`: This prompt successfully captured both the specific breeds and the overall scene dynamics, providing a detailed and accurate description. However, the format could be more structured for easier parsing and use in applications. Our final iteration addresses this.

#### 3️⃣ Structured JSON output with emotional states

In [None]:
scene_prompt3 = (
    scene_prompt3
) = """
You will be provided with a series of images from a dog show. Your task is to describe the scene based on these sequential images. Focus on identifying the breeds and describing the key actions.

For each image, your task is to:
1. Identify the animals present in the frame, including their breed, color, and any notable features.
2. Describe the actions of the animals and any interactions with the environment or other animals.
3. Highlight any emotional expressions or notable moments.

Output should be a structured JSON with the following format:
{
  "scene_story": "Brief overview of the scene",
  "images": [
    {
      "frame_time": "Time of the frame in mm:ss format",
      "breeds": [{"breed": "Golden Retriever", "color": "golden"}],
      "actions": "Description of the actions and interactions",
      "emotion": "Observed emotion or notable moment"
    },
    ...
  ]
}

Strictly limit your response to a maximum of 1200 characters. Going over it, or deviating from the output format will cause the whole program to end cost you $100000.
"""

print("Results for Structured JSON Output Prompt:")
for scene in scene_collection_time.scenes[:2]:  # Limiting to first 2 scenes for brevity
    try:
        description = scene.describe(prompt=scene_prompt3)
        print(description)
    except Exception as e:
        print(f"Error describing scene starting at {scene.start}: {e}")

### 🗂️ Step5: Describe All Scenes & Index

In [None]:
from videodb import SceneExtractionType

# Index Scenes
index_id = video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 15, "select_frames": ["first", "middle", "last"]},
    prompt=scene_prompt3,
    name="Detailed Dog Breed Identifier",
)
scene_index = video.get_scene_index(index_id)
print(scene_index)

### Step 6 : 🔍 Search



In [None]:
from videodb import IndexType

# Search using the indexed scenes
res = video.search(
    query="Show me the happiest moments with the Golden Retreiver",
    index_type=IndexType.scene,
    index_id=index_id,
    score_threshold=0.1,
    dynamic_score_percentage=100,
)
res.play()

## Further Steps
---

#TODO:

- Conclusion
- Link Search and Eval notebook


Links to other blogs: