# 🗒️ Lecture and Meeting Videos into Concise Notes with VideoDB

<a href="https://colab.research.google.com/github/video-db/videodb-cookbook/blob/main/examples/lecture_notes_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Taking notes during lectures or meetings can be tough, especially when trying to capture both what's on the screen and what's being said. To make it easier, we've developed a tool with VideoDB that combines different modalitites into searchable notes. Now, you can quickly find and share key moments from any session, whether it's a specific slide or a crucial point in the discussion.

## Setup
---

### 📦  Installing packages

Before diving in, make sure you have the necessary tools installed:

* VideoDB: For video indexing and search.
* OpenAI: To generate text summaries.
* Markdown2: For converting markdown text into HTML.

To get started, install these packages:

In [None]:
%pip install videodb openai markdown2

## 🔑 API keys
Before proceeding, ensure access to [VideoDB](https://videodb.io), [OpenAI](https://openai.com) API key. If not, sign up for API access on the respective platforms.

> Get your API key from [VideoDB Console](https://console.videodb.io). ( Free for first 50 uploads, **No credit card required** ) 🎉

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["VIDEO_DB_API_KEY"] = ""

---

## 📋 Step 1: Connect to VideoDB

In [None]:
from videodb import connect

conn = connect()
coll = conn.get_collection()

## 🎬 Step 2: Upload the Video

Next, let's upload our sample video:

You can use any public url, Youtube link or local file on your system.

This approach works particularly well with:

* University lectures that include PowerPoint slides and detailed discussions.
* Conference presentations where speakers use slides to illustrate their points.
* Team meetings that involve screen sharing and in-depth explanations.

In this tutorial, we're using an [explainer video](https://www.youtube.com/watch?v=QJNwK2uJyGs) focusing on “Introduction to Arrays”. This example is perfect for demonstrating how to capture and summarize key points from both visual aids and spoken explanations in an educational setting.

In [None]:
# Upload a video by url
video = coll.upload(url="https://www.youtube.com/watch?v=QJNwK2uJyGs")

## 📸🗣️ Step 3: Index the Video on different Modalities

Now comes the exciting part - we're going to index our video in two ways:

1. Indexing spoken content (what's being said in the video)
2. Indexing visual content (what's being shown in the video)

#### 🗣️ Indexing Spoken Content

In [None]:
# Index spoken content

video.index_spoken_words()

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:47<00:00,  2.12it/s]


#### 👁️ Indexing Visual Content

To learn more about Scene Index, explore the following guides:

* [Quickstart Guide guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb) provides a step-by-step introduction to Scene Index. It's ideal for getting started quickly and understanding the primary functions.
* [Scene Extraction Options Guide](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb) delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.

**1.Finding the Right Configuration for Scene Extraction**:

In this case, our ideal result (notes) would depend on being able to accurately identify and extract every slide in the video. We use shot-based extraction to accurately identify transitions between slides, and thus create a scene corresponding to every slide. This involves some testing—starting with a higher threshold to see the initial results, then lowering it to ensure all important slides are captured. By fine-tuning these settings, we can ensure every slide is accurately represented as a distinct scene.

Let’s dive in and see how it works:

In [None]:
from videodb import SceneExtractionType
from IPython.display import Image, display
import requests

# Helper function that will help us view the Scene Collection Images
def display_scenes(scenes, images=True):
    for scene in scenes:
        print(f"{scene.id} : {scene.start}-{scene.end}")
        if images:
            for frame in scene.frames:
                # display(Image(data=image_data))
                im = Image(requests.get(frame.url, stream=True).content)
                display(im)
        print("----")


scene_collection_default = video.extract_scenes(
    extraction_type=SceneExtractionType.shot_based,
    extraction_config={"threshold": 25, "frame_count": 1}
)
display_scenes(scene_collection_default.scenes)

In this example, we begin with a higher threshold, but this may cause some important slides to be overlooked. To make sure we capture every slide, we'll adjust the threshold to a lower setting and re-run the extraction.

In [None]:
from videodb import SceneExtractionType

# Extract scenes using shot-based extraction
scene_collection = video.extract_scenes(
    extraction_type=SceneExtractionType.shot_based,
    extraction_config={"threshold": 10, "frame_count": 1}
)
display_scenes(scene_collection.scenes)

Now that we have found the right configuration for Scene Indexing, it's like we've found the perfect match—let's commit to indexing those scenes ✨!

**2. Indexing the Scenes for Easy Search**

You can use the default prompt or customize it to fit your needs. Whether you want general notes or something more specific, tailoring the prompt lets you get exactly the information you're looking for.

In [None]:
index_id = video.index_scenes(
    extraction_type=SceneExtractionType.shot_based,
    extraction_config={"threshold": 10, "frame_count": 1},
    prompt="Imagine you are student studying for an exam. You are watching a given video lecture slide. Take notes from this slide and also think through if you can gather more information around the topic mentioned in the slide"
)
scene_index = video.get_scene_index(index_id)

## 📋 Step 4: Processing the Video and Summarize each scene

In this step, we'll combine the scene descriptions (equivalent to slide descriptions) and transcripts from each scene's duration, then ask the LLM to generate a comprehensive note. This approach allows our notes to integrate information from multiple modalities, including spoken words and visual content.

![](https://raw.githubusercontent.com/video-db/videodb-cookbook-assets/main/images/guides/lecture_notes_1.png)

In [None]:
import openai

summarize_prompt = """
Scene Description: {scene_description}
Scene Transcript: {scene_transcript}

You are an AI tasked with creating concise summary notes for scenes from a video, integrating both visual and spoken content. For each scene, you will receive a description of the visual content (e.g., slides, images, on-screen text) and a transcript of the spoken content. Your goal is to synthesize this information into a single, coherent summary that captures the main points discussed by the narrator at the time of each specific slide.

Task:
Identify key points from the spoken content (transcript) and determine their relevance.
Integrate these key points with the most relevant visual elements (e.g., slides, images) shown during the scene.
Focus on providing a cohesive summary that reflects the main discussion and key concepts presented.
Constraints:
It is not necessary to describe each visual element separately.
Only include details that contribute to a clear understanding of the main topic discussed.
The goal is to create a concise overall summary that encapsulates the essence of the explanation given during the scene.
The summary should not exceed 200 characters.
Output Format:
Provide the summary in Markdown format under the heading "### Scene Summary." The summary should seamlessly blend the spoken content and visual context, presenting a unified description of the main ideas conveyed during the scene.
Example:
Visual Content Description:

"A slide titled 'Introduction to Quantum Computing' with a diagram illustrating qubits and superposition."
Spoken Content Transcript:

"In this slide, we discuss the basic concepts of quantum computing, focusing on the principle of superposition. Superposition allows qubits to exist in multiple states simultaneously, unlike classical bits."
Example Output:
Scene Summary: Introduction to Quantum Computing
This scene introduces the fundamental concepts of quantum computing, focusing on the principle of superposition. The slide features a diagram showing qubits in multiple states, contrasting with classical bits."""


def extract_transcript_for_scene(transcript, start_time, end_time):
    return " ".join(
        [item["text"] for item in transcript if start_time <= item["start"] < end_time]
    )

def summarize_scene(scene_description, scene_transcript):
    client = openai.Client()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that tasked with creating concise summary notes for scenes from a video, integrating both visual and spoken content. For each scene, you will receive a description of the visual content and a transcript of the spoken content. Your goal is to synthesize this information into a coherent summary that highlights the main points discussed while presenting the scene.",
            },
            {
                "role": "user",
                "content": summarize_prompt.format(
                    scene_description=scene_description,
                    scene_transcript=scene_transcript,
                ),
            },
        ],
    )

    return response.choices[0].message.content

def process_video(transcript, scenes):
    summarized_scenes = []

    for scene in scenes:
        scene_transcript = extract_transcript_for_scene(
            transcript, scene["start"], scene["end"]
        )
        summary = summarize_scene(scene["description"], scene_transcript)

        summarized_scenes.append(
            {
                "start": scene["start"],
                "end": scene["end"],
                "image_url": scene["image_url"],
                "summary": summary,
            }
        )

    return summarized_scenes

scenes = [
    {
        "start": scene.start,
        "end": scene.end,
        "image_url": scene.frames[0].url,
        "description": scene_index["description"],
    }
    for scene_index, scene in zip(scene_index, scene_collection.scenes)
]
transcript = video.get_transcript()
summary = process_video(transcript, scenes)

## Step 5 : Display Summaries

Here we display summaries of video scenes using HTML. The summaries include images, scene timing, and text descriptions formatted with Markdown. The `create_summary_html` function generates the HTML structure, styling each summary as a card with an image and formatted text. The `display_summaries` function renders the generated HTML in an interactive, scrollable format, making it easy to visualize and review multiple scenes at once.



In [None]:

from IPython.display import HTML, display
import markdown2

def create_summary_html(summaries):
    html_content = """
    <style>
        .summary-container {
            display: flex;
            overflow-x: auto;
            padding: 20px 0;
            scroll-snap-type: x mandatory;
            background-color: #f0f0f0;
        }
        .summary-item {
            flex: 0 0 auto;
            width: 300px;
            margin-right: 20px;
            border: 1px solid #ddd;
            padding: 15px;
            scroll-snap-align: start;
            background-color: white;
            box-shadow: 0 2px 5px rgba(0,0,0,0.1);
        }
        .summary-item img {
            width: 100%;
            height: 180px;
            object-fit: cover;
        }
        .summary-item h3 {
            margin-top: 10px;
            color: #333;
        }
        .markdown-body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            color: #333;
        }
        .markdown-body p {
            margin-bottom: 10px;
        }
    </style>
    <div class="summary-container">
    """

    for i, summary in enumerate(summaries, 1):
        markdown_summary = markdown2.markdown(summary['summary'])
        html_content += f"""
        <div class="summary-item">
            <img src="{summary['image_url']}" alt="Scene {i}" onerror="this.onerror=null;this.src='https://via.placeholder.com/300x180.png?text=Image+Not+Available';">
            <h3>Scene {i} ({summary['start']:.2f}s - {summary['end']:.2f}s)</h3>
            <div class="markdown-body">{markdown_summary}</div>
        </div>
        """

    html_content += "</div>"
    return html_content

def display_summaries(summaries):
    html = create_summary_html(summaries)
    display(HTML(html))

display_summaries(summary)

##Conclusion
---

In this tutorial, we've explored a sophisticated approach to extracting and identifying key points from lecture and meeting videos by combining spoken word indexing with shot-based scene indexing. By leveraging VideoDB's robust multimodal indexing capabilities, we've developed a powerful workflow that enables you to efficiently create detailed summaries or notes from your videos, capturing everything from crucial discussion points to key slides in a presentation.

This approach is highly adaptable and can be applied to various scenarios where summarization is essential, such as:

* Summarizing Board Meetings: Capture key decisions by linking spoken discussions with slides or whiteboard content.
* Creating Training Notes: Merge screen recordings with spoken instructions to develop comprehensive step-by-step summaries.
* Documenting Q&A Sessions: Summarize questions and answers by indexing on-screen content with verbal responses for easy reference.

###Further Resources
To learn more about Scene Index, explore the following guides:

- [Quickstart Guide](https://github.com/video-db/videodb-cookbook/blob/main/quickstart/Scene%20Index%20QuickStart.ipynb)
- [Scene Extraction Options](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/playground_scene_extraction.ipynb)
- [Advanced Visual Search](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/advanced_visual_search.ipynb)
- [Custom Annotation Pipelines](https://github.com/video-db/videodb-cookbook/blob/main/guides/scene-index/custom_annotations.ipynb)


### 👨‍👩‍👧‍👦 Support & Community

If you have any questions or feedback. Feel free to reach out to us 🙌🏼

* [Discord](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fdiscord.gg%2Fpy9P639jGz)
* [GitHub](https://github.com/video-db)
* [Email](mailto:ashu@videodb.io)
