SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

SPDX-License-Identifier: Apache-2.0

# Part 2: Deploying Metropolis VSS for Search and Summarization

## Q&A and Graph-RAG

In the last notebook, we learned how to use the [Blueprint for video search and summarization](https://build.nvidia.com/nvidia/video-search-and-summarization/blueprintcard) to summarize a video.

In this notebook, we will explore additional features of the VSS blueprint, specifically the Q&A using Graph-RAG functionalities. We will demonstrate these features on warehouse videos to illustrate how VSS can be applied in real-world scenarios.

### Learning Objectives:
The goals of this notebook are to:
- Explore Q&A on videos through VSS REST APIs
- Understand Graph-RAG Components
- Visualize knowledge graph in Neo4J

### Table of Contents

**[Set Up the Environment](#Set-Up-the-Environment)**  
**[Exploring Q&A on Videos](#Exploring-Q&A-on-Videos)**  
&nbsp;&nbsp;&nbsp;&nbsp;[Upload Video File](#Upload-Video-File)  
&nbsp;&nbsp;&nbsp;&nbsp;[Ingest and Process Video](#Ingest-and-Process-Video)  
&nbsp;&nbsp;&nbsp;&nbsp;[Ask Questions](#Ask-Questions)  

**[Understanding Graph-RAG Components](#Understanding-Graph-RAG-Components)**  
&nbsp;&nbsp;&nbsp;&nbsp;[3.1 G-Extraction/Indexing](#3.1-G-Extraction/Indexing)  
&nbsp;&nbsp;&nbsp;&nbsp;[G-Retriever](#G-Retriever)  
&nbsp;&nbsp;&nbsp;&nbsp;[G-Generation](#G-Generation)  
&nbsp;&nbsp;&nbsp;&nbsp;[Let's Try a Few More Questions](#Let's-Try-a-Few-More-Questions)  

**[Graph-RAG Visualization](#Graph-RAG-Visualization)**  
&nbsp;&nbsp;&nbsp;&nbsp;[Cypher Queries](#Cypher-Queries)  

**[Review](#Review)**

#### Q&A with VSS

VSS supports Question-Answering (Q&A) functionality via **Vector-RAG** and **Graph-RAG**. Vector-RAG is the only supported method for live stream processing. And Graph-RAG is specifically designed for video-based queries.

**Q&A with Vector-RAG:** Captions generated by the VLM, along with their embeddings, are stored in Milvus DB. Given a query, the top five most relevant chunks are retrieved, re-ranked using ```llama-3.2-nv-rerankqa-1b-v2```, and passed to a LLM to generate the final answer.

**Q&A with Graph-RAG:** To capture the complex information produced by the VLM, a knowledge graph is built and stored during video ingestion. Use an LLM to convert the dense captions in a set of nodes, edges, and associated properties. This knowledge graph is stored in a graph database. Captions and embeddings, generated with ```llama-3.2-nv-embedqa-1b-v2```, are also linked to these entities. By using Graph-RAG techniques, an LLM can access this information to extract key insights for Q&A.

<img alt="VSS CA-RAG Diagram" src="assets/VSS_CA-RAG.png" width=1000>

---
### Part 0: Set Up the Environment

We will be using the same VSS server as the previous notebook. Let's verify that it is up and running.

In [None]:
vss_url = "http://localhost:8100"

warehouse_video = "assets/warehouse.mp4"
keynote_video = "assets/keynote_clip.mp4"
traffic_video = "assets/traffic.mp4"

In [7]:
health_endpoint = vss_url + "/health/ready" #check the status of the VSS server
upload_file_endpoint = vss_url + "/files" #upload and manage files
summarize_endpoint = vss_url + "/summarize" #summarize uploaded content
qna_endpoint = vss_url + "/chat/completions" #ask questions for ingested video

 The next cell will install all the necessary Python packages for this notebook

In [None]:
import sys 
python_exe = sys.executable
!{python_exe} -m pip install -r requirements.txt 

In [9]:
#helper function to verify responses 
import json
import requests
from IPython.display import Markdown, display

def check_response(response, text=False):
    print(f"Response Code: {response.status_code}")
    if response.status_code == 200:
        print("Response Status: Success")
        if text:
            print(response.text)
            return response.text
        else:
            print(json.dumps(response.json(), indent=4))
            return response.json()
    else:
        print("Response Status: Error")
        print(response.text)
        return None 

Let's use the health endpoint to verify your VSS instance is running. **Make sure the following cell outputs "Response Code: 200" before proceeding.**

In [10]:
resp = requests.get(health_endpoint)
resp = check_response(resp, text=True)

Response Code: 200
Response Status: Success



Then lets save the configured VLM model so we can use it in future requests. 

In [None]:
try:
    resp = requests.get(vss_url + "/models")
    resp = check_response(resp)
    configured_vlm = resp["data"][0]["id"]
except Exception as e:
    print(f'Server not ready: {e}')

In [None]:
print(f"Configured VLM: {configured_vlm}")

---
### Part 1: Exploring Q&A on Videos

Please refer to the previous lab for exploring all REST API endpoints. Below we will use REST APIs to upload a file, start video processing with chat enabled, and then try out a few questions.

<!-- ![Warehouse Scene](images/warehouse.png) -->

<video width="1000 " height=" " 
       src="assets/warehouse.mp4"  
       controls>
</video>

---
#### 1.1: Upload Video File

Let's start by uploading a video and storing the file-id from the response.

In [None]:
with open(warehouse_video, "rb") as file:
    files = {"file": ("warehouse_video", file)} #provide the file content along with a file name 
    data = {"purpose":"vision", "media_type":"video"}
    response = requests.post(upload_file_endpoint, data=data, files=files) #post file upload request 
response = check_response(response)
video_id = response["id"] #save file ID for summarization request

To view all the uploaded files, send a get request to the ```/files``` endpoint. 

In [None]:
resp = requests.get(upload_file_endpoint, params={"purpose":"vision"})
resp = check_response(resp)

---
#### 1.2 Ingest and Process Video

Next, let's process the video to generate dense captions and knowledge graph. This step can take a couple of minutes.
- First, we'll set the prompts
- Then, we'll call the summarize API to ingest the video
- Note that we set ```enable_chat``` to True to create the knowledge graph

In [15]:
prompts = {
    "vlm_prompt": "You are a warehouse monitoring system. Describe the events in this warehouse and look for any anomalies. "
                            "Start each sentence with start and end timestamp of the event.",
    
    "caption_summarization": "You will be given captions from sequential clips of a video. Aggregate captions in the format "
                             "start_time:end_time:caption based on whether captions are related to one another or create a continuous scene.",
    
    "summary_aggregation": "Based on the available information, generate a summary that captures the important events in the video. "
                           "The summary should be organized chronologically and in logical sections. This should be a concise, "
                           "yet descriptive summary of all the important events. The format should be intuitive and easy for a "
                           "user to read and understand what happened. Format the output in Markdown so it can be displayed nicely."
}

In [None]:
summarize_payload = {
    "id": video_id,
    "prompt": prompts['vlm_prompt'],
    "caption_summarization_prompt": prompts['caption_summarization'],
    "summary_aggregation_prompt": prompts['summary_aggregation'],
    "model": configured_vlm,
    "chunk_duration": 10,
    "chunk_overlap_duration": 0,
    "summarize": True,  #processes the video, but doesn't generate a summary
    "enable_chat": True  #enables knowledge graph creation
}

response = requests.post(summarize_endpoint, json=summarize_payload)
response = check_response(response)

---
#### 1.3: Ask Questions

Once the video is processed, the ```/chat/completions``` endpoint can be called to ask a question

<img alt="Q&A endpoint" src="assets/qna_swagger.png" width=1000>

In [17]:
#helper function to ask question for a specific video
def qna(query, video_id=video_id):
    print(video_id)

    payload = {
        "id": video_id,
        "messages": [{"content": query, "role": "user"}],
        "model": configured_vlm
    }

    try:
        response = requests.post(qna_endpoint, json=payload)
        if response.status_code == 200:
            response_data = response.json()
            # Extracting the answer content
            answer = response_data.get("choices", [])[0].get("message", {}).get("content", "")
            return answer if answer else "No answer received."
        else:
            return f"Failed to get a response. Status code: {response.status_code}. Response: {response.text}"
    
    except requests.RequestException as e:
        return f"An error occurred: {e}"

In [None]:
qna("Was there any forklift in the scene?")

In [None]:
qna("Was the worker carrying the box wearing PPE?")

You will be able to try more questions later in the notebook

---
### Part 2: Understanding Graph-RAG Components

<img alt="GraphRAG Diagram" src="assets/GraphRAG.png" width=800>

---
#### Graph-Extraction/Indexing

##### Dense Captions to Graph Conversion:
The Graph Extractor uses an LLM to analyze dense captions or any text input and identify key entities, actions, and relationships within the text.

##### Example:
Given a warehouse video scene caption like:  
*"A worker places a heavy box on the conveyor belt, and the box falls due to improper placement."*

- The LLM can extract entities such as:
  - **Worker** (Person)
  - **Box** (Object)
  - **Conveyor Belt** (Equipment)

- Relationships identified might include:
  - **"Worker places box on conveyor belt"**
  - **"Box falls due to improper placement"**

These entities and relationships are represented as nodes and edges in a Neo4j graph. Captions and embeddings, generated with `llama-3.2-nv-embedqa-1b-v2`, are also linked to these entities. These can provide descriptive answers to user queries.


---
#### Graph-Retriever

##### Cypher Query Generation:
The Graph Retriever leverages an LLM to process user queries and translate them into structured cypher queries suitable for graph-based searches.

##### Example:
If the user query is:  
*"What caused the box to fall?"*

- The LLM identifies the key entities (e.g., "box") and the desired information (e.g., cause of fall).  
- It then generates a structured cypher query for the graph:

```cypher
MATCH (b:Object)-[:PLACED_ON]->(c:Equipment), (b)-[:FALLS_DUE_TO]->(r:Reason)
WHERE b.name = 'Box'
RETURN r
```

This query, executed on the knowledge graph, retrieves the relevant information, enabling users to query complex relationships within the graph.


---
#### Graph-Generation

Once the Graph Retriever processes the user query and fetches a relevant subgraph (entities, relationships, and captions) from the knowledge graph, **G-Generation** utilizes an LLM to analyze and synthesize the retrieved data into a coherent and meaningful response.

##### Example:
If the user query is:  
*"What caused the box to fall?"*  

The Graph Retriever might fetch the subgraph containing:
- **Nodes**: 
  - Object (**Box**)
  - Equipment (**Conveyor Belt**)
  - Reason (**Improper Placement**)
- **Relationships**:
  - **"Box placed on conveyor belt"**
  - **"Box falls due to improper placement"**
- **Caption**:
  - **"A worker places a heavy box on the conveyor belt, and the box falls due to improper placement."**

G-Generation processes this data, combining the graph structure and its properties, to generate a response such as:  
*"The box fell because it was improperly placed on the conveyor belt."*

---

#### 2.1 Let's Try a Few More Questions

In [None]:
qna("What could be some possible safety issues in this warehouse?")

In [None]:
qna("When did the forklift appear?")

In [None]:
qna("Describe the warehouse setting in detail.")

In [None]:
# qna("Enter your question")

In [None]:
# qna("Enter your question")

---
### Part 3: Graph-RAG Visualization

In this section, we will explore and visualize the knowledge graph stored in the Neo4j database. By leveraging the Neo4j Python library, we will run queries to fetch specific parts of the graph and render them visually for better understanding. This visualization helps in inspecting the structure and relationships in the graph, providing a clear representation of the data stored in the database.

##### Sample graph from Neo4j Visualizer Dashboard

<img alt="Graph Diagram" src="assets/graph_neo4j.png" width=1000>

In [None]:
# Helper functions - No need to understand the following code cell

from py2neo import Graph
import networkx as nx
import matplotlib.pyplot as plt
import textwrap

def visualize_neo4j_query(query, host="localhost", port=7687, user="neo4j", password="password"):
    try:
        graph = Graph(f"bolt://{host}:{port}", auth=(user, password))
    except Exception as e:
        print(f"Error connecting to Neo4j: {e}")
        return

    try:
        result = graph.run(query)
        G = nx.DiGraph()

        for record in result:
            path = record["p"]
            for rel in path.relationships:
                start_node = rel.start_node
                end_node = rel.end_node

                start_label = start_node.get("name", start_node.get("id", f"Node_{start_node.identity}"))
                end_label = end_node.get("name", end_node.get("id", f"Node_{end_node.identity}"))

                # Wrap labels for better readability if they are too long
                start_label = '\n'.join(textwrap.wrap(start_label, width=20))
                end_label = '\n'.join(textwrap.wrap(end_label, width=20))

                G.add_node(start_label)
                G.add_node(end_label)
                G.add_edge(start_label, end_label, label=rel.__class__.__name__)

        plt.figure(figsize=(15, 10))
        
        pos = nx.spring_layout(G, seed=42, k=0.5, iterations=50)
        
        nx.draw_networkx_nodes(G, pos, node_color='lightgreen', node_size=2500)
        nx.draw_networkx_labels(G, pos, font_size=8)
        edges = nx.draw_networkx_edges(G, pos, arrowstyle='-|>', arrowsize=10)
        edge_labels = nx.get_edge_attributes(G, 'label')
        nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color='blue')

        plt.title("Neo4j Graph Visualization")
        plt.show()

    except Exception as e:
        print(f"Error running query or visualizing the graph: {e}")


def get_neo4j_query_text(query, host="localhost", port=7687, user="neo4j", password="password"):
    try:
        graph = Graph(f"bolt://{host}:{port}", auth=(user, password))
    except Exception as e:
        print(f"Error connecting to Neo4j: {e}")
        return

    try:
        result = graph.run(query)
        output = []

        for record in result:
            path = record["p"]
            for rel in path.relationships:
                start_node = rel.start_node
                end_node = rel.end_node

                start_label = start_node.get("name", start_node.get("id", f"Node_{start_node.identity}"))
                end_label = end_node.get("name", end_node.get("id", f"Node_{end_node.identity}"))
                rel_type = type(rel).__name__

                output.append(f"{start_label} - {rel_type} - {end_label}")

        if not output:
            return "No results found."

        return "\n".join(output)

    except Exception as e:
        print(f"Error running query or processing the results: {e}")
        return None



---
#### 3.1 Cypher Queries

<span style="color:red"><b>NOTE: You might have to modify the entity and relationship names in the following cypher queries based on the actual generated graph</b></span>

##### Visualizing Who Wears What

Let's see how the sub-graph related to all entities with keywork "WEARS" looks like:

The following Cypher query retrieves and visualizes relationships where people are wearing items. It matches all `WEARS` relationships in the graph and returns the paths to better understand the connections.

**Cypher Query:**
```cypher
MATCH p=()-[r:WEARS]->() 
RETURN p
```

In [None]:
visualize_neo4j_query("MATCH p=()-[r:WEARS]->() RETURN p")

##### Visualizing Sub-Graph where a Person with id='worker' is shown with WEARS relation/edge

The following Cypher query retrieves information about a specific person (identified by `worker`) and what the worker is wearing. It matches the `WEARS` relationship between the person and the clothing item, returning the path and details of the item.

**Cypher Query:**

```cypher
MATCH p=(person)-[r:WEARS]->(item)
WHERE person.id = 'worker'
RETURN p
```


In [None]:
visualize_neo4j_query("MATCH p=(person)-[r:WEARS]->(item) WHERE person.id = 'worker' RETURN p")

##### Fetching a particular nodes with Cypher Query

Suppose you want to ask a question where the agent needs to know what is located in the warehouse.

The following Cypher query retrieves information about objects which are located in spaces starting with "warehouse...". This is denoted by the item `id` = "warehouse*". It matches the `LOCATED_IN` relationship between objects and the location, and returns the details of each node.

**Cypher Query:**

```cypher
MATCH p=(person)-[r:LOCATED_IN]->(item)
WHERE item.id =~ "warehouse.*"
RETURN p
```

In [None]:
query = """
MATCH p=(person)-[r:LOCATED_IN]->(item)
WHERE item.id =~ "warehouse.*"
RETURN p
"""

text_output = get_neo4j_query_text(query)
print(text_output)

---
### Part 4: Enhancing video understanding - Audio transcription and CV Pipeline

VSS 2.3.0 GA release includes several new features including multi-live stream, burst mode ingestion, CV pipeline, and audio transcription.

- **Multi-live stream and burst clip modes**: Concurrently process hundreds of live streams or pre-recorded video files. This is useful to scale your video AI agents.
- **Audio transcription**: Convert speech to text for a multimodal understanding of a scene. This is useful for use cases where audio is a key component, such as instructional videos, keynotes, team meetings, or company training content.
- **Computer vision pipeline**: Enhance accuracy by tracking objects in a scene with zero-shot object detection and using bounding boxes and segmentation masks with Set-of-Mark (SoM), which guides vision-language models using a predefined set of reference points or labels to improve detection.



#### 4.1: Audio Transcription

By default, VSS uses Riva ASR NIM, which is a state-of-the-art automatic speech recognition (ASR) models for multiple languages. 

To enable audio transcription integration in VSS ingestion pipeline, make sure to set ```ENABLE_AUDIO=true``` in .env while setting up VSS, and provide details for RIVA ASR model in the configuration.

- [Enabling audio with Docker Compose](https://docs.nvidia.com/vss/latest/content/quickstart_docker.html#step-4-deploy-riva-asr-nim-optional)
- [Enabling audio with Helm Chart](https://docs.nvidia.com/vss/latest/content/run_via.html#enabling-audio)

We will explore this feature with a new video which contains audio. This is from GTC Keynote 2025.

<video width="1000 " height=" " 
       src="assets/keynote_clip.mp4"  
       controls>
</video>

In [None]:
with open(keynote_video, "rb") as file:
    files = {"file": ("keynote_video", file)} #provide the file content along with a file name 
    data = {"purpose":"vision", "media_type":"video"}
    response1 = requests.post(upload_file_endpoint, data=data, files=files) #post file upload request 
response1 = check_response(response1)
keynote_video_id_1 = response1["id"] #save file ID for summarization request without audio

with open(keynote_video, "rb") as file:
    files = {"file": ("keynote_video", file)} #provide the file content along with a file name 
    data = {"purpose":"vision", "media_type":"video"}
    response2 = requests.post(upload_file_endpoint, data=data, files=files) #post file upload request 
response2 = check_response(response2)
keynote_video_id_2 = response2["id"] #save file ID for summarization request with audio

In [None]:
prompts = {
    "vlm_prompt": "Write a concise and clear dense caption for the provided NVIDIA GTC Keynote video presented by Jensen Huang, focusing on the technology launches and visual presentation",
    
    "caption_summarization": "You should summarize the following events of a conference. The output should be in bullet points with timestamps. Do not return anything else except the bullet points.",
    
    "summary_aggregation": "You are a video description service. Given the video captions and audio transcripts, aggregate them to a concise summary with timestamps. The output should only contain bullet points."
}

In [None]:
payload = {
    "id": keynote_video_id_1,
    "prompt": prompts['vlm_prompt'],
    "caption_summarization_prompt": prompts['caption_summarization'],
    "summary_aggregation_prompt": prompts['summary_aggregation'],
    "model": configured_vlm,
    "chunk_duration": 10,
    "chunk_overlap_duration": 0,
    "summarize": True,  
    "enable_chat": True,
    "enable_audio": False  #Processing video without audio
}

response = requests.post(summarize_endpoint, json=payload)
response = check_response(response)
summary_no_audio = response["choices"][0]["message"]["content"]

In [None]:
payload = {
    "id": keynote_video_id_2,
    "prompt": prompts['vlm_prompt'],
    "caption_summarization_prompt": prompts['caption_summarization'],
    "summary_aggregation_prompt": prompts['summary_aggregation'],
    "model": configured_vlm,
    "chunk_duration": 10,
    "chunk_overlap_duration": 0,
    "summarize": True,  
    "enable_chat": True,
    "enable_audio": True  #Processing video with audio
}

response = requests.post(summarize_endpoint, json=payload)
response = check_response(response)
summary_with_audio = response["choices"][0]["message"]["content"]

In [None]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h2> Without Audio </h2>
    {summary_no_audio}
  </div>
  <div style="flex: 1;">
  <h2> With Audio </h2>
    \n{summary_with_audio}
  </div>
</div>
"""

Markdown(markdown_string)

Notice how we were able to get more accurate summaries by enabling audio, keeping all other parameters the same. This could be useful in transcription heavy videos like training sessions, lectures, instructional videos, etc. 

Now, let's look into another feature that could be easily enabled within VSS.

---
#### 4.2: Computer Vision Pipeline

The CV metadata generated by the CV pipeline is utilized to improve the accuracy of Video Search and Summarization in two ways:

* Metadata is used by the data processing pipeline to generate inputs for VLM with overlaid object ID, masks etc. This helps in improving the accuracy of VLM as well as enables Set of Marks prompting.

* Metadata is attached with VLM generated dense captions and passed to retrieval pipeline for further processing and indexing.

Following are the steps to initialize the CV pipeline in VSS. Once initialized, users can choose to enable or disable the CV pipeline for individual summarization requests.

Let's look at this using a sample video used in the previous notebook - traffic intersection.

<video width="1000 " height=" " 
       src="data/traffic.mp4"  
       controls>
</video>

In [None]:
with open(traffic_video, "rb") as file:
    files = {"file": ("traffic_video", file)} #provide the file content along with a file name 
    data = {"purpose":"vision", "media_type":"video"}
    response1 = requests.post(upload_file_endpoint, data=data, files=files) #post file upload request 
response1 = check_response(response1)
traffic_video_id = response1["id"]

In [None]:
prompts = {
    "vlm_prompt": "You are an intelligent traffic system. You must monitor and take note of all traffic related events. Start each event description with a start and end time stamp.",
    
    "caption_summarization": "You will be given captions from sequential clips of a video. Aggregate captions in the format start_time:end_time:caption based on whether captions are related to one another or create a continuous scene",
    
    "summary_aggregation": "Based on the available information, generate a traffic report that is organized chronologically and in logical sections.Give each section a descriptive heading of what occurs and the time range. This should be a concise, yet descriptive summary of all the important events. The format should be intuitive and easy for a user to read and understand what happened. Format the output in Markdown so it can be displayed nicely."
}

In [None]:
payload = {
    "id": traffic_video_id,
    "prompt": prompts['vlm_prompt'],
    "caption_summarization_prompt": prompts['caption_summarization'],
    "summary_aggregation_prompt": prompts['summary_aggregation'],
    "model": configured_vlm,
    "chunk_duration": 10,
    "chunk_overlap_duration": 0,
    "summarize": True,  
    "enable_chat": True,
    "enable_cv_metadata": False  #Processing video without CV metadata
}

response = requests.post(summarize_endpoint, json=payload)
response = check_response(response)
summary_no_cv = response["choices"][0]["message"]["content"]

In [None]:
answer_no_cv = qna("Which cars collided?", traffic_video_id)
print(answer_no_cv)

Now, for effectivaly using CV metadata, we need to update the first prompt as shown below, so that VLM uses IDs in event descriptions.

In [None]:
updated_vlm_prompt = (
    "You are an intelligent traffic system. The provided video is "
    "a processed clip where each vehicle is overlaid with an ID. "
    "You must monitor and take note of all traffic related events. "
    "Start each event description with a start and end time stamp "
    "of the event, and use vehicle IDs in event description."
)

In [None]:
payload = {
    "id": traffic_video_id,
    "prompt": updated_vlm_prompt,
    "caption_summarization_prompt": prompts['caption_summarization'],
    "summary_aggregation_prompt": prompts['summary_aggregation'],
    "model": configured_vlm,
    "chunk_duration": 10,
    "chunk_overlap_duration": 0,
    "summarize": True,  
    "enable_chat": True,
    "enable_cv_metadata": True,  #Processing video with CV metadata
    "cv_pipeline_prompt": "vehicle"
}

response = requests.post(summarize_endpoint, json=payload)
response = check_response(response)
summary_with_cv = response["choices"][0]["message"]["content"]

In [None]:
answer_with_cv = qna("Which cars collided?", traffic_video_id)
print(answer_with_cv)

In [None]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
    <h2>Summary without CV Metadata</h2>
    {summary_no_cv}
  </div>
  <div style="flex: 1;">
    <h2>Summary with CV Metadata</h2>
    {summary_with_cv}
  </div>
</div>
"""

Markdown(markdown_string)

Finally lets compare the Q&A results

In [None]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
    <h2>Q&A without CV Metadata</h2>
    {answer_no_cv}
  </div>
  <div style="flex: 1;">
    <h2>Q&A with CV Metadata</h2>
    {answer_with_cv}
  </div>
</div>
"""

Markdown(markdown_string)

With CV metadata enabled, VLM was able to use overlayed vehicles IDs and give context in the summary. This makes it easier to ground information to the original input video. 

---
### Review

In this notebook you learned the following:
- How to enable chat in VSS to ask questions about a video
- How Graph-RAG extraction from dense captions works under the hood
- How to visualize the knowledge graph to understand its structure and relationships
- How to enable and use features like audio transcription and CV metadata to improve accuracy for specific use-cases

---
### [Additional Material]: Finetuning a Custom VLM

In Video Search and Summarization (VSS), users can leverage a custom or fine-tuned Vision-Language Model (VLM).

To support this, NVIDIA provides a set of resources for finetuning VLMs using the TAO FTMS service.

#### Finetuning VLMs with Train Adapt and Optimize Finetuning Microservices (TAO FTMS)

We offer a container and resources via [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/containers/vlm-lita-finetuning-ea)

#### Reference Notebooks

You can find the full set of [example notebooks](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/vlm-lita-getting-started-ea/files) for Standard Operating Procedure (SOP) usecase

**Step-1. Data Preparation for SOP Use Case: `01_Data_Labeling.ipynb`**
- Video chunking (frame/window segmentation)
- Label creation in LLaVA-compatible format
- Supports: MCQ (Multiple Choice QA), GQA (General QA), and BQA (Binary QA)

**Step-2. Finetuning the VLM with TAO FTMS: `vila.ipynb`**
- Setup for training using the VLM LITA container
- Use the curated videos and labels generated in Step-1

**Step-3. SOP Agent (Standard Operating Procedure): `03_VLM_Agent.ipynb`**
- Example QA agent to test VLM capabilities
- Compare model outputs before and after fine-tuning

> **Tip:** Use this setup to enhance VLM performance on your domain-specific video data, and you can follow the [steps](https://docs.nvidia.com/vss/latest/content/installation-vlms.html#local-ngc-models-vila-nvila) to integrate finetuned vlm in VSS.