<center><a href="https://www.nvidia.cn/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# 4a. VSS：摘要和 CA-RAG

在之前的实验中，我们了解了多模态模型的理论。本实验将带您看到如何将多模态模型编排成一个实用的数据工作流。我们将使用[视频搜索和摘要 AI 蓝图](https://build.nvidia.com/nvidia/video-search-and-summarization/blueprintcard)（Summarization AI Blueprint）作为灵感。这是一个参考实现，用 NVIDIA NIM 来提供大规模视频的摘要和问答能力。

#### 学习目标：
本 notebook 的目标是：
* 测试 VSS REST API 进行文件管理和摘要
* 实现摘要和上下文感知的 RAG (CA-RAG, Context Aware RAG)
* 通过 REST API 配置 VSS 选项，以适应特定的摘要用例 

![vss 架构图](images/vss_arch_diagram.png)

## 4.1 设置环境

我们已经设置了一个可在下面地址访问的 VSS 实例。

在图片文件夹中还有两个示例视频：
1) 一个合成生成的交通交叉口视频
2) 一个无人机飞绕桥的视频

In [1]:
vss_url = "http://via-server:8000"
traffic_video = "data/traffic.mp4"
bridge_video = "data/bridge.mp4"

继续导入以下库。

In [2]:
import json
import requests
from pymilvus import MilvusClient
from IPython.display import Markdown, Video, display
import time

## 4.2 VSS API 概览

VSS 提供了用于与蓝图交互的 REST API 端点。这些端点是将 VSS 集成到您自己的应用或服务中的接口。

API 包括以下内容：
- 警报
- 文件
- 健康检查
- 实时流
- 指标
- 模型
- 推荐配置
- 摘要


要查看 REST API 端点和架构，可以在 VSS 运行时访问 http://localhost:8000/docs 查看 swagger 文档。
这是 VSS 的后端端口，可能需要根据您的配置进行调整。

![VSS Swagger](images/swagger_docs.png)

REST API 端点请求成功时将返回 200 状态码，响应为 JSON 格式。以下是定义的辅助函数，下面是一个用来验证请求响应并帮助调试错误的辅助函数。

In [3]:
#helper function to verify responses 
def check_response(response, text=False):
    print(f"Response Code: {response.status_code}")
    if response.status_code == 200:
        print("Response Status: Success")
        if text:
            print(response.text)
            return response.text
        else:
            print(json.dumps(response.json(), indent=4))
            return response.json()
    else:
        print("Response Status: Error")
        print(response.text)
        return None 

我们将探索以下端点：

In [4]:
files_endpoint = vss_url + "/files" #upload and manage files
summarize_endpoint = vss_url + "/summarize" #summarize uploaded content 
health_endpoint = vss_url + "/health/ready" #check the status of the VSS server
models_endpoint = vss_url + "/models" #view the configured LLM in VSS

可以通过健康检查端点来验证您的 VSS 实例是否正在运行。它应该返回 200 状态码。

In [5]:
resp = requests.get(vss_url + "/health/ready")
resp = check_response(resp, text=True)

Response Code: 200
Response Status: Success



模型的端点将返回可用于摘要请求的 LLM。这是基于 VSS 的启动配置。LLM 可以被配置为指向任何与 OpenAI 兼容的 LLM。

In [6]:
resp = requests.get(vss_url + "/models")
resp = check_response(resp)

Response Code: 200
Response Status: Success
{
    "object": "list",
    "data": [
        {
            "id": "vila-1.5",
            "created": 1745458450,
            "object": "model",
            "owned_by": "NVIDIA",
            "api_type": "internal"
        }
    ]
}


## 4.3 视频摘要

下面将展示如何上传视频文件，并向 VSS 发出请求，以生成两分钟交通交叉口视频的简单摘要。

In [7]:
Video(traffic_video, width=1000)

### 4.3.1 上传文件

使用 VSS 进行视频摘要的第一步是通过 REST API 上传视频文件。有几个端点可以与文件交互。

![文件端点](images/file_endpoints.png)

要发送包含视频文件的请求，应以 "rb" 模式打开文件，以获取文件的二进制内容。然后，我们可以将其作为文件添加到请求正文中。请求还应将 ```purpose``` 指定为 "vision"，将 ```media_type``` 指定为 "video"。也可以上传单个图像，其 media_type 为 "image"，用于摘要单个图像文件。

然后可以将这个请求作为多部分表单发布到 ```/files``` 端点。

In [8]:
with open(traffic_video, "rb") as file:
    files = {"file": ("traffic_video", file)} #provide the file content along with a file name 
    data = {"purpose":"vision", "media_type":"video"}
    response = requests.post(files_endpoint, data=data, files=files) #post file upload request 
response = check_response(response)
video_id = response["id"] #save file ID for summarization request

Response Code: 200
Response Status: Success
{
    "id": "2cbd8f30-4ba2-443b-b2e9-1e7c2b904658",
    "bytes": 8141473,
    "filename": "traffic_video",
    "purpose": "vision",
    "media_type": "video"
}


一旦发布，将返回一个唯一 ID，用于在摘要请求中引用上传的文件。

要查看所有上传的文件，请向 ```/files``` 端点发送 GET 请求。

In [9]:
resp = requests.get(files_endpoint, params={"purpose":"vision"})
resp = check_response(resp)

Response Code: 200
Response Status: Success
{
    "data": [
        {
            "id": "2cbd8f30-4ba2-443b-b2e9-1e7c2b904658",
            "bytes": 8141473,
            "filename": "traffic_video",
            "purpose": "vision",
            "media_type": "video"
        }
    ],
    "object": "list"
}


## 4.3.2 摘要

一旦视频或图像上传完毕，就可以调用 ```/summarize``` 端点来生成摘要。

![摘要端点](images/summarize_endpoint.png)

请求的正文中应包含视频 ID，以及提示词和模型选项。稍后我们将在 notebook 中详细探讨这些选项。

In [10]:
body = {
    "id": video_id, #id of file returned after upload 
    "prompt": "Write a caption based on the video clip.",
    "caption_summarization_prompt": "Combine sequential captions to create more concise descriptions.",
    "summary_aggregation_prompt": "Write a summary of the video. ",
    "model": "vila-1.5",
    "max_tokens": 1024,
    "temperature": 0.8,
    "top_p": 0.8,
    "chunk_duration": 20,
    "chunk_overlap_duration": 0
}

然后将请求体发到 ```/summarize``` 端点，以开始摘要处理。根据视频长度和配置选项，此请求可能需要一些时间才会返回。

In [11]:
response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
generic_summary = response["choices"][0]["message"]["content"]
summary_id = response["id"] #save to inspect later

Response Code: 200
Response Status: Success
{
    "id": "d8305ffe-a3e5-415c-b2d4-3aca6c8b1cfe",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "Unfortunately, it seems like there are two different descriptions provided, and I'll summarize each one separately.\n\n**Summary 1:**\nThe video shows various scenes of cars driving and navigating through intersections and a roundabout. It starts with a gray car driving on the right side of the road, followed by a car stopping at a stop sign. Then, it shows a top-down view of a roundabout with multiple lanes, traffic signs, and various vehicles moving smoothly around it. Finally, a yellow car is seen turning left at an intersection.\n\n**Summary 2:**\nThe video shows a car accident at an intersection, with a red car and a yellow car colliding and causing damage. A police car arrives at the scene, and the video captures the aftermath of the collision. Th

请求响应包括摘要输出、一些元数据和唯一的请求 ID。摘要以类似于 OpenAI API 规范的格式返回。您可以从选择列表的第一个消息里提取摘要：```response["choices"][0]["message"]["content"]```。运行下一个单元格以在 notebook 中展示输出的摘要。

In [12]:
display(Markdown("### Summary Output")) 
markdown_string = "\n".join(f"> {line}" for line in generic_summary.splitlines())
display(Markdown(markdown_string)) #render summary output as markdown

### Summary Output

> Unfortunately, it seems like there are two different descriptions provided, and I'll summarize each one separately.
> 
> **Summary 1:**
> The video shows various scenes of cars driving and navigating through intersections and a roundabout. It starts with a gray car driving on the right side of the road, followed by a car stopping at a stop sign. Then, it shows a top-down view of a roundabout with multiple lanes, traffic signs, and various vehicles moving smoothly around it. Finally, a yellow car is seen turning left at an intersection.
> 
> **Summary 2:**
> The video shows a car accident at an intersection, with a red car and a yellow car colliding and causing damage. A police car arrives at the scene, and the video captures the aftermath of the collision. The scene then shifts to a top-down view of the intersection, showing the movement of the cars as they navigate the intersection. Finally, the video displays a bird's-eye view of the intersection with the damaged cars and a white fire truck stationary, with no visible people or movement.

摘要输出非常简短，没有捕捉很多细节或任何时间戳信息。这是因为在摘要请求中提供的都是很通用的参数。为了改善输出响应，我们可以调整提示词、caption_summarization_prompt、summary_aggregation_prompt、块持续时间、块重叠持续时间、温度（temperature）和 top_p。为了了解如何更好地配置摘要请求，让我们深入了解摘要工作流是如何运作的。

## 4.4 摘要工作流和 CA-RAG

摘要是一个多阶段的工作流，涵盖一系列 VLM 和 LLM 的调用。这个工作流是 GPU 加速的，能与优化的 VLM 和 LLM NIMs 一起运行。为了生成信息丰富的摘要，LLM 会结合 VLM 生成的视频细节，再加上一个向量数据库，组成 VSS 的上下文感知 RAG 模块。

当发布摘要请求时，输入视频会首先被拆分成许多较小的片段或“块”。这些块的大小由 ```chunk_duration``` 参数配置。每个块的典型大小在 10 到 60 秒之间。

VLM 会并行处理每个视频块。VLM 通过从视频片段中采样多个帧来检查视频块，然后生成一个文本描述，说明该块中发生的事件。VLM 的文本描述输出受 ```prompt``` 参数的影响。

![摘要图](images/summarization_diagram.png)


一批 VLM 生成的密集描述（dense caption）会连同“caption_summarization_prompt”一起提供给 LLM，以凝练这些标题并减少重复信息。图示显示了该步骤的批处理大小为 2。批处理大小可以在 VSS 启动时的 CA-RAG 配置 yaml 文件中配置。

摘要的最后一步是一次 LLM 调用，它接受凝练后的标题并生成最终的摘要输出。这个摘要的生成由 ```summary_aggregation_prompt``` 参数控制。

### 4.4.1 Milvus 向量数据库

在 VLM 生成密集描述的同时，这些标题会被发送到嵌入模型并存储在 Milvus 向量数据库中。稍后这将用于支持 Q&A 的向量 RAG。

每个摘要都有一个 ID，每个 ID 都有一个 Milvus 集合存储 VLM 为视频每个块生成的标题。我们可以使用 pymilvus 库查看 Milvus 数据库中的这些标题。

In [13]:
from pymilvus import MilvusClient

#connect to Milvus DB started by VSS
client = MilvusClient(uri="http://via-server:19530")
res = client.list_collections() #print available collections
print(res)

['summary_till_now_d8305ffe_a3e5_415c_b2d4_3aca6c8b1cfe']


In [14]:
summary_id = summary_id.replace("-", "_") #convert ID from summarization request into Milvus collection ID
collection_name = f"summary_till_now_{summary_id}"
print(collection_name)

summary_till_now_d8305ffe_a3e5_415c_b2d4_3aca6c8b1cfe


使用摘要请求中的 ID，可以加载包含密集描述的 Milvus 集合。

In [15]:
client.load_collection(collection_name=collection_name) #load collection associated with the previous summarization request. 
res = res = client.get_load_state(collection_name=collection_name)
print(res)

{'state': <LoadState: Loaded>}


一旦集合加载完毕，便可以检索和查看密集描述。

In [16]:
res = client.query(
    collection_name=collection_name,
    limit=10
)

以下单元将打印出存储在 Milvus 中的一条记录。该记录对应一组被归并在一起并嵌入的密集描述。与该记录相关联的一系列键值对用于存储标题的元数据，例如时间戳。

In [17]:
for k,v in res[0].items():
    if k == "vector": #skip embedding vector from being printed
        continue 
    print(f"{k}: {v}")

pts_offset_ns: 0
request_id: d8305ffe-a3e5-415c-b2d4-3aca6c8b1cfe
start_pts: 0
is_last: False
start_ntp_float: 0.0
is_first: True
file: /tmp/assets/2cbd8f30-4ba2-443b-b2e9-1e7c2b904658/traffic_video
pk: 457561457844922008
end_ntp_float: 20.0
text: <0.00> <20.00> A gray car is driving on the right side of the road.
end_ntp: 1970-01-01T00:00:20.000Z
cv_meta: []
chunkIdx: 0
streamId: 2cbd8f30-4ba2-443b-b2e9-1e7c2b904658
end_pts: 20000000000
start_ntp: 1970-01-01T00:00:00.000Z
batch_i: 0


在尝试提高摘要输出质量时，检查存储在 Milvus 中的标题非常有帮助。因为这些是用于生成摘要的 VLM 的输出。如果密集描述的输出没有捕获必要的细节，摘要就会显得不够丰富。

下一节将展示如何调节配置以改善摘要输出。

## 4.5 摘要配置 - 智能交通系统

本节将介绍可用的配置选项，以及如何调节这些选项，使 VSS 成为一个能够生成交通报告的智能交通系统。

有几种选项可以调节摘要输出。最重要的是提供给 VLM 和 LLM 的提示词。
这些提示词可以通过配置文件作为默认值提供，或者直接在发出请求时给到摘要入口。

![摘要提示词](images/summarization_prompts_diagram.png)

### 4.5.1 提示词

一组三个提示词用于控制三个阶段的摘要生成。接下来将展示如何改进 4.3 部分中使用的通用提示词，以产生更具信息量的交通摘要。

#### VLM 提示词

VLM 的提示词应该提供足够的信息，让模型知道要在视频中寻找什么。如果摘要缺少重要细节，可能是因为 VLM 在最开始的时候没有从视频块中提取这些细节。

通常效果较好的是这三方面的提示词：

1) 角色
2) 细节
3) 格式

例如：

> "您是一个智能交通系统。您必须监控并记录所有与交通相关的事件，每个事件描述要以开始和结束的时间戳开头。"


给 VLM 一个智能交通系统的角色，使它的回复能包含生成交通报告所需的相关细节。然后可以在提示词中添加它需要关注的具体细节，比如交通事件。最后，我们通常希望摘要报告包含时间戳信息，因此必须告诉 VLM 在描述中包含时间戳。

当 VSS 将视频分块，并向 VLM 提供一个块的采样帧时，它还会附上时间戳，以便模型知道每一帧在视频中的位置。模型随后可以利用这些时间戳信息，在输出中关联视频中事件发生的时间。

有了这个更具体的提示词，VLM 将生成包含相关信息的更详细描述，这对于得到一个好的摘要至关重要。

In [18]:
prompt = "You are an intelligent traffic system. You must monitor and take note of all traffic related events. Start each event description with a start and end time stamp."

#### LLM 标题摘要提示词

通常，由 VLM 生成的文本描述在顺序或重叠的块中可能会重复。对于非常长的视频，这在生成最终摘要时可能浪费了很多 token。为了精简 VLM 的标题，用 LLM 将 VLM 的输出结合在一起，生成更简洁的描述。

这个提示词在不同用例中通常保持不变，因为它只是需要指示 LLM 将相似的描述结合在一起。

例如：
>"您将获得来自视频顺序片段的标题。根据标题之间的关联性，采用 format start_time:end_time:caption 将标题聚合在一起，或者创建一个连续的场景。"

In [19]:
caption_summarization_prompt = "You will be given captions from sequential clips of a video. Aggregate captions in the format start_time:end_time:caption based on whether captions are related to one another or create a continuous scene"

#### LLM 总结聚合提示词

总结聚合提示词用于生成由总结入口返回的最终总结。通过单次 LLM 调用结合所有聚合的标题生成总结输出。

这个提示词应该重申需要包含哪些细节以及格式选项。请记住，这个阶段摘要只能包含先前阶段生成的聚合标题中提供的细节。

例如：
>"基于可用信息，生成一个按时间顺序组织并具有逻辑关系的交通报告。给每个部分一个描述性标题，说明发生了什么并标注时间范围。这应该是一个简洁而富有描述性的总结，包含所有重要事件。格式应该直观，易于用户阅读和理解发生了什么。将输出格式化为 Markdown，以便很好地显示出来。"


In [20]:
summary_aggregation_prompt = "Based on the available information, generate a traffic report that is organized chronologically and in logical sections.Give each section a descriptive heading of what occurs and the time range. This should be a concise, yet descriptive summary of all the important events. The format should be intuitive and easy for a user to read and understand what happened. Format the output in Markdown so it can be displayed nicely."

In [21]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": "vila-1.5",
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "chunk_duration": 20,
    "chunk_overlap_duration": 0
}

response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
summary = response["choices"][0]["message"]["content"]


Response Code: 200
Response Status: Success
{
    "id": "33493818-e5f1-407d-9ac2-253c24558c6f",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "tool_calls": [],
                "role": "assistant"
            }
        }
    ],
    "created": 1745480029,
    "model": "vila-1.5",
    "media_info": {
        "type": "offset",
        "start_offset": 0,
        "end_offset": 130
    },
    "object": "summarization.completion",
    "usage": {
        "query_processing_time": 39,
        "total_chunks_processed": 7
    }
}


运行以下单元格，并列呈现通用提示词和调整后的提示词的总结输出。

In [22]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Generic Prompts </h1>
    {generic_summary}
  </div>
  <div style="flex: 1;">
  <h1> Tuned Prompts </h1>
    \n{summary}
  </div>
</div>
"""

In [23]:
Markdown(markdown_string)


<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Generic Prompts </h1>
    Unfortunately, it seems like there are two different descriptions provided, and I'll summarize each one separately.

**Summary 1:**
The video shows various scenes of cars driving and navigating through intersections and a roundabout. It starts with a gray car driving on the right side of the road, followed by a car stopping at a stop sign. Then, it shows a top-down view of a roundabout with multiple lanes, traffic signs, and various vehicles moving smoothly around it. Finally, a yellow car is seen turning left at an intersection.

**Summary 2:**
The video shows a car accident at an intersection, with a red car and a yellow car colliding and causing damage. A police car arrives at the scene, and the video captures the aftermath of the collision. The scene then shifts to a top-down view of the intersection, showing the movement of the cars as they navigate the intersection. Finally, the video displays a bird's-eye view of the intersection with the damaged cars and a white fire truck stationary, with no visible people or movement.
  </div>
  <div style="flex: 1;">
  <h1> Tuned Prompts </h1>
    
**Traffic Report**
================

### **Scene 1: Normal Traffic Flow (10:00 AM - 10:30 AM)**

* 10:00 AM - 10:30 AM: A sequence of cars drives through the intersection, including a black car, a silver car, a white car, and another black car, all moving in alternating directions.
* 10:00 AM - 10:30 AM: A variety of vehicles, including a black car, a green car, a yellow school bus, a red car, and a black truck, drive through the intersection in a sequence.

### **Scene 2: Intersection Activity (40.00 - 100.00)**

* 40.00 - 60.00: A red fire truck with flashing lights drives through the intersection, followed by a black truck, then a red car, a yellow school bus, a yellow car, and a blue car.
* 60.00 - 80.00: Cars drive through the intersection, making various turns: a yellow car makes a left turn, a red car makes a right turn, a yellow car makes a right turn, a red car makes a left turn, and a yellow car makes a left turn.
* 80.00 - 100.00: A red car and a yellow car collide at an intersection. The red car drives away, while the yellow car remains stationary. A police car arrives, and the officer exits to approach the yellow car.

### **Scene 3: Accident Aftermath (10:00 AM - 10:25 AM)**

* 10:00 AM - 10:25 AM: A red car and a yellow car drive through the intersection, followed by a black car with flashing lights. A white fire truck with red and yellow stripes arrives at the intersection, driving behind the black car. The police officer hands over the report to the driver of the yellow car and advises them on what to do next. The driver of the yellow car drives away from the scene.
  </div>
</div>


使用调整后提示词的总结输出应该比使用通用提示词生成的总结信息丰富得多。它包含与交通视频相关的更多详情，事件时间戳，并以易于阅读的格式呈现。

### 4.5.2 分块持续时间

除了提示词，```chunk_duration```（也称为分块大小）根据使用案例也很重要。分块大小决定了 VLM 观看视频的时间粒度。

![chunk duration](images/chunk_duration.png)

可以用以下方式计算 VLM 处理的分块和帧数：

$ Number\ of\ Chunks = \frac{Video\ Length\ (s)}{Chunk\ Size\ (s)} $  <!-- Display-style math -->  
$ Processed\ Frames = Frames\ per\ Chunk * Number\ of\ Chunks $  <!-- Display-style math -->  
$ Processed\ Frames = Frames\ per\ Chunk * \frac{Video\ Length (s)}{Chunk\ Size (s)}  $  <!-- Display-style math -->  


现在输入一些实际数字。

视频长度 = 2 分钟（120 秒）  
分块大小 = 5 秒  
每个分块的帧数 = 10（VSS 的默认值）  

$ Number\ of\ Chunks = \frac{120}{5} = 24\ Chunks $  <!-- Display-style math -->  
$ Processed\ Frames = 10 * 24 = 240\ Frames  $  <!-- Display-style math -->  


如果将分块大小调整为 30 秒：

$ Number\ of\ Chunks = \frac{120}{30} = 4\ Chunks $  <!-- Display-style math -->  
$ Processed\ Frames = 10 * 4 = 40\ Frames  $  <!-- Display-style math -->  



根据公式，较小的分块大小用于摘要时，会处理更多来自视频的帧并用于生成摘要。较大的分块大小将导致用于生成摘要的帧数减少。

随着帧数减少（分块大小较大），摘要生成会更快，但可能会错过更细微的细节或快速事件，因为 VLM 没有看到来自视频的更多帧。

有了更多帧（分块大小较小），摘要生成会更慢，但会包含更多细节，并更有可能捕捉到快速的细节和事件，比如汽车快速穿过交叉路口。

最佳的分块大小取决于使用案例，必须进行调优，以找到处理时间和摘要时间分辨率之间的正确平衡。

此外，还可以在摘要请求中添加 ```chunk_overlap_duration``` 来配置分块之间的重叠。这会有助于捕捉分块边界发生的事件。

以下单元将展示 30 秒分块大小与 5 秒分块大小的摘要并排比较。

In [24]:
prompt = "You are an intelligent traffic monitoring system that will be given a clip from a camera overlooking a four way intersection. You must inspect the clip and write a detailed description of what occurs. End each sentence with a timestamp."
caption_summarization_prompt = "If any descriptions have the same meaning and are sequential then combine them under one sentence and merge the time stamps to a range. Format the timestamps as 'mm:ss'"
summary_aggregation_prompt = "Write out a detailed time line based on the descriptions. The output should be a bulleted list in the format 'mm:ss-mm:ss Description' that includes the timestamp and description of what occured."

In [25]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": "vila-1.5",
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "chunk_duration": 30,
    "chunk_overlap_duration": 5
}

start_t = time.time() 
response = requests.post(summarize_endpoint, json=body)
summary_30_time = time.time() - start_t 
response = check_response(response)
summary_30 = response["choices"][0]["message"]["content"]

Response Code: 200
Response Status: Success
{
    "id": "a9c44c00-dbbe-4003-ae97-32bcbc957a14",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "Here is the detailed timeline in the format 'mm:ss-mm:ss Description':\n\n\u2022 00:00-00:30 The clip begins with an empty intersection, with no vehicles or pedestrians in sight. The road markings are clearly visible, and the surrounding area is well-maintained with trees and street lamps. As the clip progresses, a black car enters the intersection from the top left corner, followed by a red car from the top right corner. The black car continues straight, while the red car turns right. Shortly after, a yellow school bus enters the intersection from the top right corner and turns left. The clip ends with the black car and the yellow school bus driving away from the intersection.\n\n\u2022 00:25-00:55 The clip continues with a green car driving through th

In [26]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": "vila-1.5",
    "max_tokens": 1024,
    "temperature": 0.2,
    "top_p": 0.8,
    "chunk_duration": 5,
    "chunk_overlap_duration": 1
}

start_t = time.time() 
response = requests.post(summarize_endpoint, json=body)
summary_5_time = time.time() - start_t
response = check_response(response)
summary_5 = response["choices"][0]["message"]["content"]

Response Code: 200
Response Status: Success
{
    "id": "7590978b-da78-4f65-bdbb-e247ac54ccb0",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "Here is the detailed timeline in the format 'mm:ss-mm:ss Description':\n\n\u2022 00:00-00:13 The clip shows a four-way intersection with a stop sign on the left side. There are no vehicles or pedestrians in the clip. The clip is taken during the daytime and the shadows indicate it is either morning or late afternoon.\n\u2022 00:12-00:17 A car is seen driving down the road and then turning left.\n\u2022 00:16-00:21 A black car is seen driving through the intersection.\n\u2022 00:20-00:41 The clip shows a four-way intersection with a stop sign on the left side. A black car is seen driving through the intersection. A green car is then seen driving through the intersection and turning right. The clip also shows a yellow school bus driving down the road and 

In [27]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Chunk Size: 30 seconds </h1>
  <h1> Generation time: {round(summary_30_time, 2)} seconds </h1>
    \n{summary_30}
  </div>
  <div style="flex: 1;">
  <h1> Chunk Size: 5 seconds </h1>
  <h1> Generation time: {round(summary_5_time, 2)} seconds </h1>
    \n{summary_5}
  </div>
</div>
"""

In [28]:
Markdown(markdown_string)


<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Chunk Size: 30 seconds </h1>
  <h1> Generation time: 35.78 seconds </h1>
    
Here is the detailed timeline in the format 'mm:ss-mm:ss Description':

• 00:00-00:30 The clip begins with an empty intersection, with no vehicles or pedestrians in sight. The road markings are clearly visible, and the surrounding area is well-maintained with trees and street lamps. As the clip progresses, a black car enters the intersection from the top left corner, followed by a red car from the top right corner. The black car continues straight, while the red car turns right. Shortly after, a yellow school bus enters the intersection from the top right corner and turns left. The clip ends with the black car and the yellow school bus driving away from the intersection.

• 00:25-00:55 The clip continues with a green car driving through the intersection, followed by a yellow school bus. A red fire truck then enters the intersection, followed by a black truck. The fire truck then turns left, and the black truck continues straight.

• 00:50-01:20 A yellow school bus is driving through the intersection.

• 01:15-01:45 A red car is seen driving down the road and then turns into the intersection.

• 01:40-02:10 A red car and a yellow car are seen driving through a four-way intersection with a stop sign visible on the left side. The clip is taken from an aerial view, and the vehicles are moving in a clockwise direction. The red car is stopped at the intersection, while the yellow car is moving forward, and a white fire truck with red and black stripes is also moving forward on the right side of the intersection.
  </div>
  <div style="flex: 1;">
  <h1> Chunk Size: 5 seconds </h1>
  <h1> Generation time: 74.81 seconds </h1>
    
Here is the detailed timeline in the format 'mm:ss-mm:ss Description':

• 00:00-00:13 The clip shows a four-way intersection with a stop sign on the left side. There are no vehicles or pedestrians in the clip. The clip is taken during the daytime and the shadows indicate it is either morning or late afternoon.
• 00:12-00:17 A car is seen driving down the road and then turning left.
• 00:16-00:21 A black car is seen driving through the intersection.
• 00:20-00:41 The clip shows a four-way intersection with a stop sign on the left side. A black car is seen driving through the intersection. A green car is then seen driving through the intersection and turning right. The clip also shows a yellow school bus driving down the road and turning right at the intersection. A red truck is also seen driving through the intersection.
• 00:40-00:49 A red car is seen driving through the intersection and then down the road, where it passes a black truck driving in the opposite direction.
• 00:48-00:57 A yellow school bus is driving through the intersection and then turns right.
• 00:56-01:01 A yellow car is seen driving through the intersection and turning left.
• 01:00-01:05 The clip shows a yellow car driving through the intersection. The car is the only vehicle visible in the clip.
• 01:04-01:13 The clip shows a four-way intersection with a stop sign on the left side. There are no vehicles or pedestrians in the clip.
• 01:12-01:21 The clip begins with a view of a four-way intersection with white lines marking the lanes and a stop sign visible on the left. The sky is clear, and the road is empty. As the clip progresses, a red car enters the intersection from the left, followed by a yellow car from the right. The red car stops at the intersection, and the yellow car continues through the intersection.
• 01:20-01:33 The clip shows a red car and a yellow car driving through the intersection. The red car is on the left side of the intersection and the yellow car is on the right side. The red car is moving towards the right and the yellow car is moving towards the left. The clip is taken from an aerial view and the intersection has white lines and a stop sign. The red car is moving forward and the yellow car is moving backward.
• 01:32-01:41 A red car is seen driving through the intersection and then a black car with flashing lights is seen driving through the intersection. The clip shows a black and white police car with flashing lights and a red car with a yellow car behind it. The police car is in the middle of the intersection and the red car is behind it. The yellow car is behind the red car. The clip is taken from an aerial view.
• 02:00-02:13 The clip shows a red car and a yellow car driving through the intersection from one side to the other, with the red car on the left and the yellow car on the right, as seen from an aerial view. The intersection has white lines and arrows indicating the direction of traffic.
• 02:04-02:09 A black and white police car with flashing lights and the word "POLICE" on the side drives through the intersection, followed by a red car.
• 02:12-02:17 The clip shows a red sedan, a yellow sports car, and a black police car with flashing lights on the roof, with the police car parked on the side of the road and the sports car driving down the road.
• 02:16-02:30 The clip shows a four-way intersection with a stop sign visible on the left side. There are three vehicles visible: a red car, a yellow car, and either a white fire truck with red and yellow stripes, a black car with flashing lights, or a black and white police car. The red car is stopped at the intersection, while the yellow car is moving forward. The other vehicle is either parked on the right side, turning right, or approaching the intersection from the right side. The clip is taken from an aerial view, and the vehicles are moving in a clockwise direction.
  </div>
</div>


从这两个摘要来看，使用 5 秒分块大小生成的摘要应该有更多细节和更细致的时间戳信息，不过生成时间会长。

### 4.5.3 模型参数 

总结 API 也接受参数来控制 LLM 在生成摘要时的表现。重要的参数有：
- max_tokens 
- temperature 
- top_p 

```max_tokens``` 参数控制摘要生成的最大长度。超出 max_tokens 的摘要将被截断。如果您发现摘要被截断了，可以增加 max_tokens。如果摘要太冗长，则可以减少 max_tokens。 

```temperature``` 和 ```top_p``` 参数影响选择下一个输出 token 的概率。较高的 temperature 意味着所有 token 之间的选择机率更均匀。高的 top_p 允许更多种类的 token 被选择。在创意写作、新想法生成时，这种多样性是件好事，然而也可能导致出现幻觉。 

对于希望结果具重复性和低幻觉的场景，应该使用较低的 ```temperature``` 和 ```top_p```。  

以下单元将对比高低 ```temperature```、```top_p``` 值下的摘要。

In [29]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": "vila-1.5",
    "max_tokens": 1024,
    "temperature": 0.9,
    "top_p": 0.9,
    "chunk_duration": 20
}

response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
summary_high_t = response["choices"][0]["message"]["content"]

Response Code: 200
Response Status: Success
{
    "id": "03c623a0-8f33-4fc7-841e-882e71e1e08c",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "Here is the detailed timeline in the format 'mm:ss-mm:ss Description':\n\n\u2022 00:00-01:00 The video shows a series of images taken from an aerial perspective of a traffic intersection. The images capture the movement of various vehicles, including cars, trucks, and a yellow school bus. The vehicles are seen entering and exiting the intersection, with some waiting at the stop sign and others moving through the intersection. The colors of the vehicles vary, with the yellow school bus being the most distinctive. A white car is visible driving down the road. A red fire truck, a black truck, a yellow car, a red car, and a blue car are also seen navigating the intersection, with some turning left or right. The images are taken during the daytime, and the s

In [30]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": "vila-1.5",
    "max_tokens": 1024,
    "temperature": 0.1,
    "top_p": 0.1,
    "chunk_duration": 20
} 
response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
summary_low_t = response["choices"][0]["message"]["content"]

Response Code: 200
Response Status: Success
{
    "id": "a09eea89-e0b9-4973-a160-b8abad0d0102",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "Here is the detailed timeline in the format 'mm:ss-mm:ss Description':\n\n\u2022 00:00-00:20 The clip begins with a view of an empty intersection with white lines marking the lanes and a stop sign visible on the left. The sky is clear and the shadows are long, indicating it is either early morning or late afternoon.\n\u2022 00:20-01:40 A car enters the intersection from the top left and turns right, followed by another car entering from the top right and turning left. The cars are dark in color, and the intersection is surrounded by greenery and a few street lamps.\n\u2022 01:40-02:04 Various cars are seen driving through the intersection, including a black car, a red car, a yellow car, and a combination of a red and yellow car.\n\u2022 02:04-02:04 A re

In [31]:
markdown_string = f"""
<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Low Temperature, Low Top P</h1>
    \n{summary_low_t}
  </div>
  <div style="flex: 1;">
  <h1> High Temperature, High Top P </h1>
    \n{summary_high_t}
  </div>
</div>
"""
Markdown(markdown_string)


<div style="display: flex; gap: 20px;">
  <div style="flex: 1;">
  <h1> Low Temperature, Low Top P</h1>
    
Here is the detailed timeline in the format 'mm:ss-mm:ss Description':

• 00:00-00:20 The clip begins with a view of an empty intersection with white lines marking the lanes and a stop sign visible on the left. The sky is clear and the shadows are long, indicating it is either early morning or late afternoon.
• 00:20-01:40 A car enters the intersection from the top left and turns right, followed by another car entering from the top right and turning left. The cars are dark in color, and the intersection is surrounded by greenery and a few street lamps.
• 01:40-02:04 Various cars are seen driving through the intersection, including a black car, a red car, a yellow car, and a combination of a red and yellow car.
• 02:04-02:04 A red car and a yellow car are seen driving through the intersection. The red car is driving straight, the yellow car is turning left, and a black car with flashing lights on the roof is turning right.
  </div>
  <div style="flex: 1;">
  <h1> High Temperature, High Top P </h1>
    
Here is the detailed timeline in the format 'mm:ss-mm:ss Description':

• 00:00-01:00 The video shows a series of images taken from an aerial perspective of a traffic intersection. The images capture the movement of various vehicles, including cars, trucks, and a yellow school bus. The vehicles are seen entering and exiting the intersection, with some waiting at the stop sign and others moving through the intersection. The colors of the vehicles vary, with the yellow school bus being the most distinctive. A white car is visible driving down the road. A red fire truck, a black truck, a yellow car, a red car, and a blue car are also seen navigating the intersection, with some turning left or right. The images are taken during the daytime, and the shadows indicate that the sun is low in the sky, possibly during the late afternoon.
• 01:00-01:20 The yellow car is initially stopped at the intersection, but then begins to turn left, crossing the stop line on the intersection.
• 01:20-01:40 The red car is stationary and the yellow car is moving.
• 01:40-02:10 The scene begins with two cars, a red one and a yellow one, driving away from the intersection. They are followed by a police car that enters the intersection. Shortly after, a fire truck enters the intersection, and then a black car with a siren is stopped at the intersection with its flashers on.
  </div>
</div>


调整 temperature 和 top_p 的话，输出质量可能变化不大。查看上面的并排比较，看看您能否发现输出中的任何差异。一般来说，对于摘要场景，temperature 和 top_p 的值设置为 0.2，可以得到不错的输出质量。通常提示词和分块大小对输出质量的影响会比 temperature 和 top_p 更大。

## 4.6 挑战 - 桥梁检查用例 

提供了一段无人机飞过旧桥的视频，放在这个 notebook 的 images 文件夹中，文件名为 ```bridge.mp4```。

In [32]:
Video(bridge_video, width=1000)

第一步先打开并上传 ```bridge.mp4``` 视频文件。

In [33]:
with open(bridge_video, "rb") as file:
    files = {"file": ("bridge_video", file)}
    data = {"purpose":"vision", "media_type":"video"}
    response = requests.post(files_endpoint, data=data, files=files)
response = check_response(response)
video_id = response["id"]

Response Code: 200
Response Status: Success
{
    "id": "6f4065dd-a312-462c-81c8-06eebcbaa70a",
    "bytes": 112950948,
    "filename": "bridge_video",
    "purpose": "vision",
    "media_type": "video"
}


调整以下参数，以创建关于桥梁状态的最有信息量的摘要： 

In [34]:
#Fill in these parameters 
prompt = #Consider what the VLM needs to look for 
caption_summarization_prompt=
summary_aggregation_prompt =  #Include what the report should have and any formatting requirements 
chunk_duration = 
temperature = 
top_p = 

SyntaxError: invalid syntax (2777616175.py, line 2)

In [None]:
body = {
    "id": video_id,
    "prompt": prompt,
    "caption_summarization_prompt": caption_summarization_prompt,
    "summary_aggregation_prompt": summary_aggregation_prompt,
    "model": "vila-1.5",
    "max_tokens": 1024,
    "temperature": temperature,
    "top_p": top_p,
    "chunk_duration": chunk_duration
}

response = requests.post(summarize_endpoint, json=body)
response = check_response(response)
summary = response["choices"][0]["message"]["content"]

In [None]:
#render the summary output 
display(Markdown("### Summary Output"))
markdown_string = "\n".join(f"> {line}" for line in summary.splitlines())
display(Markdown(markdown_string))

## 下一步

这个实验我们学习了如何使用 VSS 来总结视频。下一个实验我们将实现一个问答系统。

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)