# Processing and narrating a video with ZhipuAI GLM's visual capabilities

**This tutorial is available in English and is attached below the Chinese explanation**

此代码演示了如何通过视频使用 GLM 的视觉功能。 GLM-4 不直接将视频作为输入，但我们可以使用视觉和新的 128K 上下文窗口来一次性描述整个视频的静态帧。

**由于模型对视频理解的能力有待提高，在这个代码中的视频理解的细节程度无法达到较高水平。**

This cookbook demonstrates how to use GLM's visual capabilities with a video. GLM-4 doesn't take videos as input directly, but we can use vision and the new 128K context window to describe the static frames of a whole video at once. 

**Since the model's ability to understand videos needs to be improved, the level of detail of video understanding in this code cannot reach a high level. **

首先，设定好调用模型的API key

First, set the API key for calling the model

In [7]:
import os
from zhipuai import ZhipuAI
os.environ["ZHIPUAI_API_KEY"] = "your api key"

client = ZhipuAI()

接着，将视频传入给模型，请注意，只有`GLM-4V-Plus` 模型支持对视频进行理解。接着，我们将要视频编码为 video_base64。 由大模型进行分析即可。

Next, input the video into the model. Please note that only the `GLM-4V-Plus` model supports video understanding. After that, we need to encode the video into video_base64. The analysis can then be performed by the large model.

In [8]:
import base64

video_path = "data/video_1.mp4"
with open(video_path, 'rb') as video_file:
    video_base = base64.b64encode(video_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="glm-4v-plus", 
    temperature=0.0,
    top_p=0,
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "video_url",
            "video_url": {
                "url" : video_base
            }
          },
          {
            "type": "text",
            "text": "请仔细描述这个视频的环境，图中的小狗在干啥，以有趣的方式讲给我听"
          }
        ]
      }
    ]
)
print(response.choices[0].message)


CompletionMessage(content='小狗在视频中正在玩耍,它正在跳进一个水坑里,溅起水花,看起来非常开心。它不停地跳进水坑里,又跑出来,然后再跳进去,好像永远都不会累一样。它的毛发湿透了,但似乎并不在意,只是不停地玩耍。有时候它会停下来,甩甩身上的水,然后再继续玩耍。看起来非常享受这个时刻,好像在享受水的清凉和自由的感觉。', role='assistant', tool_calls=None)


通常来说，视频理解的响应时间较长，需要等待数十秒。

Typically, video understanding has a longer response time, often requiring several tens of seconds to process.