# Tutorial11: PLLaVA视频理解模型推理

建议在SCOW AI集群运行本教程。推荐使用[PLLaVA-NPU库](https://github.com/SunnyMass/PLLaVA-NPU)的代码。

本节旨在使用 [pllava-7b](https://huggingface.co/ermu2001/pllava-7b) 模型展示多模态视频理解模型的推理。

分以下几步来实现：
1. 环境安装
2. 下载模型
3. 3D池化算子适配
4. 模型推理

## 1. 环境安装

建议使用1张910B NPU运行本教程。

按照如下方式创建conda虚拟环境，并安装所需库。

In [None]:
conda create -n pllava python=3.10
conda activate pllava
pip install -r requirement.txt

## 2. 下载模型

按照如下方式下载pllava-7b模型。

```bash
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download ermu2001/pllava-7b --local-dir pllava-7b
```

## 3. 3D池化算子适配

原始代码问题分析：

RuntimeError: adaptive_avg_pool3d only support D=1 && H=1 && W=1 current!该问题指出，NPU不支持3D池化，即不支持torch._C._nn.adaptive_avg_pool3d自适应尺寸。如果解决这种报错，必须要修改尺寸，这会导致模型推理时，出现错误的映射并无法对齐，从而导致幻觉等问题出现。如果改成2D池化或者用其他方式表征时域，都无法改善这种幻觉问题，而在GPU上推理能得到准确结果。因此需要从本质上解决这个3D池化函数适配问题。
AdaptiveAvgPool3d((T_out, H_out, W_out)) 表示：在三维空间（时间 + 空间）上，自动将每段 video（形状 [C, T, H, W]）平均池化到指定形状 [C, T_out, H_out, W_out]。如果输入是 [B, C, T=8, H=14, W=14]，设置 --pooling_shape 4-12-12，就会变成：输出形状 = [B, C, 4, 12, 12]，从而稀疏化token个数。


解决方案：

由于 NPU 不支持包括 AdaptiveAvgPool3d 在内的多种 3D 池化算子，因此我们手动实现一个兼容 NPU 的 3D 池化函数，用于在不改变数值语义的前提下，完成时间维度和空间维度的降采样，确保推理过程与 GPU 上一致，避免出现映射错乱或语义幻觉等问题。该函数完全复现了 AdaptiveAvgPool3d 的行为，前提是输入的时间、空间尺寸能整除目标尺寸。适用于大多数实际配置，同时在 NPU 上高效可用，确保视觉特征 token 数在时空维度的稀疏化过程保持一致。

在**PLLaVA/models/pllava/modeling_pllava.py**文件中修改，把class PllavaMultiModalProjector(nn.Module):中的self.pooling替换为下面的自定义实现。该实现已经验证在GPU和NPU的输出保持一致。

In [None]:
def adaptive_avg_pool3d_manual(self, x, output_size):
        """
        x: [B, C, D, H, W]
        output_size: (d_out, h_out, w_out)
        替代 AdaptiveAvgPool3d，NPU 兼容
        """
        B, C, D, H, W = x.shape
        d_out, h_out, w_out = output_size
        print(output_size)
        assert D % d_out == 0 and H % h_out == 0 and W % w_out == 0, "Input size must be divisible by output size"

        kd = D // d_out
        kh = H // h_out
        kw = W // w_out

        # reshape 成 6维：将 D/H/W 分成 avg block 块 + 块内元素
        x = x.view(B, C, d_out, kd, h_out, kh, w_out, kw)  # [B, C, d_out, kd, h_out, kh, w_out, kw]
        x = x.mean(dim=(3, 5, 7))  # 对 kd, kh, kw 三个维度做均值
        return x  # shape [B, C, d_out, h_out, w_out]

## 4. 模型推理

PLLaVA-7B 是一个面向视频理解任务的开源多模态聊天模型。该模型在图像大模型的基础上，进一步引入视频指令数据进行微调，具备视频内容理解与问答能力。其底层语言模型为 llava-hf/llava-v1.6-vicuna-7b-hf，采用 Transformer 架构，具备自回归生成能力。

推理过程中使用 npu-smi info 命令可以查看 NPU 运行情况。

推荐直接使用[PLLaVA-NPU库](https://github.com/SunnyMass/PLLaVA-NPU)的代码运行。

运行命令：


In [None]:
python run_demo.py   --video_path path_to_1-2.mp4   --prompt "describe this video in detail"   --pretrained_model_name_or_path path_to_pllava7b   --weight_dir path_to_pllava7b   --use_lora   --num_frames 16   --conv_mode plain   --max_new_tokens 128  --video_caption(如果是做视频caption任务就加上，如果是其他视频理解任务就不加)

In [None]:
import torch
import time
import cv2
from PIL import Image
from argparse import ArgumentParser
import torchvision.transforms as transforms

from tasks.eval.eval_utils import conv_templates, ChatPllava
from tasks.eval.model_utils import load_pllava

SYSTEM = """You are a powerful Video Magic ChatBot, a large vision-language assistant. 
You are able to understand the video content that the user provides and assist the user in a video-language related task.
The user might provide you with the video and maybe some extra noisy information to help you out or ask you a question. Make use of the information in a proper way to be competent for the job.
### INSTRUCTIONS:
1. Follow the user's instruction.
2. Be critical yet believe in yourself.
"""
SYSTEM2 = """
Describe this video. Pay attention to all objects in the video. The description should be useful for AI to re-generate the video. The description should be no more than six sentences. Here are some examples of good descriptions: 1. A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about. 2. Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field. 3. Drone view of waves crashing against the rugged cliffs along Big Sur's garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff's edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.
"""

def load_video(video_path, num_segments=4, resolution=336):
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = [int(i * total_frames / num_segments) for i in range(num_segments)]

    frames = []
    resize = transforms.Resize((resolution, resolution))
    idx = 0
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        if idx in indices:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            pil_img = Image.fromarray(frame)
            pil_img = resize(pil_img)
            frames.append(pil_img)
        idx += 1
    cap.release()
    return frames


def parse_args():
    parser = ArgumentParser()
    parser.add_argument('--video_path', type=str, required=True)
    parser.add_argument('--prompt', type=str, required=True)
    parser.add_argument('--num_frames', type=int, default=4)
    parser.add_argument('--pretrained_model_name_or_path', type=str, required=True)
    parser.add_argument('--weight_dir', type=str, default=None)
    parser.add_argument('--use_lora', action='store_true')
    parser.add_argument('--lora_alpha', type=int, default=4)
    parser.add_argument('--conv_mode', type=str, default='plain')
    parser.add_argument('--max_new_tokens', type=int, default=200)
    parser.add_argument('--num_beams', type=int, default=1)
    parser.add_argument('--temperature', type=float, default=1.0)
    parser.add_argument('--video_caption', action='store_true')
    return parser.parse_args()


def main():
    args = parse_args()

    print("📦 Loading model...")
    model, processor = load_pllava(
        repo_id=args.pretrained_model_name_or_path,
        num_frames=args.num_frames,
        use_lora=args.use_lora,
        lora_alpha=args.lora_alpha,
        weight_dir=args.weight_dir,
    )
    model = model.to('npu').eval()
    print(f"Model device: {next(model.parameters()).device}")
    chat = ChatPllava(model, processor)

    print("📽️ Loading video frames...")
    frames = load_video(args.video_path, args.num_frames)
    img_list = [frames]  # 必须是二维列表 [ [PIL, PIL, PIL...] ]

    print("💬 Asking and answering...")
    conv = conv_templates[args.conv_mode].copy()
    if args.video_caption:
        conv = chat.ask(args.prompt, conv, SYSTEM2)
    else:
        conv = chat.ask(args.prompt, conv, SYSTEM)

    start_time = time.time()
    llm_message, _, _ = chat.answer(
        conv=conv,
        img_list=img_list,
        max_new_tokens=args.max_new_tokens,
        num_beams=args.num_beams,
        temperature=args.temperature
    )
    elapsed = time.time() - start_time

    print(f"\n⏱️ Inference took {elapsed:.2f} seconds")
    print("\n===== FINAL ANSWER =====\n")
    print(llm_message.strip())


if __name__ == '__main__':
    main()

如果需要部署gradio

In [None]:
sh ./scripts/demo.sh

可以在PLLaVA/tasks/eval/demo/pllava_demo.py文件末尾定义url地址。

In [None]:
demo.launch(
    server_name="0.0.0.0",
    server_port=10034,
    root_path="/ai/api/proxy/ascend-k8s/relative/master/30003/proxy/10034"
)

![gradio](https://github.com/PKUHPC/scow-for-ai-tutorial-ascend/raw/tutorial11/tutorial11_PLLaVA视频理解模型推理/gradio.png)

---

> 作者: 张天戈; 林荣群; 贾川民; 马思伟
>
> 联系方式: tgzhang@stu.pku.edu.cn