# Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

This tutorial offers several usage examples, including:
* Chat with Video-CCAM
* Evaluate Video-CCAM on the supported benchmark

## Setup

In [None]:
%pip install -qU pip torch transformers peft decord pysubs2 imageio

## Initialize the Video-CCAM

* Download the Video-CCAM models to a local directory

In [None]:
%%bash
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download JaronTHU/Video-CCAM-4B-v1.1 --local-dir <your_local_path_1>
# HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download JaronTHU/Video-CCAM-9B-v1.1 --local-dir <your_local_path_2>
# HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download JaronTHU/Video-CCAM-14B-v1.1 --local-dir <your_local_path_3>

* Video-CCAM relies on the weights of SigLIP-SO400M, Phi-3-mini-4k-instruct (4B), Yi-1.5-9B-Chat (9B), and Phi-3-medium-4k-instruct (4B). Video-CCAM will automatically download them if they are not in the huggingface cache.
* If you have downloaded them before in local directories, local weights will be loaded if `llm_name_or_path` or `vision_encoder_name_or_path` is provided.

In [None]:
import os
import torch
from transformers import AutoModel

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

videoccam = AutoModel.from_pretrained(
    '<your_local_path>',
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map='auto',
    _attn_implementation='flash_attention_2',
    # llm_name_or_path='<your_local_llm_path>',
    # vision_encoder_name_or_path='<your_local_vision_encoder_path>'
)

## Chat with Video-CCAM

In [None]:
from PIL import Image
from eval import load_decord

messages = [
    [
        {
            'role': 'user',
            'content': '<image>\nDescribe this image in detail.'
        }
    ], [
        {
            'role': 'user',
            'content': '<video>\nDescribe this video in detail.'
        }
    ]
]

images = [
    Image.open('assets/example_image.jpg').convert('RGB'),
    load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]

response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)

print(response)

## Evaluate Video-CCAM on benchmarks

In [None]:
# MVBench
from eval import evaluate_mvbench

evaluate_mvbench(
    videoccam,
    '<mvbench_path>',
    '<output_path>',
    num_frames=32,
    batch_size=2,
)

In [None]:
# Video-MME
from eval import evaluate_videomme

evaluate_videomme(
    videoccam,
    '<videomme_path>',
    '<output_path>',
    sample_config=dict(
        sample_type='uniform',
        num_frames=96
    ),
    batch_size=1
)

In [None]:
# MLVU
from eval import evaluate_mlvu

evaluate_mlvu(
    videoccam,
    '<mlvu_path>',
    '<output_path>',
    sample_config=dict(
        sample_type='uniform',
        num_frames=96
    ),
    batch_size=1
)

In [None]:
# VideoVista
from eval import evaluate_videovista

evaluate_videovista(
    videoccam,
    '<videovista_path>',
    '<output_path>',
    sample_config=dict(
        sample_type='uniform',
        num_frames=96
    ),
    batch_size=1
)