# Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

This tutorial offers several usage examples, including:
* Chat with Video-CCAM
* Evaluate Video-CCAM on the supported benchmark

## Setup

In [None]:
%pip install -qU pip torch transformers decord pysubs2 imageio accelerate
# flash attention support
%pip install -q flash-attn --no-build-isolation

## Initialize the Video-CCAM

* Download the Video-CCAM models to a local directory

In [1]:
import os

import torch
from huggingface_hub import snapshot_download
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# if you have downloaded this model, just replace the following line with your local path
model_path = snapshot_download(repo_id='JaronTHU/Video-CCAM-7B-v1.2')

In [2]:
videoccam = AutoModel.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map='cuda:0',
    attn_implementation='flash_attention_2'
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

image_processor = AutoImageProcessor.from_pretrained(model_path)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## Chat with Video-CCAM

In [3]:
from PIL import Image
from eval import load_decord

messages = [
    [
        {
            'role': 'user',
            'content': '<image>\nDescribe this image in detail.'
        }
    ], [
        {
            'role': 'user',
            'content': '<video>\n请仔细描述这个视频。'
        }
    ]
]

images = [
    [Image.open('assets/example_image.jpg').convert('RGB')],
    load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]

response = videoccam.chat(messages, images, tokenizer, image_processor, max_new_tokens=512, do_sample=False)

print(response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


["The image depicts a serene and picturesque landscape featuring a wooden dock extending into a calm lake. The dock is constructed from weathered wooden planks, showing signs of age and exposure to the elements. It leads directly towards a small, distant structure that appears to be a small pier or platform, also made of wood. The water in the lake is still, creating a perfect reflection of the surrounding scenery, including the dock, the distant pier, and the lush greenery.\n\nIn the background, a dense forest of evergreen trees stretches across the landscape, leading up to a range of mountains. The mountains are partially covered in snow, indicating a higher altitude and possibly a colder climate. The sky above is overcast, with a blanket of clouds that diffuse the light, giving the scene a soft, muted appearance. The overall atmosphere is tranquil and peaceful, evoking a sense of solitude and natural beauty.\n\nThe composition of the image is balanced, with the dock and the pier ser

## Evaluate Video-CCAM on benchmarks

In [None]:
# MVBench
from eval import evaluate_mvbench

evaluate_mvbench(
    videoccam,
    tokenizer,
    image_processor,
    '<mvbench_path>',
    '<output_path>',
    num_frames=32,
    batch_size=16,
)

In [None]:
# Video-MME
from eval import evaluate_videomme

evaluate_videomme(
    videoccam,
    tokenizer,
    image_processor,
    '<videomme_path>',
    '<output_path>',
    sample_config=dict(
        sample_type='uniform',
        num_frames=96
    ),
    batch_size=1
)

In [None]:
# MLVU
from eval import evaluate_mlvu

evaluate_mlvu(
    videoccam,
    tokenizer,
    image_processor,
    '<mlvu_path>',
    '<output_path>',
    sample_config=dict(
        sample_type='uniform',
        num_frames=96
    ),
    batch_size=1
)

In [None]:
# VideoVista
from eval import evaluate_videovista

evaluate_videovista(
    videoccam,
    tokenizer,
    image_processor,
    '<videovista_path>',
    '<output_path>',
    sample_config=dict(
        sample_type='uniform',
        num_frames=96
    ),
    batch_size=1
)