GitHub - RenShuhuai-Andy/TimeChat: [CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou

News

[24.06.04] Add FAQ, see FAQ.md.
[24.06.04] Release zero-shot evaluation results of TimeChat-7b on several VideoLLM benchmarks (e.g., MVBench, TempCompass, etc.), see EVAL.md.
[24.01.09] Release TimeChat-7b 🤗 checkpoint and local demo.
[23.12.27] 🤗 Release the instruction-tuning dataset of TimeIT.
[23.12.06] Release the initial version of TimeChat.

Introduction

TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions:
- (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame
- (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations.
We also construct an instruction-tuning dataset named TimeIT, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance.

Example Outputs

An illustration of temporal localization capability of TimeChat

Examples for dense video captioning (left), temporal video grounding (middle), and video highlight detection (right)

Fine-tuned Checkpoints

The following checkpoints store learnable parameters (positional embedding layers, Time-aware Frame Encoder, Sliding Video Q-Former, linear projection layers, and lora) only.

Checkpoint	LLM backbone	Link	Note
TimeChat-2-7B-Finetuned	LLaMA-2 7B	link	Fine-tuned on the instruction-tuning data from TimeIT-104K (asr version) and Valley-73K (previous version of current Valley-65K)

Usage

Enviroment Preparation

First, install ffmpeg.

apt update
apt install ffmpeg

Then, create a conda environment:

conda env create -f environment.yml
conda activate timechat
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Prerequisites

Before fine-tuning your own model (or reproduce our TimeChat model), make sure you have obtained the following checkpoints:

Pre-trained Image Encoder (EVA ViT-g)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth

Pre-trained Image Q-Former (InstructBLIP Q-Former)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Use git-lfs to download weights of Video-LLaMA (7B):

git lfs install
git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned

Instruct-tuned TimeChat-7B

git lfs install
git clone https://huggingface.co/ShuhuaiRen/TimeChat-7b

The file structure looks like:

ckpt/
|–– Video-LLaMA-2-7B-Finetuned/
    |-- llama-2-7b-chat-hf/
    |-- VL_LLaMA_2_7B_Finetuned.pth
|–– instruct-blip/
    |-- instruct_blip_vicuna7b_trimmed.pth
|–– eva-vit-g/
    |-- eva_vit_g.pth
|-- timechat/
    |-- timechat_7b.pth

How to Run Demo Locally

Please refer to our Jupyter Demo here.

Instruction-Tuning

Data

For now, the fine-tuning dataset consists of:

104K time-sensitive instructions from TimeIT [link]
- see DATA.md
73K (now 65K) video-based instructions from Valley [link]

Script

Tuning

Config the checkpoint and dataset paths in stage2_finetune_time104k_valley72k.yaml.

conda activate timechat
torchrun --nproc_per_node=8 train.py --cfg-path  train_configs/stage2_finetune_time104k_valley72k.yaml

Evaluation

Config the checkpoint and dataset paths in timechat.yaml.

Config the downstream task in eval.sh.

bash eval.sh

Recommended GPUs

Instruction-tuning: 8xV100 (32G)
Inference: 1xA100 (40G/80G) or 1xA6000

Acknowledgement

We are grateful for the following awesome projects our TimeChat arising from:

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
EVA-CLIP: Improved Training Techniques for CLIP at Scale
LLaMA: Open and Efficient Foundation Language Models
VideoChat: Chat-Centric Video Understanding
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Term of Use

Our TimeChat is just a research preview intended for non-commercial use only. You must NOT use our TimeChat for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{Ren2023TimeChat,
  title={TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding},
  author={Shuhuai Ren and Linli Yao and Shicheng Li and Xu Sun and Lu Hou},
  journal={ArXiv},
  year={2023},
  volume={abs/2312.02051},
}

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
benchmark		benchmark
docs		docs
eval_configs		eval_configs
examples		examples
figs		figs
metrics		metrics
prompts		prompts
timechat		timechat
train_configs		train_configs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE_Lavis.md		LICENSE_Lavis.md
LICENSE_Minigpt4.md		LICENSE_Minigpt4.md
README.md		README.md
demo.ipynb		demo.ipynb
environment.yml		environment.yml
eval.sh		eval.sh
evaluate.py		evaluate.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou

News

Introduction

Example Outputs

Fine-tuned Checkpoints

Usage

Enviroment Preparation

Prerequisites

Pre-trained Image Encoder (EVA ViT-g)

Pre-trained Image Q-Former (InstructBLIP Q-Former)

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Instruct-tuned TimeChat-7B

How to Run Demo Locally

Instruction-Tuning

Data

Script

Tuning

Evaluation

Recommended GPUs

Acknowledgement

Term of Use

Citation

About

Licenses found

Releases

Packages

Languages

License

Licenses found

RenShuhuai-Andy/TimeChat

Folders and files

Latest commit

History

Repository files navigation

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren*, Linli Yao*, Shicheng Li, Xu Sun, Lu Hou

News

Introduction

Example Outputs

Fine-tuned Checkpoints

Usage

Enviroment Preparation

Prerequisites

Pre-trained Image Encoder (EVA ViT-g)

Pre-trained Image Q-Former (InstructBLIP Q-Former)

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Instruct-tuned TimeChat-7B

How to Run Demo Locally

Instruction-Tuning

Data

Script

Tuning

Evaluation

Recommended GPUs

Acknowledgement

Term of Use

Citation

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou

Packages