This is an official pytorch implementation of EZ-CLIP: Efficient Zero-Shot Video Action Recognition [arXiv]

Updates

Trained model download link of google driver.

Overview

Introduction

In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data.

Prerequisites

We provide the conda requirements.txt to help you install these libraries. You can initialize environment by using pip install -r requirements.txt.

Model Zoo

NOTE: All models in our experiments below uses publicly available ViT/B-16 based CLIP model.

Zero-shot results

All models are trained on Kinetics-400 and then evaluated directly on downstream datasets.

Model	Input	HMDB-51	UCF-101	Kinetics-600	Model
EZ-CLIP(ViT-16)	8x224	52.9	79.1	70.1	link

Base-to-novel generalization results

Here, we divide each dataset into base and novel classes. All models are trained on base classes and evaluated on both base and novel classes.

Dataset	Input	Base Acc.	Novel Acc.	HM	Model
K-400	8x224	73.1	60.6	66.3	link
HMDB-51	8x224	77.0	58.2	66.3	link
UCF-101	8x224	94.4	77.9	85.4	link
SSV2	8x224	16.6	13.3	14.8	Link

Data Preparation

We need to first extract videos into frames for fast reading. Please refer 'Dataset_creation_scripts' data pre-processing. We have successfully trained on Kinetics, UCF101, HMDB51,

Training

# Train
python train.py --config configs/K-400/k400_train.yaml

Testing

# Test 
python test.py --config configs/ucf101/UCF_zero_shot_testing.yaml

Citation

If you find the code and pre-trained models useful for your research, please consider citing our paper:

@article{ez2022clip,
  title={EZ-CLIP: Efficient Zeroshot Video Action Recognition},
  author={Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat},
  journal={arXiv preprint arXiv:2312.08010},
  year={2024}
}

Acknowledgments

Our code is based on ActionCLIP

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Dataset_creation_scripts		Dataset_creation_scripts
GPT_discription		GPT_discription
UCF_101_txt		UCF_101_txt
__pycache__		__pycache__
clip		clip
configs		configs
dataset_splits		dataset_splits
datasets		datasets
lists		lists
logs		logs
modules		modules
utils		utils
EZ-CLIP.png		EZ-CLIP.png
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py
train.py		train.py
train_base_to_novel.py		train_base_to_novel.py
train_fullysupervised.py		train_fullysupervised.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is an official pytorch implementation of EZ-CLIP: Efficient Zero-Shot Video Action Recognition [arXiv]

Updates

Overview

Introduction

Content

Prerequisites

Model Zoo

Zero-shot results

Base-to-novel generalization results

Data Preparation

Training

Testing

Citation

Acknowledgments

About

Releases

Packages

Languages

License

Shahzadnit/EZ-CLIP

Folders and files

Latest commit

History

Repository files navigation

This is an official pytorch implementation of EZ-CLIP: Efficient Zero-Shot Video Action Recognition [arXiv]

Updates

Overview

Introduction

Content

Prerequisites

Model Zoo

Zero-shot results

Base-to-novel generalization results

Data Preparation

Training

Testing

Citation

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages