LaViP: Language-Grounded visual Prompting

This repo contains the official implementation of our AAAI 2024 paper

Abstract

We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks. By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder, eliminating the need to modify or add to the model's parameters. Due to this design choice, our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained. We will empirically demonstrate that, compared to prior art, grounding visual prompts with language enhances both the accuracy and speed of adaptation. Moreover, our algorithm excels in base-to-novel class generalization, overcoming limitations of visual prompting and exhibiting the capacity to generalize beyond seen classes. We thoroughly assess and evaluate our method across a variety of image recognition datasets, such as EuroSAT, UCF101, DTD, and CLEVR, spanning different learning situations, including few-shot learning, base-to-novel class generalization, and transfer learning.

Research Highlights

Language-Grounded Input-Dependent Dynamic Visual Prompting: To our best knowledge, this is the first paper that explores the language-grounded input-dependent visual prompting without using an external model. For this, we devise low-rank learnable vectors. The language integration both imporves classification accuracy and 3x convergence speed.
New Algorithm for Extending Visual Prompting to Generlize Beyond Seen Classes: We propose a Kronecker based encoding fusion method to generalize visual prompitng method to beyond seen classes during training.
Support for Visual Prompting in gradient free environment: By equipping language grounded input-aware prompts, LaViP can be utilized in a gradient free environment(ie:where access to model parameters and backpropgation are forbidden) with more thanr 15x faster convergence.

Experiments

main performance (Tab. 1, Tab. 2 and Tab.3 of paper)
- 12 transfer learning benchmarks - [Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, Resisc45, CLEVR, UCF101]

Setup

Run the following commands to create the environment.
- Note that we slightly modifed the Dassl.pytorch to mm_dassl for flexible experiments.

# Clone this repo
git clone https://github.com/NilakshanKunananthaseelan/LaViP.git
cd lavip_5492

# Create a conda environment
conda create -y -n lavip python=3.9.16

# Activate the environment
conda activate lavip

# Install torch and torchvision
# Please refer to https://pytorch.org/ if you need a different cuda version
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia


# Install dependencies
cd mm_dassl
pip install -r requirements.txt

Data preparation

To prepare following 10 datasets (adopted by CoOp), please follow the instruction from https://github.com/KaiyangZhou/CoOp/blob/main/DATASETS.md
- Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, and UCF101
- We use the same few-shot split of CoOp for above 10 datasets.
To prepare following three datasets (adopted by BlackVIP), please follow the instruction from: https://github.com/changdaeoh/BlackVIP/blob/main/README.md
- SVHN, Resisc45 and CLEVR:
To prepare PCam , we downloaded the data from: https://github.com/basveeling/pcam and adopted CoOp style split.

Run

transfer learning benchmarks

Move to LaViP/scripts/method_name directory
On the targeted dataset, run the commands with dataset-specific configs as below:

# for few-shot learning {1:dataset,2:epoch,3:init_lr}
sh fsl_train dtd 300 1.0

# for base-to-nove generalization {1:dataset, 2:epoch, 3:init_lr}
sh base2new_train.sh dtd 300 1.0

# for transfer learning {1:dataset, 2:epoch, 3:init_lr}
sh train_full.sh dtd 300 1.0

# for gradient free learning {1:dataset, 2:epoch, 3:moms, 4:gamma, 5:spsa_c}
sh bb_train.sh dtd 300 0.9 0.2 0.005  
# for VP (white-box), specify {1:dataset, 2:epoch, 3:lr}
sh tl_bench.sh svhn 1000 40.0

Acknowledgements

Our overall experimental pipeline is based on CoOp, CoCoOp repository. For baseline construction, we bollowed/refered the code from repositories of VP, MaPLe,BAR, and BlackVIP. We appreciate the authors (Zhou et al., Bahng et al.,Khattak, M. U. et al., Tsai et al., Oh C et al.) for sharing their code.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
clip		clip
configs		configs
datasets		datasets
docs		docs
mm_dassl		mm_dassl
scripts		scripts
trainers		trainers
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clip

clip

configs

configs

datasets

datasets

docs

docs

mm_dassl

mm_dassl

scripts

scripts

trainers

trainers

.gitignore

.gitignore

README.md

README.md

init.py

init.py

train.py

train.py

Repository files navigation

LaViP: Language-Grounded visual Prompting

Abstract

Research Highlights

Experiments

Setup

Data preparation

Run

transfer learning benchmarks

Acknowledgements

About

Releases

Packages

Languages

NilakshanKunananthaseelan/LaViP

Folders and files

Latest commit

History

Repository files navigation

LaViP: Language-Grounded visual Prompting

Abstract

Research Highlights

Experiments

Setup

Data preparation

Run

transfer learning benchmarks

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages