Skip to content


Repository files navigation

Word and Descriptor Soups 🍜 [CVPR 2024] [ArXiv]

Code in this repo uses code from multimodal prompt learning, which in turn uses code from Co-CoOp and CoOp.

⏳ Installation

  • Install dassl library and other requirements.
# Instructions borrowed from

git clone
cd Dassl.pytorch/
pip install -r requirements.txt
python develop
cd ..

pip install open_clip_torch
pip install pytorch_metric_learning
  • Create a directory somewhere called data/. Download all 15 zip files from this shared Google Drive and unzip them into data/. The resulting file tree should look like:
|-- caltech-101
|-- dtd
|-- eurosat
|-- fgvc_aircraft
|-- food-101
|-- imagenet
|-- imagenet-adversarial
|-- imagenet-rendition
|-- imagenet-sketch
|-- imagenetv2
|-- oxford_flowers
|-- oxford_pets
|-- stanford_cars
|-- sun397
|-- ucf101

Alternatively, follow the download instructions here (some dataset links are stale; may also need to reorganize the directory structure): installing datasets

Modify the following two lines in to reflect where you have your data/ dir and where you want the pretrained CLIP weights to be cached (which could be many gigabytes)

parser.add_argument('--cache_dir', default = "", type =str) # set to directory where you want large pretrained model weights to be cached
parser.add_argument('--data_dir', default = "", type =str)  # set to parent directory of data/

🍜 Descriptor soups

(1) Generate Description Features

First, calculate the descriptor features on ImageNet. Use preprocess/ This python file reads from preprocess/descriptions.list, which is a sorted list of 4227 unique GPT descriptors. They begin with a space and end in a period. Currently, we use a pretrained model for these features.

Run: python preprocess/ --dataset ImageNet

This will save the tuple of description strings, description features in cache/description_features__ViT-B-16_openai.tensor

(2) Calculate greedy descriptor soups

This needs to be done for each random seed of ImageNet training split!


python preprocess/ --dataset ImageNet --seed 1
python preprocess/ --dataset ImageNet --seed 2
python preprocess/ --dataset ImageNet --seed 3

This will save the greedily selected descriptors in cache/good_descriptions_seed1__ViT-B-16_openai.list as a list.

Example logs: example_logs/example_get_greedy_descriptor_soup_output.txt

Proceed to Zero-shot comparisons section for evaluation.

🍜 Word soups

(1) Get Word Features

preprocess/words.list contains 10,000 most common English words minus swear words. They have a space prepended. We can use the same preprocess/ to generate the text features from individual words.

Run: python preprocess/ --dataset ImageNet --descriptions preprocess/words.list --savename word_features

This will save the tuple or words and word features in cache/word_features__ViT-B-16_openai.tensor

(2) Calculate greedy word soups

This needs to be done for each random seed of ImageNet training split!


python preprocess/ --dataset ImageNet --seed 1 --n_descriptors 8
python preprocess/ --dataset ImageNet --seed 2 --n_descriptors 8
python preprocess/ --dataset ImageNet --seed 3 --n_descriptors 8

This will save the greedily selected descriptors in cache/word_soup_descriptors_seed1__ViT-B-16_openai.list as a list.

Example logs: example_logs/example_get_greedy_word_soup_output.txt

Proceed to Zero-shot comparisons section for evaluation.

🧪 Baselines

Results are outputted in CSV format at the end of the experiment. You can copy and paste directly into a spreadsheet.

Zero-shot comparisons

For all ZS methods presented in Table 3 of the paper (Open-AI handcrafted ensemble, GPT, descriptor soup, token offest, word soup), run:

sh scripts/ 0 ViT-B-16 openai 512

Example logs: example_logs/example_run_pt_eval_ViT-B-16_openai_output.txt

For WaffleCLIP with 16 members, run:

sh scripts/ 16

Example logs: example_logs/example_waffle_descriptors_eval_output.txt

Few-shot OOD comparisons

These scripts train on 3 random splits of 16-shot ImageNet-1K. "XD Mean" stands for average test accuracy on 10 OOD ddatasets. "DG Mean" stands for average test accuracy on 4 domain-shifted versions of ImageNet. You can verify these results by running the indicated bash script and pasting the CSV-formatted results at the end of the output into a spreadsheet.

Method Command to run XD Mean DG Mean
CLIP-adapter scripts/ 6e-3 ViT-B-16 512 65.02 58.12
bitfit scripts/ 1.25e-4 ViT-B-16 512 66.05 59.12
Cross Entropy scripts/ 2e-5 ViT-B-16 512 66.80 60.39
Cross Entropy + word soup + diversity loss scripts/ 0.25 10 67.43 61.32
ClipOOD scripts/ 2e-5 ViT-B-16 512 66.50 60.47
ClipOOD + word soup + diversity loss scripts/ 0.25 10 67.42 61.23
CoOp scripts/ 8e-5 ViT-B-16 512 66.52 59.25
CoOp + word soup + diversity loss scripts/ 0.25 10 67.30 60.25
KgCoOp scripts/ 4e-5 ViT-B-16 512 66.16 58.64
LoRA scripts/ 1e-5 ViT-B-16 512 66.19 57.93
MaPLe scripts/ 0.025 ViT-B-16 512 66.44 59.32
MaPLe + word soup + diversity loss scripts/ 66.65 60.20
ProDA scripts/ 3.2e-4 ViT-B-16 512 66.23 58.83
ProGrad scripts/ 1.28e-3 ViT-B-16 512 66.48 58.96
ResBlock-adapter scripts/ 2.5e-3 ViT-B-16 512 65.55 59.48
SSF scripts/ 1e-4 ViT-B-16 512 65.86 58.44
VPT scripts/ 0.8 ViT-B-16 512 65.16 58.42

🧪 More experiments

Base to novel setting

First, generate features for each training dataset:

For descriptor features:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
  python preprocess/ --dataset $dataset --subsample_classes base

For word features:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
  python preprocess/ --dataset $dataset --descriptions words.list --savename word_features --subsample_classes base

To get greedy descriptor soup:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
  sh scripts/ablations/ $dataset

To get greedy word soup:

for dataset in ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101;
  sh scripts/ablations/ $dataset

Then run training using provided bash scripts, example:

sh scripts/ 5e-05 > run_ce_with_eval.btn.sh_5e-05.o

See any bash script called scripts/*

CoOp soft descriptor ensemble baseline

Run scripts/ablations/ which logs in train_softd.o and outputs

  • cache/soft_descriptors/random_8_10_token_8_ensemble/8_random_10_token_word_chains_seed1.list_e0.soft
  • cache/soft_descriptors/random_8_10_token_8_ensemble/8_random_10_token_word_chains_seed2.list_e0.soft
  • cache/soft_descriptors/random_8_10_token_8_ensemble/8_random_10_token_word_chains_seed3.list_e0.soft

These are list of 8 soft descriptors.

To evaluate: (reference scripts/ablations/

More baselines

Many more baselines in the scripts/ablations folder. Run these at your pleasure.


No description, website, or topics provided.






No releases published


No packages published