Official code release for the ICML 2026 paper "Test-Time Debiasing with Probabilistic Prompts via Wasserstein Distance in Vision-Language Models".
- [April 2026] W4D has been accepted by ICML 2026!
W4D is a lightweight test-time debiasing framework for vision-language models. Instead of correcting a query with a single debiased point, W4D models a distribution of prompt-induced query perturbations and aligns the resulting query distribution with group reference distributions using a Wasserstein-based objective.
.
├── create_dataset.py # Precompute CLIP image embeddings and 5-fold splits
├── envs.txt # Full environment snapshot from the original setup
├── experimental_configs/ # YAML configs for all released experiments
├── queries.py # Query class definitions
├── query_templates/ # Query / augmentation templates
├── runner.py # Convenience launcher for multi-fold experiments
├── w4d.py # Main W4D evaluation script
└── w4d_utils.py # Metrics and helper utilities
We recommend Python 3.10+ and a separate Conda environment.
conda create -n w4d python=3.10 -y
conda activate w4d
pip install --upgrade pip
pip install -r requirements.txtenvs.txt is the full package dump from the original environment. requirements.txt is a cleaned dependency list for this repository.
If you use GPU acceleration, install the PyTorch build that matches your CUDA version. The original environment used torch 2.8.0 and torchvision 0.23.0.
This release supports the datasets and query settings used in the paper:
CelebA: hair-color queries, debiasing with respect to genderFairFace: stereotype queries, debiasing with respect to gender or raceUTKFace: stereotype queries, debiasing with respect to gender or raceFACET: job queries and stereotype queries, debiasing with respect to gender or skin tone
Note: the current config filenames facet_job_race_*.yml correspond to skin-tone debiasing because FACET provides skin-tone annotations rather than race labels.
This repository does not ship the raw datasets. After obtaining a dataset, organize your files locally and precompute CLIP image embeddings with create_dataset.py.
The script writes:
data/<dataset>_featurized_<clip-model>.jsonldata/fold_indices/<dataset>_featurized_<clip-model>_folds.jsonl
Example for CelebA:
python create_dataset.py \
--dataset_name celeba \
--_MODEL_NAME clip-vit-base-patch16 \
--data_path /path/to/celeba_root/ \
--meta_data_file_name classification_label/CelebAMask-HQ-attribute-anno.txtExample for FairFace:
python create_dataset.py \
--dataset_name fairface \
--_MODEL_NAME clip-vit-base-patch16 \
--data_path /path/to/fairface_root/ \
--meta_data_file_name fairface_label_train.csvcreate_dataset.py currently contains dataset-specific metadata parsers for:
celebafairfaceutkfacefacet
Please make sure the metadata columns match the expectations inside the script.
Run one fold with a selected config:
python w4d.py \
--enum 0 \
--config experimental_configs/celeba_hair_gender_clip-vit-base-patch16.ymlRun the default 5-fold launcher:
python runner.pyTo evaluate a different setting, either:
- pass another YAML file to
w4d.py, or - edit the
configslist inrunner.py
Parts of this codebase were inspired by and adapted from the excellent BEND-VLM repository:
We thank the authors for open-sourcing their implementation and supporting reproducible research in debiasing for vision-language models.
If you use this code, please cite the corresponding paper once the final bibliographic information is available.
