PowerCLIP augments standard CLIP training with a region-phrase alignment loss that pairs SAM-generated image regions with syntactic phrases extracted from captions.
PowerCLIP extends OpenCLIP with three key components:
- SAM region extraction -- Segment Anything Model generates semantic regions per image, converted to patch-grid token indices (CSR format).
- Parse-tree phrase extraction -- spaCy-based constituency parsing extracts NP/PP/VP/S phrases from captions, aligned to the CLIP tokenizer.
- Softplus region-phrase scoring -- A softplus-based scoring function aligns region features with phrase features during training.
The model uses average pooling for both vision and text encoders (instead of CLIP's default CLS/EoT token pooling).
| Link | |
|---|---|
| Pre-trained model (SAM) | KMasaki/PowerCLIP-ViT-B-16-CC12M |
| Pre-trained model (SAM2) | KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M |
| Pre-processed dataset (SAM) | KMasaki/cc12m-sam-parse-tree |
| Pre-processed dataset (SAM2) | KMasaki/cc12m-sam2-parse-tree |
pip install -r requirements.txt
python -m spacy download en_core_web_smNote: The pre-processed CC12M datasets (the output of the steps below) are publicly available on Hugging Face:
- SAM regions: KMasaki/cc12m-sam-parse-tree
- SAM2 regions: KMasaki/cc12m-sam2-parse-tree
You can skip this section if you use the pre-processed datasets directly.
PowerCLIP uses WebDataset tar archives. Starting from a standard image-text WebDataset (.jpg, .txt, .json per sample), run the following two preprocessing steps:
Adds .njson files containing phrase token indices (CSR format) to each tar.
python -m training.precompute_parse_tree \
--data /path/to/cc12m-wds/ \
--model ViT-B-16 \
--workers 4Output directory: cc12m-wds_parse_tree/
Reads the parse-tree tar and adds .samlens.npy + .samcat.npy (region token indices in CSR format) per sample.
Using SAM (multi-GPU):
torchrun --nproc_per_node 8 -m training.precompute_sam \
--src-dir /path/to/cc12m-wds_parse_tree \
--out-dir /path/to/cc12m-wds_sam_parse_tree \
--patch-size 16 --image-size 224Using SAM2 (multi-GPU):
torchrun --nproc_per_node 8 -m training.precompute_sam2 \
--src-dir /path/to/cc12m-wds_parse_tree \
--out-dir /path/to/cc12m-wds_sam2_parse_tree \
--patch-size 16 --image-size 224Each sample in the final tar contains:
{key}.jpg # Image
{key}.txt # Caption
{key}.json # Metadata
{key}.njson # Parse-tree phrase indices (CSR)
{key}.samlens.npy # SAM region lengths (CSR)
{key}.samcat.npy # SAM region token indices (CSR)
bash scripts/train.shKey arguments (see scripts/train.sh):
| Argument | Default | Description |
|---|---|---|
--vision_pool_type |
avg |
Vision encoder pooling (avg or cls) |
--text_pool_type |
avg |
Text encoder pooling (avg or argmax) |
--sam_loss_ratio |
0.1 |
Weight of region-phrase alignment loss (0 = standard CLIP) |
--sam_regions_topk |
5 |
Max SAM regions per image |
--sam_regions_topk_random |
flag | Randomly sample topk regions. Without this flag, the first topk regions (descending area order) are used deterministically. |
--use_parse_phrases |
flag | Use parse-tree phrases for alignment |
--softplus_scoring |
flag | Enable softplus scoring (PowerCLIP core) |
--softplus_tau |
0.001 |
Temperature for softplus scoring |
--softplus_alpha |
0.75 |
Mixing weight for softplus alignment scores |
Install clip-benchmark:
pip install clip-benchmarkImportant: PowerCLIP uses average pooling (--vision_pool_type avg, --text_pool_type avg), which differs from standard CLIP's CLS/EoT token pooling. clip-benchmark assumes the default pooling, so you need to modify its model loading to use average pooling. Specifically, set model.visual.pool_type = "avg" and model.text.pool_type = "avg" after loading the checkpoint (or patch the model config accordingly).
This repository is based on OpenCLIP.

