PowerCLIP: Powerset Alignment for Contrastive Pre-Training (CVPR 2026)

PowerCLIP augments standard CLIP training with a region-phrase alignment loss that pairs SAM-generated image regions with syntactic phrases extracted from captions.

Overview

PowerCLIP extends OpenCLIP with three key components:

SAM region extraction -- Segment Anything Model generates semantic regions per image, converted to patch-grid token indices (CSR format).
Parse-tree phrase extraction -- spaCy-based constituency parsing extracts NP/PP/VP/S phrases from captions, aligned to the CLIP tokenizer.
Softplus region-phrase scoring -- A softplus-based scoring function aligns region features with phrase features during training.

The model uses average pooling for both vision and text encoders (instead of CLIP's default CLS/EoT token pooling).

Resources

	Link
Pre-trained model (SAM)	KMasaki/PowerCLIP-ViT-B-16-CC12M
Pre-trained model (SAM2)	KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M
Pre-processed dataset (SAM)	KMasaki/cc12m-sam-parse-tree
Pre-processed dataset (SAM2)	KMasaki/cc12m-sam2-parse-tree

Installation

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Data Preparation

Note: The pre-processed CC12M datasets (the output of the steps below) are publicly available on Hugging Face:

SAM regions: KMasaki/cc12m-sam-parse-tree

SAM2 regions: KMasaki/cc12m-sam2-parse-tree

You can skip this section if you use the pre-processed datasets directly.

PowerCLIP uses WebDataset tar archives. Starting from a standard image-text WebDataset (.jpg, .txt, .json per sample), run the following two preprocessing steps:

Step 1: Parse-tree phrase extraction

Adds .njson files containing phrase token indices (CSR format) to each tar.

python -m training.precompute_parse_tree \
    --data /path/to/cc12m-wds/ \
    --model ViT-B-16 \
    --workers 4

Output directory: cc12m-wds_parse_tree/

Step 2: SAM region extraction

Reads the parse-tree tar and adds .samlens.npy + .samcat.npy (region token indices in CSR format) per sample.

Using SAM (multi-GPU):

torchrun --nproc_per_node 8 -m training.precompute_sam \
    --src-dir /path/to/cc12m-wds_parse_tree \
    --out-dir /path/to/cc12m-wds_sam_parse_tree \
    --patch-size 16 --image-size 224

Using SAM2 (multi-GPU):

torchrun --nproc_per_node 8 -m training.precompute_sam2 \
    --src-dir /path/to/cc12m-wds_parse_tree \
    --out-dir /path/to/cc12m-wds_sam2_parse_tree \
    --patch-size 16 --image-size 224

Final tar structure

Each sample in the final tar contains:

{key}.jpg           # Image
{key}.txt           # Caption
{key}.json          # Metadata
{key}.njson         # Parse-tree phrase indices (CSR)
{key}.samlens.npy   # SAM region lengths (CSR)
{key}.samcat.npy    # SAM region token indices (CSR)

Training

bash scripts/train.sh

Key arguments (see scripts/train.sh):

Argument	Default	Description
`--vision_pool_type`	`avg`	Vision encoder pooling (`avg` or `cls`)
`--text_pool_type`	`avg`	Text encoder pooling (`avg` or `argmax`)
`--sam_loss_ratio`	`0.1`	Weight of region-phrase alignment loss (0 = standard CLIP)
`--sam_regions_topk`	`5`	Max SAM regions per image
`--sam_regions_topk_random`	flag	Randomly sample topk regions. Without this flag, the first topk regions (descending area order) are used deterministically.
`--use_parse_phrases`	flag	Use parse-tree phrases for alignment
`--softplus_scoring`	flag	Enable softplus scoring (PowerCLIP core)
`--softplus_tau`	`0.001`	Temperature for softplus scoring
`--softplus_alpha`	`0.75`	Mixing weight for softplus alignment scores

Evaluation

Install clip-benchmark:

pip install clip-benchmark

Important: PowerCLIP uses average pooling (--vision_pool_type avg, --text_pool_type avg), which differs from standard CLIP's CLS/EoT token pooling. clip-benchmark assumes the default pooling, so you need to modify its model loading to use average pooling. Specifically, set model.visual.pool_type = "avg" and model.text.pool_type = "avg" after loading the checkpoint (or patch the model config accordingly).

Acknowledgement

This repository is based on OpenCLIP.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
open_clip		open_clip
scripts		scripts
training		training
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PowerCLIP: Powerset Alignment for Contrastive Pre-Training (CVPR 2026)

Overview

Resources

Installation

Data Preparation

Step 1: Parse-tree phrase extraction

Step 2: SAM region extraction

Final tar structure

Training

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

PowerCLIP: Powerset Alignment for Contrastive Pre-Training (CVPR 2026)

Overview

Resources

Installation

Data Preparation

Step 1: Parse-tree phrase extraction

Step 2: SAM region extraction

Final tar structure

Training

Evaluation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages