Skip to content

Masakichi210/PowerCLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PowerCLIP: Powerset Alignment for Contrastive Pre-Training (CVPR 2026)

arXiv

PowerCLIP augments standard CLIP training with a region-phrase alignment loss that pairs SAM-generated image regions with syntactic phrases extracted from captions.

Overview

Overview

PowerCLIP extends OpenCLIP with three key components:

  1. SAM region extraction -- Segment Anything Model generates semantic regions per image, converted to patch-grid token indices (CSR format).
  2. Parse-tree phrase extraction -- spaCy-based constituency parsing extracts NP/PP/VP/S phrases from captions, aligned to the CLIP tokenizer.
  3. Softplus region-phrase scoring -- A softplus-based scoring function aligns region features with phrase features during training.

The model uses average pooling for both vision and text encoders (instead of CLIP's default CLS/EoT token pooling).

Results

Resources

Link
Pre-trained model (SAM) KMasaki/PowerCLIP-ViT-B-16-CC12M
Pre-trained model (SAM2) KMasaki/PowerCLIP-SAM2-ViT-B-16-CC12M
Pre-processed dataset (SAM) KMasaki/cc12m-sam-parse-tree
Pre-processed dataset (SAM2) KMasaki/cc12m-sam2-parse-tree

Installation

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Data Preparation

Note: The pre-processed CC12M datasets (the output of the steps below) are publicly available on Hugging Face:

You can skip this section if you use the pre-processed datasets directly.

PowerCLIP uses WebDataset tar archives. Starting from a standard image-text WebDataset (.jpg, .txt, .json per sample), run the following two preprocessing steps:

Step 1: Parse-tree phrase extraction

Adds .njson files containing phrase token indices (CSR format) to each tar.

python -m training.precompute_parse_tree \
    --data /path/to/cc12m-wds/ \
    --model ViT-B-16 \
    --workers 4

Output directory: cc12m-wds_parse_tree/

Step 2: SAM region extraction

Reads the parse-tree tar and adds .samlens.npy + .samcat.npy (region token indices in CSR format) per sample.

Using SAM (multi-GPU):

torchrun --nproc_per_node 8 -m training.precompute_sam \
    --src-dir /path/to/cc12m-wds_parse_tree \
    --out-dir /path/to/cc12m-wds_sam_parse_tree \
    --patch-size 16 --image-size 224

Using SAM2 (multi-GPU):

torchrun --nproc_per_node 8 -m training.precompute_sam2 \
    --src-dir /path/to/cc12m-wds_parse_tree \
    --out-dir /path/to/cc12m-wds_sam2_parse_tree \
    --patch-size 16 --image-size 224

Final tar structure

Each sample in the final tar contains:

{key}.jpg           # Image
{key}.txt           # Caption
{key}.json          # Metadata
{key}.njson         # Parse-tree phrase indices (CSR)
{key}.samlens.npy   # SAM region lengths (CSR)
{key}.samcat.npy    # SAM region token indices (CSR)

Training

bash scripts/train.sh

Key arguments (see scripts/train.sh):

Argument Default Description
--vision_pool_type avg Vision encoder pooling (avg or cls)
--text_pool_type avg Text encoder pooling (avg or argmax)
--sam_loss_ratio 0.1 Weight of region-phrase alignment loss (0 = standard CLIP)
--sam_regions_topk 5 Max SAM regions per image
--sam_regions_topk_random flag Randomly sample topk regions. Without this flag, the first topk regions (descending area order) are used deterministically.
--use_parse_phrases flag Use parse-tree phrases for alignment
--softplus_scoring flag Enable softplus scoring (PowerCLIP core)
--softplus_tau 0.001 Temperature for softplus scoring
--softplus_alpha 0.75 Mixing weight for softplus alignment scores

Evaluation

Install clip-benchmark:

pip install clip-benchmark

Important: PowerCLIP uses average pooling (--vision_pool_type avg, --text_pool_type avg), which differs from standard CLIP's CLS/EoT token pooling. clip-benchmark assumes the default pooling, so you need to modify its model loading to use average pooling. Specifically, set model.visual.pool_type = "avg" and model.text.pool_type = "avg" after loading the checkpoint (or patch the model config accordingly).

Acknowledgement

This repository is based on OpenCLIP.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages