Intro

This repository is the official implementation of CsdCLIP.

Despite the remarkable success of CLIP models in text-to-image retrieval, their performance remains suboptimal when confronted with complex queries encompassing multiple subjects, attributes, or relations. We introduce CsdCLIP, a novel training-free framework that significantly improves zero-shot retrieval performance.

CsdCLIP first uses a large language model to optimize a complex query, then decomposes this optimized query into multiple logically related clauses. Subsequently, it performs a composite search with a plug-and-play architecture that integrates seamlessly with existing CLIP-based systems. Furthermore, prevailing evaluation metrics, such as R@k, are often insufficient for comprehensively assessing a model's true compositional capabilities in handling complex queries. To address this evaluative gap, we construct TIGR-100k, a novel benchmark dataset specifically designed for complex query evaluation, which consists of 1,044 bilingual complex query pairs with multi-level relevance grading (highly relevant, moderately relevant, and irrelevant) images, along with hierarchical evaluation metrics assessing both coverage and ranking quality. Extensive experiments across multiple CLIP variants demonstrate that CsdCLIP consistently elevates highly relevant images to the top positions, with significant improvements in text-to-image retrieval for complex queries.

Highlights:

We introduce a novel benchmark dataset, along with two metrics, GRP and NIPR, to provide a more fine-grained evaluation for complex semantic search in text-to-image retrieval.
We propose CsdCLIP, a training-free composite search algorithm that decomposes a complex query into multiple clauses, and then perform efficient composite search. Its plug-and-play design allows for seamless integration into existing CLIP-based retrieval systems.
Through extensive experiments and detailed visualization analyses, we demonstrate the effectiveness of the proposed benchmark and the composite search method.

Tigr-100k

The directory structure organization of the Tigr-100k dataset is as follows:

data
├── en_complex_1044_final.jsonl
├── zh_complex_1044_final.jsonl
├── imgs
└── lmdb/
    ├── imgs/
    ├── pairs/
    └── pairs_zh/

English and Chinese text pairs are constructed and stored as jsonl file:

{"text_id": "de7d1fdf-b0eb-46e3-8d62-c0c0de086b67", "text": "A woman in a green dress dancing in front of York City Hall with no shoes | Time: none, place: York City Hall, person: none | Search intention: complex | Phrase, word expression optimization: A woman in a green dress dancing in front of City Hall | Multi-agent logical relationship analysis: -AND: < a woman wearing a green dress, a woman in front of City Hall, a woman dancing >, -NOT: < shoes >\n", "image_ids": ["d11884d9-832d-4c38-b3a7-22fbf6472cf5", ...], "image_xiangsi_ids": ["be86418e-74bb-427d-9ae1-7ff92b3af847", ...]}

imgs/ holds the HR and MR images.
all pairs and images (include HR, MR, IR images) data are provided lmdb database versions.

Usage

Prepare Overall Environment

Run the following command to create the environment and install the required third-party libraries:

conda create -n CsdCLIP python=3.10
cd CsdCLIP
pip install -r requirements.txt
export PROJECT_ROOT="$(pwd)"

Download Datasets

Tigr-100k dataset:

Tigr-100k dataset is available at Tigr100k, which is ready to use upon download. Please put the Tigr-100k dataset into ./data directory.

public retrieval datasets:

Our experiments are conducted on four public retrieval datasets: MSCOCO, Flickr30k, wikiDO, Urban1k. Please download them and format them and place them under ${DATAPTH}, as follows:

${DATAPTH}
├── tigr100k
|   ├── lmdb/
|   |   ├── pairs/
|   |   ├── pairs_zh/
|   |   └── imgs/
|   ├── en_tigr100k.jsonl
|   └── zh_tigr100k.jsonl
├── coco
|   ├── lmdb/
|   |   ├── pairs/
|   |   └── imgs/
|   └── en_coco.jsonl
└── other datasets

Finetune Qwen3-1.5B

Be sure to download the model locally and replace the path in the bash script with the correct one before fine-tuning

# Lora fine-tune Qwen3-1.5B 
sh ./LLM_finetune/lora.sh
# Generate rewrite text using the fine-tuned model
sh ./LLM_finetune/reason.sh

Experiments on Tigr-100k dataset

CLIP and CN-CLIP

Refer to CLIP and CN-CLIP to install CLIP and CN-CLIP.

# CsdCLIP with CLIP-RN50, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale RN50 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10

# CsdCLIP with CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10

# CsdCLIP with CLIP-ViT-L/14, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-L/14 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10

# CsdCLIP with CN-CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name cnclip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10

# CsdCLIP with CN-CLIP-ViT-L/14, Top-10 retrieval
python ./model/model.py --model_name cnclip --model_scale ViT-L/14 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10

Long-CLIP, FG-CLIP

Refer to Long-CLIP to clone Long-CLIP repo to CsdCLIP/ directory and modify the package name to LongCLIP. Then download checkpoints to CsdCLIP/LongCLIP/checkpoints/ directory.

# CsdCLIP with Long-CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name longclip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10

# CsdCLIP with FG-CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name fgclip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10

Integrate Your Own CLIP-like Model

CsdCLIP features a plug-and-play architecture. To integrate a new CLIP-like model (e.g., myclip), you only need to modify two files. The new model must provide the following two capabilities:

Image encoding: a method that takes a preprocessed image tensor and returns a normalized feature vector.
Text encoding: a method that takes tokenized text and returns a normalized feature vector.

Summary of required interfaces

Capability	Standard CLIP-style API	HuggingFace-style API (e.g., FG-CLIP)
Load model	`model, preprocess = load(scale, device)`	`model = AutoModel.from_pretrained(name)`
Image preprocess	`preprocess(image)` → Tensor	`processor(images=image, return_tensors='pt')`
Encode image	`model.encode_image(image_tensor)`	`model.get_image_features(pixel_values)`
Tokenize text	`tokenize([text])` → Tensor	`tokenizer([text], ...)` → input_ids
Encode text	`model.encode_text(token_tensor)`	`model.get_text_features(input_ids)`

Experiments on public retrieval dataset

# CsdCLIP with CLIP-RN50, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale RN50 --dataset_path $PROJECT_ROOT/${DATAPATH}/${DATASET_NAME} --gpu 0 --num_res 10

# CsdCLIP with CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/${DATASET_NAME} --gpu 0 --num_res 10

# CsdCLIP with CLIP-ViT-L/14, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-L/14 --dataset_path $PROJECT_ROOT/${DATAPATH}/${DATASET_NAME} --gpu 0 --num_res 10

Citation

If you find our work helpful for your research, please consider giving a citation:

to be determined

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LLM_finetune		LLM_finetune
data		data
model		model
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro

Tigr-100k

Usage

Prepare Overall Environment

Download Datasets

Finetune Qwen3-1.5B

Experiments on Tigr-100k dataset

CLIP and CN-CLIP

Long-CLIP, FG-CLIP

Integrate Your Own CLIP-like Model

Summary of required interfaces

Experiments on public retrieval dataset

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Intro

Tigr-100k

Usage

Prepare Overall Environment

Download Datasets

Finetune Qwen3-1.5B

Experiments on Tigr-100k dataset

CLIP and CN-CLIP

Long-CLIP, FG-CLIP

Integrate Your Own CLIP-like Model

Summary of required interfaces

Experiments on public retrieval dataset

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages