This repository is the official implementation of CsdCLIP.
Despite the remarkable success of CLIP models in text-to-image retrieval, their performance remains suboptimal when confronted with complex queries encompassing multiple subjects, attributes, or relations. We introduce CsdCLIP, a novel training-free framework that significantly improves zero-shot retrieval performance.
CsdCLIP first uses a large language model to optimize a complex query, then decomposes this optimized query into multiple logically related clauses. Subsequently, it performs a composite search with a plug-and-play architecture that integrates seamlessly with existing CLIP-based systems. Furthermore, prevailing evaluation metrics, such as R@k, are often insufficient for comprehensively assessing a model's true compositional capabilities in handling complex queries. To address this evaluative gap, we construct TIGR-100k, a novel benchmark dataset specifically designed for complex query evaluation, which consists of 1,044 bilingual complex query pairs with multi-level relevance grading (highly relevant, moderately relevant, and irrelevant) images, along with hierarchical evaluation metrics assessing both coverage and ranking quality. Extensive experiments across multiple CLIP variants demonstrate that CsdCLIP consistently elevates highly relevant images to the top positions, with significant improvements in text-to-image retrieval for complex queries.
Highlights:
- We introduce a novel benchmark dataset, along with two metrics, GRP and NIPR, to provide a more fine-grained evaluation for complex semantic search in text-to-image retrieval.
- We propose CsdCLIP, a training-free composite search algorithm that decomposes a complex query into multiple clauses, and then perform efficient composite search. Its plug-and-play design allows for seamless integration into existing CLIP-based retrieval systems.
- Through extensive experiments and detailed visualization analyses, we demonstrate the effectiveness of the proposed benchmark and the composite search method.
The directory structure organization of the Tigr-100k dataset is as follows:
data
├── en_complex_1044_final.jsonl
├── zh_complex_1044_final.jsonl
├── imgs
└── lmdb/
├── imgs/
├── pairs/
└── pairs_zh/- English and Chinese text pairs are constructed and stored as jsonl file:
{"text_id": "de7d1fdf-b0eb-46e3-8d62-c0c0de086b67", "text": "A woman in a green dress dancing in front of York City Hall with no shoes | Time: none, place: York City Hall, person: none | Search intention: complex | Phrase, word expression optimization: A woman in a green dress dancing in front of City Hall | Multi-agent logical relationship analysis: -AND: < a woman wearing a green dress, a woman in front of City Hall, a woman dancing >, -NOT: < shoes >\n", "image_ids": ["d11884d9-832d-4c38-b3a7-22fbf6472cf5", ...], "image_xiangsi_ids": ["be86418e-74bb-427d-9ae1-7ff92b3af847", ...]}imgs/holds the HR and MR images.- all pairs and images (include HR, MR, IR images) data are provided lmdb database versions.
Run the following command to create the environment and install the required third-party libraries:
conda create -n CsdCLIP python=3.10
cd CsdCLIP
pip install -r requirements.txt
export PROJECT_ROOT="$(pwd)"
- Tigr-100k dataset:
Tigr-100k dataset is available at Tigr100k, which is ready to use upon download. Please put the Tigr-100k dataset into ./data directory.
- public retrieval datasets:
Our experiments are conducted on four public retrieval datasets: MSCOCO, Flickr30k, wikiDO, Urban1k. Please download them and format them and place them under ${DATAPTH}, as follows:
${DATAPTH}
├── tigr100k
| ├── lmdb/
| | ├── pairs/
| | ├── pairs_zh/
| | └── imgs/
| ├── en_tigr100k.jsonl
| └── zh_tigr100k.jsonl
├── coco
| ├── lmdb/
| | ├── pairs/
| | └── imgs/
| └── en_coco.jsonl
└── other datasetsBe sure to download the model locally and replace the path in the bash script with the correct one before fine-tuning
# Lora fine-tune Qwen3-1.5B
sh ./LLM_finetune/lora.sh
# Generate rewrite text using the fine-tuned model
sh ./LLM_finetune/reason.shRefer to CLIP and CN-CLIP to install CLIP and CN-CLIP.
# CsdCLIP with CLIP-RN50, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale RN50 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10
# CsdCLIP with CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10
# CsdCLIP with CLIP-ViT-L/14, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-L/14 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10
# CsdCLIP with CN-CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name cnclip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10
# CsdCLIP with CN-CLIP-ViT-L/14, Top-10 retrieval
python ./model/model.py --model_name cnclip --model_scale ViT-L/14 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10Refer to Long-CLIP to clone Long-CLIP repo to CsdCLIP/ directory and modify the package name to LongCLIP. Then download checkpoints to CsdCLIP/LongCLIP/checkpoints/ directory.
# CsdCLIP with Long-CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name longclip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10
# CsdCLIP with FG-CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name fgclip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/tigr100k --gpu 0 --num_res 10CsdCLIP features a plug-and-play architecture. To integrate a new CLIP-like model (e.g., myclip), you only need to modify two files. The new model must provide the following two capabilities:
- Image encoding: a method that takes a preprocessed image tensor and returns a normalized feature vector.
- Text encoding: a method that takes tokenized text and returns a normalized feature vector.
| Capability | Standard CLIP-style API | HuggingFace-style API (e.g., FG-CLIP) |
|---|---|---|
| Load model | model, preprocess = load(scale, device) |
model = AutoModel.from_pretrained(name) |
| Image preprocess | preprocess(image) → Tensor |
processor(images=image, return_tensors='pt') |
| Encode image | model.encode_image(image_tensor) |
model.get_image_features(pixel_values) |
| Tokenize text | tokenize([text]) → Tensor |
tokenizer([text], ...) → input_ids |
| Encode text | model.encode_text(token_tensor) |
model.get_text_features(input_ids) |
# CsdCLIP with CLIP-RN50, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale RN50 --dataset_path $PROJECT_ROOT/${DATAPATH}/${DATASET_NAME} --gpu 0 --num_res 10
# CsdCLIP with CLIP-ViT-B/16, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-B/16 --dataset_path $PROJECT_ROOT/${DATAPATH}/${DATASET_NAME} --gpu 0 --num_res 10
# CsdCLIP with CLIP-ViT-L/14, Top-10 retrieval
python ./model/model.py --model_name clip --model_scale ViT-L/14 --dataset_path $PROJECT_ROOT/${DATAPATH}/${DATASET_NAME} --gpu 0 --num_res 10If you find our work helpful for your research, please consider giving a citation:
to be determined