Skip to content
/ CFRH Public

CLIP-based Fusion-modal Reconstructing Hashing for Unsupervised Large-scale Cross-modal Retrieval

Notifications You must be signed in to change notification settings

AwakerLee/CFRH

Repository files navigation

CLIP-based Fusion-modal Reconstructing Hashing for Unsupervised Large-scale Cross-modal Retrieval

This repository contains the author's implementation in PyTorch for the IJMIR-23 paper "CLIP-based Fusion-modal Reconstructing Hashing for Unsupervised Large-scale Cross-modal Retrieval".

Introduction

Cross-modal hashing encodes the multimedia data into a common binary hash space in which the correlations among the samples from different modalities can be effectively measured. Deep cross-modal hashing further improves the retrieval performance as the deep neural networks can generate more semantically relevant features and hash codes. Currently, the existing unsupervised hashing methods generally have two limitations: (1) Existing methods fail to adequately capture the latent semantic relevance and coexistent information from the different modality data. (2) Existing unsupervised methods typically construct a similarity matrix to guide the hash code learning, which suffers from inaccurate similarity problems, resulting in sub-optimal retrieval performance. To address these issues, we propose a novel CLIP-based Fusion-modal Reconstructing Hashing (CFRH) for Large-scale Unsupervised Cross-modal Retrieval. First, we use CLIP to encode cross-modal features of visual modality and learn the common representation space of the hash code using modality-specific autoencoders. Second, we propose an efficient fusion approach to construct a semantically complementary affinity matrix that can maximize the potential semantic relevance of different modal instances. Furthermore, to retain the intrinsic semantic similarity of all similar pairs in the learned hash codes, an objective function for similarity reconstructing based on semantic complementation is designed to learn high-quality hash code representations.


Dependencies

Please, install the following packages:

  • Python (>=3.8)
  • pytorch
  • torchvision
  • h5py
  • CLIP

Datasets

You can download the features of the datasets from: For datasets, we follow Deep Cross-Modal Hashing's Github (Jiang, CVPR 2017). You can download these datasets from:

Implementation

Here we provide the implementation of our proposed models, along with datasets. The repository is organised as follows:

  • data/ contains the necessary dataset files for NUS-WIDE, MIRFlickr, and MS-COCO;
  • models.py contains the implementation of the model;

Finally, main.py puts all of the above together and can be used to execute a full training run on MIRFlcikr or NUS-WIDE or MS-COCO.

Process

  • Place the datasets in data/
  • Set the experiment parameters in main.py.
  • Train a model:
python main.py
  • Modify the parameter EVAL = True in main.py for evaluation:
python main.py

Citation

If you find our work or the code useful, please consider cite our paper using:

@article{mingyong2023clip,
  title={CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval},
  author={Mingyong, Li and Yewen, Li and Mingyuan, Ge and Longfei, Ma},
  journal={International Journal of Multimedia Information Retrieval},
  volume={12},
  number={1},
  pages={2},
  year={2023},
  publisher={Springer}
}

About

CLIP-based Fusion-modal Reconstructing Hashing for Unsupervised Large-scale Cross-modal Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages