Skip to content

Latest commit

 

History

History
1356 lines (1033 loc) · 74.7 KB

conventional_method.md

File metadata and controls

1356 lines (1033 loc) · 74.7 KB

Methods Summary of Conventional Image-Text Matching

Catalogue

Algorithm-oriented Works

*Generic-Feature Extraction*

(NeurIPS2013_DeViSE) DeViSE: A Deep Visual-Semantic Embedding Model.
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov.
[paper]

(TACL2014_SDT-RNN) Grounded Compositional Semantics for Finding and Describing Images with Sentences.
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, Andrew Y. Ng.
[paper]

(NeurIPSws2014_UVSE) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models.
Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel.
[paper] [code] [demo]

(NeurIPS2014_DeFrag) Deep fragment embeddings for bidirectional image sentence mapping.
Andrej Karpathy, Armand Joulin, Li Fei-Fei.
[paper]

(ICCV2015_m-CNN) Multimodal Convolutional Neural Networks for Matching Image and Sentence.
Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li.
[paper]

(CVPR2015_DCCA) Deep Correlation for Matching Images and Text.
Fei Yan, Krystian Mikolajczyk.
[paper]

(CVPR2015_FV) Associating Neural Word Embeddings with Deep Image Representationsusing Fisher Vectors.
Benjamin Klein, Guy Lev, Gil Sadeh, Lior Wolf.
[paper]

(CVPR2015_DVSA) Deep Visual-Semantic Alignments for Generating Image Descriptions.
Andrej Karpathy, Li Fei-Fei.
[paper]

(NeurIPS2015_STV) Skip-thought Vectors.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler.
[paper]

(CVPR2016_SPE) Learning Deep Structure-Preserving Image-Text Embeddings.
Liwei Wang, Yin Li, Svetlana Lazebnik.
[paper]

(ICCV2017_HM-LSTM) Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding.
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, Gang Hua.
[paper]

(ICCV2017_RRF-Net) Learning a Recurrent Residual Fusion Network for Multimodal Matching.
Yu Liu, Yanming Guo, Erwin M. Bakker, Michael S. Lew.
[paper]

(CVPR2017_2WayNet) Linking Image and Text with 2-Way Nets.
Aviv Eisenschtat, Lior Wolf.
[paper]

(ACMMM2018_WSJE) Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval.
Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury.
[paper]

(WACV2018_SEAM) Fast Self-Attentive Multimodal Retrieval.
Jônatas Wehrmann, Maurício Armani Lopes, Martin D More, Rodrigo C. Barros.
[paper] [code]

(CVPR2018_CSE) End-to-end Convolutional Semantic Embeddings.
Quanzeng You, Zhengyou Zhang, Jiebo Luo.
[paper]

(CVPR2018_CHAIN-VSE) Bidirectional Retrieval Made Simple.
Jonatas Wehrmann, Rodrigo C. Barros.
[paper] [code]

(CVPR2018_SCO) Learning Semantic Concepts and Order for Image and Sentence Matching.
Yan Huang, Qi Wu, Liang Wang.
[paper]

(NC2019_MDM) Bidirectional image-sentence retrieval by local and global deep matching.
Lin Ma, Wenhao Jiang, Zequn Jie, Xu Wang.
[paper]

(ACMMM2019_SAEM) Learning Fragment Self-Attention Embeddings for Image-Text Matching.
Yiling Wu, Shuhui Wang, Guoli Song, Qingming Huang.
[paper] [code]

(ICCV2019_VSRN) Visual Semantic Reasoning for Image-Text Matching.
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu.
[paper] [code]

(ICCV2019_LIWE) Language-Agnostic Visual-Semantic Embeddings.
Jonatas Wehrmann, Maurício Armani Lopes, Douglas Souza, Rodrigo Barros.
[paper] [code] [demo]

(CVPR2019_Personality) Engaging Image Captioning via Personality.
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason Weston.
[paper]

(CVPR2019_PVSE) Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.
Yale Song, Mohammad Soleymani.
[paper] [code]

(Access2020_GSLS) Combining Global and Local Similarity for Cross-Media Retrieval.
Zhixin Li, Feng Ling, Canlong Zhang, Huifang Ma.
[paper]

(Access2020_M3A) Multi-Modal Memory Enhancement Attention Network for Image-Text Matching.
Zhong Ji, Zhigang Lin, Haoran Wang, Yuqing He.
[paper]

(ICPR2020_TERN) Transformer Reasoning Network for Image-Text Matching and Retrieval.
Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato.
[paper] [code]

(TOMM2020_TERAN) Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders.
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, Stéphane Marchand-Maillet.
[paper] [code]

(TOMM2020_NIS) Upgrading the Newsroom: An Automated Image Selection System for News Articles.
Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, Karl Aberer.
[paper] [slides] [demo]

(TCSVT2020_MFM) Matching Image and Sentence With Multi-Faceted Representations.
Lin Ma, Wenhao Jiang, Zequn Jie, Yu-Gang Jiang, Wei Liu.
[paper]

(TCSVT2020_DSRAN) Learning Dual Semantic Relations with Graph Attention for Image-Text Matching.
Keyu Wen, Xiaodong Gu, Qingrong Cheng.
[paper] [code]

(ICMR2020_VRACR) Visual Relations Augmented Cross-modal Retrieval.
Yutian Guo, Jingjing Chen, Hao Zhang, Yu-Gang Jiang.
[paper]

(WACV2020_SGM) Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval.
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, Xilin Chen.
[paper]

(ACMMM2020_CAMERA) Context-Aware Multi-View Summarization Network for Image-Text Matching.
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, Qi Tian.
[paper] [code]

(EMNLP2021_DXR) Cross-Modal Retrieval Augmentation for Multi-Modal Classification.
Shir Gur, Natalia Neverova, Chris Stauffer, Ser-Nam Lim, Douwe Kiela, Austin Reiter.
[paper]

(arXiv2021_T-EMDE) T-EMDE: Sketching-based global similarity for cross-modal retrieval.
Barbara Rychalska, Mikolaj Wieczorek, Jacek Dabrowski.
[paper]

(SoMeT2021_LGSGM) A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval.
Manh-Duy Nguyen, Binh T. Nguyen, Cathal Gurrins.
[paper]

(ACLIJCNLP2021_HEI) Hashing based Efficient Inference for Image-Text Matching.
Rong-Cheng Tu, Lei Ji, Huaishao Luo, Botian Shi, Heyan Huang, Nan Duan, Xian-Ling Mao.
[paper]

(CSAE2021_SVSEN) Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval.
Zhixian Zeng, Jianjun Cao, Guoquan Jiang, Nianfeng Weng, Yuxin Xu, Zibo Nie.
[paper]

(ACMMM2021_SMFEA) Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval.
Xuri Ge, Fuhai Chen, Joemon M. Jose, Zhilong Ji, Zhongqin Wu, Xiao Liu.
[paper]

(IJCAI2021_SSP) Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective.
Yang Yang, Chubing Zhang, Yi-Chu Xu, Dianhai Yu, De-Chuan Zhan, Jian Yang.
[paper]

(CVPR2021_GPO) Learning the Best Pooling Strategy for Visual Semantic Embedding.
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, Changhu Wang.
[paper] [code]

(CVPR2021_PG) Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching.
Shiyang Yan, Li Yu, Yuan Xie.
[paper] [code]

(ICCV2021_WCGL) Wasserstein Coupled Graph Learning for Cross-Modal Retrieval.
Yun Wang, Tong Zhang, Xueya Zhang, Zhen Cui, Yuge Huang, Pengcheng Shen, Shaoxin Li, Jian Yang.
[paper]

(ECAL2023_ADPOOL) Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective.
Zijian Zhang, Chang Shu, Ya Xiao, Yuan Shen, Di Zhu, Jing Xiao, Youxin Chen, Jey Han Lau, Qian Zhang, Zheng Lu.
[paper] [code]

(TOMM2022_CGMN) Cross-modal Graph Matching Network for Image-text Retrieval.
Yuhao Cheng, Xiaoguang Zhu, Jiuchao Qian, Fei Wen, Peilin Liu.
[paper] [code]

(NAACL2022_MOTIS) Leaner and Faster: Two-Stage Model Compression for Lightweight Text-Image Retrieval.
Siyu Ren, Kenny Q. Zhu.
[paper] [code]

(TPAMI2022_VSRN++) Image-Text Embedding Learning via Visual and Textual Semantic Reasoning.
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu.
[paper]

(ACMMM2022_CFM) Synthesizing Counterfactual Samples for Effective Image-Text Matching.
Hao Wei, Shuhui Wang, Xinzhe Han, Zhe Xue, Bin Ma, Xiaoming Wei, Xiaolin Wei.
[paper] [code]

(ACMMM2022_RGF) Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval.
Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu.
[paper]

(ACMMM2022_TEAM) Token Embeddings Alignment for Cross-Modal Retrieval.
Chen-Wei Xie, Jianmin Wu, Yun Zheng, Pan Pan, Xian-Sheng Hua.
[paper]

(TIP2024_USER) USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval.
Yan Zhang, Zhong Ji, Di Wang, Yanwei Pang, Xuelong Li.
[paper] [code]

(TCSVT2023_ESA) ESA: External Space Attention Aggregation for Image-Text Retrieval.
Hongguang Zhu, Chunjie Zhang, Yunchao Wei, Shujuan Huang, Yao Zhao.
[paper] [code]

(ICASSP2023_RVSE) Semantic-Preserving Augmentation for Robust Image-Text Retrieval.
Sunwoo Kim, Kyuhong Shim, Luong Trung Nguyen, Byonghyo Shim.
[paper]

(ACMMM2023_GTMIS) Giving Text More Imagination Space for Image-text Matching.
Xinfeng Dong, Longfei Han, Dingwen Zhang, Li Liu, Junwei Han, Huaxiang Zhang.
[paper]

(CVPR2023_MSRM) Multilateral Semantic Relations Modeling for Image Text Retrieval.
Zheng Wang, Zhenwei Gao, Kangshuai Guo, Yang Yang, Xiaoming Wang, Heng Tao Shen.
[paper]

(CVPR2023_HREM) Learning Semantic Relationship Among Instances for Image-Text Matching.
Zheren Fu, Zhendong Mao, Yan Song, Yongdong Zhang.
[paper] [code]

(TCSVT2024_IMEB) Fast, Accurate, and Lightweight Memory-Enhanced Embedding Learning Framework for Image-Text Retrieval.
Zhe Li, Lei Zhang, Kun Zhang, Yongdong Zhang, Zhendong Mao.
[paper]

*Cross-Modal Interaction*

(CVPR2015_NIC) Show and Tell: A Neural Image Caption Generator.
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.
[paper]

(ICLR2015_m-RNN) Deep Captioning with Multimodal Recurrent Neural Network(M-RNN).
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille.
[paper] [code]

(CVPR2015_LRCN) Long-term Recurrent Convolutional Networks for Visual Recognition and Description.
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell.
[paper]

(CVPR2017_DAN) Dual Attention Networks for Multimodal Reasoning and Matching.
Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim.
[paper]

(CVPR2017_sm-LSTM) Instance-aware Image and Sentence Matching with Selective Multimodal LSTM.
Yan Huang, Wei Wang, Liang Wang.
[paper]

(ECCV2018_CITE) Conditional Image-Text Embedding Networks.
Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, Svetlana Lazebnik.
[paper]

(ECCV2018_SCAN) Stacked Cross Attention for Image-Text Matching.
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Xiaodong He.
[paper] [code]

(CVPR2018_DSVE-Loc) Finding beans in burgers: Deep semantic-visual embedding with localization.
Martin Engilberge, Louis Chevallier, Patrick Pérez, Matthieu Cord.
[paper]

(arXiv2019_R-SCAN) Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators.
Kuang-Huei Lee, Hamid Palang, Xi Chen, Houdong Hu, Jianfeng Gao.
[paper]

(arXiv2019_ParNet) ParNet: Position-aware Aggregated Relation Network for Image-Text matching.
Yaxian Xia, Lun Huang, Wenmin Wang, Xiaoyong Wei, Jie Chen.
[paper]

(TIS2021_TOD-Net) Target-Oriented Deformation of Visual-Semantic Embedding Space.
Takashi Matsubara.
[paper]

(ACML2019_SAVE) Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching.
Zhuobin Zheng, Youcheng Ben, Chun Yuan.
[paper]

(ICMR2019_OAN) Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks.
Po-Yao Huang, Vaibhav, Xiaojun Chang, Alexander Georg Hauptmann.
[paper] [code]

(ACMMM2019_BFAN) Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching.
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, Yongdong Zhang.
[paper] [code]

(ACMMM2019_MTFN) Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking.
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, Jingkuan Song.
[paper] [code]

(IJCAI2019_RDAN) Multi-Level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching.
Zhibin Hu, Yongsheng Luo,Jiong Lin,Yan Yan, Jian Chen.
[paper]

(IJCAI2019_PFAN) Position Focused Attention Network for Image-Text Matching.
Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan.
[paper] [code]

(ICCV2019_CAMP) CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval.
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, Jing Shao.
[paper] [code]

(ICCV2019_SAN) Saliency-Guided Attention Network for Image-Sentence Matching.
Zhong Ji, Haoran Wang, Jungong Han, Yanwei Pang.
[paper] [code]

(TC2020_SMAN) SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval.
Zhong Ji, Haoran Wang, Jungong Han, Yanwei Pang.
[paper]

(TMM2020_PFAN++) PFAN++: Bi-Directional Image-Text Retrieval with Position Focused Attention Network.
Yaxiong Wang, Hao Yang, Xiuxiu Bai, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan.
[paper] [code]

(TNNLS2020_CASC) Cross-Modal Attention With Semantic Consistence for Image-Text Matching.
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, Heng Tao Shen.
[paper] [code]

(AAAI2020_DP-RNN) Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching.
Tianlang Chen, Jiebo Luo.
[paper]

(AAAI2020_ADAPT) Adaptive Cross-modal Embeddings for Image-Text Alignment.
Jonatas Wehrmann, Camila Kolling, Rodrigo C Barros.
[paper] [code]

(CVPR2020_CAAN) Context-Aware Attention Network for Image-Text Retrieval.
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li.
[paper]

(CVPR2020_MMCA) Multi-Modality Cross Attention Network for Image and Sentence Matching.
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu.
[paper]

(CVPR2020_IMRAM) IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval.
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han.
[paper] [code]

(arXiv2021_CCRS) More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints.
Yuxiao Chen, Jianbo Yuan, Long Zhao, Rui Luo, Larry Davis, Dimitris N. Metaxas.
[paper]

(ICMR2022_SSAMT) Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval.
Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, Jianqing Fan.
[paper]

(AISTATS2023_SwAMP) SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval.
Minyoung Kim.
[paper]

(BMVC2021_RELAX) Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs.
Juyong Song, Sunghyun Choi.
[paper]

(ACMMM2021_CSCC) Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching.
Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Shuaiqi Jing, Jingkuan Song.
[paper]

(IJCAI2021_SHAN) Step-Wise Hierarchical Alignment Network for Image-Text Matching.
Zhong Ji, Kexin Chen, Haoran Wang.
[paper]

(EMNLP2021_ISERI) Inflate and Shrink: Enriching and Reducing Interactions for Fast Text-Image Retrieval.
Haoliang Liu, Tan Yu, Ping Li.
[paper]

(TIP2021_MEMBER) Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval.
Jiangtong Li, Liu Liu, Li Niu, and Liqing Zhang.
[paper]

(SIGIR2021_HAN) Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval.
Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, Ping Li.
[paper]

(SIGIR2021_CAEMCL) Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval.
Yi He, Xin Liu, Yiu-Ming Cheung, Shu-Juan Peng, Jinhan Yi, Wentao Fan.
[paper]

(SIGIR2021_DIME) Dynamic Modality Interaction Modeling for Image-Text Retrieval.
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, Liqiang Nie.
[paper] [code]

(TMM2022_UARDAN) Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching.
Kun Zhang, Zhendong Mao, An-An Liu, Yongdong Zhang.
[paper]

(WACV2022_GraDual) GraDual: Graph-based Dual-modal Representation for Image-Text Matching.
Siqu Long, Soyeon Caren Han, Xiaojun Wan, Josiah Poon.
[paper]

(AAAI2022_CMCAN) Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching.
Huatian Zhang, Zhendong Mao, Kun Zhang, Yongdong Zhang.
[paper] [code]

(CVPR2022_NAAF) Negative-Aware Attention Framework for Image-Text Matching.
Kun Zhang, Zhendong Mao, Quan Wang, Yongdong Zhang.
[paper] [code]

(ICME2023_VSL) Image-text Retrieval via preserving main Semantics of Vision.
Xu Zhang, Xinzheng Niu, Philippe Fournier-Viger, Xudong Dai.
[paper] [code]

(WACV2023_CMSEI) Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval.
Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Joemon M. Jose.
[paper]

(ACMMM2023_PPAF) Progressive Positive Association Framework for Image and Text Retrieval.
Wenhui Li, Yan Wang, Yuting Su, Lanjun Wang, Weizhi Nie, An-An Liu.
[paper]

(ACMMM2023_DCIN) Towards Deconfounded Image-Text Matching with Causal Inference.
Wenhui Li, Xinqi Su, Dan Song, Lanjun Wang, Kun Zhang, An-An Liu.
[paper]

(ACMMM2023_RCTRN) Reservoir Computing Transformer for Image-Text Retrieval.
Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Penghong Wang, Jinqiao Shi, Xiaopeng Fan.
[paper]

(ACMMM2023_FNE) Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination.
Haoxuan Li, Yi Bin, Junrong Liao, Yang Yang, Heng Tao Shen.
[paper] [code]

(TIP2023_TGDT) Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training.
Chong Liu, Yuqi Zhang, Hongsong Wang, Weihua Chen, Fan Wang, Yan Huang, Yi-Dong Shen, Liang Wang.
[paper] [code]

(NeurIPS2023_DiffusionITM) Are Diffusion Models Vision-And-Language Reasoners?.
Benno Krojer, Elinor Poole-Dayan, Vikram Voleti, Christopher Pal, Siva Reddy.
[paper] [code]

(TIP2023_RCAR) Plug-and-Play Regulators for Image-Text Matching.
Haiwen Diao, Ying Zhang, Wei Liu, Xiang Ruan, Huchuan Lu.
[paper] [code]

(TIP2024_DBL) Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching.
Haiwen Diao, Ying Zhang, Shang Gao, Xiang Ruan, Huchuan Lu.
[paper] [code]

*Similarity Measurement*

(ICLR2016_Order-emb) Order-Embeddings of Images and Language.
Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun.
[paper] [code]

(CVPR2020_HOAD) Visual-Semantic Matching by Exploring High-Order Attention and Distraction.
Yongzhi Li, Duo Zhang, Yadong Mu.
[paper]

(CVPR2020_GSMN) Graph Structured Network for Image-Text Matching.
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang.
[paper] [code]

(ICML2020_GOT) Graph Optimal Transport for Cross-Domain Alignment.
Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu.
[paper] [code]

(EMNLP2020_WD-Match) Wasserstein Distance Regularized Sequence Representation for Text Matching in Asymmetrical Domains.
Weijie Yu, Chen Xu, Jun Xu, Liang Pang, Xiaopeng Gao, Xiaozhao Wang, Ji-Rong Wen.
[paper] [code]

(AAAI2021_SGRAF) Similarity Reasoning and Filtration for Image-Text Matching.
Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu.
[paper] [code]

(arXiv2022_TSHSR) Two-stream Hierarchical Similarity Reasoning for Image-text Matching.
Ran Chen, Hanli Wang, Lei Wang, Sam Kwong.
[paper]

(ICIP2022_RGN) Relation-Guided Network for Image-Text Retrieval.
Yulou Yang, Hao Shen, Ming Yang.
[paper]

(TCSVT2022_HAT) Hierarchical Feature Aggregation Based on Transformer for Image-Text Matching.
Xinfeng Dong, Huaxiang Zhang, Lei Zhu, Liqiang Nie, Li Liu.
[paper]

(ACMMM2022_BiKA) Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks.
Jianwei Zhu, Zhixin Li, Yufei Zeng, Jiahui Wei, Huifang Ma.
[paper]

(ACMMM2022_CAliC) CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling.
Hongyu Gao, Chao Zhu, Mengyin Liu, Weibo Gu, Hongfa Wang, Wei Liu, Xu-cheng Yin.
[paper]

(SIGIR2023_LEAPRR) Learnable Pillar-based Re-ranking for Image-Text Retrieval.
Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, Tat-Seng Chua.
[paper] [code]

(arXiv2023_Listwise) Integrating Listwise Ranking into Pairwise-based Image-Text Retrieval.
Zheng Li, Caili Guo, Xin Wang, Zerun Feng, Yanjun Wang.
[paper] [code]

(ACMMM2023_X-Dim) Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching.
Kun Zhang, Lei Zhang, Bo Hu, Mengxiao Zhu, Zhendong Mao.
[paper] [code]

(CVPR2023_CHAN) Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network.
Zhengxin Pan, Fangyu Wu, Bailing Zhang.
[paper] [code]

(CVPR2023_DivE) Improving Cross-Modal Retrieval with Set of Diverse Embeddings.
Dongwon Kim, Namyup Kim, Suha Kwak.
[paper] [Project]

(TIP2023_RCAR) Plug-and-Play Regulators for Image-Text Matching.
Haiwen Diao, Ying Zhang, Wei Liu, Xiang Ruan, Huchuan Lu.
[paper] [code]

*Uncertainty Learning*

(CVPR2021_PCME) Probabilistic Embeddings for Cross-Modal Retrieval.
Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis Kalantidis, Diane Larlus.
[paper] [code]

(arXiv2022_UCMRPR) Uncertainty-based Cross-Modal Retrieval with Probabilistic Representations.
Leila Pishdad, Ran Zhang, Konstantinos G. Derpanis, Allan Jepson, Afsaneh Fazly.
[paper]

(ACMMM2022_P2RM) Point to Rectangle Matching for Image Text Retrieval.
Zheng Wang, Zhenwei Gao, Xing Xu, Yadan Luo, Yang Yang, Heng Tao Shen.
[paper]

(NeurIPS2022_DAA) A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval.
Hao Li, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Haonan Zhang, Gongfu Li.
[paper] [code]

(arXiv2023_PCME++) Improved Probabilistic Image-Text Representations.
Sanghyuk Chun.
[paper] [code]

(arXiv2023_UAMVSE) Uncertainty-Aware Multi-View Visual Semantic Embedding.
Wenzhang Wei, Zhipeng Gui, Changguang Wu, Anqi Zhao, Xingguang Wang, Huayi Wu.
[paper]

(NeurIPS2023_PAU) Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval.
Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, Heng Tao Shen.
[paper] [code]

*Noisy Correspondence*

(NeurIPS2021_NCR) Learning with Noisy Correspondence for Cross-modal Matching.
Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, Xi Peng.
[paper] [code]

(ACMMM2022_NRCCR) Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning.
Yabing Wang, Jianfeng Dong, Tianxiang Liang, Minsong Zhang, Rui Cai, Xun Wang.
[paper] [code]

(ACMMM2022_DECL) Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval.
Yang Qin, Dezhong Peng, Xi Peng, Xu Wang, Peng Hu.
[paper] [code]

(ACMMM2022_ELRCMR) Early-Learning regularized Contrastive Learning for Cross-Modal Retrieval with Noisy Labels.
Tianyuan Xu, Xueliang Liu, Zhen Huang, Dan Guo, Richang Hong, Meng Wang.
[paper]

(TMM2023_RVTR) Robust Video-Text Retrieval via Noisy Pair Calibration.
Huaiwen Zhang, Yang Yang, Fan Qi, Shengsheng Qian, Changsheng Xu.
[paper]

(TMM2023_CTPR) Learning From Noisy Correspondence With Tri-Partition for Cross-Modal Matching.
Feng, Zerun and Zeng, Zhimin and Guo, Caili and Li, Zheng and Hu, Lin.
[paper]

(TPAMI2023_RCL) Cross-Modal Retrieval With Partially Mismatched Pairs.
Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, Xi Peng.
[paper] [code]

(CVPR2023_MSCN) Noisy Correspondence Learning with Meta Similarity Correction.
Haochen Han, Kaiyao Miao, Qinghua Zheng, Minnan Luo.
[paper] [code]

(CVPR2023_BiCro) BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency.
Shuo Yang, Zhaopan Xu, Kai Wang, Yang You, Hongxun Yao, Tongliang Liu, Min Xu.
[paper] [code]

(ICCV2023_NoC) Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning.
Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh.
[paper] [code]

(NeurIPS2023_CRCL) Cross-modal Active Complementary Learning with Self-refining Correspondence.
Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, Peng Hu.
[paper] [code]

(AAAI2024_NPC) Negative Pre-aware for Noisy Cross-modal Matching.
Xu Zhang, Hao Li, Mang Ye.
[paper] [code]

(AAAI2024_SREM) Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation.
Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Xiaojun Chang, Jingdong Wang.
[paper]

(TIP2024_CREAM) Cross-modal Retrieval with Noisy Correspondence via Consistency Refining and Mining.
Xinran Ma, Mouxing Yang, Yunfan Li, Peng Hu, Jiancheng Lv, Xi Peng.
[paper] [code]

(ICLR2024_Norton) Multi-granularity Correspondence Learning from Long-term Noisy Videos.
Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, Xi Peng.
[paper] [code]

(CVPR2024_RDE) Noisy-Correspondence Learning for Text-to-Image Person Re-identification.
Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, Peng Hu.
[paper] [code]

(CVPR2024_L2RM) Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval.
Haochen Han, Qinghua Zheng, Guang Dai, Minnan Luo, Jingdong Wang.
[paper] [code]

(arXiv2024_REPAIR) REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence.
Ruochen Zheng, Jiahao Hong, Changxin Gao, Nong Sang.
[paper]

*Commonsense Learning*

(KSEM2019_SCKR) Semantic Modeling of Textual Relationships in Cross-Modal Retrieval.
Jing Yu, Chenghao Yang, Zengchang Qin, Zhuoqian Yang, Yue Hu, Weifeng Zhang.
[paper] [code]

(IJCAI2019_SCG) Knowledge Aware Semantic Concept Expansion for Image-Text Matching.
Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, Nan Duan.
[paper]

(ECCV2020_CVSE) Consensus-Aware Visual-Semantic Embedding for Image-Text Matching.
Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, Lin Ma.
[paper] [code]

(ECCV2022_CODER) CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval.
Haoran Wang, Dongliang He, Wenhao Wu, Boyang Xia, Min Yang, Fu Li, Yunlong Yu, Zhong Ji, Errui Ding, Jingdong Wang.
[paper]

(TOMM2023_MKVSE) MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-Text Retrieval.
Duoduo Feng, Xiangteng He, Yuxin Peng.
[paper] [code]

(SPL2023_CKSTN) The Style Transformer with Common Knowledge Optimization for Image-Text Retrieval.
Wenrui Li, Zhengyu Ma, Jinqiao Shi, Xiaopeng Fan.
[paper]

(ACMMM2023_EKDM) External Knowledge Dynamic Modeling for Image-text Retrieval.
Song Yang, Qiang Li, Wenhui Li, Min Liu, Xuanya Li, Anan Liu.
[paper]

*Adversarial Learning*

(ACMMM2017_ACMR) Adversarial Cross-Modal Retrieval.
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, Heng Tao Shen.
[paper] [code]

(COLING2018_CAS) Learning Visually-Grounded Semantics from Contrastive Adversarial Samples.
Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, Jian Sun.
[paper] [code]

(CVPR2018_GXN) Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models.
Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, Gang Wang.
[paper]

(ICCV2019_TIMAM) Adversarial Representation Learning for Text-to-Image Matching.
Nikolaos Sarafianos, Xiang Xu, Ioannis A. Kakadiaris.
[paper]

(CVPR2019_UniVSE) Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations.
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma.
[paper]

(ICPR2020_ADDR) Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization.
Li Ren, Kai Li, LiQiang Wang, Kien Hua.
[paper]

(PR2021_ITMeetsAL) Integrating Information Theory and Adversarial Learning for Cross-modal Retrieval.
Wei Chen, Yu Liu, Erwin M. Bakker, Michael S. Lew.
[paper]

(ICCV2021_AACH) Adversarial Attack on Deep Cross-Modal Hamming Retrieval.
Chao Li, Shangqian Gao, Cheng Deng, Wei Liu, Heng Huang.
[paper]

(ACMMM2022_DCMHT) Differentiable Cross-modal Hashing via Multimodal Transformers.
Junfeng Tu, Xueliang Liu, Zongxiang Lin, Richang Hong, Meng Wang.
[paper]

*Loss Function*

(TPAMI2018_TBNN) Learning Two-Branch Neural Networks for Image-Text Matching Tasks.
Liwei Wang, Yin Li, Jing Huang, Svetlana Lazebnik.
[paper] [code]

(BMVC2018_VSE++) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler.
[paper] [code]

(ECCV2018_CMPL) Deep Cross-Modal Projection Learning for Image-Text Matching.
Ying Zhang, Huchuan Lu.
[paper] [code]

(ACLws2019_kNN-loss) A Strong and Robust Baseline for Text-Image Matching.
Fangyu Liu, Rongtian Ye.
[paper]

(ICASSP2019_NAA) A Neighbor-aware Approach for Image-text Matching.
Chunxiao Liu, Zhendong Mao, Wenyu Zang, Bin Wang.
[paper]

(CVPR2019_PVSE) Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.
Yale Song, Mohammad Soleymani.
[paper] [code]

(CVPR2019_SoDeep) SoDeep: a Sorting Deep net to learn ranking loss surrogates.
Martin Engilberge, Louis Chevallier, Patrick Pérez, Matthieu Cord.
[paper] [code]

(TOMM2020_Dual-Path) Dual-path Convolutional Image-Text Embeddings with Instance Loss.
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, YiDong Shen.
[paper] [code]

(AAAI2020_HAL) HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs.
Fangyu Liu, Rongtian Ye, Xun Wang, Shuaipeng Li.
[paper] [code]

(AAAI2020_CVSE++) Ladder Loss for Coherent Visual-Semantic Embedding.
Mo Zhou, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang, Gang Hua.
[paper] [code]

(CVPR2020_MPL) Universal Weighting Metric Learning for Cross-Modal Matching.
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen.
[paper] [code]

(ECCV2020_PSN) Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval.
Christopher Thomas, Adriana Kovashka.
[paper] [code]

(ECCV2020_AOQ) Adaptive Offline Quintuplet Loss for Image-Text Matching.
Tianlang Chen, Jiajun Deng, Jiebo Luo.
[paper] [code]

(ACMMM2021_Meta-SPN) Meta Self-Paced Learning for Cross-Modal Matching.
Jiwei Wei, Xing Xu, Zheng Wang, Guoqing Wang.
[paper]

(ICCVws2021_IMRL) Hard-Negatives or Non-Negatives? A Hard-Negative Selection Strategy for Cross-Modal Retrieval Using the Improved Marginal Ranking Loss.
Damianos Galanopoulos, Vasileios Mezaris.
[paper]

(CVPR2021_MRL) Learning Cross-Modal Retrieval with Noisy Labels.
Peng Hu, Xi Peng, Hongyuan Zhu, Liangli Zhen, Jie Lin.
[paper]

(TPAMI2021_LESS) Learning to Embed Semantic Similarity for Joint Image-text retrieval.
Noam Malali, Yosi Keller.
[paper]

(ECIR2022_DLMLG) Do Lessons from Metric Learning Generalize to Image-Caption Retrieval.
Maurits Bleeker, Maarten de Rijke.
[paper] [code]

(WACV2022_SAM) Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching.
Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas.
[paper]

(arXiv2022_BCLS) Image-Text Retrieval with Binary and Continuous Label Supervision.
Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Ying Jin, Yufeng Zhang.
[paper]

(arXiv2022_UnifiedL) Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval.
Zheng Li, Caili Guo, Xin Wang, Zerun Feng, Jenq-Neng Hwang, Zhongtian Du.
[paper]

(TMM2022_DCD) Dynamic Contrastive Distillation for Image-Text Retrieval.
Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, Dacheng Tao.
[paper]

(ICIP2022_IMC) Intra-Modal Constraint Loss For Image-Text Retrieval.
Jianan Chen, Lu Zhang, Qiong Wang, Cong Bai, Kidiyo Kpalma.
[paper] [code]

(PR2022_LSEH) Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval.
Yan Gong, Georgina Cosma.
[paper] [code]

(IJCAI2022_MV-VSE) Multi-View Visual Semantic Embedding.
Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Xijun Xue.
[paper]

(WACV2023_GOAL) Dissecting Deep Metric Learning Losses for Image-Text Retrieval.
Hong Xuan, Xi Chen.
[paper] [code]

(arXiv2023_MCAD) MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval.
Youbo Lei, Feifei He, Chen Chen, Yingbin Mo, Si Jia Li, Defeng Xie, Haonan Lu.
[paper]

(arXiv2023_SelHN) Selectively Hard Negative Mining for Alleviating Gradient Vanishing in Image-Text Matching.
Zheng Li, Caili Guo, Xin Wang, Zerun Feng, Zhongtian Du.
[paper]

(CVPR2024_CUSA) Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval.
Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang.
[paper] [code]

(TIP2024_DBL) Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching.
Haiwen Diao, Ying Zhang, Shang Gao, Xiang Ruan, Huchuan Lu.
[paper] [code]

Task-oriented Works

*Un-Supervised or Semi-Supervised*

(ECCV2018_VSA-AE-MMD) Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach.
Angelo Carraggi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara.
[paper]

(ACMMM2019_A3VSE) Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment.
Po-Yao Huang, Guoliang Kang, Wenhe Liu, Xiaojun Chang, Alexander G Hauptmann.
[paper]

*Zero-Shot or Fewer-Shot*

(CVPR2017_DEM) Learning a Deep Embedding Model for Zero-Shot Learning.
Li Zhang, Tao Xiang, Shaogang Gong.
[paper] [code]

(AAAI2019_GVSE) Few-shot image and sentence matching via gated visual-semantic matching.
Yan Huang, Yang Long, Liang Wang.
[paper]

(ICCV2019_ACMM) ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching.
Yan Huang, Liang Wang.
[paper]

*Continual Learning*

(ICCV2023_CTP) CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation.
Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, Yao Zhao.
[paper] [code]

(ACMMM2023_C2MR) C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal Data.
Huaiwen Zhang, Yang Yang, Fan Qi, Shengsheng Qian, Changsheng Xu.
[paper]

(ACMMM2023_CMITR) Knowledge Decomposition and Replay: A Novel Cross-modal Image-Text Retrieval Continual Learning Method.
Rui Yang, Shuang Wang, Huan Zhang, Siyuan Xu, YanHe Guo, Xiutiao Ye, Biao Hou, Licheng Jiao.
[paper]

*Identification Learning*

(ICCV2015_LSTM-Q+I) VQA: Visual question answering.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, MargaretMitchell, Dhruv Batra, C Lawrence Zitnick, Devi Parikh.
[paper]

(CVPR2016_Word-NN) Learning Deep Representations of Fine-grained Visual Descriptions.
Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee.
[paper]

(CVPR2017_GNA-RNN) Person search with natural language description.
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, DayuYue, Xiaogang Wang.
[paper] [code]

(ICCV2017_IATV) Identity-aware textual-visual matching with latent co-attention.
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, Xiaogang Wang.
[paper]

(WACV2018_PWM-ATH) Improving text-based person search by spatial matching and adaptive threshold.
Tianlang Chen, Chenliang Xu, Jiebo Luo.
[paper]

(ECCV2018_CMPL) Deep Cross-Modal Projection Learning for Image-Text Matching.
Ying Zhang, Huchuan Lu.
[paper] [code]

(ECCV2018_GLA) Improving deep visual representation for person re-identification by global and local image-language association.
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, JingShao, Zejian Yuan, Xiaogang Wang.
[paper]

(ICASSP2019_MCCL) Language person search with mutually connected classification loss.
Yuyu Wang, Chunjuan Bo, Dong Wang, Shuang Wang, Yunwei Qi, Huchuan Lu.
[paper]

(ACMMM2019_A-GANet) Deep adversarial graph attention convolution network for text-based person search.
Jiawei Liu, Zheng-Jun Zha, Richang Hong, Meng Wang, Yongdong Zhang.
[paper]

(ICCV2019_TIMAM) Adversarial Representation Learning for Text-to-Image Matching.
Nikolaos Sarafianos, Xiang Xu, Ioannis A. Kakadiaris.
[paper]

(ICCV2019_FTD) Fusing Two Directions in Cross-Domain Adaption for Real Life Person Search by Language.
Kai Niu, Yan Huang, Liang Wang.
[paper]

(CVPR2019_DSCMR) Deep Supervised Cross-modal Retrieval.
Liangli Zhen, Peng Hu, Xu Wang, Dezhong Peng.
[paper] [code]

(TOMM2020_Dual-Path) Dual-path Convolutional Image-Text Embeddings with Instance Loss.
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, YiDong Shen.
[paper] [code]

(TIP2020_MIA) Improving Description-based Person Re-identification by Multi-granularity Image-text Alignment.
Kai Niu, Yan Huang, Wanli Ouyang, Liang Wang.
[paper]

(WACV2020_CMAAM) Text-based Person Search via Attribute-aided Matching.
Surbhi Aggarwal, R. Venkatesh Babu, Anirban Chakraborty.
[paper]

(ACMMM2020_HGAN) Hierarchical Gumbel Attention Network for Text-based Person Search.
Kecheng Zheng, Wu Liu, Jiawei Liu, Zheng-Jun Zha, Tao Mei.
[paper]

(AAAI2020_PMA) Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search.
Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, Tieniu Tan.
[paper]

(ECCV2020_ViTAA) ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language.
Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang.
[paper] [code]

(arXiv2021_NAFS) Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search.
Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, Xing Sun.
[paper] [code]

(arXiv2021_SSAN) Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification.
Zefeng Ding, Changxing Ding, Zhiyin Shao, Dacheng Tao.
[paper] [code]

(ICASSP2022_SAF) Learning Semantic-Aligned Feature Representation for Text-based Person Search.
Shiping Li, Min Cao, Min Zhang.
[paper] [code]

(PR2021_ITMeetsAL) Integrating Information Theory and Adversarial Learning for Cross-modal Retrieval.
Wei Chen, Yu Liu, Erwin M. Bakker, Michael S. Lew.
[paper]

(ACMMM2021_DSSL) DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval.
Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, Gang Hua.
[paper] [code]

(BMVC2021_TextReID) Text-Based Person Search with Limited Data.
Xiao Han, Sen He, Li Zhang, Tao Xiang.
[paper] [code]

(IJCAI2021_MGEL) Text-based Person Search via Multi-Granularity Embedding Learning.
Chengji Wang, Zhiming Luo1, Yaojin Lin, Shaozi Li.
[paper]

(ICCV2021_LapsCore) LapsCore: Language-Guided Person Search via Color Reasoning.
Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, Shuguang Cui.
[paper]

(arXiv2022_MANet) Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search.
Shuanglin Yan, Hao Tang, Liyan Zhang, Jinhui Tang.
[paper]

(ACMMM2022_PCDA) Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval.
Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao.
[paper]

(ACMMM2022_C2A2) Cross-modal Co-occurrence Attributes Alignments for Person Search by Language.
Kai Niu, Linjiang Huang, Yan Huang, Peng Wang, Liang Wang, Yanning Zhang.
[paper]

(ACMMM2022_LBUL) Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold.
Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, Yifeng Li.
[paper]

(ACMMM2022_CAIBC) CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval.
Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, Yifeng Li.
[paper]

(ACMMM2022_LGUR) Learning Granularity-Unified Representations for Text-to-Image Person Re-identification.
Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, Changxing Ding.
[paper] [code]

(AAAI2022_AXM-Net) AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification.
Ammarah Farooq, Muhammad Awais, Josef Kittler, Syed Safwan Khalid.
[paper]

(ECCVW2022_IVT) See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval.
Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, Xiao Wang.
[paper] code

(ECCV2022_SRCF) A Simple and Robust Correlation Filtering method for text-based person search.
Wei Suo, Mengyang Sun, Kai Niu, Yiqi Gao, Peng Wang, Yanning Zhang, Qi Wu.
[paper] code

(TIP2023_CFine) CLIP-Driven Fine-grained Text-Image Person Re-identification.
Shuanglin Yan, Neng Dong, Liyan Zhang, Jinhui Tang.
[paper] code

(ACMMM2023_GTR) Text-based Person Search without Parallel Image-Text Data.
Yang Bai, Jingyao Wang, Min Cao, Chen Chen, Ziqiang Cao, Liqiang Nie, Min Zhang.
[paper]

(CVPR2023_IRRA) Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval.
Ding Jiang, Mang Ye.
[paper] code

*Scene-Text Learning*

(ECCV2018_SS) Single Shot Scene Text Retrieval.
Lluís Gómez, Andrés Mafla, Marçal Rusiñol, Dimosthenis Karatzas.
[paper] [code_Tensorflow][code_Pytorch]

(WACV2020_PHOC) Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features.
Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas.
[paper] [code]

(WACV2021_MMRG) Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval.
Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas.
[paper] [code]

(WACV2021_StacMR) StacMR: Scene-Text Aware Cross-Modal Retrieval.
Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas.
[paper] [code]

*Related Works*

(Machine Learning 2010) Large scale image annotation: learning to rank with joint word-image embeddings.
Jason Weston, Samy Bengio, Nicolas Usunier.
[paper]

(NeurIPS2013_Word2Vec) Distributed Representations of Words and Phrases and their Compositionality.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean.
[paper]

(CVPR2017_DVSQ) Deep Visual-Semantic Quantization for Efficient Image Retrieval.
Yue Cao, Mingsheng Long, Jianmin Wang, Shichen Liu.
[paper]

(ACL2018_ILU) Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search.
Jamie Kiros, William Chan, Geoffrey Hinton.
[paper]

(AAAI2018_VSE-ens) VSE-ens: Visual-Semantic Embeddings with Efficient Negative Sampling.
Guibing Guo, Songlin Zhai, Fajie Yuan, Yuan Liu, Xingwei Wang.
[paper]

(ECCV2018_HTG) An Adversarial Approach to Hard Triplet Generation.
Yiru Zhao, Zhongming Jin, Guo-jun Qi, Hongtao Lu, Xian-sheng Hua.
[paper]

(ECCV2018_WebNet) CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images.
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang.
[paper] [code]

(CVPR2018_BUTD) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang.
[paper] [code]

(EMNLP2019_GMMR) Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations.
Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann.
[paper]

(EMNLP2019_MIMSD) Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents.
Jack Hessel, Lillian Lee, David Mimno.
[paper] [code]

(ICCV2019_DRNet) Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid.
Zhanghui Kuang, Yiming Gao, Guanbin Li, Ping Luo, Yimin Chen, Liang Lin, Wayne Zhang.
[paper]

(ICCV2019_Align2Ground) Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment.
Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran.
[paper]

(CVPR2019_TIRG) Composing Text and Image for Image Retrieval - An Empirical Odyssey.
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays.
[paper]

(SIGIR2019_PAICM) Prototype-guided Attribute-wise Interpretable Scheme for Clothing Matching.
Xianjing Han, Xuemeng Song, Jianhua Yin, Yinglong Wang, Liqiang Nie.
[paper]

(SIGIR2019_NCR) Neural Compatibility Ranking for Text-based Fashion Matching.
Suthee Chaidaroon, Mix Xie, Yi Fang, Alessandro Magnani.
[paper]

(arXiv2020_Tweets) Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval.
Hadi Abdi Khojasteh, Ebrahim Ansari, Parvin Razzaghi, Akbar Karimi.
[paper]

(arXiv2020_TIMNet) Weakly-Supervised Feature Learning via Text and Image Matching.
Gongbo Liang, Connor Greenwell, Yu Zhang, Xiaoqin Wang, Ramakanth Kavuluru, Nathan Jacobs.
[paper] [code]

(ECCV2020_InfoNCE) Contrastive Learning for Weakly Supervised Phrase Grounding.
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem.
[paper] [code]

(ECCV2020_JVSM) Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval.
Yanbei Chen, Loris Bazzani.
[paper]

(CVPR2020_POS-SCAN) More Grounded Image Captioning by Distilling Image-Text Matching Model.
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang.
[paper] [code]

(COLING2020_VSE-Probing) Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case.
Adam Dahlgren Lindström, Suna Bensch, Johanna Björklund, Frank Drewes.
[paper] [code]

(TMM2021_ALGCN) Adaptive Label-aware Graph Convolutional Networks for Cross-Modal Retrieval.
Shengsheng Qian, Dizhan Xue, Quan Fang, Changsheng Xu.
[paper]

(ICASSP2021_DAQN) Deep Adversarial Quantization Network for Cross-Modal Retrieval.
Yu Zhou, Yong Feng, Mingliang Zhou, Baohua Qiang, Leong Hou U, Jiajie Zhu.
[paper] [code]

(ACMMM2021_DARR) Database-adaptive Re-ranking for Enhancing Cross-modal Image Retrieval.
Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama.
[paper]

(AAAI2021_DAGNN) Dual Adversarial Graph Neural Networks for Multi-label Cross-modal Retrieval.
Shengsheng Qian, Dizhan Xue, Huaiwen Zhang, Quan Fang, Changsheng Xu.
[paper]

(ICCV2021_Ask&Confirm) Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval with Partial Query.
Guanyu Cai, Jun Zhang, Xinyang Jiang, Yifei Gong, Lianghua He, Fufu Yu, Pai Peng, Xiaowei Guo, Feiyue Huang, Xing Sun.
[paper] [code]

(TOMM2024_SSJDN) Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing.
Chengyu Zheng, Ning song, Ruoyu Zhang, Lei Huang, Zhiqiang Wei, Jie Nie.
[paper]

(arXiv2023_HMRN) Hierarchical Matching and Reasoning for Multi-Query Image Retrieval.
Zhong Ji, Zhihao Li, Yan Zhang, Haoran Wang, Yanwei Pang, Xuelong Li.
[paper] [code]

(arXiv2023_VLDD) Vision-Language Dataset Distillation.
Xindi Wu, Byron Zhang, Zhiwei Deng, Olga Russakovsky.
[paper]

(ACMMM2023_RTC) Relation Triplet Construction for Cross-modal Text-to-Video Retrieval.
Xue Song, Jingjing Chen, Yu-Gang Jiang.
[paper]

(NeurIPS2023_Diffusion-Classifier) Text-to-Image Diffusion Models are Zero-Shot Classifiers.
Kevin Clark, Priyank Jaini.
[paper]

(ICCV2023_Diffusion-Classifier) Your Diffusion Model is Secretly a Zero-Shot Classifier.
Alexander C. Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, Deepak Pathak.
[paper] [code]

Posted in

(JImaging2021_Survey) On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval.
Yan Gong, Georgina Cosma, Hui Fang.
[paper] [code]

(SIGIR2022_Survey) Where Does the Performance Improvement Come From? A Reproducibility Concern about Image-Text Retrieval.
Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, Dacheng Tao.
[paper] [code]

(IJCAI2022_Survey) Image-text Retrieval: A Survey on Recent Research and Development.
Min Cao, Shiping Li, Juntao Li, Liqiang Nie, Min Zhang.
[paper]

(FTCGV2022_Survey) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends.
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao.
[paper]

(IJMIR2023_Survey) Deep Learning for Video-Text Retrieval: a Review.
Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, Yu Liu.
[paper]

(arXiv2023_Survey) A Survey on Image-text Multimodal Models.
Ruifeng Guo, Jingxuan Wei, Linzhuang Sun, Bihui Yu, Guiyong Chang, Dawei Liu, Sibo Zhang, Zhengbing Yao, Mingjun Xu, Liping Bu.
[paper] [code]

(EACL2021_CxC) Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO.
Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, Yinfei Yang.
[paper] [code]

(ECCV2022_ECCVCaption) ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO.
Sanghyuk Chun, Wonjae Kim, Song Park, Minsuk Chang, Seong Joon Oh.
[paper] [code]