RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu* · Zeyi Sun* · Yuhang Zang · Wei Li · Pan Zhang · Xiaoyi Dong · Yuanjun Xiong · Dahua Lin · Jiaqi Wang

Submitted to arXiv

📖Paper |🏠Homepage

In this paper, we highlight the potential of combining retrieving and ranking with multi-modal large language models to revolutionize perception tasks such as fine-grained recognition, zero-shot image recognition, and few-shot object recognition. Motivated by the limited zero-shot/few-shot of CLIP and MLLMs on fine-grained datasets, our RAR designs the pipeline that uses MLLM to rank the retrieved results. Our proposed approach can be seamlessly integrated into various MLLMs for real-world applications where the variety and volume of categories continuously expand. Our method opens up new avenues for research in augmenting the MLLM’s abilities with the retrieving-augmented solution and could be beneficial for other tasks such as reasoning and generation in future works.

📢 News

🚀 [03/25/2024] We are excited to announce the publication of our fine-tuning data, along with the code used to generate this data. Our sample JSON data is based on the FGVC-Aircraft dataset. You are encouraged to expand your research and experiments with additional datasets to uncover even more possibilities!
🚀 [03/20/2024] We upload part of our code in github, including Fine-Grained Visual Recognition and Few-Shot Image Recognition. More updata is coming soon !!!
🚀 [03/20/2024] Our work is submitted to arXiv.

💡 Highlights

🔥 We conduct an in-depth analysis of the strengths and weaknesses of VLMs and MLLMs in processing fine-grained datasets.
🔥 Our RAR can be seamlessly integrated into various MLLMs in a plug-and-play manner.
🔥 Through rigorous testing across 11 classification datasets and 2 object detection datasets, we demonstrate that our method outperforms baselines on a variety of visual recognition tasks.

🛠️ Usage

🛠️ Install

If you are not using Linux, do NOT proceed, see instructions for macOS and Windows.

Clone this repository and navigate to RAR folder

git clone https://github.com/Liuziyu77/RAR.git
cd RAR

Prepare the environment step-by-step:

conda create -n rar python=3.10.13 -y  # create RAR conda environment
conda activate rar  # activate the environment and install dependencies

💾 Prepare Data

Navigate to the CLIP-Cls folder, and prepare the data following the instructions.

📅 Generate finetune data

In our experiments, we have finetuned several MLLMs (Multimodal Large Language Models). The purpose of finetuning these models is to tap into their classification potential, enabling the MLLMs to provide answers in a standardized format. This facilitates the processing of our final results.

Within the finetune folder, we have included an .ipynb file for generating finetune data. The JSON file in this folder, based on the FGVA-Aircraft dataset, contains pre-generated finetune data. With minor format adjustments, this JSON file can be used for finetuning models such as LLaVa, Intern-Xcomposer, Qwen, and others.

An finetune data example is shown below:

{
    "id": 0,
    "image": [
        "your picture path"
    ],
    "conversations": [
        {
            "from": "user",
            "value": "Here is a image:<Img index=1><image></Img>. Please play the role of a aircraft classification expert,
                     and sort the provided categories from high to low according to the top 5 similarity with the input image.
                     Here are the optional categories:['707-320', 'DC-8', 'DC-6', 'L-1011', '707-320']."
        },
        {
            "from": "assistant",
            "value": "['707-320', '707-320', 'DC-8', 'DC-6', 'L-1011']"
        }
    ]
}

🔍 Few-Shot Image Classification

📋 Build memory

Navigate to the Few_shot folder, and run build_memory.ipynb step by step to construct the the external memory. When you finish the step above, three files will be generated:

{dataset_name}_{shot_number}_shot_database.txt
{dataset_name}_{shot_number}_shot_img_index.index
predictions_{shot_number}_shot_knn.pth

# For different datasets, we have different files.
# eg. caltech101_4_shot_database.txt
# eg. eurosat_8_shot_img_index.index
# eg. predictions_16_shot_knn.pth

The index file stores the indices of image embeddings that make up the memory. The txt file includes filenames and labels in corresponding order. The pth file contains test results obtained using the CLIP+KNN method, and you can use the code in CLIP_Cls to test its accuracy.

📋 Retrieve and Rank

After that, you can test the retrieve and rank by run retrieve_and_rerank.py. And a new pth file will be saved, it records the answers of VLM after ranking the retrieved results. Before you run retrieve_and_rerank.py, three parameters are needed to change:

shot_number = 4
top_k = 5
dataset_name = 'caltech101'

shot_number = 4 corresponds to 4-shot setting, top_k controls the number of retrievd items, and dataset_name decides which dataset to be tested.

🔍 Fine-Grained Visual Recognition

In this experiment, our testing is based on the FineR. Therefore, the first step is to clone the project using the git clone command and install the required environment.

After that, navigate to the Fine-Grained Visual Recognition folder, run build_memory.ipynb step by step to build the memory for five datasets(Pets37, Dogs120, Cars196, Flowers102 and Bird200). Here, we have prepared the built memory index and category names in Fine-Grained Visual Recognition/database folder, the fold is organized as shown below:

├── database/
│   └── Pets37/
|       ├── classnames.txt
|       ├── paired_data_pets.txt
|       ├── pets37_database.index
|       ├── pets37_database.txt
│   └── Dog120/
│   └── Flowers102/
│   └── Cars196/
│   └── Bird200/

Next, you can run our provided Fine-Grained Visual Recognition/retrieve_test.ipynb code to use our retrieval method for reselecting names. When U get the names, replace these names in FineR/experiments/pet37/guess/pet_llm_gussed_names_3.json with these new names, and run sh FineR/scripts_eval/p_pipe.sh to eval the sACC and cACC.

✒️Citation

@misc{liu2024rar,
      title={RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition}, 
      author={Ziyu Liu and Zeyi Sun and Yuhang Zang and Wei Li and Pan Zhang and Xiaoyi Dong and Yuanjun Xiong and Dahua Lin and Jiaqi Wang},
      year={2024},
      eprint={2403.13805},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

Usage and License Notices: The data and code are intended and licensed for research use only.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
CLIP-Cls		CLIP-Cls
Few_shot		Few_shot
Fine-Grained_Visual_Recognition		Fine-Grained_Visual_Recognition
figures		figures
finetune		finetune
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIP-Cls

CLIP-Cls

Few_shot

Few_shot

Fine-Grained_Visual_Recognition

Fine-Grained_Visual_Recognition

figures

figures

finetune

finetune

LICENSE

LICENSE

README.md

README.md

Repository files navigation

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Submitted to arXiv

📢 News

💡 Highlights

🛠️ Usage

📃 Contents

🛠️ Install

💾 Prepare Data

📅 Generate finetune data

🔍 Few-Shot Image Classification

📋 Build memory

📋 Retrieve and Rank

🔍 Fine-Grained Visual Recognition

✒️Citation

📄 License

About

Releases

Packages

Contributors 2

Languages

License

Liuziyu77/RAR

Folders and files

Latest commit

History

Repository files navigation

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Submitted to arXiv

📢 News

💡 Highlights

🛠️ Usage

📃 Contents

🛠️ Install

💾 Prepare Data

📅 Generate finetune data

🔍 Few-Shot Image Classification

📋 Build memory

📋 Retrieve and Rank

🔍 Fine-Grained Visual Recognition

✒️Citation

📄 License

About

Resources

License

Stars

Watchers

Forks

Languages