Skip to content

Liuziyu77/RAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu* · Zeyi Sun* · Yuhang Zang · Wei Li · Pan Zhang · Xiaoyi Dong · Yuanjun Xiong · Dahua Lin · Jiaqi Wang

Submitted to arXiv

📖Paper |🏠Homepage

In this paper, we highlight the potential of combining retrieving and ranking with multi-modal large language models to revolutionize perception tasks such as fine-grained recognition, zero-shot image recognition, and few-shot object recognition. Motivated by the limited zero-shot/few-shot of CLIP and MLLMs on fine-grained datasets, our RAR designs the pipeline that uses MLLM to rank the retrieved results. Our proposed approach can be seamlessly integrated into various MLLMs for real-world applications where the variety and volume of categories continuously expand. Our method opens up new avenues for research in augmenting the MLLM’s abilities with the retrieving-augmented solution and could be beneficial for other tasks such as reasoning and generation in future works.

Logo

📢 News

  • 🚀 [03/25/2024] We are excited to announce the publication of our fine-tuning data, along with the code used to generate this data. Our sample JSON data is based on the FGVC-Aircraft dataset. You are encouraged to expand your research and experiments with additional datasets to uncover even more possibilities!
  • 🚀 [03/20/2024] We upload part of our code in github, including Fine-Grained Visual Recognition and Few-Shot Image Recognition. More updata is coming soon !!!
  • 🚀 [03/20/2024] Our work is submitted to arXiv.

💡 Highlights

  • 🔥 We conduct an in-depth analysis of the strengths and weaknesses of VLMs and MLLMs in processing fine-grained datasets.
  • 🔥 Our RAR can be seamlessly integrated into various MLLMs in a plug-and-play manner.
  • 🔥 Through rigorous testing across 11 classification datasets and 2 object detection datasets, we demonstrate that our method outperforms baselines on a variety of visual recognition tasks.

🛠️ Usage

📃 Contents

🛠️ Install

If you are not using Linux, do NOT proceed, see instructions for macOS and Windows.

  1. Clone this repository and navigate to RAR folder
git clone https://github.com/Liuziyu77/RAR.git
cd RAR
  1. Prepare the environment step-by-step:
conda create -n rar python=3.10.13 -y  # create RAR conda environment
conda activate rar  # activate the environment and install dependencies

💾 Prepare Data

Navigate to the CLIP-Cls folder, and prepare the data following the instructions.

📅 Generate finetune data

In our experiments, we have finetuned several MLLMs (Multimodal Large Language Models). The purpose of finetuning these models is to tap into their classification potential, enabling the MLLMs to provide answers in a standardized format. This facilitates the processing of our final results.

Within the finetune folder, we have included an .ipynb file for generating finetune data. The JSON file in this folder, based on the FGVA-Aircraft dataset, contains pre-generated finetune data. With minor format adjustments, this JSON file can be used for finetuning models such as LLaVa, Intern-Xcomposer, Qwen, and others.

An finetune data example is shown below:

{
    "id": 0,
    "image": [
        "your picture path"
    ],
    "conversations": [
        {
            "from": "user",
            "value": "Here is a image:<Img index=1><image></Img>. Please play the role of a aircraft classification expert,
                     and sort the provided categories from high to low according to the top 5 similarity with the input image.
                     Here are the optional categories:['707-320', 'DC-8', 'DC-6', 'L-1011', '707-320']."
        },
        {
            "from": "assistant",
            "value": "['707-320', '707-320', 'DC-8', 'DC-6', 'L-1011']"
        }
    ]
}

🔍 Few-Shot Image Classification

📋 Build memory

Navigate to the Few_shot folder, and run build_memory.ipynb step by step to construct the the external memory. When you finish the step above, three files will be generated:

{dataset_name}_{shot_number}_shot_database.txt
{dataset_name}_{shot_number}_shot_img_index.index
predictions_{shot_number}_shot_knn.pth

# For different datasets, we have different files.
# eg. caltech101_4_shot_database.txt
# eg. eurosat_8_shot_img_index.index
# eg. predictions_16_shot_knn.pth

The index file stores the indices of image embeddings that make up the memory. The txt file includes filenames and labels in corresponding order. The pth file contains test results obtained using the CLIP+KNN method, and you can use the code in CLIP_Cls to test its accuracy.

📋 Retrieve and Rank

After that, you can test the retrieve and rank by run retrieve_and_rerank.py. And a new pth file will be saved, it records the answers of VLM after ranking the retrieved results. Before you run retrieve_and_rerank.py, three parameters are needed to change:

shot_number = 4
top_k = 5
dataset_name = 'caltech101'

shot_number = 4 corresponds to 4-shot setting, top_k controls the number of retrievd items, and dataset_name decides which dataset to be tested.

🔍 Fine-Grained Visual Recognition

In this experiment, our testing is based on the FineR. Therefore, the first step is to clone the project using the git clone command and install the required environment.

After that, navigate to the Fine-Grained Visual Recognition folder, run build_memory.ipynb step by step to build the memory for five datasets(Pets37, Dogs120, Cars196, Flowers102 and Bird200). Here, we have prepared the built memory index and category names in Fine-Grained Visual Recognition/database folder, the fold is organized as shown below:

├── database/
│   └── Pets37/
|       ├── classnames.txt
|       ├── paired_data_pets.txt
|       ├── pets37_database.index
|       ├── pets37_database.txt
│   └── Dog120/
│   └── Flowers102/
│   └── Cars196/
│   └── Bird200/ 

Next, you can run our provided Fine-Grained Visual Recognition/retrieve_test.ipynb code to use our retrieval method for reselecting names. When U get the names, replace these names in FineR/experiments/pet37/guess/pet_llm_gussed_names_3.json with these new names, and run sh FineR/scripts_eval/p_pipe.sh to eval the sACC and cACC.

✒️Citation

@misc{liu2024rar,
      title={RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition}, 
      author={Ziyu Liu and Zeyi Sun and Yuhang Zang and Wei Li and Pan Zhang and Xiaoyi Dong and Yuanjun Xiong and Dahua Lin and Jiaqi Wang},
      year={2024},
      eprint={2403.13805},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

Code License Data License Usage and License Notices: The data and code are intended and licensed for research use only.