Mix-ViT: Mixing Attentive Vision Transformer for Ultra-Fine-Grained Visual Categorization

Official PyTorch implementation of Mix-ViT: Mixing Attentive Vision Transformer for Ultra-Fine-Grained Visual Categorization accepted by Pattern Recognition.

If you use the code in this repo for your work, please cite the following bib entries:

@article{yu2023mix,
  title={Mix-ViT: Mixing attentive vision transformer for ultra-fine-grained visual categorization},
  author={Yu, Xiaohan and Wang, Jun and Zhao, Yang and Gao, Yongsheng},
  journal={Pattern Recognition},
  volume={135},
  pages={109131},
  year={2023},
  publisher={Elsevier}
}

Abstract

Ultra-fine-grained visual categorization (ultra-FGVC) moves down the taxonomy level to classify sub-granularity categories of fine-grained objects. This inevitably poses a challenge, i.e., classifying highly similar objects with limited samples, which impedes the performance of recent advanced vision transformer methods. To that end, this paper introduces Mix-ViT, a novel mixing attentive vision transformer to address the above challenge towards improved ultra-FGVC. The core design is a self-supervised module that mixes the high-level sample tokens and learns to predict whether a token has been substituted after attentively substituting tokens. This drives the model to understand the contextual discriminative details among inter-class samples. Via incorporating such a self-supervised module, the network gains more knowledge from the intrinsic structure of input data and thus improves generalization capability with limited training sample. The proposed Mix-ViT achieves competitive performance on seven publicly available datasets, demonstrating the potential of vision transformer compared to CNN for the first time in addressing the challenging ultra-FGVC tasks.

Prerequisites

The following packages are required to run the scripts:

[Python >= 3.6]
[PyTorch = 1.8]
[Torchvision]
[Apex]

Download Google pre-trained ViT models

Get models in this link: ViT-B_16, ViT-B_32...

wget https://storage.googleapis.com/vit_models/imagenet21k/{MODEL_NAME}.npz

Dataset

You can download the datasets from the links below:

Run the experiments.

Using the scripts on scripts directory to train the model, e.g., train on SoybeanGene dataset.

$ sh scripts/train_soybean_gene.sh

Download Trained Models

Trained model Google Drive

Acknowledgment

Our project references the codes in the following repos. Thanks for thier works and sharing.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
mixvit_purevit		mixvit_purevit
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
architecture.jpeg		architecture.jpeg
command.txt		command.txt
data		data
requirements.txt		requirements.txt
train.py		train.py
visualize_attention_map.ipynb		visualize_attention_map.ipynb
visualize_group_features.ipynb		visualize_group_features.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mix-ViT: Mixing Attentive Vision Transformer for Ultra-Fine-Grained Visual Categorization

Abstract

Prerequisites

Download Google pre-trained ViT models

Dataset

Run the experiments.

Download Trained Models

Acknowledgment

About

Releases

Packages

Languages

Markin-Wang/MixViT

Folders and files

Latest commit

History

Repository files navigation

Mix-ViT: Mixing Attentive Vision Transformer for Ultra-Fine-Grained Visual Categorization

Abstract

Prerequisites

Download Google pre-trained ViT models

Dataset

Run the experiments.

Download Trained Models

Acknowledgment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages