Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%.
# 1. Clone repository
git clone <your-repo-url>
cd PAND
# 2. Create conda environment
conda create -n pand_env python=3.9
conda activate pand_env
# 3. Install PyTorch (adjust CUDA version as needed)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# 4. Install dependencies
pip install -r requirements.txt
pip install open-clip-torchconfigs/data/attributes/
├── 0_CUB_200_2011.yaml # CUB-200-2011 dataset
├── 1_FGVC_AIRCRAFT.yaml # FGVC Aircraft dataset
├── 4_OxfordIIITPet.yaml # Oxford Pets dataset
├── 5_StanfordDogs.yaml # Stanford Dogs dataset
└── ...
Edit the config file for your dataset:
# Example: configs/data/attributes/0_CUB_200_2011.yaml
data_root: /path/to/your/CUB_200_2011 # Change this to your dataset path
class_num: 200
prompt_tmpl: "a photo of a {}, a type of bird."
classes:
1: "Black_footed_Albatross"
2: "Laysan_Albatross"
# ... (class names)Key fields to modify:
data_root: Path to your dataset directoryclass_num: Number of classesprompt_tmpl: Template for text prompts (optional)classes: Class ID to name mapping
Run the automated script:
bash scripts/run_coop_stagePSC.shWhat this script does:
- Extracts CLIP image features
- Trains CoOp with pre-extracted features
- Generates
learned_text_features.ptfor Stage-NSD
Configuration:
Edit scripts/run_coop_stagePSC.sh to modify:
DATASET: Dataset name (e.g.,0_CUB_200_2011)DATA_ROOT: Path to your datasetGPUS: GPU devices to use
python src/train.py \
data/attributes=0_CUB_200_2011 \
model=coop_kd \
model.net.student.model_name=resnet18 \
model.net.teacher.coop_text_features=/path/to/learned_text_features.pt \
model.kd_criterion.use_coop=true \
model.kd_criterion.use_nlrd=true \
model.kd_criterion.nlrd_k=3 \
model.kd_criterion.nlrd_lambda=1.0 \
model.kd_criterion.nlrd_weight=1.0 \
trainer=ddp \
trainer.devices=4 \
trainer.max_epochs=300 \
data.batch_size=128n_ctx: Number of learnable context tokens (default: 16)lr: Learning rate (default: 0.002)epochs: Training epochs (default: 200)batch_size: Batch size (default: 32)
model: Config file (kdafor baseline,coop_kdfor PAND)model.net.student.model_name: Student architecture (resnet18,mobilenet_v2)model.kd_criterion.use_coop: Enable CoOp text featuresmodel.kd_criterion.use_nlrd: Enable NLRD lossmodel.kd_criterion.nlrd_weight: NLRD loss weight (0~1)trainer.devices: Number of GPUsdata.batch_size: Batch size (recommend 128 for 4 GPUs)
This project builds upon the following excellent works:
- VL2Lite: Visual-Language Knowledge Distillation Framework
- CoOp: Context Optimization for Prompt Learning
- NRKD: Neighborhood-based Relational Knowledge Distillation
If you find this work helpful, please consider citing:
@misc{luo2026pandpromptawareneighborhooddistillation,
title={PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification},
author={Qiuming Luo and Yuebing Li and Feng Li and Chang Kong},
year={2026},
eprint={2602.07768},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.07768},
}This project is licensed under the MIT License.