VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

This is the official PyTorch implementation of:

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

CVPR 2026

Highlights

Language-Free: VisualAD removes the text encoder entirely and learns anomaly/normal prototypes purely in the visual feature space.
Two Learnable Tokens: An anomaly token and a normal token are inserted into a frozen ViT, interacting with patch tokens through multi-layer self-attention to encode normality and abnormality.
SCA & SAF Modules: Spatial-Aware Cross-Attention (SCA) injects fine-grained spatial evidence into the tokens; Self-Alignment Function (SAF) recalibrates patch features before anomaly scoring.
13 Benchmarks: State-of-the-art performance across 6 industrial and 7 medical zero-shot anomaly detection benchmarks.
Backbone Agnostic: Adapts seamlessly to CLIP (ViT-L/14@336px) and DINOv2 (ViT-L/14).

Main Results

Image-Level ZSAD Performance (AUROC / F1-max / AP)

Dataset	WinCLIP	APRIL-GAN	AnomalyCLIP	AdaCLIP	VisualAD (CLIP)	VisualAD (DINOv2)
MVTec-AD	90.4 / 92.7 / 95.6	86.1 / 90.4 / 93.6	91.6 / 92.7 / 96.2	92.0 / 92.7 / 96.4	92.2 / 93.2 / 96.7	90.1 / 92.4 / 94.8
VisA	75.6 / 78.2 / 78.8	77.4 / 78.6 / 80.9	81.0 / 80.3 / 84.4	79.7 / 79.6 / 83.2	84.7 / 82.5 / 87.6	83.1 / 81.4 / 86.8
BTAD	68.2 / 67.8 / 70.9	73.7 / 68.7 / 69.9	88.7 / 86.0 / 90.6	90.0 / 87.2 / 91.5	94.9 / 93.9 / 97.0	88.2 / 84.7 / 89.7
KSDD2	93.5 / 86.4 / 94.2	90.4 / 82.9 / 92.0	91.9 / 84.5 / 93.4	94.9 / 90.3 / 96.2	98.0 / 93.9 / 98.3	97.7 / 93.1 / 98.1
DAGM	91.8 / 75.8 / 79.5	94.4 / 80.3 / 83.9	98.0 / 90.6 / 92.4	98.3 / 91.5 / 94.2	99.5 / 95.0 / 97.8	93.2 / 83.9 / 86.1
DTD-Synthetic	95.1 / 94.1 / 97.7	85.5 / 89.1 / 94.0	93.7 / 94.3 / 97.4	92.1 / 92.4 / 96.3	97.5 / 96.6 / 99.1	91.0 / 94.4 / 97.4

Pixel-Level ZSAD Performance (AUROC / F1-max / AP / PRO)

Dataset	WinCLIP	APRIL-GAN	AnomalyCLIP	AdaCLIP	VisualAD (CLIP)	VisualAD (DINOv2)
MVTec-AD	82.3 / 24.8 / 18.2 / 62.0	87.5 / 42.3 / 39.1 / 43.7	91.0 / 38.9 / 34.4 / 81.7	88.5 / 43.9 / 41.0 / 47.6	90.8 / 43.9 / 41.2 / 87.5	91.3 / 47.4 / 45.4 / 88.6
VisA	73.2 / 9.0 / 5.4 / 51.1	93.8 / 32.6 / 26.2 / 86.5	95.4 / 27.6 / 20.7 / 86.4	95.1 / 33.8 / 29.2 / 71.3	95.8 / 34.6 / 28.4 / 91.0	95.3 / 35.2 / 29.9 / 88.2
BTAD	72.7 / 18.5 / 12.9 / 27.3	91.3 / 40.1 / 37.7 / 21.0	93.0 / 47.1 / 41.5 / 71.0	87.7 / 42.3 / 36.6 / 17.1	91.1 / 49.8 / 43.1 / 80.4	93.4 / 42.6 / 38.7 / 76.7
DTD-Synthetic	79.5 / 16.1 / 9.8 / 51.5	94.9 / 60.4 / 61.0 / 33.8	97.5 / 55.8 / 52.5 / 87.9	95.1 / 58.4 / 56.1 / 34.3	98.1 / 64.3 / 65.5 / 94.8	96.7 / 65.8 / 67.7 / 92.4

All backbones use ViT-L/14@336px (CLIP) or ViT-L/14 (DINOv2). Full results including medical benchmarks (OCT17, BrainMRI, Brain_AD, HIS, CVC-ClinicDB, Endo, Kvasir) are available in our paper.

Getting Started

1. Environment

pip install -r requirements.txt

Main dependencies: PyTorch >= 2.0, torchvision, timm, scikit-learn, scipy, tqdm.

2. Data Preparation

We adopt the same dataset structure and JSON format as AnomalyCLIP. Please refer to their repository for dataset download and organization details.

We also provide scripts to generate the required JSON metadata files:

python generate_dataset_json/mvtec.py
python generate_dataset_json/visa.py

Supported datasets: MVTec-AD, VisA, BTAD, KSDD2, DAGM, DTD-Synthetic, OCT17, BrainMRI, Brain_AD, HIS, CVC-ClinicDB, Endo, Kvasir.

3. Pre-trained Weights

We provide pre-trained checkpoints with the CLIP (ViT-L/14@336px) backbone:

Training Set	Checkpoint	Evaluation Set
VisA	`weight/train_on_visa/CLIP.pth`	MVTec-AD and other cross-dataset benchmarks
MVTec-AD	`weight/train_on_mvtec/CLIP.pth`	VisA

4. Quick Start

Run the full cross-dataset training and evaluation pipeline (MVTec-AD <-> VisA) with a single command:

bash scripts/CLIP.sh

Please modify the dataset paths in scripts/CLIP.sh before running.

Citation

If you find this work useful, please consider citing:

@article{hou2026visualad,
  title={VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer},
  author={Hou, Yanning and Li, Peiyuan and Liu, Zirui and Wang, Yitong and Ruan, Yanran and Qiu, Jianfeng and Xu, Ke},
  journal={arXiv preprint arXiv:2603.07952},
  year={2026}
}

Acknowledgements

This project builds upon CLIP and DINOv2. We thank the authors of AnomalyCLIP and AdaCLIP for their open-source implementations.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
VisualAD_lib		VisualAD_lib
generate_dataset_json		generate_dataset_json
scripts		scripts
utils		utils
weight		weight
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

Highlights

Main Results

Image-Level ZSAD Performance (AUROC / F1-max / AP)

Pixel-Level ZSAD Performance (AUROC / F1-max / AP / PRO)

Getting Started

1. Environment

2. Data Preparation

3. Pre-trained Weights

4. Quick Start

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

Highlights

Main Results

Image-Level ZSAD Performance (AUROC / F1-max / AP)

Pixel-Level ZSAD Performance (AUROC / F1-max / AP / PRO)

Getting Started

1. Environment

2. Data Preparation

3. Pre-trained Weights

4. Quick Start

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages