A Reference-Comparison Framework for Generalizable AI-Generated Image Detection
🇨🇳 中文文档 · 📄 Paper · 📊 Results · 🚀 Quick Start · 📚 Citation
"Perception is a process of hypothesis testing." Richard L. Gregory, 1980
MIRROR reframes AI-generated image (AIGI) detection: instead of binary artifact classification, it casts detection as a Reference-Comparison process. A learnable, orthogonal Memory Bank explicitly encodes the manifold of real images. Each input is projected onto this manifold via sparse top-$k$ attention to construct an Ideal Reference, and the comparison residual between the input and its reference becomes a generator-agnostic detection signal.
This shift unlocks two properties that prior detectors lack:
- 📈 Backbone scalability. Accuracy keeps climbing as DINOv3 scales from Base to Huge, while NPR / UnivFD / DDA saturate.
- 👁️ Superhuman robustness. On the human-imperceptible split of our Human-AIGI benchmark, MIRROR reaches 89.6% across 27 generators, surpassing both lay users and CV experts.
- 2026.04 Inference code and DINOv3-H+ weights released.
- 2026.03 Paper released on arXiv.
- Coming soon Training code, full checkpoint zoo, and the Human-AIGI Benchmark.
| 🔄 New paradigm | Reference-Comparison instead of artifact hunting |
| 🏆 State of the art | +2.1% on 6 standard benchmarks, +8.1% on 7 in-the-wild benchmarks |
| 👁️ Superhuman | 89.6% on Human-AIGI hard subset, beats lay users and CV experts |
| 📈 Scales with backbone | Sustained gains from DINOv3-Base to -Huge; competitors saturate |
| 🧠 Reality memory bank |
|
| 🔬 First-of-its-kind benchmark | Psychophysically curated Human-AIGI set, 50 participants, 27 generators |
MIRROR is a two-phase framework (see fig/method.pdf).
A frozen DINOv3 encoder extracts patch-level features from real images only. A learnable memory bank
The first term forces
With
All numbers are Balanced Accuracy (%) with format-aligned inputs (PNG to JPG).
| Category | Benchmark | Prior SOTA | DINOv2-L | DINOv3-L | DINOv3-H+ | Δ vs SOTA |
|---|---|---|---|---|---|---|
| Standard | AIGCDetectBenchmark | 84.7 (B-Free) | 90.5 | 91.7 | 97.3 | +12.6 |
| GenImage | 89.6 (B-Free) | 91.3 | 94.2 | 99.8 | +10.2 | |
| UnivFakeDetect | 87.8 (B-Free) | 84.6 | 88.2 | 92.4 | +4.6 | |
| Synthbuster | 96.5 (DDA) | 97.0 | 98.1 | 99.2 | +2.7 | |
| EvalGEN | 96.6 (DDA) | 98.1 | 99.0 | 99.8 | +3.2 | |
| DRCT-2M | 99.2 (B-Free) | 92.8 | 93.0 | 93.0 | −6.2 | |
| In-the-wild | Chameleon | 83.5 (DDA) | 85.4 | 90.7 | 94.6 | +11.1 |
| SynthWildx | 94.6 (B-Free) | 88.9 | 93.1 | 95.1 | +0.5 | |
| WildRF | 92.6 (B-Free) | 92.2 | 96.7 | 97.8 | +5.2 | |
| AIGIBench | 84.4 (DDA) | 85.6 | 90.5 | 94.9 | +10.5 | |
| CO-SPY | 80.3 (DDA) | 87.4 | 91.3 | 97.4 | +17.1 | |
| RR-Dataset | 70.3 (DDA) | 76.8 | 78.9 | 88.3 | +18.0 | |
| BFree-Online | 87.1 (B-Free) | 84.3 | 83.0 | 97.6 | +10.5 |
Aggregate: +2.1% average on 6 standard benchmarks, +8.1% average on 7 in-the-wild benchmarks vs the previous SOTA.
A psychophysically curated benchmark covering 27 generators, with a hard subset selected from a 50-participant study using accuracy, confidence, and response time. Designed to measure when detectors cross the Superhuman Crossover line.
| Method | Hard subset Acc. (%) |
|---|---|
| Lay users (untrained) | ~ 55 |
| CV experts (no forensics training) | ~ 73 |
| MIRROR (DINOv3-H+) | 89.6 |
See the paper for full psychophysics and the per-generator breakdown.
- Inference code
- DINOv3-H+ inference weights
- Training code (Phase 1 + Phase 2)
- Full checkpoint zoo (DINOv2-L / DINOv3-L / DINOv3-H+)
- Human-AIGI Benchmark public release
Python 3.10+ is recommended.
git clone https://github.com/handsome-rich/MIRROR.git
cd MIRROR
# Install PyTorch first per your CUDA version: https://pytorch.org
pip install torch torchvision tqdm pillow numpy scikit-learn transformers peft| File | Purpose | Link |
|---|---|---|
checkpoint-h-cur.pth |
Phase 2 detector checkpoint | Google Drive |
mirror_phase1.pth |
Phase 1 memory-bank weights | Google Drive |
dinov3-huge/ |
DINOv3-H+ backbone | official DINOv3 release |
Place them under weight/:
weight/
├── checkpoint-h-cur.pth # Phase 2 detector
├── mirror_phase1.pth # Phase 1 memory bank
└── dinov3-huge/ # DINOv3-Huge backbone
├── config.json
└── model.safetensors
python inference.py \
--model_path ./weight/checkpoint-h-cur.pth \
--memory_path ./weight/mirror_phase1.pth \
--backbone_path ./weight/dinov3-huge \
--base_data_path /path/to/your/dataset \
--benchmarks Chameleon \
--batch_size 128 \
--device cuda \
--use_amp--base_data_path should point at a root that holds one folder per benchmark:
base_data_path/
├── AIGC_bm/ # AIGCDetectBenchmark
├── UniversalFakeDetect/ # UnivFD
├── synthbuster/ # Synthbuster
├── GenEval-JPEG/ # EvalGEN
├── Chameleon/test/
├── WildRF/test/
├── synthwildx/
├── AIGIBench/
├── CO-SPY-In-the-Wild/
├── drct/
├── RRDataset/
└── B-Free/
| Flag | Type | Description |
|---|---|---|
--model_path |
str | Phase 2 checkpoint (.pth) |
--memory_path |
str | Phase 1 memory-bank weights |
--backbone_path |
str | DINOv3 backbone directory |
--base_data_path |
str | Root directory containing benchmark sub-folders |
--benchmarks |
list | Benchmarks to evaluate, e.g. Chameleon GenImage |
--batch_size |
int | Per-device batch size |
--device |
str | cuda or cpu |
--use_amp |
flag | Enable mixed-precision inference |
--output_dir |
str | Where CSV reports go (default ./results) |
CSV reports land at results/{benchmark}_{timestamp}.csv with Acc, Bal_Acc, Real_Acc, Fake_Acc.
If MIRROR helps your research, please cite:
@article{liu2026mirror,
title = {MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection},
author = {Liu, Ruiqi and Cui, Manni and Qin, Ziheng and Yan, Zhiyuan and Chen, Ruoxin and Han, Yi and Li, Zhiheng and Chen, Junkai and Chen, ZhiJin and Lin, Kaiqing and others},
journal = {arXiv preprint arXiv:2602.02222},
year = {2026}
}- Issues github.com/handsome-rich/MIRROR/issues
- Email
ruiqi.liu24@nlpr.ia.ac.cn