This repository accompanies the survey:
Consistency in Diffusion-Based Visual Generation: A Survey
Song Yan, Wei Zhai, Chenfeng Wang, Ruixuan Li, Zhangping Yang, Yancheng Cai, Tao Zhang, Ling Wang, Yunwei Lan, Yujie He, Yang Cao, Min Li, Zheng-Jun Zha.
Preprints, 2026. DOI: 10.20944/preprints202606.0870.v1.
Paper: https://www.preprints.org/manuscript/202606.0870/v1
This repository provides a curated companion resource for the survey. It collects representative papers, methods, benchmarks, datasets, evaluators, and diagnostic resources related to consistency in diffusion-based visual generation. Rather than organizing the literature only by task or modality, the repository follows the survey's central question: what should a generated visual output agree with?
- Paper overview
- Why consistency matters
- Three consistency relations
- Evaluation view
- Optimization view
- Repository organization
- External consistency
- Internal consistency
- Normative consistency
- Machine-readable resources
- Coverage labels
- Contribution guide
- Citation
Diffusion models have become a standard foundation for visual generation, including text-to-image synthesis, image editing, personalization, video generation, and 3D-aware content creation. As these models are deployed in increasingly structured settings, visual fidelity alone is no longer sufficient. A generated result may look plausible while still omitting prompt details, breaking object-attribute relations, losing identity across instances, drifting over time, disagreeing across viewpoints, or violating human preference, safety, commonsense, and physical-plausibility requirements.
The survey treats these problems as instances of a broader requirement: consistency. In this view, consistency is not a single metric or task-specific property. It is a family of agreement relations between generated content and a target of agreement. The target may be external to the sample, internal to a set or sequence of generated outputs, or normative with respect to evaluative criteria. This relation-based view makes it possible to compare method families, evaluation protocols, and trade-offs across task boundaries.
The paper makes three organizing moves:
- Relation-based taxonomy. The survey distinguishes external, internal, and normative consistency according to the object of agreement.
- Evaluation protocol abstraction. Consistency claims are analyzed through observation units, agreement targets, evidence sources, and evaluation outputs.
- Optimization-locus analysis. Methods are compared according to where they impose consistency: before sampling, at the condition interface, during denoising, across coupled outputs, or after generation.
This repository operationalizes that structure as a maintainable resource list. The long method and benchmark sections below are intended as a navigable literature map rather than a replacement for the full survey discussion.
A visually high-quality diffusion output can still be inconsistent. Typical examples include prompt omissions in text-to-image generation, attribute binding errors in compositional prompts, over-editing in image editing, subject identity drift in personalization, temporal flicker in video generation, inconsistent geometry in multi-view generation, unsafe outputs, and physically implausible action consequences.
The survey argues that these failures are easier to reason about when grouped by their agreement target instead of by surface task. For example, prompt following, layout control, and instruction editing are all forms of agreement with an external specification. Identity persistence, multi-view coherence, and video continuity are forms of agreement among generated states. Preference alignment, safety, fairness, and physical plausibility require agreement with normative or world-level criteria.
Figure 1. Three consistency relations in diffusion-based generation. External consistency measures agreement with prompts, layouts, references, or editing instructions. Internal consistency measures compatibility among generated parts, views, frames, or instances. Normative consistency measures agreement with human-centered criteria such as preference, safety, and value constraints, as well as world-centered criteria such as physical, commonsense, or causal plausibility.
| Relation | Agreement target | Typical failures | Typical settings |
|---|---|---|---|
| External consistency | Agreement with prompts, references, controls, masks, layouts, poses, or edit instructions | Prompt omission, attribute binding error, counting error, control mismatch, over-editing | Text-to-image generation, structural control, instruction editing, inpainting, virtual try-on, typography |
| Internal consistency | Agreement among generated subjects, views, frames, shots, or story states | Identity drift, view inconsistency, temporal flicker, state forgetting, narrative discontinuity | Personalization, multi-view/3D generation, video generation, story visualization |
| Normative consistency | Agreement with preference, safety, fairness, physical plausibility, commonsense, or causal/world-state criteria | Low preference, unsafe output, erased benign concepts, physical violation, causal failure | Preference optimization, safety editing, concept erasure, physical/world-model evaluation |
Figure 2. Evaluation protocol for consistency claims. A consistency evaluation should make explicit what is being observed, what the output should agree with, which evidence source supports the judgment, and what form the evaluation output takes. Observation units may include a single image, an edited pair, an identity-conditioned set, a multi-view bundle, or a video/story sequence. Evidence sources may include MLLM/VQA checks, similarity signals, set- or sequence-level diagnostics, learned evaluators, human preference studies, and stress-test protocols.
This view separates a method's stated consistency goal from the evidence used to verify it. It also exposes common mismatches, such as evaluating a sequence-level claim with only frame-level metrics, or using broad preference scores to support a more specific claim about prompt faithfulness, identity preservation, or physical plausibility.
Figure 3. Optimization loci for enforcing consistency. Consistency can be strengthened before sampling, at the condition interface, during denoising, across coupled outputs, or after generation. These loci lead to different trade-offs among persistence, controllability, realism, diversity, memory cost, and modularity.
The survey uses this view to compare broad mechanism families without reducing the discussion to a single algorithmic template. Some methods improve the input or condition before generation; others modify the denoising dynamics, couple multiple outputs, use memory or correspondence mechanisms, apply reward or safety optimization, or filter/verify outputs after generation. The appropriate locus depends on the target relation and on which trade-offs are acceptable for the application.
The resource list below follows the paper's three-relation taxonomy:
- External consistency: agreement with user-specified or task-provided conditions, including prompts, layouts, masks, poses, reference images, and edit instructions.
- Internal consistency: agreement among generated subjects, views, frames, instances, scenes, or story states.
- Normative consistency: agreement with evaluative criteria such as preference, aesthetics, safety, fairness, physical plausibility, commonsense, causal structure, and world-state validity.
Within each relation, entries are grouped into Methods, Benchmarks & Evaluators, and Datasets & Data Resources. Each entry follows this format:
Title (venue/year or source) — short explanation of what consistency issue it helps study.
For very recent or not-yet-proceedings papers, the venue is marked as arXiv / project / venue TBD rather than guessed.
External consistency asks whether generated content follows externally specified conditions: text prompts, layouts, boxes, masks, depth maps, poses, reference images, editing instructions, or other user/task controls.
- GLIDE (arXiv 2022) — Early text-guided diffusion model supporting prompt-conditioned generation and editing.
- Imagen (NeurIPS 2022 / arXiv) — High-fidelity text-to-image diffusion model emphasizing language understanding.
- Latent Diffusion Models (CVPR 2022) — Latent-space diffusion backbone widely used for controllable generation and editing.
- Composable Diffusion Models (ECCV 2022) — Combines multiple diffusion score functions for compositional generation.
- Structured Diffusion Guidance (arXiv 2022) — Uses structured guidance signals to improve prompt-object alignment. (Same paper as StructureDiffusion, L68)
- StructureDiffusion (arXiv 2022) — Parses prompts into structured representations to improve compositional text-to-image generation.
- Attend-and-Excite (SIGGRAPH 2023) — Manipulates cross-attention maps to reduce missing objects and improve prompt coverage Paper
- BoxDiff (ICCV 2023) — Training-free box-constrained generation for spatially grounded text-to-image synthesis Paper
- Composer (ICML 2023) — Composes heterogeneous visual conditions for controllable image synthesis Paper
- MultiDiffusion (ICML 2023) — Fuses multiple diffusion paths to satisfy spatial and regional generation constraints.
- LLM-grounded Diffusion (ICLR 2024) — Uses LLM planning to turn complex prompts into layout-grounded generation constraints.
- SynGen (ICCV 2023) — Uses syntactic guidance to improve compositional text-to-image generation.
- RPG: Recaption, Plan, and Generate (arXiv 2024) — Uses MLLM-based recaptioning and planning for complex prompt following Paper
- CONFORM (arXiv / venue TBD) — Improves object-attribute alignment through contrastive or correspondence-driven prompt grounding.
- Divide-and-Bind (arXiv / venue TBD) — Decomposes complex prompts and binds objects to attributes or relations.
- Linguistic Binding in Diffusion (arXiv / venue TBD) — Studies or improves language-binding failures in text-to-image diffusion.
- Promptist (arXiv 2022) — Optimizes prompts to improve text-to-image generation quality and alignment.
- BeautifulPrompt (AAAI 2024 / arXiv) — Refines user prompts for stronger image generation quality and faithfulness.
- Prompt Expansion for Text-to-Image (topic / resource) — Expands underspecified prompts to reduce ambiguity in generation.
- Prompt Decomposition for T2I (topic / resource) — Decomposes prompts into atomic semantic constraints for evaluation or guidance.
- ControlNet (ICCV 2023) — Adds trainable side branches for depth, edge, pose, segmentation, and other controls Paper
- GLIGEN (CVPR 2023) — Grounds generation with boxes and phrase-level grounding tokens Paper
- T2I-Adapter (AAAI 2024) — Uses lightweight adapters for structural conditions such as sketch, depth, and pose Paper
- IP-Adapter (arXiv 2023) — Adds image-prompt conditioning while preserving text compatibility Paper
- AnyDoor (CVPR 2024) — Performs zero-shot object-level customization and insertion Paper
- FreeDoM (ICCV 2023) — Applies training-free energy guidance for conditional diffusion tasks Paper
- HumanSD (ICCV 2023) — Generates human images under native skeleton guidance Paper
- UniControl (NeurIPS 2023 / arXiv) — Provides a unified framework for multiple controllable generation signals.
- Uni-ControlNet (arXiv 2023) — Unifies multi-condition ControlNet-style conditioning.
- Ctrl-Adapter (ICLR 2025) — Uses efficient adapters for diverse spatial and structural controls.
- UniCon (ICLR 2025) — Designs unidirectional information flow for stronger large-scale condition control.
- InstanceDiffusion (CVPR 2024) — Supports instance-level control over object placement and attributes.
- ControlNet++ (arXiv / venue TBD) — Improves ControlNet-style conditioning quality and efficiency.
- ControlNet-XS (arXiv / venue TBD) — Compresses controllable generation modules for efficient deployment.
- ControlLoRA (arXiv / venue TBD) — Uses LoRA-style lightweight control adaptation.
- SparseCtrl (arXiv / venue TBD) — Controls image/video generation from sparse visual conditions.
- SemanticControl (arXiv / venue TBD) — Handles loose or weakly aligned semantic controls.
- LayoutDiffusion (CVPR 2023 / arXiv) — Conditions diffusion generation on layout annotations.
- LayoutDM (CVPR 2023 / arXiv) — Models layout-to-image synthesis through diffusion.
- SceneComposer (arXiv / venue TBD) — Composes scene-level controls for complex generation.
- Scene Graph Diffusion (arXiv / venue TBD) — Uses scene graphs for relation-aware image synthesis.
- DetDiffusion (arXiv / venue TBD) — Integrates detection-like constraints into image generation.
- Grounded Diffusion (topic / resource) — General family of grounding-based diffusion methods.
- SAM-guided Diffusion Editing (topic / resource) — Uses segmentation masks to localize editing constraints.
- Diffusion Posterior Sampling (ICLR 2023) — Uses measurement likelihoods to guide inverse-problem diffusion.
- Universal Guidance for Diffusion Models (ICML 2023 Workshop) — Applies generic guidance losses during sampling.
- Classifier Guidance (NeurIPS 2021) — Uses classifier gradients to steer diffusion samples.
- Classifier-Free Guidance (NeurIPS 2021 workshop / arXiv) — Steers conditional generation without an external classifier.
- SDEdit (ICLR 2022) — Edits images by adding noise and denoising under new guidance.
- Prompt-to-Prompt (ICLR 2023) — Controls cross-attention to edit prompts while preserving layout/content Paper
- Null-Text Inversion (CVPR 2023) — Inverts real images for more faithful prompt-based editing.
- DiffEdit (ICLR 2023) — Computes semantic edit masks from prompt differences Paper
- InstructPix2Pix (CVPR 2023) — Trains a diffusion editor to follow natural-language instructions Paper
- InstructDiffusion (CVPR 2024) — Unifies several visual instruction tasks in diffusion models Paper
- Imagic (CVPR 2023) — Edits real images by optimizing text embeddings and model weights.
- Paint-by-Example (CVPR 2023) — Uses exemplar images to guide localized editing Paper
- Plug-and-Play Diffusion Features (CVPR 2023) — Injects diffusion features to preserve structure during editing.
- Pix2Pix-Zero (ICCV 2023) — Performs zero-shot image-to-image translation through cross-attention guidance.
- MasaCtrl (ICCV 2023) — Uses mutual self-attention to preserve structure across synthesis/editing Paper
- LEDITS++ (arXiv 2023) — Performs lightweight semantic editing and concept erasure.
- DragonDiffusion (ICLR 2024 / arXiv) — Supports object moving, resizing, and fine-grained interactive editing Paper
- DragDiffusion (CVPR 2024) — Enables point-based drag editing with diffusion priors Paper
- FreeDrag (CVPR 2024 / arXiv) — Improves drag editing without model finetuning.
- DiffEditor (arXiv / venue TBD) — Provides an editing pipeline for localized diffusion modifications.
- SEGA (arXiv 2023) — Steers semantic directions during diffusion sampling.
- Emu Edit (CVPR 2024 / arXiv) — Uses instruction data for high-quality image editing.
- SmartEdit (CVPR 2024 / arXiv) — Combines MLLMs and diffusion for instruction-based editing.
- BrushNet (ECCV 2024 / arXiv) — Adds a dedicated inpainting branch for masked image editing.
- PowerPaint (ECCV 2024 / arXiv) — Supports versatile object removal, insertion, and inpainting.
- Inpaint Anything (arXiv 2023) — Combines segmentation and diffusion inpainting.
- TextDiffuser (NeurIPS 2023) — Improves text rendering inside generated images.
- TextDiffuser-2 (arXiv 2023) — Improves multilingual and layout-aware text rendering.
- AnyText (ICLR 2024) — Generates and edits multilingual text in images.
- GlyphDraw (NeurIPS 2023 / arXiv) — Uses glyph-level information for visual text generation.
- GlyphControl (arXiv / venue TBD) — Adds explicit glyph constraints for controllable typography.
- TryOnDiffusion (CVPR 2023) — Uses diffusion for virtual try-on with garment-person consistency.
- StableVITON (CVPR 2024) — Adapts stable diffusion to virtual try-on Paper
- IDM-VTON (ECCV 2024) — Improves image-based virtual try-on with diffusion Paper
- CatVTON (arXiv 2024) — Provides a lightweight virtual try-on framework Paper
- OOTDiffusion (arXiv 2024) — Generates outfits and try-on images under reference constraints Paper
- LaDI-VTON (ACM MM 2023 / arXiv) — Uses latent diffusion for virtual try-on.
- AnyDressing (arXiv / venue TBD) — Handles generalized dressing and garment transfer constraints.
- PosterCraft (arXiv / venue TBD) — Studies layout- and text-aware poster generation.
- CreatiPoster (arXiv / venue TBD) — Generates visually structured poster layouts.
- PosterMaker (arXiv / venue TBD) — Uses diffusion for controllable poster design.
- TIFA (ICCV 2023) — Evaluates prompt faithfulness using generated question-answer pairs.
- GenEval (NeurIPS 2023 workshop / arXiv) — Tests object presence, counting, colors, positions, and attribute binding.
- T2I-CompBench (NeurIPS 2023) — Measures compositional alignment across attributes, relations, and complex prompts.
- GenEval2 (arXiv / venue TBD) — Extends prompt-following evaluation with harder and less saturated cases.
- HRS-Bench (ICCV 2023) — Provides holistic evaluation of T2I capabilities, robustness, fairness, and bias. Paper
- DPG-Bench (arXiv 2024) — Uses dense prompts to evaluate semantic and relation following.
- GenAI-Bench / VQAScore (ECCV 2024) — Evaluates text-to-visual generation through VQA-style image/video scoring.
- DrawBench (Imagen / NeurIPS 2022 resource) — Human-evaluation prompt suite for text-to-image generation.
- PartiPrompts (arXiv 2022) — Large prompt set for evaluating compositional and high-level prompt following.
- DSG: Davidsonian Scene Graph evaluation (arXiv / venue TBD) — Converts prompts to scene-graph-like checks for semantic consistency.
- VIEScore (arXiv / venue TBD) — Uses vision-language evaluators for image-text alignment.
- EditBench (CVPR 2023) — Benchmarks text-guided image inpainting and edit preservation.
- ConceptBed (arXiv / venue TBD) — Evaluates concept learning and reusable concept binding.
- CountBench (resource / venue TBD) — Tests numerical object-counting consistency in generated images.
- SpatialBench (resource / venue TBD) — Tests spatial relation following.
- ObjectAttributeBench (resource / venue TBD) — Tests object-attribute binding.
- RelationBench (resource / venue TBD) — Tests relational semantics in text-to-image generation.
- TypographyBench (resource / venue TBD) — Evaluates rendered text accuracy in generated images.
- VTON evaluation suites (resource) — Evaluate garment preservation and person-garment alignment.
- Human pose generation evaluation (resource) — Evaluates pose-conditioned human generation.
- MagicBrush (NeurIPS 2023 Datasets and Benchmarks) — Instruction-guided image editing dataset with multi-turn annotations.
- InstructPix2Pix dataset (CVPR 2023 resource) — Synthetic instruction-edit pairs for image editing Paper
- COCO Captions (ECCV 2014 / dataset) — Common image-caption source for prompt grounding.
- Visual Genome (IJCV 2017 / dataset) — Dense object, attribute, and relation annotations.
- OpenImages (dataset) — Large-scale object and visual relationship annotations.
- ADE20K (CVPR 2017 / dataset) — Scene parsing annotations for structural control.
- LAION-5B (NeurIPS 2022 Datasets and Benchmarks) — Web-scale image-text pretraining data.
- LAION-Aesthetics (dataset resource) — Aesthetic-filtered image-text data.
- CC3M (ACL 2018 / dataset) — Web image-caption data for vision-language pretraining.
- CC12M (CVPR 2021 / dataset) — Larger conceptual-caption dataset.
- SA-1B (ICCV 2023 / dataset) — Large-scale segmentation masks for editing and control.
- LVIS (CVPR 2019 / dataset) — Long-tail instance annotations for object-level diagnostics.
- DeepFashion (CVPR 2016 / dataset) — Fashion data for virtual try-on and garment consistency.
- VITON-HD (CVPR 2021 workshop / dataset) — High-resolution virtual try-on data.
- DressCode (CVPR 2022 / dataset) — Multi-category virtual try-on dataset.
- OpenPose / pose datasets (resource) — Pose supervision for human-conditioned generation.
- RefCOCO (dataset) — Referring-expression grounding resource.
- GQA (CVPR 2019 / dataset) — Visual-question-answering resource for compositional reasoning.
- CLEVR (CVPR 2017 / dataset) — Synthetic compositional reasoning dataset.
- OCR/text rendering corpora (resource) — Text-image data for typography generation.
Internal consistency asks whether generated states remain mutually compatible across identities, subjects, views, frames, videos, or story sequences.
- Textual Inversion (ICLR 2023) — Learns new textual tokens for personalized concepts Paper
- DreamBooth (CVPR 2023) — Finetunes T2I models for subject-driven generation.
- Custom Diffusion (CVPR 2023) — Efficiently customizes multiple concepts through parameter-efficient updates Paper
- Perfusion (SIGGRAPH 2023) — Uses key-locking to preserve personalized concept identity.
- SVDiff (arXiv 2023) — Parameter-efficient personalization via singular-vector updates.
- P+ (arXiv 2023) — Expands textual inversion representation capacity.
- NeTI (arXiv 2023) — Uses neural textual inversion for richer concept embedding.
- ProSpect (SIGGRAPH 2023 / arXiv) — Personalizes without heavy finetuning.
- DisenBooth (arXiv 2023) — Disentangles identity and context for personalization.
- SuTI (arXiv 2023) — Scalable subject-driven text-to-image personalization.
- BLIP-Diffusion (NeurIPS 2023) — Uses pretrained subject representations for controllable subject generation Paper
- ELITE (ICCV 2023) — Encodes visual concepts into textual embeddings for fast personalization Paper
- FastComposer (NeurIPS 2023) — Enables tuning-free multi-subject generation Paper
- Subject-Diffusion (ICCV 2023) — Supports open-domain personalized subject generation Paper
- PhotoMaker (CVPR 2024) — Uses stacked ID embeddings for realistic human personalization Paper
- InstantID (arXiv 2024) — Provides zero-shot identity-preserving generation Paper
- IP-Adapter-FaceID (arXiv 2023/2024) — Preserves face identity through image-prompt adapters Paper
- PuLID (arXiv 2024) — Supports pure and lightning ID customization Paper
- InfiniteYou (arXiv / venue TBD) — Explores scalable identity-consistent personalization.
- RealCustom (arXiv / venue TBD) — Focuses on realistic personalized concept generation.
- InstantCharacter (arXiv / venue TBD) — Builds fast character-consistent generation.
- ConsiStory (arXiv 2024) — Training-free consistent character generation across images Paper
- StoryDiffusion (NeurIPS 2024) — Uses consistent self-attention for long-range image/video generation Paper
- StyleAligned (SIGGRAPH 2024) — Shares attention to preserve style across generated sets.
- The Chosen One (SIGGRAPH Asia 2024) — Generates consistent characters across text-to-image outputs.
- ConsistentID (arXiv 2024) — Preserves identity in portrait and character generation.
- CharaConsist (arXiv / venue TBD) — Studies fine-grained character consistency.
- MagicID (arXiv / venue TBD) — Provides ID-conditioned video customization.
- PersonalVideo (arXiv / venue TBD) — Customizes video generation with personalized identity.
- Phantom (arXiv / venue TBD) — Explores subject-consistent video generation.
- Preserve and Personalize (ICLR 2026) — Preserves distributional behavior while personalizing concepts.
- ConceptPrism (CVPR 2026 / project) — Disentangles concepts for personalized diffusion.
- Zero-1-to-3 (ICCV 2023) — Generates novel views from one image Paper
- One-2-3-45 (arXiv 2023) — Produces multi-view images and 3D assets from a single image Paper
- Zero123++ (arXiv 2023) — Improves single-image novel-view generation.
- Cascade-Zero123 (arXiv 2023) — Cascades view generation for stronger 3D consistency Paper
- Consistent123 (arXiv 2023) — Encourages cross-view consistency in novel-view synthesis.
- SyncDreamer (ICLR 2024) — Synchronizes multi-view diffusion generation Paper
- MVDream (ICLR 2024) — Generates multi-view images with 3D-aware diffusion Paper
- Wonder3D (CVPR 2024) — Reconstructs 3D assets from single images through multi-view diffusion Paper
- ViewDiff (CVPR 2024) — Enforces 3D consistency for text-to-image multi-view generation.
- EscherNet (CVPR 2024) — Performs scalable view synthesis under camera changes.
- DreamGaussian (ICLR 2024) — Uses 3D Gaussians for fast text/image-to-3D generation Paper
- LGM (ECCV 2024) — Reconstructs 3D Gaussians from sparse or generated views Paper
- GRM (ECCV 2024) — Builds large Gaussian reconstruction models.
- Instant3D (arXiv / venue TBD) — Accelerates 3D generation from sparse visual evidence.
- TripoSR (arXiv 2024) — Fast feed-forward 3D reconstruction from a single image Paper
- CRM (arXiv / venue TBD) — Uses reconstruction priors for consistent 3D asset generation.
- LRM (ICLR 2024 / arXiv) — Learns large reconstruction models for image-to-3D.
- VideoLDM (CVPR 2023) — Extends latent diffusion to video generation.
- Text2Video-Zero (ICCV 2023) — Adapts image diffusion to zero-shot video generation Paper
- Tune-A-Video (ICCV 2023) — Tunes a T2I model for video generation from one video Paper
- AnimateDiff (ICLR 2024) — Adds motion modules to personalized image diffusion models Paper
- FateZero (ICCV 2023) — Uses attention fusion for zero-shot video editing Paper
- Video-P2P (arXiv 2023) — Extends Prompt-to-Prompt-style editing to videos Paper
- TokenFlow (ICLR 2024) — Propagates diffusion features to improve temporal video editing consistency.
- CoDeF (CVPR 2024) — Uses content deformation fields for temporally consistent video processing.
- Rerender A Video (SIGGRAPH Asia 2023) — Performs zero-shot text-guided video-to-video translation.
- COVE (arXiv 2024) — Uses correspondence guidance for video editing Paper
- VideoCrafter (arXiv 2023) — Open video diffusion framework Paper
- VideoCrafter2 (CVPR 2024) — Improves high-quality video diffusion generation Paper
- ModelScopeT2V (project / 2023) — Open text-to-video generation system.
- Make-A-Video (arXiv 2022) — Generates videos from text using image-text and video data.
- Imagen Video (arXiv 2022) — Cascaded video diffusion model.
- Phenaki (ICLR 2023 / arXiv) — Generates long videos from open-domain prompts.
- VideoFusion (CVPR 2023 / arXiv) — Uses decomposed diffusion for video generation.
- Latte (TMLR / arXiv) — Applies latent diffusion transformers to video generation.
- VideoPoet (ICML 2024 / arXiv) — Multimodal video generation and editing model.
- Lumiere (SIGGRAPH 2024 / arXiv) — Space-time diffusion model for coherent video generation.
- Sora technical report (technical report 2024) — Large-scale video generation model emphasizing world simulation properties.
- MovieDreamer (arXiv / venue TBD) — Studies hierarchical long visual sequence generation.
- TaleCrafter (arXiv / venue TBD) — Generates multi-character visual stories.
- One-Prompt-One-Story (arXiv / venue TBD) — Aims at consistent story generation from a single prompt.
- Animate-A-Story (arXiv 2023) — Generates storytelling videos with retrieval and control signals Paper
- MotionStream (ICLR 2026) — Supports real-time video generation with interactive motion control.
- VideoDirectorGPT (arXiv 2023) — Uses LLM planning for multi-scene video generation.
- ShotAdapter (arXiv / venue TBD) — Adapts video generation for multi-shot consistency.
- VideoBooth (arXiv 2023) — Customizes video generation to a subject or concept.
- DreamVideo (arXiv 2023) — Personalizes video generation with subject-aware priors.
- Vlogger (arXiv 2024) — Generates talking/head or human-centric video content.
- MagicAnimate (CVPR 2024) — Animates human images under motion guidance Paper
- AnimateAnyone (CVPR 2024 / arXiv) — Animates reference characters with strong identity preservation.
- Champ (arXiv 2024) — Enables controllable and consistent human animation.
- MVG-Bench (arXiv 2024) — Evaluates multi-view generation consistency.
- MET3R (arXiv 2024) — Measures 3D-aware multi-view consistency from generated images.
- VBench (CVPR 2024) — Comprehensive video generation benchmark including subject/background and temporal consistency.
- Video-Bench (CVPR 2025) — Human-aligned video generation benchmark.
- EvalCrafter (CVPR 2024) — Evaluates generated videos along visual, text-video, and motion dimensions.
- FETV (NeurIPS 2023 Datasets and Benchmarks) — Fine-grained open-domain text-to-video evaluation benchmark. Paper
- ViStoryBench (CVPR 2026 / preprint) — Evaluates story visualization, character consistency, and narrative coherence. Paper
- T2V-CompBench (arXiv / venue TBD) — Tests compositional text-to-video generation.
- VideoScore (arXiv / venue TBD) — Provides learned or automatic video generation quality scoring.
- VideoPhy temporal subset (ICLR 2025) — Uses physical video checks as temporal/world consistency diagnostics.
- Long-video consistency evaluation (resource) — Focuses on long-horizon entity and scene persistence.
- Character consistency benchmark (resource) — Tests identity preservation across generated sets.
- Multi-view consistency metrics (resource) — Measures cross-view geometric compatibility.
- Story visualization benchmark (resource) — Tests narrative and character persistence in story sequences.
- Video editing consistency metrics (resource) — Measures preservation and temporal stability after video editing.
- CLIP frame consistency (metric family) — Uses semantic features to estimate cross-frame consistency.
- DINO tracking consistency (metric family) — Uses self-supervised features for object/region persistence.
- Identity similarity metrics (metric family) — Evaluates subject or face identity preservation.
- Face recognition metrics (metric family) — Uses face recognition models for identity consistency.
- LPIPS temporal smoothness (metric family) — Measures perceptual smoothness across frames.
- MeViS (ICCV 2023) — Motion-expression video segmentation data useful for temporal grounding.
- MOSE (ICCV 2023 / dataset) — Video object segmentation data with complex occlusions.
- TAO (ECCV 2020) — Long-tail tracking data for object persistence diagnostics.
- VSPW (CVPR 2021) — Video scene parsing dataset for scene-state continuity.
- nuScenes (CVPR 2020) — Driving dataset useful for dynamic-scene consistency.
- KITTI (IJRR 2013 / dataset) — Autonomous-driving visual dataset for geometry and temporal checks.
- Waymo Open Dataset (CVPR 2020 / dataset) — Large-scale driving data for world and motion consistency.
- DAVIS (CVPR 2016) — Video object segmentation data for temporal preservation.
- YouTube-VOS (ECCV 2018 / dataset) — Large-scale video object segmentation data.
- LaSOT (CVPR 2019) — Long-term single-object tracking dataset.
- TrackingNet (ECCV 2018 / dataset) — Large-scale object tracking data.
- Objaverse (CVPR 2023 / dataset) — Large 3D object dataset for view and 3D generation.
- Objaverse-XL (NeurIPS 2023 Datasets and Benchmarks) — Web-scale 3D object data.
- CO3D (ICCV 2021 / dataset) — Common objects in 3D data for view consistency.
- RealEstate10K (dataset) — Camera-trajectory video data for novel-view synthesis.
- ScanNet (CVPR 2017 / dataset) — RGB-D scene data for geometry-aware generation.
- ShapeNet (arXiv 2015 / dataset) — 3D shape dataset for object-level 3D generation.
- Google Scanned Objects (dataset) — High-quality scanned object assets.
- MVImgNet (CVPR 2023 / dataset) — Multi-view image dataset for object-centric reconstruction.
- Kubric (CVPR 2022 / dataset generator) — Synthetic video/scene data generation for controlled temporal diagnostics. Paper
Normative consistency asks whether generated content satisfies evaluative principles such as preference, aesthetics, safety, fairness, concept restrictions, physical plausibility, commonsense, action consequence, and world-state validity.
- Pick-a-Pic / PickScore (NeurIPS 2023) — Collects pairwise preferences and trains a preference scorer.
- ImageReward (NeurIPS 2023) — Learns a general human preference reward model for T2I images. Paper
- HPS (ICCV 2023 / arXiv) — Scores generated images according to human preference.
- HPSv2 (arXiv 2023) — Refines human preference scoring and benchmark coverage.
- HPSv3 (arXiv 2025) — Extends preference evaluation to broader text-image distributions. Paper
- MPS (CVPR 2024) — Models multi-dimensional human preferences for T2I generation.
- VisionReward (AAAI 2026) — Learns multi-dimensional reward signals for image and video generation.
- Diffusion-DPO (NeurIPS 2023 / arXiv) — Applies direct preference optimization to diffusion models.
- DDPO (ICLR 2024) — Trains diffusion models with reinforcement learning rewards.
- AlignProp (ICLR 2024 / arXiv) — Backpropagates reward gradients through diffusion sampling.
- DPOK (arXiv 2023) — Applies KL-regularized policy optimization to diffusion models.
- D3PO (arXiv 2024) — Optimizes diffusion policies from preference data.
- SPO (CVPR 2025) — Performs step-by-step preference optimization for aesthetic post-training.
- DSPO (ICLR 2025) — Aligns diffusion models using direct score preference optimization.
- RankDPO (ICCV 2025) — Uses ranked preference data for scalable T2I preference optimization.
- CMPO / CaPO (CVPR 2025) — Calibrates multiple preferences for diffusion alignment.
- Diffusion-NPO (ICLR 2026) — Performs negative preference optimization for diffusion alignment.
- BranchGRPO (ICLR 2026) — Uses branch-level GRPO-style optimization for diffusion generation.
- Flow-GRPO (arXiv / venue TBD) — Applies GRPO-like preference learning to flow/diffusion sampling.
- RLAIF for Diffusion (topic / resource) — Uses AI feedback instead of human feedback for diffusion alignment.
- Safe Latent Diffusion (CVPR 2023) — Adds safety guidance during latent diffusion sampling.
- Erasing Concepts from Diffusion Models (ICCV 2023) — Removes undesirable concepts from diffusion weights.
- Ablating Concepts (ICCV 2023) — Ablates target concepts while retaining general model behavior.
- Unified Concept Editing (WACV 2024 / arXiv) — Edits multiple concepts in diffusion models.
- MACE (CVPR 2024 / arXiv) — Scales concept erasure to many concepts.
- Forget-Me-Not (arXiv 2023) — Uses attention control to forget concepts.
- Ring-A-Bell (NeurIPS 2023 workshop / arXiv) — Red-teams concept erasure using adversarial prompts.
- ACE (arXiv / venue TBD) — Studies robust anti-editing concept erasure.
- Editing Massive Concepts (arXiv / venue TBD) — Edits or suppresses many concepts at scale.
- SalUn (ICLR 2024 / arXiv) — Uses saliency-guided unlearning for generative models.
- ESD (arXiv 2023) — Erases stable-diffusion concepts through targeted training.
- ConceptPrune (arXiv / venue TBD) — Removes concepts by pruning or editing model components.
- Responsible Text-to-Image Diffusion (ICML 2026 / project) — Studies controllable and interpretable safe/fair generation.
- T2VSafetyBench methods (arXiv 2024) — Studies safety evaluation and intervention for text-to-video models.
- SafeGen (arXiv / venue TBD) — Improves safety during generative sampling.
- Safety Checker / post-hoc filters (system resource) — Filters generated outputs after sampling.
- NSFW prompt filtering (system resource) — Screens prompts before generation.
- Adversarial prompt defense (topic / resource) — Defends against jailbreak prompts in visual generation.
- Jailbreak-resistant diffusion (topic / resource) — Studies robust safety under prompt attacks.
- Concept restoration after erasure (topic / resource) — Diagnoses benign capability loss after safety editing.
- UniSim (ICLR 2024) — Learns interactive real-world simulators for action-conditioned generation.
- Genie (ICML 2024) — Generates interactive environments from videos.
- GAIA-1 (arXiv / technical report 2023) — Builds a generative world model for autonomous driving.
- WorldDreamer (arXiv 2024) — Generates driving videos with world-model priors.
- DriveDreamer (ECCV 2024 / arXiv) — Generates driving scenarios with structured controls.
- DriveDreamer-2 (arXiv / venue TBD) — Extends driving world generation to longer/higher-quality videos.
- Vista (arXiv / venue TBD) — Studies video world models for controllable environments.
- Pandora (arXiv / venue TBD) — Explores world modeling through video generation.
- Cosmos World Foundation Models (technical report / arXiv) — Studies large-scale world foundation models.
- HunyuanWorld / Hunyuan World (technical report / arXiv) — Generates 3D/world environments using generative world modeling.
- World-consistent Video Diffusion (topic / resource) — Enforces geometry, dynamics, and state consistency in video generation.
- Physics-guided Diffusion (topic / resource) — Injects physical constraints into diffusion sampling or training.
- Simulator-guided Diffusion (topic / resource) — Uses simulators or constraints to steer generation.
- Verifier-guided Generation (topic / resource) — Uses post-hoc or in-loop verifiers to reject inconsistent samples.
- Causal Video Generation (topic / resource) — Studies causal state transitions in generated videos.
- Object-state-change Generation (topic / resource) — Focuses on object state changes and action consequences.
- Embodied Diffusion World Models (topic / resource) — Connects diffusion generation with embodied planning and control.
- Pick-a-Pic / PickScore (NeurIPS 2023) — Pairwise preference data and reward model for T2I outputs.
- ImageReward (NeurIPS 2023) — Learned reward model for human preference evaluation.
- HPSv2 (arXiv 2023) — Human preference benchmark for T2I evaluation.
- HPSv3 (arXiv 2025) — Wide-spectrum preference benchmark and reward model.
- VisionReward (AAAI 2026) — Multi-dimensional image/video preference evaluator.
- MPS evaluation (CVPR 2024) — Evaluates multiple dimensions of human preference.
- Aesthetic score models (metric family) — Score visual aesthetics for generated images.
- LAION aesthetic predictor (dataset/model resource) — Provides aesthetic scores for LAION-like data.
- Six-CD (arXiv / venue TBD) — Evaluates concept removal and benign retention.
- I2P (arXiv 2023) — Prompt benchmark for inappropriate image generation risks.
- Unsafe Diffusion benchmark (resource) — Evaluates unsafe output generation.
- T2VSafetyBench (arXiv 2024) — Safety benchmark for text-to-video generation.
- Concept removal benchmarks (resource) — Measures erasure success and collateral damage.
- Benign retention benchmarks (resource) — Tests whether safe editing harms benign generations.
- Red-teaming prompts (resource) — Stress-tests safety filters and concept removal.
- PhyBench (arXiv 2024) — Static physical commonsense benchmark for T2I.
- VideoPhy (ICLR 2025) — Physical commonsense benchmark for generated videos.
- PhyCoBench (arXiv 2024) — Optical-flow-guided physical coherence benchmark.
- PhyGenBench (arXiv 2024) — Physical-law benchmark for video generation.
- VideoPhy-2 (ICLR 2026) — Action-centric physical commonsense benchmark.
- T2VPhysBench (arXiv / venue TBD) — Tests first-principles physical consistency in T2V.
- T2VWorldBench (arXiv / venue TBD) — Evaluates world knowledge, commonsense, and causal plausibility.
- Physics-IQ (WACV 2026) — Tests physical principles in generative video models.
- PhyWorldBench (arXiv 2025) — Benchmarks physical realism in text-to-video generation.
- VideoVerse (arXiv 2025) — World-model-oriented T2V evaluation.
- PhyEduVideo (WACV 2026) — Physics-education-oriented video benchmark.
- PhyWorld (ICML 2025) — Studies how far video generation is from physical world models.
- OSCBench (arXiv / venue TBD) — Tests object state change and action consequence.
- Morpheus (arXiv / venue TBD) — Evaluates physical reasoning in video generation.
- World-model Video Evaluation (resource) — General benchmarks for video-as-world-model behavior.
- Pick-a-Pic dataset (NeurIPS 2023) — Pairwise human preferences for generated images.
- ImageRewardDB (NeurIPS 2023 resource) — Human preference annotations for reward training.
- HPD / HPSv2 data (arXiv 2023 resource) — Human preference data for T2I evaluation.
- HPDv3 / HPSv3 data (arXiv 2025 resource) — Larger preference dataset for wide-spectrum evaluation.
- MPS preference data (CVPR 2024) — Multi-dimensional preference labels.
- LAION-Aesthetics (dataset resource) — Image-text data filtered by aesthetic scores.
- AVA Aesthetics (CVPR 2012 / dataset) — Aesthetic image-quality annotations.
- I2P prompts (arXiv 2023 resource) — Inappropriate prompt set for safety testing.
- NSFW prompt resources (resource) — Prompts for unsafe content testing.
- Concept erasure prompt sets (resource) — Prompts for target concept removal.
- Physical commonsense prompts (resource) — Prompts testing static physical plausibility.
- Video physical prompts (resource) — Prompts testing dynamic physical plausibility.
- Driving world-model datasets (resource) — Driving data for action-conditioned world models.
- Ego4D (CVPR 2022 / dataset) — Egocentric video data for embodied and action reasoning.
- Something-Something V2 (ICCV 2017 / dataset) — Human-object interaction videos for action/state understanding.
- CLEVRER (ICLR 2020 / dataset) — Synthetic videos for physical and causal reasoning.
- PHYRE (NeurIPS 2019 / benchmark) — Physical reasoning environments.
- IntPhys (arXiv 2018 / dataset) — Intuitive physics video dataset.
- CLEVR (CVPR 2017 / dataset) — Synthetic visual reasoning data.
- Kubric (CVPR 2022) — Synthetic scene/video generator useful for controlled physical diagnostics.
resources/benchmark_coverage.csv: benchmark, dataset, evaluator, and diagnostic-resource coverage map.resources/related_surveys.csv: prior survey positioning.resources/taxonomy_methods.csv: compact mapping from taxonomy nodes to representative methods and resources.resources/selected_bibtex.bib: selected BibTeX entries.
| Label | Meaning |
|---|---|
| P/C | prompt and compositional faithfulness |
| S/E | structural control and edit preservation |
| ID | subject/identity persistence |
| V/T | multi-view, temporal, or narrative coherence |
| N/S | preference, safety, or value alignment |
| P/W | physical, causal, or world-grounded plausibility |
Coverage values: H = direct/dedicated coverage, M = partial/adaptable coverage, L = indirect/low coverage.
Contributions are welcome. Please include the resource title, BibTeX key, venue/year, official paper URL, official project/code URL if available, resource type, modality, primary consistency relation, coverage values, and a short diagnostic-use/blind-spot description.
Use the issue template: Add or correct a resource.
Some 2025--2026 papers may initially appear as arXiv or project-page entries before official proceedings metadata is stable. When official BibTeX becomes available, please update resources/selected_bibtex.bib and any corresponding table entries.
When adding links, prefer official repositories or project pages over unofficial reimplementations. If no stable official repository exists, leave the code URL blank in the CSV table. Entries with arXiv-search links are included as expansion placeholders and should be replaced by stable paper/project links when available.
If this survey or resource list is useful, please cite:
@article{yan2026consistency,
title = {Consistency in Diffusion-Based Visual Generation: A Survey},
author = {Yan, Song and Zhai, Wei and Wang, Chenfeng and Li, Ruixuan and Yang, Zhangping and Cai, Yancheng and Zhang, Tao and Wang, Ling and Lan, Yunwei and He, Yujie and Cao, Yang and Li, Min and Zha, Zheng-Jun},
year = {2026},
doi = {10.20944/preprints202606.0870.v1},
url = {https://www.preprints.org/manuscript/202606.0870/v1},
note = {Preprints}
}This repository is released under the MIT License.






