MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

Breaking the Visual Tokenization Trade-off — Generation ∧ Understanding, Not Generation ∨ Understanding.

Paper · Project Page · Model Zoo · Quick Start

TL;DR: Unified visual tokenizers suffer from Manifold Misalignment — pixel gradients and semantic gradients destructively interfere. MUSE resolves this via Topological Orthogonality, physically decoupling structure into attention topology and semantics into feature values. Result: gFID 3.08 (matching generation specialists) + Linear Probe 85.2% (surpassing its own teacher InternViT-300M at 82.5%).

Highlights

🎯 Mutual Reinforcement, Not Trade-off

Unlike prior unified tokenizers trapped in a zero-sum game, MUSE achieves genuine synergy — structurally aligned reconstruction actively refines semantic perception.

Metric	MUSE	Best Prior
gFID ↓	3.08	3.08 (VTP)
Zero-Shot Acc ↑	77.1%	75.7% (UniLIP)
Linear Probe ↑	85.2%	82.5% (Teacher)
Seg. mIoU ↑	46.5	36.8 (UniLIP)
MMVP ↑	74.8	72.7 (UniLIP)

🧠 Key Insight: Gradient Orthogonality

Semantic gradients naturally occupy W_V while structural gradients cluster in W_Q, W_K. MUSE respects this inductive bias, eliminating destructive interference.

Method

Manifold Misalignment & Topological Orthogonality

The core challenge: pixel reconstruction wants to unfold the latent manifold for detail, while semantic alignment wants to collapse it for invariance. Naively combining them causes destructive gradient interference.

MUSE resolves this via the Synergistic Block, which physically decouples the two objectives:

Topology Stream (W_Q, W_K) → structural gradients refine the attention routing graph
Semantic Stream (W_V) → semantic gradients update feature content values
Stop-Gradient (//) isolates the semantic branch from reconstruction gradients

This transforms interference into mutual reinforcement — a single architecture, two orthogonal optimization subspaces.

Three-Stage Progressive Training

MUSE follows an information-theoretic curriculum: structure first, then semantics, then synergy.

Stage	Name	What Learns	What's Frozen	Key Objective
1	Topology Warmup	Connector (`W_Q`, `W_K`)	Encoder + Semantic Proj.	`L_topo`: align attention topology with DINO teacher
2	Semantic Injection	Connector (`W_V`) + Proj.	Encoder	`L_ITC`: anchor feature values to CLIP manifold
3	Synergistic Tuning	Full model	DINO teacher only	All losses: `L_rec` + `L_topo` + `L_ITC` + `L_GAN`

Results

Tokenizer Comparison

MUSE breaks the generative-semantic trade-off, establishing a new Pareto frontier:

Method	Type	rFID ↓	gFID ↓	ZS Acc ↑	LP Acc ↑	mIoU ↑
VQGAN	Gen.	1.28	5.20	—	—	15.4
VA-VAE	Gen.	0.46	3.92	—	—	18.5
UniLIP	Unified	0.74	3.62	75.7	83.6	36.8
VTP	Unified	0.73	3.08	71.2	81.4	32.1
MUSE	Unified	0.62	3.08	77.1	85.2	46.5

Unified Multimodal Model (UMM)

When integrated into a full UMM pipeline, MUSE enables high-quality generation and editing without compromising perception:

Model	MMB ↑	MMVP ↑	GenEval ↑	WISE ↑	Edit Bkg. ↑
InternVL3 (specialist)	78.2	72.7	—	—	—
FLUX.1-dev (specialist)	—	—	0.76	0.50	—
UniLIP	72.6	72.7	0.78	0.62	0.79
MUSE	73.4	74.8	0.82	0.65	0.87

Qualitative Results

Attention Maps — MUSE vs. Baselines

MUSE faithfully mirrors the precise, ground-truth-like attention patterns of the DINO teacher, while VQGAN scatters across textures and UniLIP produces overly diffuse maps.

Text-to-Image Generation

Complex attribute binding, accurate spatial reasoning, and realistic textures across diverse prompts.

Image Editing

Localized semantic modifications while strictly maintaining global layout and background consistency.

Model Zoo

Model	Backbone	Params	gFID ↓	LP Acc ↑	Checkpoint
MUSE-1B	InternVL3-1B + SANA-0.6B	496M	3.08	85.2	Huggingface
MUSE-3B	InternVL3-2B + SANA-1.6B	—	—	—	Coming Soon

Installation

# Clone the repository
git clone https://github.com/your-org/MUSE.git
cd MUSE

# Create conda environment (recommended)
conda create -n muse python=3.10 -y
conda activate muse

# Install dependencies
pip install -e .
# Or:
pip install -r requirements.txt

Prerequisites

Download the following pretrained models:

Model	Role	Source
InternVL3-1B / 2B	Vision backbone encoder	HuggingFace
DC-AE (SANA)	Pixel decoder	HuggingFace
DINOv3-ViT-H+	Topology teacher (Stage 1)	Custom checkpoint
CLIP-ViT-L-14	Text encoder for ITC (Stage 2+)	OpenCLIP

Quick Start

Training the Tokenizer

export CHECKPOINT_DIR=/path/to/pretrained/models
export DATA_DIR=/path/to/datasets

# Stage 1: Topology Warmup — align attention with DINO teacher
bash tools/train_stage1.sh muse_1b

# Stage 2: Semantic Injection — anchor features to CLIP manifold
bash tools/train_stage2.sh muse_1b

# Stage 3: Synergistic Tuning — full co-optimization
bash tools/train_stage3.sh muse_1b

Evaluation

# Reconstruction metrics (rFID, PSNR, SSIM, LPIPS)
bash tools/evaluate.sh \
    configs/muse_1b/stage3.yaml \
    /path/to/checkpoint.bin

# Linear probe on ImageNet-1K
python scripts/linear_probe.py \
    --config configs/muse_1b/stage3.yaml \
    --checkpoint /path/to/checkpoint.bin

# Zero-shot ImageNet classification
python scripts/zero_shot.py \
    --config configs/muse_1b/stage3.yaml \
    --checkpoint /path/to/checkpoint.bin

# ADE20K segmentation probe (mIoU)
bash tools/segment_probe.sh muse \
    --config configs/muse_1b/stage3.yaml \
    --checkpoint /path/to/checkpoint.bin \
    --train-url "/path/to/ade20k-train-{000000..000020}.tar" \
    --val-url "/path/to/ade20k-validation-{000000..000002}.tar"

Inference

# Single image reconstruction
python scripts/inference.py \
    --config configs/muse_1b/stage3.yaml \
    --checkpoint /path/to/checkpoint.bin \
    --image_path /path/to/image.jpg \
    --output_dir outputs/inference

# Attention map visualization
bash tools/visualize_attention.sh single \
    --config configs/muse_1b/stage3.yaml \
    --checkpoint /path/to/checkpoint.bin \
    --image /path/to/image.jpg \
    --output outputs/attention_viz

Data Format

MUSE uses WebDataset (.tar) format for scalable data loading:

shard-000000.tar
├── 00000.jpg          # Image
├── 00000.txt          # Caption (Stages 2–3)
├── 00001.jpg
├── 00001.txt
└── ...

For segmentation probing, each shard additionally contains *.seg.png (ADE20K labels).

Project Structure

MUSE/
├── muse/                              # Core library
│   ├── models/
│   │   ├── muse_vit.py                #   MUSE_ViT + SynergisticBlock
│   │   ├── base_model.py              #   Save/load utilities
│   │   ├── ema_model.py               #   EMA model wrapper
│   │   ├── discriminator.py           #   PatchGAN discriminator
│   │   ├── lpips.py                   #   LPIPS perceptual metric
│   │   └── perceptual_loss.py         #   LPIPS + ConvNeXt-S perceptual
│   ├── losses/
│   │   └── muse_loss.py               #   Pixel + Perceptual + GAN + Topo + ITC
│   ├── data/
│   │   └── dataloader.py              #   WebDataset loader
│   ├── evaluation/
│   │   ├── evaluator.py               #   rFID / PSNR / SSIM / LPIPS
│   │   └── inception.py               #   InceptionV3 for FID
│   └── utils/
│       ├── viz_utils.py               #   Attention visualization pipeline
│       ├── train_utils.py             #   Training helpers
│       ├── lr_schedulers.py           #   LR schedule (cosine / constant)
│       └── logger.py                  #   Logging setup
├── scripts/                           # Entry-point scripts
│   ├── train_stage{1,2,3}.py          #   Three-stage training
│   ├── evaluate.py                    #   Batch reconstruction eval
│   ├── inference.py                   #   Single-image reconstruction
│   ├── linear_probe.py               #   ImageNet linear probe
│   ├── zero_shot.py                   #   Zero-shot classification
│   ├── zero_shot_meta.py              #   ImageNet class names + templates
│   ├── segment_probe.py              #   ADE20K segmentation probe
│   └── visualize_attention.py         #   Attention map visualization
├── configs/
│   ├── muse_1b/                       #   MUSE-1B configs (stage1–3)
│   └── muse_3b/                       #   MUSE-3B configs (stage1–3)
├── tools/                             # Shell launch scripts
│   ├── train_stage{1,2,3}.sh
│   ├── evaluate.sh
│   ├── visualize_attention.sh
│   └── segment_probe.sh
├── asst/                              # Paper figures & assets
├── requirements.txt
├── setup.py
└── LICENSE                            # Apache 2.0

Citation

If you find MUSE useful in your research, please consider citing:

@inproceedings{muse2026,
  title     = {MUSE: Resolving Manifold Misalignment in Visual Tokenization 
               via Topological Orthogonality},
  author    = {Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin, Yao Hu, Yang Luo, Yongqiang Ma},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}

Acknowledgements

MUSE builds upon several excellent open-source projects:

InternVL3 — Vision backbone
SANA / DC-AE — Pixel decoder
DINOv3 — Structural topology teacher
OpenCLIP — Text encoder for semantic anchoring

License

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

Highlights

🎯 Mutual Reinforcement, Not Trade-off

🧠 Key Insight: Gradient Orthogonality

Method

Manifold Misalignment & Topological Orthogonality

Three-Stage Progressive Training

Results

Tokenizer Comparison

Unified Multimodal Model (UMM)

Qualitative Results

Model Zoo

Installation

Prerequisites

Quick Start

Training the Tokenizer

Evaluation

Inference

Data Format

Project Structure

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
asst		asst
configs		configs
muse		muse
scripts		scripts
tools		tools
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

Highlights

🎯 Mutual Reinforcement, Not Trade-off

🧠 Key Insight: Gradient Orthogonality

Method

Manifold Misalignment & Topological Orthogonality

Three-Stage Progressive Training

Results

Tokenizer Comparison

Unified Multimodal Model (UMM)

Qualitative Results

Model Zoo

Installation

Prerequisites

Quick Start

Training the Tokenizer

Evaluation

Inference

Data Format

Project Structure

Citation

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages