WeightFormer

Linear-Time Global Visual Modeling without Explicit Attention

Ruize He*, Dongchen Han*, Gao Huang ✉️

Tsinghua University

Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that self-attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling.

Overview

Installation

Option 1: uv (recommended)

uv sync
source .venv/bin/activate

Option 2: Conda

conda create -n weightformer python=3.12 -y
conda activate weightformer
pip install -r requirements.txt

Dataset Preparation

Prepare ImageNet-1K in the standard format:

imagenet
├── train
│   ├── class1
│   │   ├── img1.jpeg
│   │   └── ...
│   └── ...
└── val
    ├── class2
    │   ├── img2.jpeg
    │   └── ...
    └── ...

Then update the dataset path in:

cfg/*.yaml

Image Classification

Training from scratch

torchrun --nproc_per_node=8 main.py --cfg cfg/wfm_t.yaml

Evaluation

torchrun --nproc_per_node=8 main.py --eval --cfg cfg/wfm_t.yaml --resume wfm-t.pth

Replace wfm_t.yaml with your desired config for the T, S, or B variants.

Image Generation

To use WeightFormer in DiT-style image generation, replace DiT's models.py with model/wfm_dit.py.

Follow the setup instructions from fast-DiT (recommended) or DiT for training and evaluation.

Weights and Logs

Acknowledgements

This project is built upon DeiT, Swin Transformer, and DiT.

Citation

@article{he2024weightformer,
  title={Linear-Time Global Visual Modeling without Explicit Attention},
  author={He, Ruize and Han, Dongchen and Huang, Gao},
  journal={arXiv preprint arXiv:2605.01711},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
cfg		cfg
data		data
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
logger.py		logger.py
lr_scheduler.py		lr_scheduler.py
main.py		main.py
optimizer.py		optimizer.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeightFormer

Linear-Time Global Visual Modeling without Explicit Attention

Overview

Installation

Option 1: uv (recommended)

Option 2: Conda

Dataset Preparation

Image Classification

Image Generation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WeightFormer

Linear-Time Global Visual Modeling without Explicit Attention

Overview

Installation

Option 1: uv (recommended)

Option 2: Conda

Dataset Preparation

Image Classification

Image Generation

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages