A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

Why we develop Wallaroo?

It is widely acknowledged that unifying understanding, generation, and editing has become an inevitable trend. To achieve this, autoregressive paradigm, as a representative choice, has been naturally considered. To advance this direction and establish a benchmark, we introduce Wallaroo, a straightforward autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. In a nutshell, Wallaroo is a comprehensive comparison baseline model.

Getting Started

Installation

Install 64-bit Python 3.10.14 and PyTorch 2.4.0+cu121
Install Python libraries with:
```
pip3 install -r requirements.txt
```
Download the Wallaroo 7B
Download the LLamaGen Tokenizer

Evaluation

Visual Understanding

Download the VLMEvalKit
Add the code in vlm/qwen2_vl/model.py 313L to allow Wallaroo 7B checkpoint loading

else:
    self.model = MODEL_CLS.from_pretrained(
        model_path, torch_dtype='auto', device_map="auto", attn_implementation='flash_attention_2'
    )

    load_wallaroo = True

    if load_wallaroo:

        load_from = "path/to/checkpoint"

        resume_checkpoint = torch.load(load_from, map_location="cpu")
        new_dict = {}
        for key, value in resume_checkpoint['state_dict'].items():
        
        if 'visual' in key:
            new_dict[key.replace('wallaroo', 'model')] = value

        elif 'model' in key:
            new_dict[key.replace('model', 'language_model').replace('wallaroo', 'model')] = value

        elif 'lm_head' in key:
            new_dict['lm_head.weight'] = value


        m, u = self.model.load_state_dict(new_dict, strict=False)
        del resume_checkpoint

    self.model.eval()

Follow the instructions in VLMEvalKit

Image Generation

cd scripts/evaluate
sh test_ar_t2i.sh

Image Editing

cd scripts/evaluate
sh test_ar_i2i.sh

Training

See examples/wallaroo/ar_wallaroo_7

This folder contains the config yaml files and corresponding training python files from different stages.

One can see detailed command in train_script.sh.

Citation

@article{Zhu2026Simple,
title   = {# A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction},
author  = {Jie Zhu, Hanghang Ma, Jia Wang, Yayong Guan, Yanbing Zeng, Lishuai Gao,
Junqiang Wu, Jie Hu, Leye Wang},
journal = {arXiv preprint arXiv:2603.04980},
year    = {2026}
}

Acknowledgments

This work is built on Qwen2.5 VL, Show-o, and LLamaGen. Thanks for their wonderful open-source work.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples/wallaroo/ar_wallaroo_7		examples/wallaroo/ar_wallaroo_7
img		img
scripts		scripts
validation_prompts		validation_prompts
wallaroo		wallaroo
LICENSE		LICENSE
README.md		README.md
json_parser.py		json_parser.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

Why we develop Wallaroo?

Getting Started

Installation

Evaluation

Visual Understanding

Image Generation

Image Editing

Training

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction

Why we develop Wallaroo?

Getting Started

Installation

Evaluation

Visual Understanding

Image Generation

Image Editing

Training

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages