MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

This repository accompanies the paper MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration (arXiv:2603.28254) and includes the implementations, training scripts, and reproduction entry points for MuonEq across both GPU and Ascend NPU experiments.

The repository currently has three main parts:

llm-opt-baseline-gpu/cifar10: quick optimizer comparisons on CIFAR-10
llm-opt-baseline-gpu/llm-baselines: GPU-side Transformer / LLM baselines with MuonEq
llm-opt-baseline-npu/llama2_pretrain: Ascend NPU pretraining experiments

This is not a new training framework built from scratch. llm-opt-baseline-gpu/llm-baselines and llm-opt-baseline-npu/llama2_pretrain are experimental directories derived from public codebases and extended locally. We keep most of their original structure because it is the most practical setup for reproduction and follow-up work.

Overview

MuonEq adds a lightweight equilibration / normalization step before Muon orthogonalization to improve the numerical condition of the update matrix before it enters the zeropower / Newton-Schulz iteration. This repository is the experimental codebase used for the paper rather than a single unified training framework.

If you just want to decide where to start, use the table below:

Goal	Recommended Entry Point	Notes
Run the smallest possible experiment	`llm-opt-baseline-gpu/cifar10`	Single-machine setup, good for checking `MuonEq` behavior and command flow
Run GPU LLM baselines	`llm-opt-baseline-gpu/llm-baselines`	Includes `MuonEq` and related optimizer variants
Reproduce Ascend NPU pretraining experiments	`llm-opt-baseline-npu/llama2_pretrain`	Includes multi-node scripts and cluster-specific arguments

Repository Layout

llm-opt-baseline-gpu/
  cifar10/
  llm-baselines/

llm-opt-baseline-npu/
  llama2_pretrain/

Quick Start

1. CIFAR-10

If you want the fastest way to sanity-check MuonEq, start with CIFAR-10:

cd llm-opt-baseline-gpu
python cifar10/compare_resnet_optimizers.py --epochs 1 --num-runs 1 --num-workers 0

In this branch, the Muon learning-rate scaling follows Keller Jordan's implementation in cifar10-airbench.

For more commands and argument details, see llm-opt-baseline-gpu/cifar10/README.md.

2. GPU LLM Baselines

llm-opt-baseline-gpu/llm-baselines is used for GPU-side Transformer / LLM baseline runs, including MuonEq and related optimizer variants.

In this branch, the Muon learning-rate scaling follows the RMS-matched strategy used in MoonshotAI's Moonlight.

Install dependencies:

cd llm-opt-baseline-gpu/llm-baselines
pip install -r requirements.txt

Run a basic training job:

python ./src/main.py --config_format base

For MuonEq sweep scripts, see llm-opt-baseline-gpu/llm-baselines/scripts/optimizers_compare/readme.md.

3. NPU Pretraining

llm-opt-baseline-npu/llama2_pretrain is used for Ascend NPU pretraining experiments. The main paper results use the cosine scheduler, while the scripts also retain wsd support for follow-up runs.

Install dependencies:

cd llm-opt-baseline-npu/llama2_pretrain
pip install -r requirements.txt

Before running, you need at least:

C4_DATA_DIR
TOKENIZER_PATH
--nodes or MULTI_NODE_HOSTS for multi-node runs
If you pass short node IDs instead of full IPs in --nodes, you also need to set MULTI_NODE_HOST_PREFIX or --host-prefix to match your network. The default prefix in the scripts is 10.0.0.

The 2026compare experiment entry points are under llm-opt-baseline-npu/llama2_pretrain/experiments/2026compare/, mainly through multi_node_sweep_*.sh and multi_node_main_*.sh. A minimal example:

cd llm-opt-baseline-npu/llama2_pretrain/experiments/2026compare
C4_DATA_DIR=/path/to/c4 \
TOKENIZER_PATH=t5-base \
VISIBLE_NPUS=0,1,2,3,4,5,6,7 \
NUM_NPUS_PER_JOB=8 \
bash multi_node_sweep_350m.sh --nodes 10.0.1.131,10.0.1.132 adamw

For more details on multi-node arguments, see llm-opt-baseline-npu/llama2_pretrain/experiments/2026compare/multi_node_usage.md.

Using MuonEq Directly

If you want to call MuonEq directly in your own code, the two implementations live at:

NPU: llm-opt-baseline-npu/llama2_pretrain/optimizers/muon_variants/muoneq.py
GPU: llm-opt-baseline-gpu/llm-baselines/src/optim/muoneq.py

Both classes are named MuonEq, and the core API is aligned. The main difference is the import path:

# NPU
from optimizers.muon_variants.muoneq import MuonEq

# GPU
from optim.muoneq import MuonEq

optimizer = MuonEq(
    lr=1e-3,
    wd=0.1,
    muon_params=muon_params,
    adamw_params=adamw_params,
    momentum=0.95,
    nesterov=True,
    ns_steps=5,
    adamw_betas=(0.95, 0.95),
    normalize_mode="row",   # also supports "rowcol" and "col"
    phase=None,             # can switch from row/col to row after phase
    zeropower_mode="native" # or "spc"
)

If you want to reuse the pretraining script in this repository directly, you can also call:

llm-opt-baseline-npu/llama2_pretrain/scripts/pretrain_c4_dist.py

The corresponding optimizer names are:

muoneq-row
muoneq-rowcol
muoneq-col

Results

Selected results from LLaMA2-1B training up to 21B tokens:

Train Loss vs Tokens	Val Loss vs Tokens	Val Loss vs Train Time

Environment and Dependencies

The GPU baseline and NPU pretraining parts maintain separate dependency sets:

llm-opt-baseline-gpu/llm-baselines/requirements.txt
llm-opt-baseline-npu/llama2_pretrain/requirements.txt

In practice, it is safer to use separate environments for these two parts instead of forcing them into one shared environment.

License

The root-level code is released under MIT; see LICENSE.

The third-party experiment directories preserved in this repository keep their own license files:

llm-opt-baseline-gpu/llm-baselines/LICENSE
llm-opt-baseline-npu/llama2_pretrain/LICENSE

Citation

If this repository or the paper is useful for your work, please cite:

@article{chang2026muoneq,
  title={MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration},
  author={Chang, Da and Shi, Qiankun and Zhang, Lvgang and Li, Yu and Zhang, Ruijie and Lu, Yao and Liu, Yongxiang and Yuan, Ganzhao},
  journal={arXiv preprint arXiv:2603.28254},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assert		assert
llm-opt-baseline-gpu		llm-opt-baseline-gpu
llm-opt-baseline-npu/llama2_pretrain		llm-opt-baseline-npu/llama2_pretrain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_EN.md		README_EN.md
README_ZH.md		README_ZH.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Contents

Overview

Repository Layout

Quick Start

1. CIFAR-10

2. GPU LLM Baselines

3. NPU Pretraining

Using MuonEq Directly

Results

Environment and Dependencies

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Contents

Overview

Repository Layout

Quick Start

1. CIFAR-10

2. GPU LLM Baselines

3. NPU Pretraining

Using MuonEq Directly

Results

Environment and Dependencies

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages