Hyunji Jung*, Sungbin Shin*, Namhoon Lee (*equal contribution)
POSTECH
ICML 2026 · Paper
We identify the root cause as basis misalignment, which induces oscillation in Adam's update directions by breaking its coordinate-wise adaptivity. Under gradient delay, these oscillations cause the updates from delayed gradients to be misaligned with — or even opposite to — the non-delayed counterparts, harming optimization. We propose Basis Rotation, which realigns the optimization space with the Hessian eigenbasis, restoring Adam's adaptivity and eliminating delay sensitivity.
Experiments are tested on Python 3.12. We use uv for environment management.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv python pin 3.12
uv syncDatasets are downloaded from Hugging Face on first use. Set HF_HOME to control where they are cached:
export HF_HOME=/scratch/data # or any directory with sufficient spaceSee run.bash for a complete example. The script can be configured for any number of pipeline stages by adjusting nstages.
# Example: 32-stage pipeline on 8 GPUs (4 stages per GPU)
ngpus=8
nstages=32
master_port=12345
for rank in $(seq 0 $(($nstages - 1))); do
local_rank=$(($rank % $ngpus))
.venv/bin/python main_with_runtime.py \
--module models.gptn \
--n_layer 32 --n_embd 384 --n_head 6 --block_size 512 \
--config_path models/gptn/layers=32/stage${nstages}.json \
-d openwebtext \
--optimizer basisrotation \
--rotation_geometry bi --approx_source 2nd \
--subspace_update_frequency 10 \
--lr 1e-3 --lr_warmup --lr_policy cosine --epochs 250 \
--clip_grad 1 --recompute \
--rank $rank --local_rank $local_rank \
--master_addr localhost --master_port $master_port \
--distributed_backend gloo > rank${rank}.log 2>&1 &
done
wait| Argument | Values | Description |
|---|---|---|
--optimizer |
basisrotation, adamw, nadamw
|
Optimizer choice. basisrotation is Adam with Basis Rotation (Algorithm 1). |
--rotation_geometry |
bi, uni
|
Rotation geometry: bilateral (two-sided) or unilateral (one-sided). Corresponds to |
--approx_source |
2nd, 1st
|
Approximation source for eigenbasis estimation: second-order covariance ( |
--subspace_update_frequency |
int | How often to refresh the rotation basis (in steps). |
| Argument | Description |
|---|---|
--config_path |
Path to stage configuration JSON (e.g., models/gptn/layers=32/stage4.json). Maps model submodules to pipeline stages. |
--rank |
Rank of this process. |
--local_rank |
Local GPU index within the node. |
--stash_to_cpu |
Offload stashed weight versions to CPU to reduce GPU memory usage. |
--recompute |
Recompute activations in the backward pass (saves memory at the cost of extra compute). |
The GPT model is defined in models/gptn/ and split into pipeline stages via JSON configs in models/gptn/layers=<N>/stage<K>.json. Available configs: stage1 (single GPU), stage2, stage4, stage8, stage16, stage32.
To add a new model size, define the architecture in models/ and create corresponding stage JSON files.
This codebase builds upon the following open-source projects:
@inproceedings{jung2026mitigating,
title={Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation},
author={Jung, Hyunji and Shin, Sungbin and Lee, Namhoon},
booktitle={International Conference on Machine Learning},
year={2026}
}