🔺 Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Official PyTorch implementation of Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention.

Ali Hatamizadeh, Yejin Choi, and Jan Kautz.

🌟 Why Gated DeltaNet-2?

Linear attention compresses an unbounded KV cache into a fixed-size recurrent state. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Prior delta-rule models (Gated DeltaNet, Kimi Delta Attention) tie erasing and writing to a single scalar gate — even though they act on different axes of the state.

Gated DeltaNet-2 decouples these two roles:

✂️ Channel-wise Erase Gate b_t — selects which key-side coordinates of the decayed state are read and removed
✍️ Channel-wise Write Gate w_t — selects which value-side coordinates of the new content are committed
🌀 Channel-wise Decay — inherited from KDA for fine-grained global forgetting
🔁 Strict Generalization — recovers KDA when both gates collapse to the same scalar, and Gated DeltaNet when the decay also collapses
⚡ Hardware-efficient Training — fast-weight WY chunkwise algorithm with gate-aware backward, fused in Triton

📐 The Gated Delta Rule-2

Given an erase gate b_t ∈ [0,1]^{d_k}, a write gate w_t ∈ [0,1]^{d_v}, and channel-wise decay D_t = Diag(α_t), the recurrent state evolves as:

S_t = (I − k_t (b_t ⊙ k_t)ᵀ) D_t S_{t−1}  +  k_t (w_t ⊙ v_t)ᵀ

Compared with KDA, the right factor of the rank-one erase becomes channel-selective on the key axis, and the write term becomes channel-selective on the value axis. The two decisions no longer share a single scalar.

📊 Results

We train all models at 1.3B parameters on 100B tokens of FineWeb-Edu, matched in recurrent state size, and compare against Mamba-2, Gated DeltaNet, KDA, and Mamba-3 (SISO and MIMO).

Language Modeling and Commonsense Reasoning

Gated DeltaNet-2 achieves the best average across both recurrent-only and hybrid settings:

Model	Wiki ppl ↓	LMB ppl ↓	LMB acc ↑	Avg. acc ↑
Recurrent
Mamba-2	16.79	12.38	45.24	51.82
Gated DeltaNet	16.40	11.89	49.62	52.07
KDA	16.81	11.68	48.13	52.28
Mamba-3 (MIMO)	16.45	11.66	47.82	52.39
Gated DeltaNet-2	15.90	11.41	48.09	53.11
Hybrid (+ SWA)
Transformer	19.22	13.72	48.32	50.86
Gated DeltaNet	16.00	10.82	48.71	52.25
KDA	16.01	10.66	49.21	52.68
Mamba-3 (MIMO)	15.81	10.92	49.82	52.72
Gated DeltaNet-2	15.62	10.43	50.90	53.97

Long-context Retrieval (RULER)

Gated DeltaNet-2 is strongest where memory editing matters most — particularly the interference-heavy multi-key needle-in-a-haystack settings:

Model	S-NIAH-2 @4K	S-NIAH-3 @2K	MK-NIAH-1 @4K
Recurrent
Gated DeltaNet	87.2	54.2	27.8
KDA	89.0	63.2	28.0
Mamba-3 (MIMO)	64.2	72.4	18.0
Gated DeltaNet-2	93.0	89.8	37.8
Hybrid
Gated DeltaNet	57.3	91.2	44.8
KDA	56.0	93.4	40.4
Mamba-3 (MIMO)	53.0	98.4	46.6
Gated DeltaNet-2	57.9	99.0	48.0

Real-world Retrieval

Across SWDE, SQuAD, FDA, TriviaQA, NQ, and DROP, Gated DeltaNet-2 leads the recurrent and hybrid frontier:

Setting	Mamba-2	GDN	KDA	Mamba-3 (MIMO)	GDN-2
Recurrent avg.	26.84	28.09	28.67	28.35	29.88
Hybrid avg.	39.74	39.11	40.14	40.11	42.28

Throughput

Gated DeltaNet-2 retains near-flat scaling with sequence length on a single H100 (training, hybrid 1.3B), with only a small constant overhead over KDA for the added channel-wise gates.

🔧 What's New in the Update Rule

Method	Decay	Erase	Write
Mamba-2	scalar	—	scalar
Gated DeltaNet	scalar	scalar `β_t`	scalar `β_t`
KDA	channel-wise	scalar `β_t`	scalar `β_t`
Gated DeltaNet-2	channel-wise	channel-wise `b_t`	channel-wise `w_t`

Ablations confirm both gates contribute, with the erase gate b_t accounting for most of the gain — consistent with its role in selectively protecting or revising key-side associations in the recurrent state.

📢 Latest Updates

05/21/2026: 🔥 Code Release: Train your own Gated DeltaNet-2 on FineWeb-Edu
Watch this space for more exciting updates!

🚀 Getting Started

Training Your Model

Launch your training with our streamlined command:

python ../pretrain.py \
--train_data_dir ${TRAIN_DATA} \
--val_data_dir ${VALIDATION_DATA} \
--output_root ${SAVE_DIR} \
--exp_name ${NAME} \
--model_name ${MODEL} \
--train_config ${CONFIG} \
--eval_iters ${EVAL_ITERS} \
--learning_rate ${LR} \
--micro_batch_size ${MICRO_BATCH_SIZE}

💡 Pro Tip: Add --interactive_job --debug for interactive debugging sessions!

Default Recipe

We train 1.3B-parameter models on 100B tokens of FineWeb-Edu with:

AdamW, peak LR 4e-4, weight decay 0.1, gradient clip 1.0
Cosine schedule with 1B-token warmup
Global batch size 0.5M tokens, sequence length 4K
Hybrid models use a 2K sliding-window attention size
16 heads, d_k = d_v = 128, matched recurrent state size against Mamba-2/3 baselines

📜 License

Licensed under the NVIDIA Source Code License-NC. See LICENSE for details.

🙏 Acknowledgements

Built on the shoulders of giants:

📖 Citation

If you find this work useful, please consider citing:

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}

⭐ Support Us

If you find this work useful, please consider:

Starring the repository
Citing our paper
Contributing to the codebase

Join us in pushing the boundaries of linear attention! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
lit_gpt		lit_gpt
paper		paper
scripts		scripts
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cache.py		cache.py
data.py		data.py
pretrain.py		pretrain.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔺 Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

🌟 Why Gated DeltaNet-2?

📐 The Gated Delta Rule-2

📊 Results

Language Modeling and Commonsense Reasoning

Long-context Retrieval (RULER)

Real-world Retrieval

Throughput

🔧 What's New in the Update Rule

📢 Latest Updates

🚀 Getting Started

Training Your Model

Default Recipe

📜 License

🙏 Acknowledgements

📖 Citation

⭐ Support Us

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔺 Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

🌟 Why Gated DeltaNet-2?

📐 The Gated Delta Rule-2

📊 Results

Language Modeling and Commonsense Reasoning

Long-context Retrieval (RULER)

Real-world Retrieval

Throughput

🔧 What's New in the Update Rule

📢 Latest Updates

🚀 Getting Started

Training Your Model

Default Recipe

📜 License

🙏 Acknowledgements

📖 Citation

⭐ Support Us

Star History

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages