Skip to content

NVlabs/GatedDeltaNet-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”Ί Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Official PyTorch implementation of Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention.

Star on GitHub

Ali Hatamizadeh, Yejin Choi, and Jan Kautz.

🌟 Why Gated DeltaNet-2?

Linear attention compresses an unbounded KV cache into a fixed-size recurrent state. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Prior delta-rule models (Gated DeltaNet, Kimi Delta Attention) tie erasing and writing to a single scalar gate β€” even though they act on different axes of the state.

Gated DeltaNet-2 decouples these two roles:

  • βœ‚οΈ Channel-wise Erase Gate b_t β€” selects which key-side coordinates of the decayed state are read and removed
  • ✍️ Channel-wise Write Gate w_t β€” selects which value-side coordinates of the new content are committed
  • πŸŒ€ Channel-wise Decay β€” inherited from KDA for fine-grained global forgetting
  • πŸ” Strict Generalization β€” recovers KDA when both gates collapse to the same scalar, and Gated DeltaNet when the decay also collapses
  • ⚑ Hardware-efficient Training β€” fast-weight WY chunkwise algorithm with gate-aware backward, fused in Triton

πŸ“ The Gated Delta Rule-2

Given an erase gate b_t ∈ [0,1]^{d_k}, a write gate w_t ∈ [0,1]^{d_v}, and channel-wise decay D_t = Diag(α_t), the recurrent state evolves as:

S_t = (I βˆ’ k_t (b_t βŠ™ k_t)α΅€) D_t S_{tβˆ’1}  +  k_t (w_t βŠ™ v_t)α΅€

Compared with KDA, the right factor of the rank-one erase becomes channel-selective on the key axis, and the write term becomes channel-selective on the value axis. The two decisions no longer share a single scalar.

πŸ“Š Results

We train all models at 1.3B parameters on 100B tokens of FineWeb-Edu, matched in recurrent state size, and compare against Mamba-2, Gated DeltaNet, KDA, and Mamba-3 (SISO and MIMO).

Language Modeling and Commonsense Reasoning

Gated DeltaNet-2 achieves the best average across both recurrent-only and hybrid settings:

Model Wiki ppl ↓ LMB ppl ↓ LMB acc ↑ Avg. acc ↑
Recurrent
Mamba-2 16.79 12.38 45.24 51.82
Gated DeltaNet 16.40 11.89 49.62 52.07
KDA 16.81 11.68 48.13 52.28
Mamba-3 (MIMO) 16.45 11.66 47.82 52.39
Gated DeltaNet-2 15.90 11.41 48.09 53.11
Hybrid (+ SWA)
Transformer 19.22 13.72 48.32 50.86
Gated DeltaNet 16.00 10.82 48.71 52.25
KDA 16.01 10.66 49.21 52.68
Mamba-3 (MIMO) 15.81 10.92 49.82 52.72
Gated DeltaNet-2 15.62 10.43 50.90 53.97

Long-context Retrieval (RULER)

Gated DeltaNet-2 is strongest where memory editing matters most β€” particularly the interference-heavy multi-key needle-in-a-haystack settings:

Model S-NIAH-2 @4K S-NIAH-3 @2K MK-NIAH-1 @4K
Recurrent
Gated DeltaNet 87.2 54.2 27.8
KDA 89.0 63.2 28.0
Mamba-3 (MIMO) 64.2 72.4 18.0
Gated DeltaNet-2 93.0 89.8 37.8
Hybrid
Gated DeltaNet 57.3 91.2 44.8
KDA 56.0 93.4 40.4
Mamba-3 (MIMO) 53.0 98.4 46.6
Gated DeltaNet-2 57.9 99.0 48.0

Real-world Retrieval

Across SWDE, SQuAD, FDA, TriviaQA, NQ, and DROP, Gated DeltaNet-2 leads the recurrent and hybrid frontier:

Setting Mamba-2 GDN KDA Mamba-3 (MIMO) GDN-2
Recurrent avg. 26.84 28.09 28.67 28.35 29.88
Hybrid avg. 39.74 39.11 40.14 40.11 42.28

Throughput

Gated DeltaNet-2 retains near-flat scaling with sequence length on a single H100 (training, hybrid 1.3B), with only a small constant overhead over KDA for the added channel-wise gates.

πŸ”§ What's New in the Update Rule

Method Decay Erase Write
Mamba-2 scalar β€” scalar
Gated DeltaNet scalar scalar Ξ²_t scalar Ξ²_t
KDA channel-wise scalar Ξ²_t scalar Ξ²_t
Gated DeltaNet-2 channel-wise channel-wise b_t channel-wise w_t

Ablations confirm both gates contribute, with the erase gate b_t accounting for most of the gain β€” consistent with its role in selectively protecting or revising key-side associations in the recurrent state.

πŸ“’ Latest Updates

  • 05/21/2026: πŸ”₯ Code Release: Train your own Gated DeltaNet-2 on FineWeb-Edu
  • Watch this space for more exciting updates!

πŸš€ Getting Started

Training Your Model

Launch your training with our streamlined command:

python ../pretrain.py \
--train_data_dir ${TRAIN_DATA} \
--val_data_dir ${VALIDATION_DATA} \
--output_root ${SAVE_DIR} \
--exp_name ${NAME} \
--model_name ${MODEL} \
--train_config ${CONFIG} \
--eval_iters ${EVAL_ITERS} \
--learning_rate ${LR} \
--micro_batch_size ${MICRO_BATCH_SIZE}

πŸ’‘ Pro Tip: Add --interactive_job --debug for interactive debugging sessions!

Default Recipe

We train 1.3B-parameter models on 100B tokens of FineWeb-Edu with:

  • AdamW, peak LR 4e-4, weight decay 0.1, gradient clip 1.0
  • Cosine schedule with 1B-token warmup
  • Global batch size 0.5M tokens, sequence length 4K
  • Hybrid models use a 2K sliding-window attention size
  • 16 heads, d_k = d_v = 128, matched recurrent state size against Mamba-2/3 baselines

πŸ“œ License

Copyright Β© 2026, NVIDIA Corporation. All rights reserved.

Licensed under the NVIDIA Source Code License-NC. See LICENSE for details.

πŸ™ Acknowledgements

Built on the shoulders of giants:

πŸ“– Citation

If you find this work useful, please consider citing:

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}

⭐ Support Us

If you find this work useful, please consider:

  • Starring the repository
  • Citing our paper
  • Contributing to the codebase

Join us in pushing the boundaries of linear attention! πŸš€

Star History

Stargazers repo roster for @NVlabs/GatedDeltaNet-2

Star History Chart

About

Official PyTorch Implementation of Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages