A PyTorch implementation of Nested Learning (NL) and the HOPE sequence model.
Ali Behrouz, Meisam Razaviyayn, Peiling Zhong, Vahab Mirrokni
Nested Learning: The Illusion of Deep Learning Architectures, NeurIPS 2025.
-
đ§ Paper: Nested Learning: The Illusion of Deep Learning Architectures (NeurIPS 2025)
-
đ» This repo (full PyTorch implementation):
https://github.com/AMIRMOHAMMAD-OSS/Google_NestedLearning -
đ Hands-on Colab tutorial (PyTorch, train on any HF dataset):
The Colab notebook gives a minimal, didactic example of a HOPE-style model and training loop entirely in one file, while this repository contains a more modular and extensible implementation.
Standard deep learning treats a model as:
one parameter vector + one global optimizer + one gradient flow.
Nested Learning (NL) reframes this as:
a system of nested optimization problems, each with:
- its own context flow (what it sees over time),
- its own update frequency,
- its own learning rule,
- and its own memory.
Under NL, familiar components become associative memories:
- A linear layer trained by SGD is a memory mapping inputs â âsurprise signalsâ (gradients).
- Momentum buffers are memories of past gradients.
- Linear attention memories (like KV states) are online compressions of the sequence.
This lens helps explain:
- In-context learning as on-the-fly memory compression of context,
- Optimizers as learned memory systems,
- The limits of âjust stack more layersâ in terms of effective computation depth.
HOPE is a sequence model that operationalizes NL with two main components:
-
Fast associative memory (working memory)
- A linear-attentionâstyle state update (
FastKV):
[ M_{t+1} = M_t + \phi(k_t)^\top v_t ] - Acts as an online memory over the current context.
- Updates are local and do not require backpropagation through the update rule.
- A linear-attentionâstyle state update (
-
Continuum Memory System (CMS)
- A chain of MLP blocks, each living at a different time scale (update frequency).
- For example:
- Level 0: updates every token,
- Level 1: updates every few tokens,
- Level 2: updates every many tokens.
- This recovers a continuum of memories from short-term to long-term.
Together, HOPE becomes a self-modifying sequence model:
- outer parameters trained in the usual way,
- inner components that adapt during inference via fast, local update rules.
Key differences:
-
Multiple time scales, not a single static backbone
- Transformers: layers are fixed after pretraining; all updates happen off-line.
- HOPE: separates fast (per-step) and slow (infrequent) components via CMS.
-
Optimizers as memories
- NL views optimizers as gradient memories (e.g., momentum â associative memory).
- This repo includes variants where inner updates use L2-regressionâstyle rules rather than pure dot-product Hebbian updates.
-
Explicit in-context adaptation
- In a Transformer, in-context learning is emergent.
- In HOPE, context is explicitly compressed into a fast memory + CMS levels, and parts of the model are designed to update on-the-fly.
-
Better alignment with continual learning
- Fast levels adapt quickly but can forget,
- Slow levels change rarely and store more stable structure,
- This avoids relying on a single âshort-term memoryâ bottleneck.
-
Self-modification without meta-gradients
- Inner learning rules are local and do not require second-order gradients or unrolled meta-learning.
- This keeps the architecture more scalable and easier to implement.
-
Fast associative memory (FastKV)
- Linear-attentionâstyle fast weight update.
- Implements keyâvalue accumulation and retrieval with a feature map Ï(·).
- File:
nl/modules/fast_kv.py
-
Continuum Memory System (CMS)
- Chain of MLP âmemory levelsâ with configurable update frequencies.
- Each level has its own optimizer / update schedule.
- File:
nl/memory/cms.py
-
HOPE language model
- Combines token embeddings, FastKV, CMS, and an inner learning rule.
- Includes the L2-regressionâstyle gradient step derived in the paper.
- File:
nl/models/hope.py
-
Training
scripts/train.py- Loads config files (
configs/*.yaml) - Prepares data via Hugging Face Datasets
- Trains HOPE on next-token prediction
- Loads config files (
-
Perplexity evaluation
scripts/eval_ppl.py- Loads a trained checkpoint
- Computes validation/test perplexity
-
Sampling
scripts/sample.py- Loads a trained checkpoint
- Generates text with temperature / top-k / top-p
-
Example model config (HOPE, Colab-friendly size):
configs/hope_colab.yaml -
Example data config (WikiText-2):
configs/data_wikitext2.yaml
You can add your own .yaml files to configure:
- model size,
- CMS depth / update frequencies,
- dataset paths or HF dataset names,
- optimization hyperparameters.
If you prefer a single-file, PyTorch-only walkthrough, use the Colab notebook:
The notebook:
- Re-implements a HOPE-style model directly in the notebook (no imports from this repo).
- Shows how to:
- load a Hugging Face text dataset,
- build a simple LM dataset (blocks of tokens),
- define a FastKV + CMS model,
- train with cross-entropy LM loss,
- generate text samples.
- Is designed as a readable reference for people who want to understand the approach before diving into the full codebase.
If you just want to try the idea on your own text or HF dataset, start from this Colab.
Clone the repo and install dependencies:
git clone https://github.com/AMIRMOHAMMAD-OSS/Google_NestedLearning
cd Google_NestedLearning
pip install -r requirements.txtMake sure you have a working Python 3.10+ and PyTorch with CUDA if you want GPU training.
python scripts/train.py --config configs/hope_colab.yaml --data configs/data_wikitext2.yamlThis will:
- Download WikiText-2 via Hugging Face,
- Tokenize and pack it into fixed-length sequences,
- Train a HOPE model with a moderate hidden size on a single GPU,
- Save checkpoints under
outputs/<exp_name>/.
python scripts/eval_ppl.py --checkpoint outputs/<exp_name>/last.ckpt --config_model configs/hope_colab.yaml --data configs/data_wikitext2.yamlThis reports token-level negative log-likelihood and perplexity on the validation or test split.
python scripts/sample.py --checkpoint outputs/<exp_name>/last.ckpt --prompt "Nested learning suggests" --max_new_tokens 150 --temperature 0.8 --top_k 50 --top_p 0.95Adjust prompt, temperature, top_k, and top_p to explore different generation behaviours.
You have two main options:
-
Pick a dataset from https://huggingface.co/datasets
Examples:imdb,ag_news,bookcorpusopen, etc. -
Create a data config (e.g.
configs/data_mytext.yaml) specifying:- HF dataset name,
- optional configuration,
- which field contains the raw text.
-
Point
scripts/train.pyto your custom data config:python scripts/train.py --config configs/hope_colab.yaml --data configs/data_mytext.yaml
You can mirror what is done in configs/data_wikitext2.yaml and adapt it for your dataset.
If you have plain .txt files:
- Either:
- Wrap them as a Hugging Face
textdataset in your data config, or - Use the Colab notebook and replace the dataset-loading cell with:
from datasets import load_dataset ds = load_dataset("text", data_files={"train": "your_file.txt"})
- Wrap them as a Hugging Face
- Once your text is exposed as a stream of strings, the rest of the pipeline (tokenization, block packing, training, sampling) stays unchanged.
- This repository is a research-oriented implementation, not a production library.
- The NeurIPS camera-ready paper is a compressed version; some details and ablations live in the (forthcoming) arXiv version.
- Where the paper is ambiguous, the implementation follows:
- standard fast-weight / linear-attention practice for the memory update,
- a clean separation between inner updates and outer gradients (no backprop through inner loops).
If you notice a discrepancy between the paper and the code, feel free to open an issue or PR.
@inproceedings{behrouz2025nestedlearning,
author = {Ali Behrouz and Meisam Razaviyayn and Peiling Zhong and Vahab Mirrokni},
title = {Nested Learning: The Illusion of Deep Learning Architectures},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025}
}If you use this codebase or the Colab notebook in your work, please cite:
@software{amirmohammad_nl_hope_2025,
author = {AmirMohammad},
title = {{NL-HOPE}: Nested Learning and {HOPE} Sequence Model in PyTorch},
year = {2025},
url = {https://github.com/AMIRMOHAMMAD-OSS/Google_NestedLearning},
note = {GitHub repository},
version = {latest}
}