Paper: http://arxiv.org/abs/2510.02228
Authors: Maximilian Beck, Kajetan Schweighofer, Sebastian Böck, Sebastian Lehner, Sepp Hochreiter
This repository contains the code and data for our research on xLSTM scaling laws. We investigate the scaling behavior and computational characteristics of xLSTM architectures compared to dense multi-head Transformers across varying model sizes, training datasets, and context lengths.
- Training Efficiency: xLSTM models demonstrate competitive performance across nearly 5 orders of magnitude of compute, with favorable loss-to-FLOP ratios compared to Transformers
- Compute-Optimal Models: Analysis reveals that optimal xLSTM models tend to be larger than optimal Transformer models for a given compute budget, with model size characteristics that remain consistent across different context lengths
- Inference Characteristics: xLSTM exhibits reduced time-to-first-token (TTFT) and context-independent step times at 16k sequence length, with scaling advantages that increase with context length
- Scaling Behavior: xLSTM maintains power-law scaling relationships even in high token-to-parameter ratio training regimes
Our study encompasses model sizes from 80M to 7B parameters, training datasets from 2B to 2T tokens, and examines both training scaling laws and inference-time properties. The results provide insights into xLSTM as an alternative architecture for applications involving long context processing.
This repository is organized into the following main components:
xlstm_scaling_laws/
├── data/ # Run data for our Dataset of Training runs
├── data_lnd_fits/ # Results of our parametric L(N,D) fits
├── notebooks/ # Jupyter notebooks for analysis and visualization
├── scripts/ # Training and evaluation scripts
├── xlstm_scaling_laws/ # Main library for the scaling law analysis
│ ├── analysis/ # Analysis modules for differen scaling law experiments
│ ├── common/ # Common utilities and data loading functions
│ ├── fitting/ # Statistical fitting functions for scaling laws
│ ├── flops/ # FLOP counting for different model architectures
│ ├── load_data/ # Data loading and preprocessing utilities
│ ├── model_accounting/ # Model parameter, FLOPs and MemOps accounting
│ └── params/ # Parameter counting for different architectures
├── requirements.txt # Python dependencies
└── README.md # This file
We provide all experiment logs and run data in several pickle files in the data/
folder.
In xlstm_scaling_laws/common/
we provide the functions to load and access the raw training log data extracted from wandb.
In xlstm_scaling_laws/load_data
we provide functions to extract the preprocessed data for our scaling law analyses.
Please have a look at the notebooks in notebooks/paper_plots/
for examples on how to access and visualize the data.
The notebooks/
directory contains interactive Jupyter notebooks organized into:
paper_plots/
- Notebooks reproducing all figures from our paperexperiment_setup/
- Notebooks for setting up our IsoFLOP experimentsflop_calculations/
- Some FLOP and Arithmetic Intensity calculationsinference_time/
- Notebooks for fitting our inference time models
The scripts/
directory contains the scripts for running the parametric L(N,D) fits on our Dataset of Training runs.
Please cite our papers if you use this codebase, or otherwise find our work valuable:
@article{beck:25xlstmscaling,
title = {{xLSTM Scaling Laws}: Competitive Performance with Linear Time-Complexity},
author = {Maximilian Beck and Kajetan Schweighofer and Sebastian Böck and Sebastian Lehner and Sepp Hochreiter},
year = {2025},
volume = {2510.02228},
journal = {arXiv},
primaryclass = {cs.LG},
url = {http://arxiv.org/abs/2510.02228}
}