Code for End-to-End Context Compression at Scale. An LCLM is an encoder–decoder soft-token compressor: an encoder maps a long input to a short sequence of latent tokens, and a decoder consumes those latents in place of the original tokens.
📄 arXiv · 🤗 Checkpoints · Eval datasets
git clone https://github.com/LeonLixyz/LCLM && cd LCLM
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
# If flash-attn fails to build:
uv run pip install flash-attn --no-build-isolationIf you need nvcc: conda install -c nvidia cuda-nvcc.
LCLM/
├── latent_context/ # Model package: LCLM, LatentEncoder, Adapter,
│ # LCLMProcessor, from_pretrained.
├── inference/ # Inference entry points.
│ ├── hf.py # reference HF path (single process, single GPU)
│ ├── vllm_inference/ # two-stage vLLM CLI
│ │ ├── encode.py # HF encoder → embeds.pt
│ │ └── decode.py # vLLM decoder reads embeds.pt
│ └── examples/ # runnable demos + eval drivers (see README)
├── train/ # Training entry points.
│ ├── launch_train.py # CLI
│ └── trainer.py # training loop, checkpointing, auto-resume
├── scripts/ # Launch wrappers + YAML configs.
│ ├── run_pipeline.sh # end-to-end (adapter → enc/dec continual pretrain → SFT)
│ ├── convert_checkpoint.sh
│ ├── experiment_config/ # per-experiment YAMLs
│ ├── pretrain_config/ # pretrain-stage YAMLs
│ └── distributed_configs/ # accelerate / deepspeed / fsdp
├── agent/ # Agent app — EXPAND(i) tool over compressed segments.
├── data/ # Training datasets, collators, dynamic packing.
└── utils/ # Helpers + checkpoint-conversion shell scripts.
Text to compress should be wrapped between <|memory_start|> and
<|memory_end|> in the prompt. See
inference/examples/README.md for
runnable demos and the RULER NIAH eval driver.
from latent_context import LCLM
model = LCLM.from_pretrained("latent-context/0.6b-4b-LCLM-16x")
# see inference/hf.py for generate_textTwo stages: first the HF encoder compresses every prompt into latent
tokens written to a .pt file, then vLLM reads that file and decodes
generations from the latents.
python -m inference.vllm_inference.encode \
--checkpoint latent-context/0.6b-4b-LCLM-16x \
--prompts-jsonl prompts.jsonl --out embeds.pt
python -m inference.vllm_inference.decode \
--checkpoint latent-context/0.6b-4b-LCLM-16x \
--embeds-pt embeds.pt --out completions.jsonlDriven by a single experiment YAML that defines four stages: adapter warm-up → encoder continual pretrain → decoder continual pretrain → SFT. Each stage runs under accelerate (DeepSpeed by default) and the pipeline converts the distributed checkpoint to the HF layout between stages.
OUTPUT_DIR=./checkpoints bash scripts/run_pipeline.sh \
scripts/experiment_config/0.6b-4b-cs4-mean-w1024-causal-mlp-O0.yamlOUTPUT_DIR is required; everything else lives in the YAML.
| Path | What's in it |
|---|---|
scripts/experiment_config/ |
Full end-to-end runs. Naming: {enc}-{dec}-cs{N}-{pooling}-w{W}-{mask}-{adapter}-O{O}.yaml — e.g. 0.6b-4b-cs16-mean-w1024-bidirectional-mlp-O0.yaml. |
scripts/pretrain_config/ |
Pretrain-only sweeps over adapter / encoder layouts. Naming: {pooling}-w{W}-{mask}-{adapter}-O{O}.yaml. |
scripts/distributed_configs/ |
Accelerate launcher configs: deepspeed_zero{1,2,3}*.yaml, fsdp_*.yaml, ddp_multi_node.yaml. |
To match the released checkpoints, the relevant axes are
pooling=mean, mask=causal, adapter=mlp, boundary_overlap=0,
encoder_window_size=1024. Pick the csN matching the compression
ratio you want (4 / 8 / 16).
# launch_train.py is the CLI; trainer.py owns the loop.
accelerate launch \
--config_file scripts/distributed_configs/deepspeed_zero1.yaml \
-m train.launch_train \
--config scripts/experiment_config/0.6b-4b-cs4-mean-w1024-causal-mlp-O0.yaml \
--stage 1 \
--output_dir ./checkpointsSwap the accelerate config:
DIST_TRAIN_CONFIG=scripts/distributed_configs/fsdp_hybrid_shard.yaml \
DISTRIBUTED_TYPE=fsdp \
OUTPUT_DIR=./checkpoints bash scripts/run_pipeline.sh \
scripts/experiment_config/0.6b-4b-cs4-mean-w1024-causal-mlp-O0.yaml| Var | Default | What it does |
|---|---|---|
OUTPUT_DIR |
(required) | Where checkpoints get written. |
AUTO_RESUME |
true |
Resume from latest matching checkpoint each SAVE_STEPS. |
RESUME_FROM_CHECKPOINT |
"" |
Resume from a specific HF checkpoint. |
DISTRIBUTED_TYPE |
deepspeed |
deepspeed or fsdp. |
DIST_TRAIN_CONFIG |
scripts/distributed_configs/deepspeed_zero1_multi_node.yaml |
Accelerate config path. |
DS_HOSTFILE |
unset | DeepSpeed hostfile for multi-node. |
scripts/convert_checkpoint.sh converts a raw FSDP / DeepSpeed
checkpoint to the HF-style {decoder, encoder, adapter}/ layout the
LCLM loader (and the published checkpoints) use. The pipeline calls it
between stages automatically. See utils/checkpoints/ for the inner
scripts and train/trainer.py for the checkpoint / resume logic.
@misc{li2026endtoendcontextcompressionscale,
title={End-to-End Context Compression at Scale},
author={Ang Li and Sean McLeish and Haozhe Chen and Nimit Kalra and Zaiqian Chen and Artem Gazizov and Venkata Anoop Suhas Kumar Morisetty and Bhavya Kailkhura and Harshitha Menon and Zhuang Liu and Brian R. Bartoldson and Tom Goldstein and Sanae Lotfi and Micah Goldblum and Pavel Izmailov},
year={2026},
eprint={2606.09659},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.09659},
}