LCLM — Latent Context Language Models

Code for End-to-End Context Compression at Scale. An LCLM is an encoder–decoder soft-token compressor: an encoder maps a long input to a short sequence of latent tokens, and a decoder consumes those latents in place of the original tokens.

📄 arXiv · 🤗 Checkpoints · Eval datasets

Install

git clone https://github.com/LeonLixyz/LCLM && cd LCLM
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
# If flash-attn fails to build:
uv run pip install flash-attn --no-build-isolation

If you need nvcc: conda install -c nvidia cuda-nvcc.

Repo layout

LCLM/
├── latent_context/        # Model package: LCLM, LatentEncoder, Adapter,
│                          # LCLMProcessor, from_pretrained.
├── inference/             # Inference entry points.
│   ├── hf.py              #   reference HF path (single process, single GPU)
│   ├── vllm_inference/    #   two-stage vLLM CLI
│   │   ├── encode.py      #     HF encoder → embeds.pt
│   │   └── decode.py      #     vLLM decoder reads embeds.pt
│   └── examples/          #   runnable demos + eval drivers (see README)
├── train/                 # Training entry points.
│   ├── launch_train.py    #   CLI
│   └── trainer.py         #   training loop, checkpointing, auto-resume
├── scripts/               # Launch wrappers + YAML configs.
│   ├── run_pipeline.sh    #   end-to-end (adapter → enc/dec continual pretrain → SFT)
│   ├── convert_checkpoint.sh
│   ├── experiment_config/ #   per-experiment YAMLs
│   ├── pretrain_config/   #   pretrain-stage YAMLs
│   └── distributed_configs/  # accelerate / deepspeed / fsdp
├── agent/                 # Agent app — EXPAND(i) tool over compressed segments.
├── data/                  # Training datasets, collators, dynamic packing.
└── utils/                 # Helpers + checkpoint-conversion shell scripts.

Inference

Text to compress should be wrapped between <|memory_start|> and <|memory_end|> in the prompt. See inference/examples/README.md for runnable demos and the RULER NIAH eval driver.

HF inference

from latent_context import LCLM
model = LCLM.from_pretrained("latent-context/0.6b-4b-LCLM-16x")
# see inference/hf.py for generate_text

vLLM inference

Two stages: first the HF encoder compresses every prompt into latent tokens written to a .pt file, then vLLM reads that file and decodes generations from the latents.

python -m inference.vllm_inference.encode \
    --checkpoint latent-context/0.6b-4b-LCLM-16x \
    --prompts-jsonl prompts.jsonl --out embeds.pt
python -m inference.vllm_inference.decode \
    --checkpoint latent-context/0.6b-4b-LCLM-16x \
    --embeds-pt embeds.pt --out completions.jsonl

Training

Driven by a single experiment YAML that defines four stages: adapter warm-up → encoder continual pretrain → decoder continual pretrain → SFT. Each stage runs under accelerate (DeepSpeed by default) and the pipeline converts the distributed checkpoint to the HF layout between stages.

One-line full pipeline

OUTPUT_DIR=./checkpoints bash scripts/run_pipeline.sh \
    scripts/experiment_config/0.6b-4b-cs4-mean-w1024-causal-mlp-O0.yaml

OUTPUT_DIR is required; everything else lives in the YAML.

Configs

Path	What's in it
`scripts/experiment_config/`	Full end-to-end runs. Naming: `{enc}-{dec}-cs{N}-{pooling}-w{W}-{mask}-{adapter}-O{O}.yaml` — e.g. `0.6b-4b-cs16-mean-w1024-bidirectional-mlp-O0.yaml`.
`scripts/pretrain_config/`	Pretrain-only sweeps over adapter / encoder layouts. Naming: `{pooling}-w{W}-{mask}-{adapter}-O{O}.yaml`.
`scripts/distributed_configs/`	Accelerate launcher configs: `deepspeed_zero{1,2,3}.yaml`, `fsdp_.yaml`, `ddp_multi_node.yaml`.

To match the released checkpoints, the relevant axes are pooling=mean, mask=causal, adapter=mlp, boundary_overlap=0, encoder_window_size=1024. Pick the csN matching the compression ratio you want (4 / 8 / 16).

Single stage

# launch_train.py is the CLI; trainer.py owns the loop.
accelerate launch \
    --config_file scripts/distributed_configs/deepspeed_zero1.yaml \
    -m train.launch_train \
    --config scripts/experiment_config/0.6b-4b-cs4-mean-w1024-causal-mlp-O0.yaml \
    --stage 1 \
    --output_dir ./checkpoints

FSDP

Swap the accelerate config:

DIST_TRAIN_CONFIG=scripts/distributed_configs/fsdp_hybrid_shard.yaml \
DISTRIBUTED_TYPE=fsdp \
OUTPUT_DIR=./checkpoints bash scripts/run_pipeline.sh \
    scripts/experiment_config/0.6b-4b-cs4-mean-w1024-causal-mlp-O0.yaml

Env vars

Var	Default	What it does
`OUTPUT_DIR`	(required)	Where checkpoints get written.
`AUTO_RESUME`	`true`	Resume from latest matching checkpoint each `SAVE_STEPS`.
`RESUME_FROM_CHECKPOINT`	`""`	Resume from a specific HF checkpoint.
`DISTRIBUTED_TYPE`	`deepspeed`	`deepspeed` or `fsdp`.
`DIST_TRAIN_CONFIG`	`scripts/distributed_configs/deepspeed_zero1_multi_node.yaml`	Accelerate config path.
`DS_HOSTFILE`	unset	DeepSpeed hostfile for multi-node.

Checkpoint conversion

scripts/convert_checkpoint.sh converts a raw FSDP / DeepSpeed checkpoint to the HF-style {decoder, encoder, adapter}/ layout the LCLM loader (and the published checkpoints) use. The pipeline calls it between stages automatically. See utils/checkpoints/ for the inner scripts and train/trainer.py for the checkpoint / resume logic.

Citation

@misc{li2026endtoendcontextcompressionscale,
      title={End-to-End Context Compression at Scale},
      author={Ang Li and Sean McLeish and Haozhe Chen and Nimit Kalra and Zaiqian Chen and Artem Gazizov and Venkata Anoop Suhas Kumar Morisetty and Bhavya Kailkhura and Harshitha Menon and Zhuang Liu and Brian R. Bartoldson and Tom Goldstein and Sanae Lotfi and Micah Goldblum and Pavel Izmailov},
      year={2026},
      eprint={2606.09659},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.09659}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LCLM — Latent Context Language Models

Install

Repo layout

Inference

HF inference

vLLM inference

Training

One-line full pipeline

Configs

Single stage

FSDP

Env vars

Checkpoint conversion

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent		agent
data		data
inference		inference
latent_context		latent_context
scripts		scripts
train		train
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LCLM — Latent Context Language Models

Install

Repo layout

Inference

HF inference

vLLM inference

Training

One-line full pipeline

Configs

Single stage

FSDP

Env vars

Checkpoint conversion

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages