Website: liquid-memory.vercel.app
Liquid Memory cuts the cost of long-context LLM workloads in two completely different ways. Pick the one that matches what you're building.
For teams whose app calls a cloud LLM (OpenAI, Anthropic, Gemini, DeepSeek, Llama-as-a-service, etc.) and wants to cut the per-token bill.
- A local Mistral-class model reads and chunks your big document.
- It extracts only the facts that matter for the question, structured as JSON.
- A 5-6x smaller fact pack gets forwarded to your cloud model of choice for synthesis.
- Your existing code changes by one line (the base URL); the response shape stays identical.
Open and runnable from this repo. See Quickstart below.
For teams training or self-hosting their own LLM who want linear-time attention so they can fit 128K-token contexts on a single H100 (where standard MultiheadAttention OOMs past 32K).
- Drop-in replacement for
torch.nn.MultiheadAttention. One line in a PyTorch model. - Parallel mode for prefill +
O(1)-per-token recurrent decode. - BF16 in, BF16 out. Precision-safe mode auto-engaged above 64K tokens.
- Compatible with
torch.compile. Works with Hugging Face Llama / Mistral / Qwen attention by patchingself_attn.
Licensed separately (compiled kernel + Ed25519 license key). See README_INTEGRATION.md for the full integration guide, or email the founders to request access.
| If you... | Use |
|---|---|
| Build an app that calls OpenAI/Anthropic/Gemini and want a smaller monthly bill | Mode 1 |
| Train your own LLM and want to extend its context window without buying more GPUs | Mode 2 |
| Serve your own open-source LLM (Llama, DeepSeek, Mistral, Qwen) and want fewer GPUs for the same context | Mode 2 |
| Need an air-gapped / fully-offline stack with no cloud LLM calls at all | Mode 1 in fully-local config, or Mode 2 — see the Fully-local section below |
| Are an enterprise with regulated data that can't leave your VPC | Mode 1 (proxy runs in your VPC) or Mode 2 (library runs inside your training stack) |
Enterprise teams are getting hit by two problems at once.
First, the cloud token tax. Standard RAG and long-context API workflows send raw documents upstream, even when most of the text is boilerplate or filler. A 100-page compliance file, legal packet, or lab report becomes an expensive prompt before any reasoning has happened.
Second, the privacy radius. Shipping raw internal documents to an external API expands the blast radius of every request. Sensitive pricing, legal language, customer data, operating procedures, and proprietary research all move further than they need to.
Liquid Memory's proxy changes that flow. It treats the local GPU as a sieve, not as the final brain.
The first stage runs a local Mistral-class model through vLLM. Documents are chunked with overlap, processed with PagedAttention, and distilled into compact structured facts. This stage strips away repetition and noise while preserving the high-signal evidence needed for downstream reasoning.
The extraction path is built to fail soft. If the local model returns malformed JSON, Liquid Memory falls back to regex extraction, bracket repair, and raw-evidence recovery blocks so information is never silently dropped.
The second stage routes the compressed fact pack through LiteLLM. This gives one proxy surface for many synthesis backends without rewriting application code. The target is selected through the SYNTHESIS_MODEL env var, so teams can switch between Gemini, OpenAI, Anthropic, DeepSeek, or a local OpenAI-compatible server with a config change instead of an architectural rewrite.
In practice, the pattern looks like this:
- Local GPU reads the large document.
- vLLM filters and compresses the content into structured evidence.
- LiteLLM forwards only the compressed facts to the selected synthesis model.
- The synthesis model generates the final answer from a far smaller prompt.
Liquid Memory is designed to produce up to 99% token compression before the cloud call. A document that would normally consume tens of thousands of remote prompt tokens can be reduced to a compact evidence bundle with only the useful facts preserved.
The local stage is optimized for throughput with vLLM PagedAttention, which improves GPU memory efficiency and keeps large-document extraction fast enough for live enterprise workflows. The result is a system that lowers cloud spend, improves privacy posture, and still preserves answer quality by sending the synthesis model only what it needs.
Don't take the chart on the website at face value. Reproduce it:
git clone https://github.com/Jamie2111/liquid_memory.git
cd liquid_memory
pip install -r requirements.txt
huggingface-cli login # Llama-3.1-8B is gated; accept the license once
# Terminal A - start the proxy (downloads Mistral-7B on first run)
export GEMINI_API_KEY="any-string-here" # dry_run skips the synthesis call
uvicorn liquid_proxy:app --host 0.0.0.0 --port 8000
# Terminal B - run the benchmark
python benchmark.pybenchmark.py is a deliberately tight ~50-line script you can read end-to-end in one minute before running it. It loads Llama-3.1-8B-Instruct, generates a 32K-token synthetic document, runs one forward pass on the full document with torch.cuda.max_memory_allocated() bookkeeping, POSTs the same document to the local proxy with dry_run=True, runs the same forward pass on the compressed fact pack, and prints the VRAM delta. No hidden constants, no telemetry, no dependencies you don't already need.
Output looks like:
[load] loading meta-llama/Llama-3.1-8B-Instruct
[doc] building a 32,768-token synthetic document
[run-1] baseline forward pass on the full 32,768-token document
peak VRAM: 64.43 GB
[proxy] POST http://localhost:8000/v1/hybrid_chat (dry_run=True, no cloud-synthesis tokens billed)
[run-2] forward pass on the Liquid Memory pack (5,742 tokens)
peak VRAM: 11.06 GB
>>> 53.37 GB saved (82.8%) <<<
A larger configurable harness lives at benchmark/benchmark_vram.py. It accepts --model, --tokens, --baseline-only, --json, and runs on a small non-gated model (Qwen/Qwen2-0.5B-Instruct) by default so it works on consumer GPUs. Use it for tuning, CI integration, or regression tracking; use the root benchmark.py for the headline proof to send to a prospect.
The default configuration uses Gemini for synthesis. Any LiteLLM-supported provider works the same way - swap SYNTHESIS_MODEL and the matching API key.
export GEMINI_API_KEY="your-gemini-api-key"
export SYNTHESIS_MODEL="gemini/gemini-3.1-flash-lite"
uvicorn liquid_proxy:app --host 0.0.0.0 --port 8000The proxy now exposes two POST endpoints. Pick the one that matches the integration shape you want.
Point any OpenAI client (openai-python, openai-node, LangChain's ChatOpenAI, LlamaIndex, etc.) at http://localhost:8000/v1 and call chat.completions.create(...) as you would against api.openai.com. The proxy decides which messages entry is the "large document" (longest user message) and which is the "question" (last user message), runs the local extraction + compression pipeline, forwards a smaller fact pack to the synthesis backend selected by SYNTHESIS_MODEL, and returns a standard OpenAI chat.completion response.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": large_document_text},
{"role": "user", "content": "Summarise the key risks."},
],
)
print(resp.choices[0].message.content)That's the literal "change one line of code" integration. No restructuring of your messages array, no new SDK to learn. Existing tooling that speaks the OpenAI chat API (eval frameworks, prompt managers, observability sidecars) keeps working.
v1 limitations. The following OpenAI features return a clean 501 with an OpenAI-shaped error body so client SDKs raise typed exceptions on them rather than silently misbehaving:
stream=Truetools/functionsn > 1
Use /v1/hybrid_chat below for those cases until v2 ships streaming + tool-calling support.
Smoke test. A 4-test smoke script ships at the repo root:
pip install openai
python test_openai_compat.pyValidates: basic request/response shape, single-message handling, the 501 path for streaming, and the 501 path for tools. Uses ?dry_run=1 internally so the test does not burn any cloud-synthesis tokens.
For workloads that want to control extraction-task wording, document/question split, and dry-run behavior explicitly:
curl -X POST http://127.0.0.1:8000/v1/hybrid_chat \
-H "Content-Type: application/json" \
--data '{
"large_document": "...",
"extraction_task": "Extract every quantitative fact and named entity.",
"final_user_prompt": "Summarise the key risks in this document."
}'The OpenAI-compatible endpoint delegates to this same pipeline internally - the only difference is the request/response shape.
For air-gapped, regulated, or cost-sensitive deployments you can run everything locally - no cloud LLM calls at any stage. The proxy uses LiteLLM's OpenAI-compatible adapter, so any local server that speaks the OpenAI chat-completions schema works as a synthesis backend. The most common patterns:
# Terminal 1: serve DeepSeek-V3 as an OpenAI-compatible endpoint on port 9000
vllm serve deepseek-ai/DeepSeek-V3 \
--host 0.0.0.0 --port 9000 \
--max-model-len 32768
# Terminal 2: start the Liquid Memory proxy, pointing synthesis at the local DeepSeek
export OPENAI_API_KEY="not-used-but-litellm-requires-the-var"
export OPENAI_API_BASE="http://localhost:9000/v1"
export SYNTHESIS_MODEL="openai/deepseek-ai/DeepSeek-V3"
uvicorn liquid_proxy:app --host 0.0.0.0 --port 8000Now the proxy reads the document with Mistral (locally), extracts a compressed fact pack, and forwards the pack to DeepSeek (locally). Nothing leaves the host.
# Serve Llama-3.1-70B with text-generation-inference
docker run --gpus all -p 9000:80 ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-70B-Instruct
# Point Liquid Memory at it
export OPENAI_API_BASE="http://localhost:9000/v1"
export SYNTHESIS_MODEL="openai/meta-llama/Llama-3.1-70B-Instruct"vllm serve Qwen/Qwen2.5-7B-Instruct --port 9000 --max-model-len 32768
export OPENAI_API_BASE="http://localhost:9000/v1"
export SYNTHESIS_MODEL="openai/Qwen/Qwen2.5-7B-Instruct"- Regulated industries (defense, healthcare, finance, legal) where prompts contain controlled data that cannot leave the perimeter.
- High-volume internal workloads where the per-token cloud rate dominates total cost; running on amortised hardware can be cheaper at scale.
- Sovereign-cloud deployments in jurisdictions where US/EU LLM APIs are restricted.
- Reproducibility-bound workflows (research, audit, regulatory submissions) where model drift on a hosted endpoint is unacceptable.
The trade-off is GPU footprint: you need enough VRAM to host both the extraction model (Mistral-7B, ~14 GB) and the synthesis model (DeepSeek-V3 needs a lot more; Llama-3.1-70B around 140 GB unquantised; Qwen2.5-7B around 14 GB). Pattern C fits on a single 80 GB H100; Pattern A typically needs a multi-GPU node.
liquid_memory/
├── benchmark.py # 50-line zero-trust VRAM proof (Llama-3.1-8B)
├── liquid_proxy.py # FastAPI proxy (Mode 1)
├── benchmark/
│ └── benchmark_vram.py # Configurable harness (--model, --tokens, etc.)
├── dist_public/ # AOT artifacts (Mode 2, license-gated)
│ ├── LiquidMemory_AOTI.pt2
│ ├── liquid_memory_auth.so
│ ├── AOTI_METADATA.json
│ └── MANIFEST.md
├── README.md # this file
├── README_INTEGRATION.md # Mode 2 - attention-library integration guide
├── requirements.txt
└── .gitignore
A drop-in replacement for torch.nn.MultiheadAttention with linear-time scaling and stable execution to 128K tokens.
from liquid_memory import LiquidMemory
# In your Transformer block, swap:
# self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
# for:
self.attn = LiquidMemory(d_model, n_heads, batch_first=True)
# Call exactly as you would nn.MultiheadAttention:
out, _ = self.attn(x, x, x, is_causal=True)That's the whole change. Forward shapes, return types, and masking semantics (is_causal, key_padding_mask) are preserved.
For autoregressive decode there's an additional step(token) method that runs in O(1) per token regardless of context length - the recurrent hidden state lives on the module and persists across calls.
- Linear time and memory. 128K-token contexts fit on a single H100 in BF16; vanilla
MultiheadAttentionOOMs at ~32K. - Per-token decode.
O(1)step instead ofO(N)per generation step. Long-running chat sessions stay cheap. torch.compilecompatible. The kernel registers as a custom op and gets treated as a black box; surrounding PyTorch fuses normally.- Hugging Face patch path. A two-screen recipe patches
LlamaAttention(or any architecture-specific attention class) cleanly. SeeREADME_INTEGRATION.md§ 3.2.
- Long-context training. Teams building a foundation model who want to extend context to 128K without quadratic VRAM blow-up. Train at full sequence length on the GPUs you already have instead of buying more.
- Self-hosted inference. Teams serving Llama-3.1-70B, Qwen2.5, Mistral, DeepSeek, etc. on their own hardware. Patch
self_attnonce, serve 5x longer contexts per GPU. - Research / ablations. Use Liquid Memory as a non-quadratic baseline in long-context experiments.
- On-prem enterprise LLMs. Regulated industries running their own model behind the firewall who want longer contexts without scaling out the cluster.
The module ships with a one-shot comparison harness:
from liquid_memory import LiquidMemory
import torch
LiquidMemory.benchmark(
embed_dim=4096,
num_heads=32,
batch_size=1,
seq_lengths=(2048, 8192, 32_768, 131_072),
dtype=torch.bfloat16,
device="cuda",
)Sample output (H100, BF16):
seq_len │ LM mem │ TX mem │ LM ms │ TX ms │ speedup
─────────────────────────────────────────────────────────────────────
2048 │ 0.21GB │ 0.34GB │ 1.2 │ 2.7 │ 2.3x
8192 │ 0.84GB │ 5.12GB │ 4.7 │ 41.3 │ 8.8x
32768 │ 3.36GB │ OOM │ 18.6 │ OOM │ n/a
131072 │ 13.42GB │ OOM │ 74.2 │ OOM │ n/a
The compiled kernel and the liquid_memory.py wrapper module are distributed under a commercial license. The compiled artifacts in dist_public/ are publicly readable so you can verify SHA-256 against MANIFEST.md before installation, but executing the kernel requires a provisioned Ed25519 license key set via LM_PRIVATE_KEY at process start.
To request a license / pilot, email the founders (see Contact below).
For security issues: same emails; we acknowledge within one business day.
Liquid Memory is not another chatbot wrapper. It is cost-control and privacy infrastructure for AI teams that want to push the long-context frontier without paying for noise.
That makes it an unusually broad wedge - it sells into TWO completely different buyers:
- App developers burning $20k+/mo on OpenAI/Anthropic long-context calls. They install the proxy (Mode 1) and the bill drops in one billing cycle.
- Foundation-model teams and self-hosters paying for the GPUs that quadratic attention forces them to over-provision. They patch one line of PyTorch (Mode 2) and serve 4x longer contexts on the same cluster.
Local GPUs do the filtering. Synthesis models - cloud or local - do the reasoning. The business stops paying for noise.
© 2026 Liquid Memory, Inc. Distribution and reverse engineering of the compiled kernel are governed by your license agreement.