Stream large LLMs from CPU RAM to VRAM on demand — run models that don't fit, without buying more GPU memory.
TorusSparse wraps any HuggingFace CausalLM so that:
- FFN weights live in pinned system RAM — attention layers, embeddings, and norms stay in VRAM permanently.
- Async PCIe streaming — a background thread prefetches the next layer's weights while the GPU computes the current one, hiding transfer latency.
- JEPA predictor — a tiny MLP predicts which weights will be needed next and schedules prefetches ahead of time.
- Zero model changes — a transparent wrapper; your HuggingFace model code is unchanged.
Hardware: RTX 2070 Super (8.6 GB VRAM), PCIe 3.0 x16, 34 GB DDR4
| Model | Type | Baseline VRAM | TorusSparse VRAM | VRAM saved | Speed |
|---|---|---|---|---|---|
| TinyLlama-1.1B | Dense | 3037 MB | 915 MB | −2122 MB | 7.4 tok/s |
| Phi-2 2.7B | Dense | 8027 MB | 2573 MB | −5454 MB | 2.7 tok/s |
| Mistral-7B | Dense | OOM | ~6 GB | runs at all | ~1 tok/s |
| OLMoE-1B-7B (64 experts) | MoE | OOM | 3390 MB | runs at all | 0.9 tok/s |
Text output is identical character-for-character to the baseline in all runs.
OLMoE-1B-7B is a Mixture-of-Experts model with 64 experts per layer (7B total parameters, 1B active per token). It physically cannot run on an 8.6 GB GPU without offloading. TorusSparse runs it with 92.9% prefetch hit rate.
The speed overhead is not a code problem — it is physics. Every FFN layer's weights must cross the PCIe bus once per token:
TinyLlama: 22 layers × 66 MB = 1.45 GB/token ÷ 16 GB/s PCIe 3.0 = 89 ms = 11 tok/s ceiling
Phi-2: 32 layers × 100 MB = 3.2 GB/token ÷ 16 GB/s PCIe 3.0 = 200 ms = 5 tok/s ceiling
On PCIe 4.0 / 5.0 (2–4× bandwidth) or Apple Silicon (unified memory, ~400 GB/s, no PCIe bottleneck), the speed overhead shrinks dramatically. On M2/M3 Max, the memory bus is fast enough that the overhead nearly disappears.
The right use cases:
- Running a model that would OOM at baseline (Mistral-7B on 8 GB VRAM)
- Freeing VRAM for other workloads while still running inference
- Apple Silicon with unified memory — large models at near-baseline speed
git clone https://github.com/cogitanai/torussparse
cd torussparse
pip install -e .Requirements: Python ≥ 3.10, PyTorch ≥ 2.2, Transformers ≥ 4.40, Accelerate ≥ 0.29
CUDA is strongly recommended. Without it the system falls back to CPU-only mode — no VRAM savings, but all code paths still run correctly (useful for development and testing).
Optional extras:
pip install rich # required for chat.py terminal UI
pip install fastapi uvicorn # required for serve.py API serverpython scripts/chat.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
python scripts/chat.py --model microsoft/phi-2
python scripts/chat.py --model mistralai/Mistral-7B-Instruct-v0.2 --skip-baselineThe chat CLI shows a loading spinner, streams tokens in real time, and keeps conversation history. Commands: /reset to clear history, /quit or Ctrl+C to exit.
pip install fastapi uvicorn
python scripts/serve.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8000Then point any OpenAI-compatible client at http://127.0.0.1:8000. It speaks the same streaming /v1/chat/completions format that OpenClaw, Open WebUI, and similar tools expect.
# Verify the server is up:
curl http://127.0.0.1:8000/health
# Send a chat message:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'- Start the TorusSparse server (see above).
- Open OpenClaw → Settings → LLM Providers → Add Provider.
- Set API Type to
OpenAI Compatible. - Set Base URL to
http://localhost:8000/v1. - Set API Key to any non-empty string (e.g.
torussparse— the server ignores it). - Set Model to the model name you started the server with (e.g.
TinyLlama/TinyLlama-1.1B-Chat-v1.0). - Save. OpenClaw will now route all LLM calls through TorusSparse.
- Start the TorusSparse server.
- Open WebUI → Settings → Connections → OpenAI API.
- Set URL to
http://localhost:8000/v1and API key to any string. - The model will appear in the model selector.
from torus_sparse import TorusSparseModel
model = TorusSparseModel.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Single response
response = model.generate("Explain the sky in two sentences.", max_new_tokens=128)
print(response)
# Streaming
for token in model.stream_generate("Tell me a story.", max_new_tokens=256):
print(token, end="", flush=True)python scripts/chat.py --model mistralai/Mistral-7B-Instruct-v0.2 --weight-quant int8INT8 uses per-channel symmetric quantisation, reducing FFN RAM usage by ~50% with minor quality degradation. Recommended when a model is too large to fit in system RAM at fp16.
python scripts/monitor_benchmark.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tokens 100
python scripts/monitor_benchmark.py --model microsoft/phi-2 --tokens 100
python scripts/monitor_benchmark.py --model mistralai/Mistral-7B-Instruct-v0.2 --tokens 50 --skip-baselineNo GPU and no model downloads required for the test suite:
# Full test suite (82 tests, ~90 seconds, CPU only)
python -m pytest tests/ -q
# Quick smoke test (synthetic model, no downloads, ~10 seconds)
python scripts/smoke_run.pyGPU executes layer i
│
▼ (post-forward hook)
restore_layer(i) — swap out i's VRAM weights, flip double-buffer
│
▼
on_layer_complete(i)
Dense fast path: submit TransferJob(i + lookahead_horizon)
_apply_prefetch(i + 1) — swap in if ready
MoE slow path: JEPA predicts which K experts will fire → submit those only
Background thread (MemoryStream):
consume_guard() — wait for GPU guard event (no CPU stall)
slot.reset() — clear write pointer
slot.write(...) — async DMA pinned RAM → VRAM slot (secondary CUDA stream)
post TransferResult → result queue
Layer i+1:
if prefetch ready: param.data = VRAM slot view (zero-copy pointer swap)
else: sync-copy fallback (pre-hook)
GPU executes layer i+1
The double-buffer prevents the GPU from reading a slot while the DMA thread is writing into it. A torch.cuda.Event recorded on the compute stream is used as the synchronisation barrier — no CPU stall.
| Module | Class | Responsibility |
|---|---|---|
config.py |
TorusSparseConfig |
All tuneable constants |
wrapper.py |
TorusSparseModel |
HuggingFace wrapper, weight partitioning, hook registration, streaming |
router/manifold.py |
TorusManifold |
Torus geometry + block index (dense and MoE) |
router/predictor.py |
JEPAPredictor |
Latent trajectory MLP + online adaptation |
engine/cache_manager.py |
CacheManager |
Double-buffer VRAM slots + zero-copy pointer swap |
engine/memory_stream.py |
MemoryStream |
Background thread + secondary CUDA stream |
engine/lookahead.py |
LookaheadCoordinator |
JEPA → Manifold → Stream → Swap orchestration |
utils/benchmarker.py |
PCIeBenchmarker |
PCIe bandwidth probe + buffer-size recommendation |
scripts/chat.py |
— | Streaming terminal chat UI (rich) |
scripts/serve.py |
— | OpenAI-compatible API server (fastapi) |
scripts/monitor_benchmark.py |
— | Side-by-side benchmark with hardware monitoring |
The SAE and JEPA predictor are optional — the system defaults to a deterministic fast path for standard sequential transformers that bypasses JEPA routing entirely and still achieves 95%+ prefetch hit rate. Training is only needed for MoE models (Mixtral, Phi-3.5-MoE, etc.) where expert routing is non-deterministic.
# Train the Sparse Autoencoder
python scripts/train_sae.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output models/tinyllama_sae.pt \
--steps 2000
# Train the JEPA predictor
python scripts/train_jepa.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--sae models/tinyllama_sae.pt \
--output models/tinyllama_predictor.pt \
--steps 2000
# Run with trained weights
python scripts/monitor_benchmark.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--sae models/tinyllama_sae.pt \
--predictor models/tinyllama_predictor.pt \
--tokens 100from torus_sparse.config import TorusSparseConfig
cfg = TorusSparseConfig(
pcie_buffer_mb=128, # VRAM slot size; auto-sized at runtime if too small
lookahead_horizon=2, # layers ahead to prefetch
vram_resident_fraction=0.0, # fraction of early layers to keep permanently in VRAM
weight_quant="none", # "none" or "int8" (per-channel symmetric)
moe_expert_prefetch_k=2, # experts to prefetch per MoE layer
)- Speed vs VRAM tradeoff is real — on PCIe 3.0 hardware, expect 5–10× throughput reduction versus baseline. This is PCIe bandwidth physics, not a code bug.
- No INT4 support — INT4 quantisation conflicts with
pin_memory()DMA. INT8 is the minimum granularity that works cleanly. - JEPA routing disabled for sequential models — the deterministic fast path is always used for standard transformers. JEPA routing (SAE + predictor) is active only for MoE models.
- Windows Python 3.13 + datasets — avoid
from datasets import load_dataset; it triggers a pyarrow crash. The training scripts useurllibdirectly to avoid this.
GNU Affero General Public License v3.0 — see LICENSE.