TorusSparse

Stream large LLMs from CPU RAM to VRAM on demand — run models that don't fit, without buying more GPU memory.

TorusSparse wraps any HuggingFace CausalLM so that:

FFN weights live in pinned system RAM — attention layers, embeddings, and norms stay in VRAM permanently.
Async PCIe streaming — a background thread prefetches the next layer's weights while the GPU computes the current one, hiding transfer latency.
JEPA predictor — a tiny MLP predicts which weights will be needed next and schedules prefetches ahead of time.
Zero model changes — a transparent wrapper; your HuggingFace model code is unchanged.

Measured Results

Hardware: RTX 2070 Super (8.6 GB VRAM), PCIe 3.0 x16, 34 GB DDR4

Model	Type	Baseline VRAM	TorusSparse VRAM	VRAM saved	Speed
TinyLlama-1.1B	Dense	3037 MB	915 MB	−2122 MB	7.4 tok/s
Phi-2 2.7B	Dense	8027 MB	2573 MB	−5454 MB	2.7 tok/s
Mistral-7B	Dense	OOM	~6 GB	runs at all	~1 tok/s
OLMoE-1B-7B (64 experts)	MoE	OOM	3390 MB	runs at all	0.9 tok/s

Text output is identical character-for-character to the baseline in all runs.

OLMoE-1B-7B is a Mixture-of-Experts model with 64 experts per layer (7B total parameters, 1B active per token). It physically cannot run on an 8.6 GB GPU without offloading. TorusSparse runs it with 92.9% prefetch hit rate.

The speed overhead is not a code problem — it is physics. Every FFN layer's weights must cross the PCIe bus once per token:

TinyLlama: 22 layers × 66 MB = 1.45 GB/token ÷ 16 GB/s PCIe 3.0 = 89 ms = 11 tok/s ceiling
Phi-2:     32 layers × 100 MB = 3.2 GB/token  ÷ 16 GB/s PCIe 3.0 = 200 ms = 5 tok/s ceiling

On PCIe 4.0 / 5.0 (2–4× bandwidth) or Apple Silicon (unified memory, ~400 GB/s, no PCIe bottleneck), the speed overhead shrinks dramatically. On M2/M3 Max, the memory bus is fast enough that the overhead nearly disappears.

The right use cases:

Running a model that would OOM at baseline (Mistral-7B on 8 GB VRAM)
Freeing VRAM for other workloads while still running inference
Apple Silicon with unified memory — large models at near-baseline speed

Installation

git clone https://github.com/cogitanai/torussparse
cd torussparse
pip install -e .

Requirements: Python ≥ 3.10, PyTorch ≥ 2.2, Transformers ≥ 4.40, Accelerate ≥ 0.29

CUDA is strongly recommended. Without it the system falls back to CPU-only mode — no VRAM savings, but all code paths still run correctly (useful for development and testing).

Optional extras:

pip install rich                  # required for chat.py terminal UI
pip install fastapi uvicorn       # required for serve.py API server

Quick Start

Terminal chat (streaming, token by token)

python scripts/chat.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
python scripts/chat.py --model microsoft/phi-2
python scripts/chat.py --model mistralai/Mistral-7B-Instruct-v0.2 --skip-baseline

The chat CLI shows a loading spinner, streams tokens in real time, and keeps conversation history. Commands: /reset to clear history, /quit or Ctrl+C to exit.

OpenAI-compatible API server (for OpenClaw, Open WebUI, Jan, LM Studio, etc.)

pip install fastapi uvicorn
python scripts/serve.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8000

Then point any OpenAI-compatible client at http://127.0.0.1:8000. It speaks the same streaming /v1/chat/completions format that OpenClaw, Open WebUI, and similar tools expect.

# Verify the server is up:
curl http://127.0.0.1:8000/health

# Send a chat message:
curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Connecting to OpenClaw

Start the TorusSparse server (see above).
Open OpenClaw → Settings → LLM Providers → Add Provider.
Set API Type to OpenAI Compatible.
Set Base URL to http://localhost:8000/v1.
Set API Key to any non-empty string (e.g. torussparse — the server ignores it).
Set Model to the model name you started the server with (e.g. TinyLlama/TinyLlama-1.1B-Chat-v1.0).
Save. OpenClaw will now route all LLM calls through TorusSparse.

Connecting to Open WebUI

Start the TorusSparse server.
Open WebUI → Settings → Connections → OpenAI API.
Set URL to http://localhost:8000/v1 and API key to any string.
The model will appear in the model selector.

Python API

from torus_sparse import TorusSparseModel

model = TorusSparseModel.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Single response
response = model.generate("Explain the sky in two sentences.", max_new_tokens=128)
print(response)

# Streaming
for token in model.stream_generate("Tell me a story.", max_new_tokens=256):
    print(token, end="", flush=True)

INT8 quantisation (halves RAM footprint)

python scripts/chat.py --model mistralai/Mistral-7B-Instruct-v0.2 --weight-quant int8

INT8 uses per-channel symmetric quantisation, reducing FFN RAM usage by ~50% with minor quality degradation. Recommended when a model is too large to fit in system RAM at fp16.

Run-to-run benchmark

python scripts/monitor_benchmark.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tokens 100
python scripts/monitor_benchmark.py --model microsoft/phi-2 --tokens 100
python scripts/monitor_benchmark.py --model mistralai/Mistral-7B-Instruct-v0.2 --tokens 50 --skip-baseline

Verify the Install

No GPU and no model downloads required for the test suite:

# Full test suite (82 tests, ~90 seconds, CPU only)
python -m pytest tests/ -q

# Quick smoke test (synthetic model, no downloads, ~10 seconds)
python scripts/smoke_run.py

How It Works

GPU executes layer i
        │
        ▼  (post-forward hook)
restore_layer(i)    — swap out i's VRAM weights, flip double-buffer
        │
        ▼
on_layer_complete(i)
  Dense fast path:  submit TransferJob(i + lookahead_horizon)
                    _apply_prefetch(i + 1) — swap in if ready
  MoE slow path:    JEPA predicts which K experts will fire → submit those only

Background thread (MemoryStream):
  consume_guard()   — wait for GPU guard event (no CPU stall)
  slot.reset()      — clear write pointer
  slot.write(...)   — async DMA pinned RAM → VRAM slot (secondary CUDA stream)
  post TransferResult → result queue

Layer i+1:
  if prefetch ready: param.data = VRAM slot view  (zero-copy pointer swap)
  else:              sync-copy fallback (pre-hook)
  GPU executes layer i+1

The double-buffer prevents the GPU from reading a slot while the DMA thread is writing into it. A torch.cuda.Event recorded on the compute stream is used as the synchronisation barrier — no CPU stall.

Architecture

Module	Class	Responsibility
`config.py`	`TorusSparseConfig`	All tuneable constants
`wrapper.py`	`TorusSparseModel`	HuggingFace wrapper, weight partitioning, hook registration, streaming
`router/manifold.py`	`TorusManifold`	Torus geometry + block index (dense and MoE)
`router/predictor.py`	`JEPAPredictor`	Latent trajectory MLP + online adaptation
`engine/cache_manager.py`	`CacheManager`	Double-buffer VRAM slots + zero-copy pointer swap
`engine/memory_stream.py`	`MemoryStream`	Background thread + secondary CUDA stream
`engine/lookahead.py`	`LookaheadCoordinator`	JEPA → Manifold → Stream → Swap orchestration
`utils/benchmarker.py`	`PCIeBenchmarker`	PCIe bandwidth probe + buffer-size recommendation
`scripts/chat.py`	—	Streaming terminal chat UI (`rich`)
`scripts/serve.py`	—	OpenAI-compatible API server (`fastapi`)
`scripts/monitor_benchmark.py`	—	Side-by-side benchmark with hardware monitoring

Training the Routing Components

The SAE and JEPA predictor are optional — the system defaults to a deterministic fast path for standard sequential transformers that bypasses JEPA routing entirely and still achieves 95%+ prefetch hit rate. Training is only needed for MoE models (Mixtral, Phi-3.5-MoE, etc.) where expert routing is non-deterministic.

# Train the Sparse Autoencoder
python scripts/train_sae.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output models/tinyllama_sae.pt \
  --steps 2000

# Train the JEPA predictor
python scripts/train_jepa.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --sae models/tinyllama_sae.pt \
  --output models/tinyllama_predictor.pt \
  --steps 2000

# Run with trained weights
python scripts/monitor_benchmark.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --sae models/tinyllama_sae.pt \
  --predictor models/tinyllama_predictor.pt \
  --tokens 100

Configuration

from torus_sparse.config import TorusSparseConfig

cfg = TorusSparseConfig(
    pcie_buffer_mb=128,            # VRAM slot size; auto-sized at runtime if too small
    lookahead_horizon=2,           # layers ahead to prefetch
    vram_resident_fraction=0.0,    # fraction of early layers to keep permanently in VRAM
    weight_quant="none",           # "none" or "int8" (per-channel symmetric)
    moe_expert_prefetch_k=2,       # experts to prefetch per MoE layer
)

Known Limitations

Speed vs VRAM tradeoff is real — on PCIe 3.0 hardware, expect 5–10× throughput reduction versus baseline. This is PCIe bandwidth physics, not a code bug.
No INT4 support — INT4 quantisation conflicts with pin_memory() DMA. INT8 is the minimum granularity that works cleanly.
JEPA routing disabled for sequential models — the deterministic fast path is always used for standard transformers. JEPA routing (SAE + predictor) is active only for MoE models.
Windows Python 3.13 + datasets — avoid from datasets import load_dataset; it triggers a pyarrow crash. The training scripts use urllib directly to avoid this.

License

GNU Affero General Public License v3.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
docs		docs
scripts		scripts
tests		tests
torus_sparse		torus_sparse
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
RUNGUIDE.md		RUNGUIDE.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TorusSparse

Measured Results

Installation

Quick Start

Terminal chat (streaming, token by token)

OpenAI-compatible API server (for OpenClaw, Open WebUI, Jan, LM Studio, etc.)

Connecting to OpenClaw

Connecting to Open WebUI

Python API

INT8 quantisation (halves RAM footprint)

Run-to-run benchmark

Verify the Install

How It Works

Architecture

Training the Routing Components

Configuration

Known Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TorusSparse

Measured Results

Installation

Quick Start

Terminal chat (streaming, token by token)

OpenAI-compatible API server (for OpenClaw, Open WebUI, Jan, LM Studio, etc.)

Connecting to OpenClaw

Connecting to Open WebUI

Python API

INT8 quantisation (halves RAM footprint)

Run-to-run benchmark

Verify the Install

How It Works

Architecture

Training the Routing Components

Configuration

Known Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages