Skip to content

CogitanAI/TorusSparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TorusSparse

Stream large LLMs from CPU RAM to VRAM on demand — run models that don't fit, without buying more GPU memory.

TorusSparse wraps any HuggingFace CausalLM so that:

  • FFN weights live in pinned system RAM — attention layers, embeddings, and norms stay in VRAM permanently.
  • Async PCIe streaming — a background thread prefetches the next layer's weights while the GPU computes the current one, hiding transfer latency.
  • JEPA predictor — a tiny MLP predicts which weights will be needed next and schedules prefetches ahead of time.
  • Zero model changes — a transparent wrapper; your HuggingFace model code is unchanged.

Measured Results

Hardware: RTX 2070 Super (8.6 GB VRAM), PCIe 3.0 x16, 34 GB DDR4

Model Type Baseline VRAM TorusSparse VRAM VRAM saved Speed
TinyLlama-1.1B Dense 3037 MB 915 MB −2122 MB 7.4 tok/s
Phi-2 2.7B Dense 8027 MB 2573 MB −5454 MB 2.7 tok/s
Mistral-7B Dense OOM ~6 GB runs at all ~1 tok/s
OLMoE-1B-7B (64 experts) MoE OOM 3390 MB runs at all 0.9 tok/s

Text output is identical character-for-character to the baseline in all runs.

OLMoE-1B-7B is a Mixture-of-Experts model with 64 experts per layer (7B total parameters, 1B active per token). It physically cannot run on an 8.6 GB GPU without offloading. TorusSparse runs it with 92.9% prefetch hit rate.

The speed overhead is not a code problem — it is physics. Every FFN layer's weights must cross the PCIe bus once per token:

TinyLlama: 22 layers × 66 MB = 1.45 GB/token ÷ 16 GB/s PCIe 3.0 = 89 ms = 11 tok/s ceiling
Phi-2:     32 layers × 100 MB = 3.2 GB/token  ÷ 16 GB/s PCIe 3.0 = 200 ms = 5 tok/s ceiling

On PCIe 4.0 / 5.0 (2–4× bandwidth) or Apple Silicon (unified memory, ~400 GB/s, no PCIe bottleneck), the speed overhead shrinks dramatically. On M2/M3 Max, the memory bus is fast enough that the overhead nearly disappears.

The right use cases:

  • Running a model that would OOM at baseline (Mistral-7B on 8 GB VRAM)
  • Freeing VRAM for other workloads while still running inference
  • Apple Silicon with unified memory — large models at near-baseline speed

Installation

git clone https://github.com/cogitanai/torussparse
cd torussparse
pip install -e .

Requirements: Python ≥ 3.10, PyTorch ≥ 2.2, Transformers ≥ 4.40, Accelerate ≥ 0.29

CUDA is strongly recommended. Without it the system falls back to CPU-only mode — no VRAM savings, but all code paths still run correctly (useful for development and testing).

Optional extras:

pip install rich                  # required for chat.py terminal UI
pip install fastapi uvicorn       # required for serve.py API server

Quick Start

Terminal chat (streaming, token by token)

python scripts/chat.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0
python scripts/chat.py --model microsoft/phi-2
python scripts/chat.py --model mistralai/Mistral-7B-Instruct-v0.2 --skip-baseline

The chat CLI shows a loading spinner, streams tokens in real time, and keeps conversation history. Commands: /reset to clear history, /quit or Ctrl+C to exit.

OpenAI-compatible API server (for OpenClaw, Open WebUI, Jan, LM Studio, etc.)

pip install fastapi uvicorn
python scripts/serve.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 8000

Then point any OpenAI-compatible client at http://127.0.0.1:8000. It speaks the same streaming /v1/chat/completions format that OpenClaw, Open WebUI, and similar tools expect.

# Verify the server is up:
curl http://127.0.0.1:8000/health

# Send a chat message:
curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Connecting to OpenClaw

  1. Start the TorusSparse server (see above).
  2. Open OpenClaw → SettingsLLM ProvidersAdd Provider.
  3. Set API Type to OpenAI Compatible.
  4. Set Base URL to http://localhost:8000/v1.
  5. Set API Key to any non-empty string (e.g. torussparse — the server ignores it).
  6. Set Model to the model name you started the server with (e.g. TinyLlama/TinyLlama-1.1B-Chat-v1.0).
  7. Save. OpenClaw will now route all LLM calls through TorusSparse.

Connecting to Open WebUI

  1. Start the TorusSparse server.
  2. Open WebUI → SettingsConnectionsOpenAI API.
  3. Set URL to http://localhost:8000/v1 and API key to any string.
  4. The model will appear in the model selector.

Python API

from torus_sparse import TorusSparseModel

model = TorusSparseModel.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Single response
response = model.generate("Explain the sky in two sentences.", max_new_tokens=128)
print(response)

# Streaming
for token in model.stream_generate("Tell me a story.", max_new_tokens=256):
    print(token, end="", flush=True)

INT8 quantisation (halves RAM footprint)

python scripts/chat.py --model mistralai/Mistral-7B-Instruct-v0.2 --weight-quant int8

INT8 uses per-channel symmetric quantisation, reducing FFN RAM usage by ~50% with minor quality degradation. Recommended when a model is too large to fit in system RAM at fp16.

Run-to-run benchmark

python scripts/monitor_benchmark.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tokens 100
python scripts/monitor_benchmark.py --model microsoft/phi-2 --tokens 100
python scripts/monitor_benchmark.py --model mistralai/Mistral-7B-Instruct-v0.2 --tokens 50 --skip-baseline

Verify the Install

No GPU and no model downloads required for the test suite:

# Full test suite (82 tests, ~90 seconds, CPU only)
python -m pytest tests/ -q

# Quick smoke test (synthetic model, no downloads, ~10 seconds)
python scripts/smoke_run.py

How It Works

GPU executes layer i
        │
        ▼  (post-forward hook)
restore_layer(i)    — swap out i's VRAM weights, flip double-buffer
        │
        ▼
on_layer_complete(i)
  Dense fast path:  submit TransferJob(i + lookahead_horizon)
                    _apply_prefetch(i + 1) — swap in if ready
  MoE slow path:    JEPA predicts which K experts will fire → submit those only

Background thread (MemoryStream):
  consume_guard()   — wait for GPU guard event (no CPU stall)
  slot.reset()      — clear write pointer
  slot.write(...)   — async DMA pinned RAM → VRAM slot (secondary CUDA stream)
  post TransferResult → result queue

Layer i+1:
  if prefetch ready: param.data = VRAM slot view  (zero-copy pointer swap)
  else:              sync-copy fallback (pre-hook)
  GPU executes layer i+1

The double-buffer prevents the GPU from reading a slot while the DMA thread is writing into it. A torch.cuda.Event recorded on the compute stream is used as the synchronisation barrier — no CPU stall.


Architecture

Module Class Responsibility
config.py TorusSparseConfig All tuneable constants
wrapper.py TorusSparseModel HuggingFace wrapper, weight partitioning, hook registration, streaming
router/manifold.py TorusManifold Torus geometry + block index (dense and MoE)
router/predictor.py JEPAPredictor Latent trajectory MLP + online adaptation
engine/cache_manager.py CacheManager Double-buffer VRAM slots + zero-copy pointer swap
engine/memory_stream.py MemoryStream Background thread + secondary CUDA stream
engine/lookahead.py LookaheadCoordinator JEPA → Manifold → Stream → Swap orchestration
utils/benchmarker.py PCIeBenchmarker PCIe bandwidth probe + buffer-size recommendation
scripts/chat.py Streaming terminal chat UI (rich)
scripts/serve.py OpenAI-compatible API server (fastapi)
scripts/monitor_benchmark.py Side-by-side benchmark with hardware monitoring

Training the Routing Components

The SAE and JEPA predictor are optional — the system defaults to a deterministic fast path for standard sequential transformers that bypasses JEPA routing entirely and still achieves 95%+ prefetch hit rate. Training is only needed for MoE models (Mixtral, Phi-3.5-MoE, etc.) where expert routing is non-deterministic.

# Train the Sparse Autoencoder
python scripts/train_sae.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output models/tinyllama_sae.pt \
  --steps 2000

# Train the JEPA predictor
python scripts/train_jepa.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --sae models/tinyllama_sae.pt \
  --output models/tinyllama_predictor.pt \
  --steps 2000

# Run with trained weights
python scripts/monitor_benchmark.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --sae models/tinyllama_sae.pt \
  --predictor models/tinyllama_predictor.pt \
  --tokens 100

Configuration

from torus_sparse.config import TorusSparseConfig

cfg = TorusSparseConfig(
    pcie_buffer_mb=128,            # VRAM slot size; auto-sized at runtime if too small
    lookahead_horizon=2,           # layers ahead to prefetch
    vram_resident_fraction=0.0,    # fraction of early layers to keep permanently in VRAM
    weight_quant="none",           # "none" or "int8" (per-channel symmetric)
    moe_expert_prefetch_k=2,       # experts to prefetch per MoE layer
)

Known Limitations

  • Speed vs VRAM tradeoff is real — on PCIe 3.0 hardware, expect 5–10× throughput reduction versus baseline. This is PCIe bandwidth physics, not a code bug.
  • No INT4 support — INT4 quantisation conflicts with pin_memory() DMA. INT8 is the minimum granularity that works cleanly.
  • JEPA routing disabled for sequential models — the deterministic fast path is always used for standard transformers. JEPA routing (SAE + predictor) is active only for MoE models.
  • Windows Python 3.13 + datasets — avoid from datasets import load_dataset; it triggers a pyarrow crash. The training scripts use urllib directly to avoid this.

License

GNU Affero General Public License v3.0 — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages