Skip to content

JacobFlorio/LocalAI4Jax

Repository files navigation

LocalAI4Jax

Local-LLM stack for cosmos-engine-v2 and cosmos-engine-pantheon, built specifically to run on an old Dell Inspiron i5 with no GPU. Drop-in replacement for the Anthropic API that doesn't cost anything.

LocalAI4Jax is a small focused toolkit that does one thing: get the two cosmos-engine projects running on a low-end laptop without any cloud API keys. It does not try to be a general LLM gateway. It does not try to support every model. It supports the models that actually run well on a Dell Inspiron i5 with 8 GB of RAM and no GPU, and it wires them into the two specific engines they need to drive.

Built for Jax when his Claude API budget ran out.


What you get

  1. A one-shot installer — installs Ollama, picks the right model tier for your hardware, pulls the curated models, and runs a smoke test. One command on Linux/macOS, a six-step list on Windows.
  2. A hardware check — reports what you've got and recommends a tier before you download multi-gigabyte models that won't fit anyway.
  3. A drop-in anthropic SDK shimcosmos-engine-v2 uses the real Anthropic SDK; this shim makes client.messages.create() work locally with one extra import line at the top of v2's entry point.
  4. A pantheon config patchcosmos-engine-pantheon already speaks OpenAI-compat HTTP, so wiring it is just env vars or a six-line edit.
  5. Verify and speed-test scripts — prove the install works end-to-end and benchmark each model on your specific hardware so you know what to expect.

The five-minute path (Linux / macOS)

# 1. install LocalAI4Jax — installs Ollama, pulls models, smoke-tests
git clone https://github.com/JacobFlorio/LocalAI4Jax.git
cd LocalAI4Jax
./install.sh

# 2. wire your existing cosmos-engine-v2 — auto-patches orchestrator.py
python wire_v2.py /path/to/cosmos-engine-v2

# 3. (recommended) better chapter prose
export LOCALAI_MAP_SONNET=qwen2.5:3b-instruct

# 4. prove the shim works on your machine before launching the real engine
python examples/v2_shim_smoke.py

# 5. run v2 the same way you would have before
cd /path/to/cosmos-engine-v2/cosmos-engine
python orchestrator.py

That's the whole thing for cosmos-engine-v2. No manual edits. wire_v2.py inserts a 4-line bootstrap block at the top of v2's orchestrator.py (with a backup), and from then on every import anthropic in v2 silently routes to local Ollama. Undo any time with python wire_v2.py --revert /path/to/cosmos-engine-v2.

For cosmos-engine-pantheon, no patcher is needed — pantheon is already designed for local. Just set four env vars (see patches/pantheon_local.md) and run its orchestrator the way you always have.

Quickstart (Windows)

See INSTALL_WINDOWS.md. Six steps, all copy-pasteable.


Hardware tiers

local_ai_4_jax/hardware_check.py reads your RAM and picks one of three tiers from models.yaml:

Tier RAM floor Models pulled Notes
minimal 4 GB llama3.2:1b Last-resort fallback. Quality is rough. Speed is great.
standard 8 GB llama3.2:1b, llama3.2:3b, qwen2.5:3b-instruct The Inspiron tier. Default for an 8 GB laptop with no GPU.
comfortable 16 GB llama3.2:1b, llama3.2:3b, qwen2.5:7b-instruct, phi3.5 Adds Phi-3.5 for structured-output and a 7B chronicler.

The standard tier loads one model at a time. Ollama swaps them in and out on demand — slower than holding both resident, but it fits in 8 GB. If you have more than ~12 GB you can run two Ollama instances on different ports and avoid swap entirely; see Path C in patches/pantheon_local.md.

Run the check anytime to see what you've got:

python -m local_ai_4_jax.hardware_check

How v2 and pantheon get wired in

The two projects are wired completely differently:

cosmos-engine-pantheon — env vars only

Pantheon is already designed for local. Its llm_client.py calls a local OpenAI-compatible HTTP endpoint. Ollama serves exactly that endpoint at http://127.0.0.1:11434/v1/chat/completions. To wire it in, just set four env vars:

export AGENT_SERVER_URL=http://127.0.0.1:11434
export AGENT_MODEL=llama3.2:3b
export CHRONICLER_SERVER_URL=http://127.0.0.1:11434
export CHRONICLER_MODEL=qwen2.5:3b-instruct
python orchestrator.py

That's the whole thing. No code changes. Three paths (env vars, edit config.py, run two Ollama instances) are documented in patches/pantheon_local.md.

cosmos-engine-v2 — one import line

v2 is harder because it imports the real Anthropic SDK and calls anthropic.Anthropic(api_key=...).messages.create(...). There's no env var to redirect that.

LocalAI4Jax solves it with a monkey-patch bootstrap. Add one line at the top of v2's orchestrator.py:

import local_ai_4_jax.bootstrap   # routes Anthropic SDK calls to local Ollama

That's it. Every subsequent import anthropic in v2 loads our local-routed replacement instead of the real SDK. v2's source stays untouched. The shim covers messages.create() and messages.stream() — the two surfaces v2 actually uses. Anthropic model names like claude-sonnet-4-6 map to local models like llama3.2:3b automatically. Map overrides are env-var configurable.

Full instructions: patches/cosmos_v2_local.md.

Verify before you run the real engine

Two example scripts prove the wiring before you launch a full pantheon or v2 run:

python examples/pantheon_smoke.py    # tests the OpenAI-compat path pantheon uses
python examples/v2_shim_smoke.py     # tests the Anthropic shim v2 uses

If either script prints a god speaking in two sentences with a green ✓, the engine will work. If it fails, fix that first — the failure mode will be the same when you run the full engine, but much harder to debug.


What's in the box

LocalAI4Jax/
├── README.md                       # this file
├── INSTALL_WINDOWS.md              # six-step Windows install
├── install.sh                      # one-shot Linux/Mac installer
├── models.yaml                     # curated model tiers
├── requirements.txt                # tiny — requests, psutil, pyyaml
├── pyproject.toml
├── LICENSE                         # MIT
│
├── local_ai_4_jax/
│   ├── __init__.py
│   ├── hardware_check.py           # RAM/CPU/disk report + tier pick
│   ├── ollama_helpers.py           # health check, pulls, chat completion
│   ├── anthropic_shim.py           # drop-in Anthropic SDK replacement
│   ├── bootstrap.py                # monkey-patch sys.modules['anthropic']
│   ├── verify.py                   # end-to-end smoke test
│   └── speed_test.py               # tok/sec benchmark per model
│
├── examples/
│   ├── pantheon_smoke.py           # proves pantheon's expected endpoint works
│   └── v2_shim_smoke.py            # proves the Anthropic shim works
│
└── patches/
    ├── pantheon_local.md           # how to wire pantheon (3 paths)
    └── cosmos_v2_local.md          # how to wire v2 (one import line)

Speed expectations

On a Dell Inspiron i5 with 8 GB of RAM and no GPU, ballpark numbers from the speed test:

Model Short turn (~100 tok) Long chronicle (~1500 tok)
llama3.2:1b ~5–15 seconds ~1.5–3 minutes
llama3.2:3b ~15–40 seconds ~3–7 minutes
qwen2.5:3b ~20–45 seconds ~4–8 minutes

Numbers vary wildly by i5 generation and how busy the rest of your laptop is. A 4th-gen i5 from 2014 will be at the slow end. A 12th-gen mobile i5 from 2022 will be at the fast end. Run python -m local_ai_4_jax.speed_test on your actual machine to get real numbers.

These speeds are slow compared to Claude or GPT-4, but they are free and they run completely offline. For a narrative engine that writes one chapter every few minutes, that's plenty.


Why Ollama and not raw llama.cpp

llama.cpp is the engine under the hood, but Ollama wraps it in a way that makes the install path tractable:

  • single-binary install on Linux / macOS / Windows
  • ollama pull llama3.2:3b handles model download, quantization choice, and disk caching automatically
  • exposes an OpenAI-compatible endpoint at http://127.0.0.1:11434/v1 that pantheon already speaks unmodified
  • automatic model load/unload — multiple models can be "installed" without all being resident in RAM
  • huge community, well-documented when something goes wrong

If you'd rather use raw llama.cpp's llama-server binary, the same endpoints will work — the patches in patches/ are agnostic about which server is behind the URL. Ollama is just the easiest path.


Limitations

  • No GPU support is required, but if you have one Ollama will use it automatically. NVIDIA, AMD ROCm, and Apple Metal are all detected.
  • No vision / multimodal. The local models can't see images. The Anthropic shim flattens any image content blocks in v2's user messages to a placeholder string so the call doesn't crash.
  • No tool use / function calling. v2 doesn't use these, and pantheon doesn't either, so the shim doesn't implement them. If you need them, add them to local_ai_4_jax/anthropic_shim.py — it's ~250 lines and designed to be readable.
  • Slow on old hardware. There is no fix for this — local CPU inference of a 3B model on a 2015 i5 will always be slow. The speed test gives you accurate per-model numbers so you can decide which model to keep loaded.

License

MIT. Copy it, modify it, point it at a different engine, point it at different models. See LICENSE.

Companion projects: cosmos-engine-pantheon, cosmos-engine-v2, pantheon-council, PantheonLive.

About

Drop-in local LLM stack for cosmos-engine-v2 and cosmos-engine-pantheon. Built specifically for an old Dell Inspiron i5 with no GPU. One-import Anthropic SDK shim + Ollama installer + curated model tiers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors