Local-LLM stack for cosmos-engine-v2 and cosmos-engine-pantheon, built specifically to run on an old Dell Inspiron i5 with no GPU. Drop-in replacement for the Anthropic API that doesn't cost anything.
LocalAI4Jax is a small focused toolkit that does one thing: get the two cosmos-engine projects running on a low-end laptop without any cloud API keys. It does not try to be a general LLM gateway. It does not try to support every model. It supports the models that actually run well on a Dell Inspiron i5 with 8 GB of RAM and no GPU, and it wires them into the two specific engines they need to drive.
Built for Jax when his Claude API budget ran out.
- A one-shot installer — installs Ollama, picks the right model tier for your hardware, pulls the curated models, and runs a smoke test. One command on Linux/macOS, a six-step list on Windows.
- A hardware check — reports what you've got and recommends a tier before you download multi-gigabyte models that won't fit anyway.
- A drop-in
anthropicSDK shim —cosmos-engine-v2uses the real Anthropic SDK; this shim makesclient.messages.create()work locally with one extra import line at the top of v2's entry point. - A pantheon config patch —
cosmos-engine-pantheonalready speaks OpenAI-compat HTTP, so wiring it is just env vars or a six-line edit. - Verify and speed-test scripts — prove the install works end-to-end and benchmark each model on your specific hardware so you know what to expect.
# 1. install LocalAI4Jax — installs Ollama, pulls models, smoke-tests
git clone https://github.com/JacobFlorio/LocalAI4Jax.git
cd LocalAI4Jax
./install.sh
# 2. wire your existing cosmos-engine-v2 — auto-patches orchestrator.py
python wire_v2.py /path/to/cosmos-engine-v2
# 3. (recommended) better chapter prose
export LOCALAI_MAP_SONNET=qwen2.5:3b-instruct
# 4. prove the shim works on your machine before launching the real engine
python examples/v2_shim_smoke.py
# 5. run v2 the same way you would have before
cd /path/to/cosmos-engine-v2/cosmos-engine
python orchestrator.pyThat's the whole thing for cosmos-engine-v2. No manual edits. wire_v2.py
inserts a 4-line bootstrap block at the top of v2's orchestrator.py (with
a backup), and from then on every import anthropic in v2 silently routes
to local Ollama. Undo any time with python wire_v2.py --revert /path/to/cosmos-engine-v2.
For cosmos-engine-pantheon, no patcher is needed — pantheon is already
designed for local. Just set four env vars (see patches/pantheon_local.md)
and run its orchestrator the way you always have.
See INSTALL_WINDOWS.md. Six steps, all
copy-pasteable.
local_ai_4_jax/hardware_check.py reads your RAM and picks one of three
tiers from models.yaml:
| Tier | RAM floor | Models pulled | Notes |
|---|---|---|---|
minimal |
4 GB | llama3.2:1b |
Last-resort fallback. Quality is rough. Speed is great. |
standard |
8 GB | llama3.2:1b, llama3.2:3b, qwen2.5:3b-instruct |
The Inspiron tier. Default for an 8 GB laptop with no GPU. |
comfortable |
16 GB | llama3.2:1b, llama3.2:3b, qwen2.5:7b-instruct, phi3.5 |
Adds Phi-3.5 for structured-output and a 7B chronicler. |
The standard tier loads one model at a time. Ollama swaps them in and out
on demand — slower than holding both resident, but it fits in 8 GB. If you
have more than ~12 GB you can run two Ollama instances on different ports
and avoid swap entirely; see Path C in patches/pantheon_local.md.
Run the check anytime to see what you've got:
python -m local_ai_4_jax.hardware_checkThe two projects are wired completely differently:
Pantheon is already designed for local. Its llm_client.py calls a local
OpenAI-compatible HTTP endpoint. Ollama serves exactly that endpoint at
http://127.0.0.1:11434/v1/chat/completions. To wire it in, just set four
env vars:
export AGENT_SERVER_URL=http://127.0.0.1:11434
export AGENT_MODEL=llama3.2:3b
export CHRONICLER_SERVER_URL=http://127.0.0.1:11434
export CHRONICLER_MODEL=qwen2.5:3b-instruct
python orchestrator.pyThat's the whole thing. No code changes. Three paths (env vars, edit
config.py, run two Ollama instances) are documented in
patches/pantheon_local.md.
v2 is harder because it imports the real Anthropic SDK and calls
anthropic.Anthropic(api_key=...).messages.create(...). There's no env
var to redirect that.
LocalAI4Jax solves it with a monkey-patch bootstrap. Add one line
at the top of v2's orchestrator.py:
import local_ai_4_jax.bootstrap # routes Anthropic SDK calls to local OllamaThat's it. Every subsequent import anthropic in v2 loads our local-routed
replacement instead of the real SDK. v2's source stays untouched. The shim
covers messages.create() and messages.stream() — the two surfaces v2
actually uses. Anthropic model names like claude-sonnet-4-6 map to local
models like llama3.2:3b automatically. Map overrides are env-var
configurable.
Full instructions: patches/cosmos_v2_local.md.
Two example scripts prove the wiring before you launch a full pantheon or v2 run:
python examples/pantheon_smoke.py # tests the OpenAI-compat path pantheon uses
python examples/v2_shim_smoke.py # tests the Anthropic shim v2 usesIf either script prints a god speaking in two sentences with a green ✓, the engine will work. If it fails, fix that first — the failure mode will be the same when you run the full engine, but much harder to debug.
LocalAI4Jax/
├── README.md # this file
├── INSTALL_WINDOWS.md # six-step Windows install
├── install.sh # one-shot Linux/Mac installer
├── models.yaml # curated model tiers
├── requirements.txt # tiny — requests, psutil, pyyaml
├── pyproject.toml
├── LICENSE # MIT
│
├── local_ai_4_jax/
│ ├── __init__.py
│ ├── hardware_check.py # RAM/CPU/disk report + tier pick
│ ├── ollama_helpers.py # health check, pulls, chat completion
│ ├── anthropic_shim.py # drop-in Anthropic SDK replacement
│ ├── bootstrap.py # monkey-patch sys.modules['anthropic']
│ ├── verify.py # end-to-end smoke test
│ └── speed_test.py # tok/sec benchmark per model
│
├── examples/
│ ├── pantheon_smoke.py # proves pantheon's expected endpoint works
│ └── v2_shim_smoke.py # proves the Anthropic shim works
│
└── patches/
├── pantheon_local.md # how to wire pantheon (3 paths)
└── cosmos_v2_local.md # how to wire v2 (one import line)
On a Dell Inspiron i5 with 8 GB of RAM and no GPU, ballpark numbers from the speed test:
| Model | Short turn (~100 tok) | Long chronicle (~1500 tok) |
|---|---|---|
llama3.2:1b |
~5–15 seconds | ~1.5–3 minutes |
llama3.2:3b |
~15–40 seconds | ~3–7 minutes |
qwen2.5:3b |
~20–45 seconds | ~4–8 minutes |
Numbers vary wildly by i5 generation and how busy the rest of your laptop
is. A 4th-gen i5 from 2014 will be at the slow end. A 12th-gen mobile i5
from 2022 will be at the fast end. Run python -m local_ai_4_jax.speed_test
on your actual machine to get real numbers.
These speeds are slow compared to Claude or GPT-4, but they are free and they run completely offline. For a narrative engine that writes one chapter every few minutes, that's plenty.
llama.cpp is the engine under the hood, but Ollama wraps it in a way that makes the install path tractable:
- single-binary install on Linux / macOS / Windows
ollama pull llama3.2:3bhandles model download, quantization choice, and disk caching automatically- exposes an OpenAI-compatible endpoint at
http://127.0.0.1:11434/v1that pantheon already speaks unmodified - automatic model load/unload — multiple models can be "installed" without all being resident in RAM
- huge community, well-documented when something goes wrong
If you'd rather use raw llama.cpp's llama-server binary, the same
endpoints will work — the patches in patches/ are agnostic about which
server is behind the URL. Ollama is just the easiest path.
- No GPU support is required, but if you have one Ollama will use it automatically. NVIDIA, AMD ROCm, and Apple Metal are all detected.
- No vision / multimodal. The local models can't see images. The Anthropic shim flattens any image content blocks in v2's user messages to a placeholder string so the call doesn't crash.
- No tool use / function calling. v2 doesn't use these, and pantheon
doesn't either, so the shim doesn't implement them. If you need them,
add them to
local_ai_4_jax/anthropic_shim.py— it's ~250 lines and designed to be readable. - Slow on old hardware. There is no fix for this — local CPU inference of a 3B model on a 2015 i5 will always be slow. The speed test gives you accurate per-model numbers so you can decide which model to keep loaded.
MIT. Copy it, modify it, point it at a different engine, point it at different models. See LICENSE.
Companion projects: cosmos-engine-pantheon, cosmos-engine-v2, pantheon-council, PantheonLive.