Skip to content

2imi9/LocalPilot

Repository files navigation

LocalPilot

teaser

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026.

LocalPilot extends the original autoresearch with a web-enhanced research loop: a visual web agent (MolmoWeb-4B or MolmoWeb-8B) browses recent arXiv papers and a local code agent (Devstral / Qwen-Coder) generates experiment scripts — all running on your own GPU, no cloud APIs required.

Results

Comparing 78 baseline experiments vs 53 web-enhanced experiments, both starting from the same config:

Baseline Web-Enhanced
Best val_bpb 1.122049 1.118972
Total improvement −0.000489 −0.003566
Experiments used 78 53 (−32%)
Final streak 0/10 plateau 0/33 plateau at better value

Web-enhanced search achieves 7.3x more improvement in fewer experiments. All 5 improvements are traceable to specific arXiv papers (2023–2026).

How it works

The repo builds on the original three core files:

  • prepare.py — one-time data prep (downloads FineWeb, trains BPE tokenizer)
  • train.py — the single file the agent edits (model, optimizer, training loop)
  • program.md — agent instructions

LocalPilot adds:

  • localpilot/browse.py — MolmoWeb-4B/8B visual web agent for arXiv paper retrieval
  • experiments/run_baseline.py — greedy hill-climbing runner (Condition A baseline)
  • experiments/run_web.py — paper-grounded experiment runner (Condition B enhanced)
  • localpilot/config.py — hardware-aware model selection (auto-detects VRAM, picks best local model)
  • localpilot/analyze.py — result analysis and figure generation
  • localpilot.yaml — optional config overrides

Quick start

Requirements: Single NVIDIA GPU (8+ GB VRAM), Python 3.10+, uv

# 1. Clone and install
git clone https://github.com/2imi9/LocalPilot.git
cd LocalPilot
uv sync

# 2. Download data and tokenizer (one-time, ~2 min)
uv run prepare.py

# 3. Check your hardware and recommended models
python -m localpilot.config --show

# 4. Download models for your GPU
python -m localpilot.config --download-web-agent    # MolmoWeb-4B (~8 GB) or MolmoWeb-8B (~18 GB)
python -m localpilot.config --download-code-agent   # auto-selected based on VRAM

# 5. Run a single training test (~2 min)
uv run train.py

Choosing your models

LocalPilot auto-selects models based on your GPU VRAM:

python -m localpilot.config --models
  Available Web Agent Models:
  Key            VRAM   Description
  MolmoWeb-4B     8 GB  4B visual web agent — fits most GPUs        <- RTX 3080 and up
  MolmoWeb-8B    18 GB  8B visual web agent — state-of-the-art      <- RTX 4090 / 5090 *

  * MolmoWeb-8B is based on Qwen3-8B + SigLIP2, surpasses GPT-4o SoM agents (arXiv:2601.10611)

  Available Code Agent Models:
  Key                   VRAM   SWE-bench  Description
  [ ] Devstral-24B-Q8   25 GB     68.0%   Maximum quality
  [Y] Devstral-24B-Q6   20 GB     67.5%   High quality          <- RTX 4090 / 5090
  [Y] Devstral-24B-Q4   14 GB     66.0%   Good quality          <- RTX 3090 / 4080
  [Y] Qwen-Coder-14B-Q6 12 GB     37.0%   Solid coder           <- RTX 3080
  [Y] Qwen-Coder-7B-Q4   5 GB     33.0%   Lightweight           <- RTX 3060
  [Y] Qwen-Coder-7B-CPU  0 GB     33.0%   CPU only (any machine)

Override in localpilot.yaml:

web_agent: MolmoWeb-8B
code_agent: Devstral-24B-Q4

Or via environment variable:

LOCALPILOT_CODE_AGENT=Qwen-Coder-7B-Q4 python experiments/run_web.py

Running experiments

Baseline (random greedy search):

python experiments/run_baseline.py

Web-enhanced (paper-grounded search):

python experiments/run_web.py

Analyze results:

python -m localpilot.analyze
# Outputs: figures/fig1_trajectory.png, table1_summary.tsv

Starting a new project

# Creates a fresh LocalPilot project in a new directory
.\scripts\new_project.ps1 -Name "MyResearch" -Dest "C:\Projects"

VRAM usage (sequential, never simultaneous)

Phase What runs VRAM
Browse arXiv MolmoWeb-4B or MolmoWeb-8B ~8–18 GB
Generate experiment script Code agent (Devstral/Qwen) 14–25 GB
Training train.py ~6 GB

All three phases are sequential — the models load and unload between phases, so a 20 GB GPU comfortably handles Devstral Q6 + MolmoWeb-4B + training without overlap. A 24 GB GPU (e.g. RTX 5090) can also run MolmoWeb-8B for higher web agent quality.

Cost

The expensive part — training — always runs locally. The code agent is optional: use a local model (free) or an external API.

Training cost (always local)

LocalPilot (local GPU) Cloud H100 (Lambda)
Per experiment (~5 min) ~$0.0016 electricity ~$0.207
53-experiment run $0.09 ~$11.00
Cost per 1M tokens trained $0.000007 ~$0.045
Savings ~120× cheaper

Calculated at $0.13/kWh (US average), RTX 5090 Laptop GPU at 150W TDP, vs Lambda H100 at $2.49/hr.

Code agent cost (per 53-experiment run)

The code agent generates experiment patches — typically ~5K input + ~1K output tokens per call.

Code agent Per experiment 53 experiments Notes
Local Devstral / Qwen (electricity) ~$0.00 ~$0.00 Runs on your GPU between training phases
Claude Haiku 3.5 (API) ~$0.003 ~$0.16 Cheapest frontier option
Claude Sonnet 3.5 (API) ~$0.030 ~$1.59 High quality
GPT-4o (API) ~$0.040 ~$2.12 High quality

Total for 53 experiments (training + code agent):

  • Local models only: ~$0.09 (electricity)
  • Local training + Claude Haiku API: ~$0.25
  • Local training + GPT-4o API: ~$2.21 — still 5× cheaper than full cloud H100

Note: This excludes hardware amortization. If you already own the GPU (e.g. for gaming), the marginal cost is just electricity + any API fees.

Design choices

  • Single file to modify. The agent only touches train.py. Diffs are always reviewable.
  • Fixed time budget. ~2 min per experiment regardless of model size or batch size. Experiments are directly comparable.
  • Local models only. No cloud APIs. MolmoWeb-4B/8B and the code agent both run on your GPU.
  • Hardware-aware. localpilot/config.py detects your VRAM and picks the best model that fits — MolmoWeb-8B on ≥18 GB, MolmoWeb-4B on ≥8 GB.

Platform support

Requires a single NVIDIA GPU. For other platforms see the original autoresearch forks.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors