Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 45 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,29 +31,35 @@ them. Full write-up in

## How it works

Four files that matter:
Four things that matter:

- **`config.json`** — FreqTrade config, fixed. Pairs, timeframe, fees, dry-run
wallet, timerange. The agent does not touch this.
- **`prepare.py`** — one-time data download from Binance via FreqTrade's Python
API. The agent does not touch this.
- **`run.py`** — in-process backtest. Calls FreqTrade's `Backtesting` class
directly (no CLI), extracts key metrics, prints a parseable `---` summary
block to stdout. The agent does not touch this.
- **`user_data/strategies/AutoResearch.py`** — **the only file the agent edits**.
Contains the full strategy: indicators, entry/exit logic, ROI/stoploss. This
is the `train.py` equivalent of Karpathy's setup.
- **`run.py`** — in-process **batch backtest**. Discovers every `.py` under
`user_data/strategies/` (skipping files prefixed `_`), runs FreqTrade's
`Backtesting` for each, and prints one `---` summary block per strategy.
The agent does not touch this.
- **`user_data/strategies/`** — **the directory the agent owns**. Each `.py`
is one strategy; up to 3 active at a time. Agent creates / evolves / forks
/ kills strategies here. `_template.py.example` is the skeleton reference.

Plus:

- **`program.md`** — the autonomous-research instructions the human points the
LLM agent at. Direct analog of Karpathy's `program.md`.
- **`results.tsv`** — the journal. `commit | sharpe | max_dd | status | description`.
Git-ignored so it survives `git reset --hard` when the agent rolls back a
failed experiment — past lessons stay available even when experimental
LLM agent at.
- **`results.tsv`** — event log. Schema: `commit | event | strategy_name | sharpe | max_dd | note`.
Events: `create | evolve | stable | fork | kill`. Gitignored so it survives
`git reset --hard` — past lessons stay available even when experimental
commits get thrown away.
- **`analysis.ipynb`** — post-hoc read of `results.tsv` once the loop has
collected some data.
- **`analysis.ipynb`** — post-hoc read: per-strategy trajectories, cap
utilization, event distribution, note word frequency.

*(v0.1.0 used a single `AutoResearch.py` file that the agent mutated in place.
That mode anchored the agent on one paradigm for all 99 rounds. v0.2.0
switched to multi-strategy; v0.1.0 is archived under [`versions/0.1.0/`](versions/0.1.0/)
with a full [retrospective](versions/0.1.0/retrospective.md).)*

## Requirements

Expand All @@ -79,11 +85,15 @@ uv sync
# 4. One-time data download (~a few minutes)
uv run prepare.py

# 5. Sanity check — run the baseline backtest
uv run run.py > run.log 2>&1 && grep "^---" -A 12 run.log
# 5. Sanity check — with no strategies yet, run.py should report
# "no strategies found" and exit. That's expected — the agent creates
# 1-3 starting strategies during setup before the first real backtest.
uv run run.py > run.log 2>&1; echo "exit=$?"
```

If step 5 prints a `---` block ending with a `pairs:` line, you're ready.
If step 5 prints `no strategies found...` and `exit=2`, you're ready. (An
actual backtest run only starts once the agent has created at least one
strategy file.)

## Running the agent

Expand Down Expand Up @@ -131,27 +141,34 @@ Auto-Quant/
├── analysis.ipynb # post-hoc analysis
├── user_data/
│ ├── strategies/
│ │ └── AutoResearch.py # THE one file the agent edits
│ │ ├── _template.py.example # skeleton the agent copies from
│ │ └── <agent-created files>.py # up to 3 active at a time
│ ├── data/ # gitignored — downloaded OHLCV
│ └── backtest_results/ # gitignored — FreqTrade outputs
└── results.tsv # gitignored — agent's journal
├── versions/ # frozen snapshots of past runs
└── results.tsv # gitignored — agent's event log
```

## Design notes

- **Agent only modifies one file.** All other files are the evaluation
contract. This is the single biggest design decision; it keeps diffs
reviewable and prevents Goodharting the metric.
- **Agent owns one directory, not one file.** `user_data/strategies/` is its
workspace; everything else is evaluation contract. Up to 3 strategies
simultaneously, hard cap. Multi-strategy exists specifically to fight
the single-paradigm anchoring that v0.1.0 exhibited.
- **No CLI indirection.** The agent only runs `uv run prepare.py` and
`uv run run.py`. `run.py` uses FreqTrade's `Backtesting` class in-process,
so startup is fast and errors surface as real Python stack traces.
- **`results.tsv` is gitignored.** When the agent reverts a failed
experiment with `git reset --hard`, the journal of what was tried
survives. Essential to avoid re-trying the same bad ideas.
- **LLM decides keep/discard, not a scalar rule.** Backtest Sharpe on a
finite window is noisy. Rather than `if new_sharpe > old_sharpe: keep`,
the agent reads the full summary block and decides based on sharpe +
drawdown + trade count + its own read on the asset.
- **`results.tsv` is a gitignored event log.** Each round, the agent appends
rows (one per strategy touched, with event type: create/evolve/stable/fork/kill).
It survives `git reset --hard` so past lessons stay available even when
experimental commits get thrown away.
- **LLM decides keep/kill, not a scalar rule.** Sharpe on a finite window
is noisy and gameable. Agent reads the full per-strategy summary blocks
and decides inline which strategies to evolve, fork, or kill — the
program.md rules force action but not which action.
- **Stagnation rule.** A strategy can't sit idle for more than 3 consecutive
stable rounds — agent must evolve, fork, or kill it. With only 3 slots,
dead weight is expensive.

## License

Expand Down
94 changes: 10 additions & 84 deletions analysis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,138 +3,64 @@
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Auto-Quant Experiment Analysis\n",
"\n",
"Analysis of autonomous strategy-iteration results from `results.tsv`.\n",
"\n",
"Run this after the agent has collected some experiments."
]
"source": "# Auto-Quant Experiment Analysis — v0.2.0 multi-strategy event log\n\nReads `results.tsv` (event log schema: `commit | event | strategy_name | sharpe | max_dd | note`) and produces per-strategy timelines, active-slot utilization, event distributions, and note word frequency.\n\nRun this after the agent has accumulated some rounds of events."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"df = pd.read_csv(\"results.tsv\", sep=\"\\t\")\n",
"df[\"sharpe\"] = pd.to_numeric(df[\"sharpe\"], errors=\"coerce\")\n",
"df[\"max_dd\"] = pd.to_numeric(df[\"max_dd\"], errors=\"coerce\")\n",
"df[\"status\"] = df[\"status\"].str.strip().str.lower()\n",
"\n",
"print(f\"Total experiments: {len(df)}\")\n",
"print(f\"Columns: {list(df.columns)}\")\n",
"df.head(10)"
]
"source": "import pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np\n\ndf = pd.read_csv(\"results.tsv\", sep=\"\\t\")\ndf[\"sharpe\"] = pd.to_numeric(df[\"sharpe\"], errors=\"coerce\")\ndf[\"max_dd\"] = pd.to_numeric(df[\"max_dd\"], errors=\"coerce\")\ndf[\"event\"] = df[\"event\"].str.strip().str.lower()\n\n# For fork events, strategy_name is \"parent→child\" — split into two columns\n# so we can treat the child as the new strategy moving forward.\ndef _canonical(name: str) -> str:\n if isinstance(name, str) and \"→\" in name:\n return name.split(\"→\", 1)[1]\n return name\n\ndf[\"parent\"] = df[\"strategy_name\"].map(\n lambda s: s.split(\"→\", 1)[0] if isinstance(s, str) and \"→\" in s else None\n)\ndf[\"strategy\"] = df[\"strategy_name\"].map(_canonical)\n\n# Give each row a time-ordered index (the row order IS the event order\n# because we append as we go). Keep \"commit\" for reference.\ndf = df.reset_index(drop=True)\ndf[\"round_idx\"] = df.index\n\nprint(f\"Total events: {len(df)}\")\nprint(f\"Unique strategies ever seen: {df['strategy'].nunique()}\")\nprint(f\"Columns: {list(df.columns)}\")\ndf.head(10)"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"counts = df[\"status\"].value_counts()\n",
"print(\"Experiment outcomes:\")\n",
"print(counts.to_string())\n",
"\n",
"n_keep = counts.get(\"keep\", 0)\n",
"n_discard = counts.get(\"discard\", 0)\n",
"n_crash = counts.get(\"crash\", 0)\n",
"n_decided = n_keep + n_discard\n",
"if n_decided > 0:\n",
" print(f\"\\nKeep rate: {n_keep}/{n_decided} = {n_keep / n_decided:.1%}\")\n",
" print(f\"Crash rate: {n_crash}/{len(df)} = {n_crash / len(df):.1%}\")"
]
"source": "# Event type distribution\nevent_counts = df[\"event\"].value_counts()\nprint(\"Events:\")\nprint(event_counts.to_string())\n\nprint(f\"\\nTotal strategies created: {event_counts.get('create', 0) + event_counts.get('fork', 0)}\")\nprint(f\"Total strategies killed: {event_counts.get('kill', 0)}\")\nprint(f\"Evolve moves: {event_counts.get('evolve', 0)}\")\nprint(f\"Stable rounds: {event_counts.get('stable', 0)}\")\n\n# Which strategies are still alive at the end?\n# A strategy is alive iff its most recent event is NOT 'kill'\nlast_event_per = df.groupby(\"strategy\").tail(1).set_index(\"strategy\")\nalive = last_event_per[last_event_per[\"event\"] != \"kill\"].index.tolist()\ndead = last_event_per[last_event_per[\"event\"] == \"kill\"].index.tolist()\nprint(f\"\\nStill alive at end ({len(alive)}): {alive}\")\nprint(f\"Killed during run ({len(dead)}): {dead}\")"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# All KEPT experiments — the ones that stuck\n",
"kept = df[df[\"status\"] == \"keep\"].copy().reset_index(drop=True)\n",
"print(f\"KEPT experiments ({len(kept)}):\\n\")\n",
"for i, row in kept.iterrows():\n",
" print(f\" #{i:3d} sharpe={row['sharpe']:.4f} dd={row['max_dd']:.1f}% {row['description']}\")"
]
"source": "# Peak sharpe per strategy (across its lifetime)\npeak = df.dropna(subset=[\"sharpe\"]).groupby(\"strategy\")[\"sharpe\"].max().sort_values(ascending=False)\nprint(\"Peak Sharpe per strategy:\\n\")\nfor name, s in peak.items():\n last_evt = last_event_per.loc[name, \"event\"]\n status = \"☠ killed\" if last_evt == \"kill\" else \"✓ alive\"\n print(f\" {status} {name:30s} peak_sharpe={s:.4f}\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sharpe Frontier Over Time\n",
"\n",
"Running max of sharpe as experiments accumulate. The plateau shows where\n",
"the agent stopped finding improvements."
]
"source": "## Per-strategy Sharpe trajectories\n\nEach active line = one strategy's sharpe over its lifetime. Gaps indicate\n`stable` rounds where the strategy wasn't modified. A line ending in a dot\nmeans the strategy was killed at that point. Scatter markers by event type\nlet you see where the agent chose to evolve / fork / kill each strategy."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "df_ordered = df.reset_index(drop=True)\n\n# Running best must only consider KEEP experiments. Discarded runs\n# (including retroactive discards — e.g. agent rolling back Goodhart\n# wins) were explicitly rejected and should not move the frontier.\nkept_sharpe = df_ordered[\"sharpe\"].where(df_ordered[\"status\"] == \"keep\")\ndf_ordered[\"running_max_sharpe\"] = kept_sharpe.cummax()\n\nstatus_color = {\"keep\": \"tab:green\", \"discard\": \"tab:gray\", \"crash\": \"tab:red\"}\n\nfig, ax = plt.subplots(figsize=(10, 5))\nfor status, color in status_color.items():\n mask = df_ordered[\"status\"] == status\n ax.scatter(df_ordered.index[mask], df_ordered[\"sharpe\"][mask],\n alpha=0.5, c=color, label=status)\nax.plot(df_ordered.index, df_ordered[\"running_max_sharpe\"],\n color=\"red\", label=\"running best (keep-only)\")\nax.set_xlabel(\"experiment #\")\nax.set_ylabel(\"sharpe\")\nax.set_title(\"Sharpe frontier\")\nax.legend()\nax.grid(alpha=0.3)\nplt.show()"
"source": "fig, ax = plt.subplots(figsize=(12, 6))\n\nevent_marker = {\"create\": \"o\", \"evolve\": \".\", \"fork\": \"^\", \"stable\": \",\", \"kill\": \"x\"}\n\nfor name, g in df.dropna(subset=[\"sharpe\"]).groupby(\"strategy\"):\n g = g.sort_values(\"round_idx\")\n ax.plot(g[\"round_idx\"], g[\"sharpe\"], alpha=0.5, label=name)\n for evt, marker in event_marker.items():\n sub = g[g[\"event\"] == evt]\n if len(sub) and evt != \"stable\": # don't clutter with stable markers\n ax.scatter(sub[\"round_idx\"], sub[\"sharpe\"], marker=marker, s=50, alpha=0.9)\n\n# Mark kills explicitly (last-event-of-strategy with no sharpe)\nfor name, g in df.groupby(\"strategy\"):\n last = g.iloc[-1]\n if last[\"event\"] == \"kill\":\n # put an x at y=0 at the kill position to signal strategy ended\n ax.scatter([last[\"round_idx\"]], [0], marker=\"x\", c=\"red\", s=80)\n\nax.axhline(0, color=\"gray\", linewidth=0.5, alpha=0.5)\nax.set_xlabel(\"event # (chronological)\")\nax.set_ylabel(\"sharpe\")\nax.set_title(\"Per-strategy Sharpe trajectories\")\nax.legend(loc=\"best\", fontsize=8)\nax.grid(alpha=0.3)\nplt.show()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sharpe vs Drawdown Scatter\n",
"\n",
"Any kept experiment with drawdown way worse than baseline is probably\n",
"over-fitting to a specific regime."
]
"source": "## Active strategy count over time\n\nWith a hard cap of 3, the count should mostly sit at 3 (agent keeps slots full)\nand occasionally dip to 2 when a strategy is killed without immediate replacement.\nSustained dips below 3 mean the agent isn't filling the cap — either exploring\ncautiously or running out of ideas."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(figsize=(8, 6))\n",
"colors = {\"keep\": \"green\", \"discard\": \"gray\", \"crash\": \"red\"}\n",
"for status, g in df.groupby(\"status\"):\n",
" ax.scatter(g[\"max_dd\"], g[\"sharpe\"], alpha=0.6, label=status, c=colors.get(status, \"black\"))\n",
"ax.set_xlabel(\"max drawdown %\")\n",
"ax.set_ylabel(\"sharpe\")\n",
"ax.set_title(\"Sharpe vs drawdown, colored by outcome\")\n",
"ax.legend()\n",
"ax.grid(alpha=0.3)\n",
"plt.show()"
]
"source": "# Walk the event log, maintain a set of active strategies, record count per event.\n# `fork` is counted as a create for the child (+1), `kill` as -1, create as +1.\nactive = set()\ncounts = []\nfor _, row in df.iterrows():\n name = row[\"strategy\"]\n evt = row[\"event\"]\n if evt == \"create\" or evt == \"fork\":\n active.add(name)\n elif evt == \"kill\":\n active.discard(name)\n # evolve / stable don't change membership\n counts.append(len(active))\n\ndf[\"active_count\"] = counts\n\nfig, ax = plt.subplots(figsize=(12, 3))\nax.plot(df[\"round_idx\"], df[\"active_count\"], drawstyle=\"steps-post\")\nax.axhline(3, color=\"red\", linestyle=\"--\", alpha=0.3, label=\"cap (3)\")\nax.set_xlabel(\"event #\")\nax.set_ylabel(\"active strategies\")\nax.set_title(\"Cap utilization over time\")\nax.set_ylim(0, 4)\nax.legend()\nax.grid(alpha=0.3)\nplt.show()"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Description Word Frequency\n",
"\n",
"What themes show up in the agent's own descriptions of what it tried?\n",
"A rough proxy for which directions it explored."
]
"source": "## Note word frequency\n\nRough proxy for what paradigms, indicators, and failure modes the agent\nthought about during this run. Skim the top 30 — if you see only one\nparadigm family dominating, anchoring is creeping in."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"import re\n",
"\n",
"text = \" \".join(df[\"description\"].dropna().astype(str).str.lower().tolist())\n",
"words = re.findall(r\"[a-z]{3,}\", text)\n",
"stop = {\"the\", \"and\", \"for\", \"with\", \"this\", \"that\", \"from\", \"was\", \"too\", \"add\", \"added\", \"use\", \"using\"}\n",
"top = Counter(w for w in words if w not in stop).most_common(25)\n",
"print(\"Top words across all descriptions:\")\n",
"for w, c in top:\n",
" print(f\" {c:4d} {w}\")"
]
"source": "from collections import Counter\nimport re\n\ntext = \" \".join(df[\"note\"].dropna().astype(str).str.lower().tolist())\nwords = re.findall(r\"[a-z]{3,}\", text)\nstop = {\n \"the\", \"and\", \"for\", \"with\", \"this\", \"that\", \"from\", \"was\", \"too\",\n \"add\", \"added\", \"use\", \"using\", \"but\", \"all\", \"not\", \"has\", \"have\",\n \"trade\", \"trades\", \"run\", \"ran\", \"same\", \"than\", \"more\", \"less\",\n \"still\", \"then\", \"one\", \"two\", \"new\", \"old\", \"now\",\n}\ntop = Counter(w for w in words if w not in stop).most_common(30)\nprint(\"Top 30 words across all notes:\")\nfor w, c in top:\n print(f\" {c:4d} {w}\")"
}
],
"metadata": {
Expand Down
Loading