RouteWorks · ypollak2 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/docs/ROUTERARENA_IMPROVEMENT_PLAN.md b/docs/ROUTERARENA_IMPROVEMENT_PLAN.md
@@ -0,0 +1,127 @@
+# RouterArena Submission Improvement Plan
+
+> Legitimate optimization roadmap for the `llm-router` submission.
+>
+> **Background**: An earlier optimization attempt (Lever #3, commit `401ad54`)
+> used per-query accuracy data from the prior evaluation to decide which
+> queries to reroute to a stronger model. The maintainer correctly flagged
+> this as test-set leakage. The branch was reset to baseline (`065cca5`)
+> and the score was invalidated.
+>
+> This plan documents what **is** allowed and the concrete paths to legitimately
+> improve the submission beyond the 0.7139 honest baseline.
+
+## Submission policy — allowed vs forbidden signals
+
+### ✅ Allowed (uses prompt content or a-priori knowledge)
+
+| Signal | Examples |
+|---|---|
+| Prompt text | keywords, length, language, presence of `\boxed{}`, code blocks |
+| `Global Index` prefix | `MMLUPro_physics_*`, `AIME_*`, `LiveCodeBench_*` |
+| `Dataset name` / `Category` from the dataset | "Mathematics", "Code", "Music" — these are dataset metadata, not labels |
+| Generated answer characteristics | length, formatting, refusal patterns |
+| Inference success/failure | `success=False`, empty `generated_answer`, API error code |
+| Cross-model agreement | two models produce the same letter → high confidence (ensemble/self-consistency) |
+
+### ❌ Forbidden (test-set leakage)
+
+| Signal | Why it leaks |
+|---|---|
+| `prediction.accuracy` | Direct ground-truth label |
+| `prediction.cost` filtered by correctness | Indirect — cost-of-correct varies by model accuracy |
+| `optimality entries` to select per-query winners | Uses observed accuracy across models |
+| Routing chain ordered by per-query historical accuracy | Same as Lever #3 — even via heuristics |
+
+**Rule of thumb**: if a routing decision could not have been made without first
+running the evaluation, the decision uses the test set.
+
+## Roadmap
+
+Phases ordered by ROI (expected score lift per $ + engineering hour).
+
+### Tier 1 — Free wins (no cost, ~1-3 hours each)
+
+#### 1A. Self-consistency for multiple-choice queries
+- **Mechanism**: detect MC by prompt pattern (presence of `A.`, `B.`, ..., `Provide the correct letter choice`). For each MC query, run 3 sampling passes from the same cheap model with `temperature=0.7`. Majority-vote the `\boxed{X}` answer.
+- **Why legitimate**: only uses prompt format detection + model outputs; never reads accuracy.
+- **Expected lift**: +0.02 to +0.04 accuracy on the ~80% of queries that are MC.
+- **Cost**: 3x inference on MC queries — still <$2 total with current model pool.
+- **Risk**: low. If self-consistency disagrees, the majority is usually right.
+
+#### 1B. Better system prompts per task family
+- **Mechanism**: currently no system prompt is set. Add explicit CoT/format guidance per dataset prefix:
+  - Math (`AIME`, `MATH`, `MathQA`, `MMLUPro_physics`) → "Think step by step. Show your reasoning. Put the final numerical answer in `\boxed{X}`."
+  - MC knowledge (`MMLUPro_*`, `ArcMMLU_*`) → "Read the question, eliminate clearly wrong options, then state the answer in `\boxed{X}`."
+  - Code (`LiveCodeBench_*`) → "Implement the function correctly. Return only working code in the requested format."
+- **Why legitimate**: dataset prefix is metadata, not a label.
+- **Expected lift**: +0.01 to +0.03 accuracy.
+- **Cost**: $0 (just code).
+- **Risk**: very low.
+
+#### 1C. Investigate robustness drop
+- The last evaluation showed robustness score 0.30 → 0.236 even though `predictions-robustness.json` wasn't modified. Investigate:
+  - Are the robustness queries scored against the main `predictions.json` instead of the dedicated robustness file?
+  - Does the deepseek-v3.2 model output format affect robustness scoring differently?
+- **Expected lift**: potentially recover +0.06 if it was a routing-side artifact.
+
+### Tier 2 — Modest cost ($5-15 OpenRouter)
+
+#### 2A. Per-dataset specialist routing
+- **Mechanism**: based on the `Dataset name` field from the dataset (a-priori knowledge of which models excel at which subjects, NOT measured from this evaluation), route:
+  - `LiveCodeBench` → `Qwen3-Coder-Next` (already a specialist in the existing adapter)
+  - `AIME`, `MATH` → strong reasoning model (DeepSeek-V3.2 or Sonnet 4)
+  - `MusicTheoryBench` → strong general (Gemini Flash, Sonnet 4)
+  - `QANTA_*` → strong knowledge (Sonnet 4 if affordable, else best workhorse)
+- **Why legitimate**: a-priori knowledge of dataset difficulty + public benchmark performance, NOT measured from RouterArena labels.
+- **Expected lift**: +0.02 to +0.04.
+- **Cost**: ~$5-10 depending on model mix.
+- **Risk**: medium — the a-priori reasoning has to be defensible (publish your rationale).
+
+#### 2B. Confidence-based fallback
+- **Mechanism**: run cheap model first. If output is:
+  - shorter than 50 chars, OR
+  - lacks `\boxed{}` when expected, OR
+  - contains refusal patterns ("I don't know", "I cannot determine"), OR
+  - has reasoning_tokens but empty content (thinking-model failure)
+  → automatically retry the same query on a stronger model.
+- **Why legitimate**: trigger uses OUTPUT features, never accuracy.
+- **Expected lift**: +0.01 to +0.02.
+- **Cost**: ~10-20% of queries escalate at ~$0.005 each = $4-8.
+
+### Tier 3 — Real engineering (higher cost, higher lift)
+
+#### 3A. Two-model ensemble vote on math/code/reasoning categories
+- **Mechanism**: for `AIME`, `MATH`, `LiveCodeBench`, `MMLUPro_physics`, `MMLUPro_chemistry`, `MMLUPro_computer science`, run two different models in parallel. Compare extracted `\boxed{X}`. If they agree → use it. If disagree → run a third model as tiebreaker.
+- **Why legitimate**: dataset-category selection only; agreement is across model outputs, not against ground truth.
+- **Expected lift**: +0.02 to +0.05 (these categories are ~20% of queries).
+- **Cost**: ~2.5x inference on those categories = $5-8 total.
+
+#### 3B. Add Claude Sonnet 4 / GPT-4o to the model pool
+- **Mechanism**: add to config; route based on category (Tier 2A) or as confidence fallback (Tier 2B).
+- **Why legitimate**: just adding a model. The routing rules that use it must follow Tier 1/2 principles.
+- **Expected lift**: +0.03 to +0.05 if combined with smart per-category routing.
+- **Cost**: $10-20 for 1500-2000 routed queries.
+
+## Realistic ceiling
+
+| Combination | Expected legitimate score |
+|---|---|
+| Baseline (current) | 0.7139 |
+| Tier 1A + 1B + 1C | 0.74-0.76 |
+| + Tier 2A + 2B | 0.76-0.78 |
+| + Tier 3A + 3B | 0.78-0.81 |
+
+Sqwish (#1) is at 0.7527. Tier 1 alone can beat it honestly.
+
+## Sequencing recommendation
+
+1. **Start with Tier 1B + 1C** (free, fast, low risk). Re-eval. If we land in 0.72-0.74 territory, great.
+2. **Then Tier 1A (self-consistency)**. Re-eval. If we cross Sqwish honestly, take a victory lap and document the strategy.
+3. **Then Tier 2A** if more headroom is wanted.
+4. **Tier 3** only if pushing for #1 with margin.
+
+After each step:
+- Run `scripts/check_submission_integrity.py` before pushing
+- Open an audit log in `docs/submission_audit.md` noting what changed and why it's legitimate
+- Only force-push to the PR branch when the integrity check passes
diff --git a/docs/SESSION_RESTART_CHECKLIST.md b/docs/SESSION_RESTART_CHECKLIST.md
@@ -0,0 +1,185 @@
+# Session Restart Checklist
+
+> Use this when starting a new Claude Code session to continue RouterArena
+> optimization work, after the v10.1.1 submission cleanup and the llm-router
+> hook + dashboard fixes.
+
+## 1. Pre-flight (run outside Claude Code)
+
+### 1.0 Reap leaked subprocesses from prior sessions
+
+Cancelled MCP tool calls and aborted background tasks leave orphan
+processes that drain RAM and GPU. After hours of work, these accumulate
+and make Ollama calls take 30-90s instead of 1-2s, which then makes
+new MCP `llm_*` calls in Claude Code appear stuck.
+
+**Symptoms of the leak**:
+- Other Claude Code sessions hang on `Calling llm-router...`
+- `vm_stat` shows `Pages free: <10000` (very low free memory)
+- `ps aux | grep -c "pytest tests/"` returns more than 0
+- `ps aux | grep "codex exec"` shows processes older than 1 hour
+
+**Cleanup commands** (run before starting a new session):
+
+```bash
+# Kill leaked background pytest runs from prior sessions
+pkill -f "pytest tests/"
+
+# Kill stuck Codex CLI subprocesses (cancelled llm_query/llm_code calls
+# leave these — Codex CLI doesn't auto-die on tool-call cancellation)
+pkill -f "codex exec"
+
+# Unload competing Ollama models — keep only the fast one (qwen2.5:7b).
+# Loading both qwen3.5 (8.5GB) and qwen2.5 (4.8GB) creates GPU
+# contention and slows inference dramatically on M1/M2.
+curl -s -X POST http://localhost:11434/api/generate \
+  -d '{"model":"qwen3.5:latest","keep_alive":0}' > /dev/null
+
+# Verify Ollama is now fast (should respond in ~1s)
+curl -s --max-time 10 http://localhost:11434/api/generate \
+  -d '{"model":"qwen2.5:7b","prompt":"hi","stream":false}' \
+  | python3 -c 'import json,sys; d=json.load(sys.stdin); print(f"Ollama latency: {d[\"total_duration\"]/1e9:.2f}s")'
+
+# Confirm free memory is healthy (Pages free > 100000 ≈ 1.5GB+)
+vm_stat | head -3
+```
+
+If `Pages free` is still very low after this, close other heavy apps
+(Cursor, large browser tabs, IDE plugins) before starting the session.
+
+### 1.1 Restart the llm-router MCP server with new config
+
+The currently-running MCP server (PID was 2100 in the cleanup session) was
+started before:
+- `qwen2.5:7b` was installed (faster Ollama model)
+- `~/.llm-router/config.yaml` had `ollama_budget_models` updated
+- `~/.llm-router/policies/subscription_first.yaml` was written
+- llm-router PR #21 (dashboard + enforce-route fixes) was merged
+
+Restart it so a new Claude Code session picks up the new config:
+
+```bash
+# Find and kill the existing MCP server
+ps aux | grep llm-router-sse | grep -v grep
+# Replace 2100 with the actual PID from the previous line
+kill 2100
+sleep 2
+
+# Optional: activate the subscription-first policy
+export LLM_ROUTER_POLICY=subscription_first
+
+# Restart
+cd /Users/yali.pollak/Projects/llm-router && uv run llm-router-sse 17891 &
+disown
+```
+
+### 1.2 Verify the new config is live
+
+```bash
+# Should show qwen2.5:7b ahead of qwen3.5
+grep ollama_budget_models ~/.llm-router/config.yaml
+
+# Should show the subscription_first policy file
+ls ~/.llm-router/policies/
+
+# Should respond fast (~2s for short prompts)
+curl -s --max-time 10 http://localhost:11434/api/generate \
+  -d '{"model":"qwen2.5:7b","prompt":"What is 2+2?","stream":false}' \
+  | python3 -c 'import json,sys; d=json.load(sys.stdin); print(f"latency: {d.get(\"total_duration\",0)/1e9:.2f}s")'
+```
+
+### 1.3 Pull the latest llm-router improvements
+
+If llm-router PR #21 is merged:
+
+```bash
+cd /Users/yali.pollak/Projects/llm-router
+git checkout main
+git pull
+```
+
+If still open, work off the PR branch:
+
+```bash
+cd /Users/yali.pollak/Projects/llm-router
+git checkout fix/dashboard-and-enforce-route
+git pull
+```
+
+## 2. Start a new Claude Code session
+
+Open a fresh terminal, `cd /Users/yali.pollak/Projects/RouterArena`, run `claude`.
+
+The new session should have:
+- ✅ Faster Ollama (qwen2.5:7b at 2-3s warm vs the old 12-18s)
+- ✅ Fixed enforce-route hook (loop-detection auto-pivot, read-only Bash allowlist, correct messaging)
+- ✅ Dashboard tracking (claude_usage table writes from session_spend)
+
+## 3. Brief Claude on the situation
+
+Open the new session with this context (paste as the first message):
+
+> Continue the RouterArena optimization work for PR #132. The previous session's
+> Lever #3 was rejected as test-set leakage and the branch was reset to baseline
+> `065cca5`. Read `docs/ROUTERARENA_IMPROVEMENT_PLAN.md` and pick up at Tier 1
+> (free wins: better system prompts + self-consistency + robustness investigation).
+> Before any submission, run `uv run python scripts/check_submission_integrity.py`
+> and ensure it exits 0.
+
+## 4. Before any submission to the PR
+
+Run the integrity check:
+
+```bash
+cd /Users/yali.pollak/Projects/RouterArena
+uv run python scripts/check_submission_integrity.py \
+  --predictions router_inference/predictions/llm-router.json \
+  --baseline router_inference/predictions/llm-router.json.bak.honest \
+  --scripts-dir scripts --scripts-dir router_inference/router \
+  --plan /tmp/reassignment_plan.json
+```
+
+It must exit 0 ("ALL CHECKS PASSED"). If it exits 1, fix the leak before pushing.
+
+## 5. Working with the optimization branch
+
+The legitimate-improvement plan lives on branch `feat/legitimate-improvement-plan`.
+Build new optimization work as additional commits on top, OR create a child branch:
+
+```bash
+git checkout feat/legitimate-improvement-plan
+git checkout -b feat/tier-1-system-prompts  # example
+# ...do work...
+uv run pytest tests/test_submission_integrity.py  # verify checks still pass
+uv run python scripts/check_submission_integrity.py  # verify state is clean
+```
+
+When ready to submit:
+
+```bash
+git push fork HEAD:submit-llm-router-v10.1.1 --force-with-lease
+gh pr comment 132 --repo RouteWorks/RouterArena --body "/evaluate"
+```
+
+## 6. Quick reference — what changed since last session
+
+| File | What |
+|---|---|
+| `docs/ROUTERARENA_IMPROVEMENT_PLAN.md` | Tiered roadmap of legitimate optimizations |
+| `docs/SESSION_RESTART_CHECKLIST.md` | This file |
+| `scripts/check_submission_integrity.py` | Pre-submission leakage detector |
+| `tests/test_submission_integrity.py` | 13 tests proving the detector catches the Lever #3 pattern |
+| `~/.llm-router/config.yaml` | qwen2.5:7b prioritized over slow qwen3.5 |
+| `~/.llm-router/policies/subscription_first.yaml` | Chain: gemini_cli → codex → claude haiku → ollama |
+| llm-router PR #21 | Dashboard savings tracking + enforce-route deadlock recovery |
+| RouterArena PR #132 | Reset to clean baseline (`065cca5` → `792bd52`) |
+
+## 7. Known follow-ups
+
+- Robustness score drop (0.30 → 0.236 in the prior submission). The robustness
+  predictions file wasn't modified, so the change came from another input.
+  Worth understanding before the next submission cycle.
+- The llm-router enforce-route patches are not yet deployed to the live hook
+  at `~/.claude/hooks/llm-router-enforce-route.py` — they live in
+  `src/llm_router/hooks/enforce-route.py` and need to be copied/symlinked
+  on PR #21 merge.