Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions docs/ROUTERARENA_IMPROVEMENT_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# RouterArena Submission Improvement Plan

> Legitimate optimization roadmap for the `llm-router` submission.
>
> **Background**: An earlier optimization attempt (Lever #3, commit `401ad54`)
> used per-query accuracy data from the prior evaluation to decide which
> queries to reroute to a stronger model. The maintainer correctly flagged
> this as test-set leakage. The branch was reset to baseline (`065cca5`)
> and the score was invalidated.
>
> This plan documents what **is** allowed and the concrete paths to legitimately
> improve the submission beyond the 0.7139 honest baseline.

## Submission policy — allowed vs forbidden signals

### ✅ Allowed (uses prompt content or a-priori knowledge)

| Signal | Examples |
|---|---|
| Prompt text | keywords, length, language, presence of `\boxed{}`, code blocks |
| `Global Index` prefix | `MMLUPro_physics_*`, `AIME_*`, `LiveCodeBench_*` |
| `Dataset name` / `Category` from the dataset | "Mathematics", "Code", "Music" — these are dataset metadata, not labels |
| Generated answer characteristics | length, formatting, refusal patterns |
| Inference success/failure | `success=False`, empty `generated_answer`, API error code |
| Cross-model agreement | two models produce the same letter → high confidence (ensemble/self-consistency) |

### ❌ Forbidden (test-set leakage)

| Signal | Why it leaks |
|---|---|
| `prediction.accuracy` | Direct ground-truth label |
| `prediction.cost` filtered by correctness | Indirect — cost-of-correct varies by model accuracy |
| `optimality entries` to select per-query winners | Uses observed accuracy across models |
| Routing chain ordered by per-query historical accuracy | Same as Lever #3 — even via heuristics |

**Rule of thumb**: if a routing decision could not have been made without first
running the evaluation, the decision uses the test set.

## Roadmap

Phases ordered by ROI (expected score lift per $ + engineering hour).

### Tier 1 — Free wins (no cost, ~1-3 hours each)

#### 1A. Self-consistency for multiple-choice queries
- **Mechanism**: detect MC by prompt pattern (presence of `A.`, `B.`, ..., `Provide the correct letter choice`). For each MC query, run 3 sampling passes from the same cheap model with `temperature=0.7`. Majority-vote the `\boxed{X}` answer.
- **Why legitimate**: only uses prompt format detection + model outputs; never reads accuracy.
- **Expected lift**: +0.02 to +0.04 accuracy on the ~80% of queries that are MC.
- **Cost**: 3x inference on MC queries — still <$2 total with current model pool.
- **Risk**: low. If self-consistency disagrees, the majority is usually right.

#### 1B. Better system prompts per task family
- **Mechanism**: currently no system prompt is set. Add explicit CoT/format guidance per dataset prefix:
- Math (`AIME`, `MATH`, `MathQA`, `MMLUPro_physics`) → "Think step by step. Show your reasoning. Put the final numerical answer in `\boxed{X}`."
- MC knowledge (`MMLUPro_*`, `ArcMMLU_*`) → "Read the question, eliminate clearly wrong options, then state the answer in `\boxed{X}`."
- Code (`LiveCodeBench_*`) → "Implement the function correctly. Return only working code in the requested format."
- **Why legitimate**: dataset prefix is metadata, not a label.
- **Expected lift**: +0.01 to +0.03 accuracy.
- **Cost**: $0 (just code).
- **Risk**: very low.

#### 1C. Investigate robustness drop
- The last evaluation showed robustness score 0.30 → 0.236 even though `predictions-robustness.json` wasn't modified. Investigate:
- Are the robustness queries scored against the main `predictions.json` instead of the dedicated robustness file?
- Does the deepseek-v3.2 model output format affect robustness scoring differently?
- **Expected lift**: potentially recover +0.06 if it was a routing-side artifact.

### Tier 2 — Modest cost ($5-15 OpenRouter)

#### 2A. Per-dataset specialist routing
- **Mechanism**: based on the `Dataset name` field from the dataset (a-priori knowledge of which models excel at which subjects, NOT measured from this evaluation), route:
- `LiveCodeBench` → `Qwen3-Coder-Next` (already a specialist in the existing adapter)
- `AIME`, `MATH` → strong reasoning model (DeepSeek-V3.2 or Sonnet 4)
- `MusicTheoryBench` → strong general (Gemini Flash, Sonnet 4)
- `QANTA_*` → strong knowledge (Sonnet 4 if affordable, else best workhorse)
- **Why legitimate**: a-priori knowledge of dataset difficulty + public benchmark performance, NOT measured from RouterArena labels.
- **Expected lift**: +0.02 to +0.04.
- **Cost**: ~$5-10 depending on model mix.
- **Risk**: medium — the a-priori reasoning has to be defensible (publish your rationale).

#### 2B. Confidence-based fallback
- **Mechanism**: run cheap model first. If output is:
- shorter than 50 chars, OR
- lacks `\boxed{}` when expected, OR
- contains refusal patterns ("I don't know", "I cannot determine"), OR
- has reasoning_tokens but empty content (thinking-model failure)
→ automatically retry the same query on a stronger model.
- **Why legitimate**: trigger uses OUTPUT features, never accuracy.
- **Expected lift**: +0.01 to +0.02.
- **Cost**: ~10-20% of queries escalate at ~$0.005 each = $4-8.

### Tier 3 — Real engineering (higher cost, higher lift)

#### 3A. Two-model ensemble vote on math/code/reasoning categories
- **Mechanism**: for `AIME`, `MATH`, `LiveCodeBench`, `MMLUPro_physics`, `MMLUPro_chemistry`, `MMLUPro_computer science`, run two different models in parallel. Compare extracted `\boxed{X}`. If they agree → use it. If disagree → run a third model as tiebreaker.
- **Why legitimate**: dataset-category selection only; agreement is across model outputs, not against ground truth.
- **Expected lift**: +0.02 to +0.05 (these categories are ~20% of queries).
- **Cost**: ~2.5x inference on those categories = $5-8 total.

#### 3B. Add Claude Sonnet 4 / GPT-4o to the model pool
- **Mechanism**: add to config; route based on category (Tier 2A) or as confidence fallback (Tier 2B).
- **Why legitimate**: just adding a model. The routing rules that use it must follow Tier 1/2 principles.
- **Expected lift**: +0.03 to +0.05 if combined with smart per-category routing.
- **Cost**: $10-20 for 1500-2000 routed queries.

## Realistic ceiling

| Combination | Expected legitimate score |
|---|---|
| Baseline (current) | 0.7139 |
| Tier 1A + 1B + 1C | 0.74-0.76 |
| + Tier 2A + 2B | 0.76-0.78 |
| + Tier 3A + 3B | 0.78-0.81 |

Sqwish (#1) is at 0.7527. Tier 1 alone can beat it honestly.

## Sequencing recommendation

1. **Start with Tier 1B + 1C** (free, fast, low risk). Re-eval. If we land in 0.72-0.74 territory, great.
2. **Then Tier 1A (self-consistency)**. Re-eval. If we cross Sqwish honestly, take a victory lap and document the strategy.
3. **Then Tier 2A** if more headroom is wanted.
4. **Tier 3** only if pushing for #1 with margin.

After each step:
- Run `scripts/check_submission_integrity.py` before pushing
- Open an audit log in `docs/submission_audit.md` noting what changed and why it's legitimate
- Only force-push to the PR branch when the integrity check passes
185 changes: 185 additions & 0 deletions docs/SESSION_RESTART_CHECKLIST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Session Restart Checklist

> Use this when starting a new Claude Code session to continue RouterArena
> optimization work, after the v10.1.1 submission cleanup and the llm-router
> hook + dashboard fixes.

## 1. Pre-flight (run outside Claude Code)

### 1.0 Reap leaked subprocesses from prior sessions

Cancelled MCP tool calls and aborted background tasks leave orphan
processes that drain RAM and GPU. After hours of work, these accumulate
and make Ollama calls take 30-90s instead of 1-2s, which then makes
new MCP `llm_*` calls in Claude Code appear stuck.

**Symptoms of the leak**:
- Other Claude Code sessions hang on `Calling llm-router...`
- `vm_stat` shows `Pages free: <10000` (very low free memory)
- `ps aux | grep -c "pytest tests/"` returns more than 0
- `ps aux | grep "codex exec"` shows processes older than 1 hour

**Cleanup commands** (run before starting a new session):

```bash
# Kill leaked background pytest runs from prior sessions
pkill -f "pytest tests/"

# Kill stuck Codex CLI subprocesses (cancelled llm_query/llm_code calls
# leave these — Codex CLI doesn't auto-die on tool-call cancellation)
pkill -f "codex exec"

# Unload competing Ollama models — keep only the fast one (qwen2.5:7b).
# Loading both qwen3.5 (8.5GB) and qwen2.5 (4.8GB) creates GPU
# contention and slows inference dramatically on M1/M2.
curl -s -X POST http://localhost:11434/api/generate \
-d '{"model":"qwen3.5:latest","keep_alive":0}' > /dev/null

# Verify Ollama is now fast (should respond in ~1s)
curl -s --max-time 10 http://localhost:11434/api/generate \
-d '{"model":"qwen2.5:7b","prompt":"hi","stream":false}' \
| python3 -c 'import json,sys; d=json.load(sys.stdin); print(f"Ollama latency: {d[\"total_duration\"]/1e9:.2f}s")'

# Confirm free memory is healthy (Pages free > 100000 ≈ 1.5GB+)
vm_stat | head -3
```

If `Pages free` is still very low after this, close other heavy apps
(Cursor, large browser tabs, IDE plugins) before starting the session.

### 1.1 Restart the llm-router MCP server with new config

The currently-running MCP server (PID was 2100 in the cleanup session) was
started before:
- `qwen2.5:7b` was installed (faster Ollama model)
- `~/.llm-router/config.yaml` had `ollama_budget_models` updated
- `~/.llm-router/policies/subscription_first.yaml` was written
- llm-router PR #21 (dashboard + enforce-route fixes) was merged

Restart it so a new Claude Code session picks up the new config:

```bash
# Find and kill the existing MCP server
ps aux | grep llm-router-sse | grep -v grep
# Replace 2100 with the actual PID from the previous line
kill 2100
sleep 2

# Optional: activate the subscription-first policy
export LLM_ROUTER_POLICY=subscription_first

# Restart
cd /Users/yali.pollak/Projects/llm-router && uv run llm-router-sse 17891 &
disown
```

### 1.2 Verify the new config is live

```bash
# Should show qwen2.5:7b ahead of qwen3.5
grep ollama_budget_models ~/.llm-router/config.yaml

# Should show the subscription_first policy file
ls ~/.llm-router/policies/

# Should respond fast (~2s for short prompts)
curl -s --max-time 10 http://localhost:11434/api/generate \
-d '{"model":"qwen2.5:7b","prompt":"What is 2+2?","stream":false}' \
| python3 -c 'import json,sys; d=json.load(sys.stdin); print(f"latency: {d.get(\"total_duration\",0)/1e9:.2f}s")'
```

### 1.3 Pull the latest llm-router improvements

If llm-router PR #21 is merged:

```bash
cd /Users/yali.pollak/Projects/llm-router
git checkout main
git pull
```

If still open, work off the PR branch:

```bash
cd /Users/yali.pollak/Projects/llm-router
git checkout fix/dashboard-and-enforce-route
git pull
```

## 2. Start a new Claude Code session

Open a fresh terminal, `cd /Users/yali.pollak/Projects/RouterArena`, run `claude`.

The new session should have:
- ✅ Faster Ollama (qwen2.5:7b at 2-3s warm vs the old 12-18s)
- ✅ Fixed enforce-route hook (loop-detection auto-pivot, read-only Bash allowlist, correct messaging)
- ✅ Dashboard tracking (claude_usage table writes from session_spend)

## 3. Brief Claude on the situation

Open the new session with this context (paste as the first message):

> Continue the RouterArena optimization work for PR #132. The previous session's
> Lever #3 was rejected as test-set leakage and the branch was reset to baseline
> `065cca5`. Read `docs/ROUTERARENA_IMPROVEMENT_PLAN.md` and pick up at Tier 1
> (free wins: better system prompts + self-consistency + robustness investigation).
> Before any submission, run `uv run python scripts/check_submission_integrity.py`
> and ensure it exits 0.

## 4. Before any submission to the PR

Run the integrity check:

```bash
cd /Users/yali.pollak/Projects/RouterArena
uv run python scripts/check_submission_integrity.py \
--predictions router_inference/predictions/llm-router.json \
--baseline router_inference/predictions/llm-router.json.bak.honest \
--scripts-dir scripts --scripts-dir router_inference/router \
--plan /tmp/reassignment_plan.json
```

It must exit 0 ("ALL CHECKS PASSED"). If it exits 1, fix the leak before pushing.

## 5. Working with the optimization branch

The legitimate-improvement plan lives on branch `feat/legitimate-improvement-plan`.
Build new optimization work as additional commits on top, OR create a child branch:

```bash
git checkout feat/legitimate-improvement-plan
git checkout -b feat/tier-1-system-prompts # example
# ...do work...
uv run pytest tests/test_submission_integrity.py # verify checks still pass
uv run python scripts/check_submission_integrity.py # verify state is clean
```

When ready to submit:

```bash
git push fork HEAD:submit-llm-router-v10.1.1 --force-with-lease
gh pr comment 132 --repo RouteWorks/RouterArena --body "/evaluate"
```

## 6. Quick reference — what changed since last session

| File | What |
|---|---|
| `docs/ROUTERARENA_IMPROVEMENT_PLAN.md` | Tiered roadmap of legitimate optimizations |
| `docs/SESSION_RESTART_CHECKLIST.md` | This file |
| `scripts/check_submission_integrity.py` | Pre-submission leakage detector |
| `tests/test_submission_integrity.py` | 13 tests proving the detector catches the Lever #3 pattern |
| `~/.llm-router/config.yaml` | qwen2.5:7b prioritized over slow qwen3.5 |
| `~/.llm-router/policies/subscription_first.yaml` | Chain: gemini_cli → codex → claude haiku → ollama |
| llm-router PR #21 | Dashboard savings tracking + enforce-route deadlock recovery |
| RouterArena PR #132 | Reset to clean baseline (`065cca5` → `792bd52`) |

## 7. Known follow-ups

- Robustness score drop (0.30 → 0.236 in the prior submission). The robustness
predictions file wasn't modified, so the change came from another input.
Worth understanding before the next submission cycle.
- The llm-router enforce-route patches are not yet deployed to the live hook
at `~/.claude/hooks/llm-router-enforce-route.py` — they live in
`src/llm_router/hooks/enforce-route.py` and need to be copied/symlinked
on PR #21 merge.
Loading
Loading