Skip to content

v0.5 — paper-5 saturation-direction lever shipped

Choose a tag to compare

@caiovicentino caiovicentino released this 09 May 15:37
· 241 commits to main since this release

Paper-5 published 2026-05-09

Saturation-Direction Lever: A Five-Class Taxonomy of Probe Causality in Qwen3.6-27B
https://openinterp.org/research/papers/saturation-direction-probe-levers

Headline finding

α=−100 robustness theorem — the L31 pre_tool capability probe direction produces +33-40pp probe-vs-random pushdown gap across code distributions spanning Qwen3.6-27B pass-rate ~7-89% (HumanEval+MBPP, BigCodeBench, Codeforces ≥2000). The α=−100 locus is saturation-independent at moderate amplitude.

Phase coverage (this release)

  • Phase 0 — Smoke G1+G4 PASS
  • Phase 1 — N=20 stratified, 12k captures, G4 GREEN
  • Phase 2 — Differential probes, AUROC up to 0.958
  • Phase 5d/6/6c — N=99 + methodology sweep
  • Phase 7 — L43 pre_tool epiphenomenal (softmax-temp artifact)
  • Phase 8 — L55 thinking template-locked
  • Phase 10 — RG L55 mid_think first lever (+30pp pushup at α=+200)
  • Phase 11/11b — 4/4 capability sites pushdown-asymmetric (+30 to +60pp)
  • Phase 12 — Persona falsifier #1 → motivates saturation-direction theory
  • Phase 11c — Cross-distribution BCB +33pp at α=−100
  • Phase 11d — Codeforces ≥2000 +40pp at α=−100; falsifier #2 walks back saturation-magnitude corollary
  • Phase 11e — Multi-site Codeforces (in progress at release)

Methodology contributions (mandatory at every site)

  1. Random K-matched probe baseline (paper-3 §3.1)
  2. Control-token normalization for log-prob shifts (paper-3 §3.2)
  3. Structural-rigidity α-sweep at α >> ‖h‖ (paper-3 §3.3)
  4. Whitespace-stripped behavioral flip metric (paper-3 §3.4)

3 of the 4 caught a confident-but-wrong claim during the work.

Two pre-registered falsification cycles

  • #1 Phase 12 persona — predicted pushup-asymmetric (continuous-gradient class), observed pushdown → falsifies categorical-vs-continuous frame, motivates saturation-direction principle
  • #2 Phase 11d Codeforces — predicted pushdown collapse + pushup emergence at lowest saturation (saturation-magnitude corollary), observed saturation-INDEPENDENT pushdown → walks back corollary, refines to α=−100 robustness theorem

Companion artifacts

  • HF dataset — caiovicentino1/agent-probe-guard-qwen36-27b (probe weights for v0.1 SDK; v0.2 will swap L43 → L31 pre_tool from Phase 11)
  • PyPI — pip install openinterp (v0.3.0+)
  • Web — paper-5 + 4 prior papers at openinterp.org/research
  • License — Apache-2.0 throughout

Reproducibility

12 standalone Colab notebooks in notebooks/. Build scripts in scripts/build_nb_swebench_v*.py regenerate every notebook from source. ~6.5h total compute on RTX 6000 Blackwell from cold start.