v4.0.0: production grade, still lazy
The hardening release. Three new reflexes, about ten lines of prompt:
- One runnable check. Non-trivial logic leaves behind the smallest test that fails if the logic breaks. No frameworks, no fixtures. One-liners stay test-free.
- Ceilings are named. A
ponytail:shortcut with a known limit (global lock, O(n²) scan) must name the limit and the upgrade path in the comment. - Robust beats flimsy at equal size. Between two same-size stdlib options, take the one that is correct on edge cases.
Benchmarked
Six tasks, three arms, same model, adversarial security and concurrency probes. Every arm passes every probe. Then the agreement ends:
| No skill | Caveman | Ponytail v4 | |
|---|---|---|---|
| Lines of code | 3,629 | 1,440 | 490 |
| Agent tokens | 430,697 | 290,546 | 229,370 |
| Surprise-extension lines | 1,115 | 413 | 96 |
Full data and methodology: benchmarks/
Also
- All cross-agent rule files updated (Cursor, Windsurf, Cline, Copilot,
AGENTS.md) ponytail-reviewno longer flags the minimal check as bloat- README grew a chart