Releases: Pyrecall/Pyrecall
v0.10.1 — Reliability & correctness patch
What's new
This is a focused patch release fixing four bugs discovered and resolved after v0.10.0.
🐛 Bug fixes
Replay buffer deduplication — fixes #11
ReplayBuffer.add() now deduplicates entries by SHA-256 hash. Calling model.learn("data.jsonl") twice no longer inflates the buffer with duplicate examples or skews replay_mix_ratio. Duplicates are logged at DEBUG level. total_seen counts unique examples only, keeping reservoir sampling probabilities honest.
Single-item category severity — fixes #10
CategoryComparison.severity previously returned MINOR for every custom benchmark category with only one prompt, because cohen_d was forced to 0.0 when n_items < 2. It now falls back to threshold-based delta buckets (MINOR < 0.05, MODERATE < 0.15, SEVERE < 0.30, CRITICAL ≥ 0.30). A new severity_method field ("effect_size" | "delta") is exposed in to_dict() and the report table marks single-item rows with n/a * and a footer note.
Rollback baseline persistence across CLI runs — fixes #13
rollback() now persists the new baseline to a model-namespaced .current_baseline file so check() uses the correct snapshot after a process restart. Previously the baseline was in-memory only, so pyrecall rollback before_v1 && pyrecall check would compare against the wrong snapshot. _set_baseline() is now the single authoritative place for both in-memory and on-disk state.
Actionable error when baseline snapshot is missing — fixes #9
check() now catches FileNotFoundError from load_snapshot() and re-raises as PyrecallError with the snapshot name, directory path, and a clear hint — matching the pattern already used in diff(). Previously a stale baseline name from a previous session would produce a raw FileNotFoundError with no guidance.
Contributors
Big thanks to @Sid294 for shipping the rollback baseline fix (#38), reviewing the replay buffer and severity PRs, and pushing back on the effect-size naming — all of it made this release better.
Full changelog: v0.10.0...v0.10.1
v0.10.0 — log-likelihood scoring, Cohen's d, 160 benchmarks
What's new in v0.10.0
Log-likelihood scoring (new default)
Snapshot scoring now uses log-likelihood by default instead of cosine similarity. This asks "how probable does the model consider the correct answer?" rather than comparing embedding vectors — giving a direct, reliable signal that is consistent with the methodology used by EleutherAI's lm-evaluation-harness.
The legacy cosine scoring method is still available via scoring_method="cosine" for backwards compatibility.
Cohen's d effect size in forgetting reports
The forgetting report now includes a Cohen's d column computed from paired per-item deltas across benchmark prompts. This replaces the raw threshold comparison as the primary severity signal:
| Severity | Cohen's d |
|---|---|
| OK | Δ ≥ 0 |
| MINOR | |d| < 0.2 |
| MODERATE | 0.2 ≤ |d| < 0.5 |
| SEVERE | 0.5 ≤ |d| < 0.8 |
| CRITICAL | |d| ≥ 0.8 |
160 benchmark prompts across 8 categories
The default benchmark suite expanded from 64 to 160 prompts (20 per category):
| Category | What it probes |
|---|---|
reasoning |
Math, logic, pattern recognition |
instruction_following |
Lists, rewrites, format constraints |
coding |
Write, debug, and explain Python |
general_knowledge |
Science, history, geography |
safety |
Refusals, harm avoidance, ethics |
multilingual |
Translation, cross-lingual comprehension |
tool_use |
Function calls, structured JSON output |
advanced_math |
Algebra, calculus, combinatorics |
Upgrade
pip install --upgrade pyrecallNote: Snapshots taken with v0.9.x used cosine scoring. If you compare a v0.9.x snapshot against a v0.10.0 snapshot, pyrecall will warn you that scoring methods differ and the scores are not directly comparable. Retake your baseline snapshot after upgrading.
v0.9.0 — Per-category forgetting thresholds
What's new
Per-category forgetting thresholds let you set tighter (or looser) sensitivity per skill — e.g. flag safety at 3% while keeping the global threshold at 10%.
Python API
model = Model(
"meta-llama/Llama-3.2-1B",
forgetting_threshold=0.10,
category_thresholds={
"safety": 0.03,
"coding": 0.15,
},
)CLI
pyrecall init --model meta-llama/Llama-3.2-1B \
--category-threshold safety=0.03 \
--category-threshold coding=0.15
pyrecall check --category-threshold safety=0.03
pyrecall diff before_v1 after_v1 --category-threshold safety=0.03CLI flags override config-file values for that run only. --json output includes the effective threshold per category so results are always reproducible.
Full changelog
ForgettingDetectoracceptscategory_thresholds: dict[str, float]ForgettingReport._threshold_for(category)resolves the effective thresholddegraded_skillsandto_dict()use per-category thresholds--category-thresholdflag oninit,check, anddiff- Config merging: CLI overrides saved config values per run
- 7 new tests in
test_detector.py
v0.8.0 — custom benchmark suites
What's new
Custom benchmark suites
Register your own domain-specific prompts so pyrecall can detect forgetting in skills the built-in 64 benchmarks don't cover.
pyrecall benchmark add nautical.jsonl
pyrecall benchmark add legal.jsonl --name legal_domain
pyrecall benchmark list
pyrecall benchmark remove legal_domain --yesBenchmark file format (JSONL):
{"prompt": "What does port mean on a ship?", "reference_answer": "The left side when facing the bow.", "category": "nautical"}Once registered, custom benchmarks run automatically alongside the built-ins on every pyrecall snapshot — no extra steps needed.
from pyrecall.benchmarks import CustomBenchmarkManager
mgr = CustomBenchmarkManager()
mgr.add("nautical.jsonl")
model = Model("meta-llama/Llama-3.2-1B")
model.snapshot("before_v1") # runs 64 built-in + all custom promptsInstall
pip install pyrecall==0.8.0Full changelog
CustomBenchmarkManagerclass withadd,suites,remove,load_all,countpyrecall benchmark add/list/removeCLI subcommandsModel._run_benchmarks()now merges default + custom prompts automatically- 28 new tests
v0.7.0 — pyrecall compare
What's new
pyrecall compare — side-by-side snapshot table
Compare any number of snapshots in a single Rich table. Each snapshot becomes a column, each skill category a row. Best score per row is highlighted green, worst is red — regressions and recoveries are immediately visible across a full training progression.
pyrecall compare before_v1 after_v1 after_v2 after_v3
pyrecall compare before_v1 after_v1 --json | jq '.categories.coding'┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Category ┃ before_v1 ┃ after_v1 ┃ after_v2 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ overall │ 0.850 │ 0.831 │ 0.844 │
│ reasoning │ 0.812 │ 0.809 │ 0.815 │
│ coding │ 0.834 │ 0.641 │ 0.720 │
│ safety │ 0.901 │ 0.899 │ 0.900 │
└──────────────────────┴─────────────┴─────────────┴─────────────┘
No model loading required — reads directly from stored snapshot data.
Also in this release (v0.6.1)
- NeptuneTracker — log snapshot scores to Neptune alongside W&B and MLflow
--log-neptuneand--neptune-projectflags onsnapshotandlearncommandspip install pyrecall[neptune]
Install
pip install pyrecall==0.7.0Full changelog
pyrecall comparecommand with--jsonflagNeptuneTrackerclass inpyrecall.trackersneptune>=1.0.0added as[neptune]optional dep, included in[trackers]- 14 new tests (8 for compare, 6 for Neptune)
v0.6.0 — pyrecall diff command
What's new
pyrecall diff command
Compare any two snapshots offline, without loading the model weights:
pyrecall diff before_v1 after_v2Fast, works offline, and exits with code 2 when forgetting is detected — drop it into CI as a lightweight gate.
Expanded benchmark suite — 64 prompts, 8 categories
Three new skill categories added:
| Category | What it probes |
|---|---|
multilingual |
Translation, cross-lingual comprehension, language identification |
tool_use |
Function calls, structured JSON output, tool selection |
advanced_math |
Algebra, calculus, combinatorics, proof by induction |
Together with the existing five categories (reasoning, instruction_following, coding, general_knowledge, safety), the suite now runs 64 benchmark prompts per snapshot.
Upgrade
pip install --upgrade pyrecall