Skip to content

Releases: Pyrecall/Pyrecall

v0.10.1 — Reliability & correctness patch

12 Jun 22:09
35b6a4d

Choose a tag to compare

What's new

This is a focused patch release fixing four bugs discovered and resolved after v0.10.0.


🐛 Bug fixes

Replay buffer deduplication — fixes #11
ReplayBuffer.add() now deduplicates entries by SHA-256 hash. Calling model.learn("data.jsonl") twice no longer inflates the buffer with duplicate examples or skews replay_mix_ratio. Duplicates are logged at DEBUG level. total_seen counts unique examples only, keeping reservoir sampling probabilities honest.

Single-item category severity — fixes #10
CategoryComparison.severity previously returned MINOR for every custom benchmark category with only one prompt, because cohen_d was forced to 0.0 when n_items < 2. It now falls back to threshold-based delta buckets (MINOR < 0.05, MODERATE < 0.15, SEVERE < 0.30, CRITICAL ≥ 0.30). A new severity_method field ("effect_size" | "delta") is exposed in to_dict() and the report table marks single-item rows with n/a * and a footer note.

Rollback baseline persistence across CLI runs — fixes #13
rollback() now persists the new baseline to a model-namespaced .current_baseline file so check() uses the correct snapshot after a process restart. Previously the baseline was in-memory only, so pyrecall rollback before_v1 && pyrecall check would compare against the wrong snapshot. _set_baseline() is now the single authoritative place for both in-memory and on-disk state.

Actionable error when baseline snapshot is missing — fixes #9
check() now catches FileNotFoundError from load_snapshot() and re-raises as PyrecallError with the snapshot name, directory path, and a clear hint — matching the pattern already used in diff(). Previously a stale baseline name from a previous session would produce a raw FileNotFoundError with no guidance.


Contributors

Big thanks to @Sid294 for shipping the rollback baseline fix (#38), reviewing the replay buffer and severity PRs, and pushing back on the effect-size naming — all of it made this release better.

Full changelog: v0.10.0...v0.10.1

v0.10.0 — log-likelihood scoring, Cohen's d, 160 benchmarks

12 Jun 16:35
2eae7a8

Choose a tag to compare

What's new in v0.10.0

Log-likelihood scoring (new default)

Snapshot scoring now uses log-likelihood by default instead of cosine similarity. This asks "how probable does the model consider the correct answer?" rather than comparing embedding vectors — giving a direct, reliable signal that is consistent with the methodology used by EleutherAI's lm-evaluation-harness.

The legacy cosine scoring method is still available via scoring_method="cosine" for backwards compatibility.

Cohen's d effect size in forgetting reports

The forgetting report now includes a Cohen's d column computed from paired per-item deltas across benchmark prompts. This replaces the raw threshold comparison as the primary severity signal:

Severity Cohen's d
OK Δ ≥ 0
MINOR |d| < 0.2
MODERATE 0.2 ≤ |d| < 0.5
SEVERE 0.5 ≤ |d| < 0.8
CRITICAL |d| ≥ 0.8

160 benchmark prompts across 8 categories

The default benchmark suite expanded from 64 to 160 prompts (20 per category):

Category What it probes
reasoning Math, logic, pattern recognition
instruction_following Lists, rewrites, format constraints
coding Write, debug, and explain Python
general_knowledge Science, history, geography
safety Refusals, harm avoidance, ethics
multilingual Translation, cross-lingual comprehension
tool_use Function calls, structured JSON output
advanced_math Algebra, calculus, combinatorics

Upgrade

pip install --upgrade pyrecall

Note: Snapshots taken with v0.9.x used cosine scoring. If you compare a v0.9.x snapshot against a v0.10.0 snapshot, pyrecall will warn you that scoring methods differ and the scores are not directly comparable. Retake your baseline snapshot after upgrading.

v0.9.0 — Per-category forgetting thresholds

12 Jun 07:00
56e7efa

Choose a tag to compare

What's new

Per-category forgetting thresholds let you set tighter (or looser) sensitivity per skill — e.g. flag safety at 3% while keeping the global threshold at 10%.

Python API

model = Model(
    "meta-llama/Llama-3.2-1B",
    forgetting_threshold=0.10,
    category_thresholds={
        "safety": 0.03,
        "coding": 0.15,
    },
)

CLI

pyrecall init --model meta-llama/Llama-3.2-1B \
    --category-threshold safety=0.03 \
    --category-threshold coding=0.15

pyrecall check --category-threshold safety=0.03
pyrecall diff before_v1 after_v1 --category-threshold safety=0.03

CLI flags override config-file values for that run only. --json output includes the effective threshold per category so results are always reproducible.

Full changelog

  • ForgettingDetector accepts category_thresholds: dict[str, float]
  • ForgettingReport._threshold_for(category) resolves the effective threshold
  • degraded_skills and to_dict() use per-category thresholds
  • --category-threshold flag on init, check, and diff
  • Config merging: CLI overrides saved config values per run
  • 7 new tests in test_detector.py

v0.8.0 — custom benchmark suites

12 Jun 06:15
071e694

Choose a tag to compare

What's new

Custom benchmark suites

Register your own domain-specific prompts so pyrecall can detect forgetting in skills the built-in 64 benchmarks don't cover.

pyrecall benchmark add nautical.jsonl
pyrecall benchmark add legal.jsonl --name legal_domain
pyrecall benchmark list
pyrecall benchmark remove legal_domain --yes

Benchmark file format (JSONL):

{"prompt": "What does port mean on a ship?", "reference_answer": "The left side when facing the bow.", "category": "nautical"}

Once registered, custom benchmarks run automatically alongside the built-ins on every pyrecall snapshot — no extra steps needed.

from pyrecall.benchmarks import CustomBenchmarkManager

mgr = CustomBenchmarkManager()
mgr.add("nautical.jsonl")

model = Model("meta-llama/Llama-3.2-1B")
model.snapshot("before_v1")  # runs 64 built-in + all custom prompts

Install

pip install pyrecall==0.8.0

Full changelog

  • CustomBenchmarkManager class with add, suites, remove, load_all, count
  • pyrecall benchmark add/list/remove CLI subcommands
  • Model._run_benchmarks() now merges default + custom prompts automatically
  • 28 new tests

v0.7.0 — pyrecall compare

12 Jun 05:41
8e540ab

Choose a tag to compare

What's new

pyrecall compare — side-by-side snapshot table

Compare any number of snapshots in a single Rich table. Each snapshot becomes a column, each skill category a row. Best score per row is highlighted green, worst is red — regressions and recoveries are immediately visible across a full training progression.

pyrecall compare before_v1 after_v1 after_v2 after_v3
pyrecall compare before_v1 after_v1 --json | jq '.categories.coding'
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Category             ┃  before_v1  ┃   after_v1  ┃   after_v2  ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ overall              │  0.850      │  0.831      │  0.844      │
│ reasoning            │  0.812      │  0.809      │  0.815      │
│ coding               │  0.834      │  0.641      │  0.720      │
│ safety               │  0.901      │  0.899      │  0.900      │
└──────────────────────┴─────────────┴─────────────┴─────────────┘

No model loading required — reads directly from stored snapshot data.

Also in this release (v0.6.1)

  • NeptuneTracker — log snapshot scores to Neptune alongside W&B and MLflow
  • --log-neptune and --neptune-project flags on snapshot and learn commands
  • pip install pyrecall[neptune]

Install

pip install pyrecall==0.7.0

Full changelog

  • pyrecall compare command with --json flag
  • NeptuneTracker class in pyrecall.trackers
  • neptune>=1.0.0 added as [neptune] optional dep, included in [trackers]
  • 14 new tests (8 for compare, 6 for Neptune)

v0.6.0 — pyrecall diff command

12 Jun 00:13
deab2d6

Choose a tag to compare

What's new

pyrecall diff command

Compare any two snapshots offline, without loading the model weights:

pyrecall diff before_v1 after_v2

Fast, works offline, and exits with code 2 when forgetting is detected — drop it into CI as a lightweight gate.

Expanded benchmark suite — 64 prompts, 8 categories

Three new skill categories added:

Category What it probes
multilingual Translation, cross-lingual comprehension, language identification
tool_use Function calls, structured JSON output, tool selection
advanced_math Algebra, calculus, combinatorics, proof by induction

Together with the existing five categories (reasoning, instruction_following, coding, general_knowledge, safety), the suite now runs 64 benchmark prompts per snapshot.

Upgrade

pip install --upgrade pyrecall

Full changelog

v0.5.1...v0.6.0