Releases · Pyrecall/Pyrecall

12 Jun 22:09

Arths17

v0.10.1

35b6a4d

v0.10.1 — Reliability & correctness patch Latest

Latest

What's new

This is a focused patch release fixing four bugs discovered and resolved after v0.10.0.

🐛 Bug fixes

Replay buffer deduplication — fixes #11
ReplayBuffer.add() now deduplicates entries by SHA-256 hash. Calling model.learn("data.jsonl") twice no longer inflates the buffer with duplicate examples or skews replay_mix_ratio. Duplicates are logged at DEBUG level. total_seen counts unique examples only, keeping reservoir sampling probabilities honest.

Single-item category severity — fixes #10
CategoryComparison.severity previously returned MINOR for every custom benchmark category with only one prompt, because cohen_d was forced to 0.0 when n_items < 2. It now falls back to threshold-based delta buckets (MINOR < 0.05, MODERATE < 0.15, SEVERE < 0.30, CRITICAL ≥ 0.30). A new severity_method field ("effect_size" | "delta") is exposed in to_dict() and the report table marks single-item rows with n/a * and a footer note.

Rollback baseline persistence across CLI runs — fixes #13
rollback() now persists the new baseline to a model-namespaced .current_baseline file so check() uses the correct snapshot after a process restart. Previously the baseline was in-memory only, so pyrecall rollback before_v1 && pyrecall check would compare against the wrong snapshot. _set_baseline() is now the single authoritative place for both in-memory and on-disk state.

Actionable error when baseline snapshot is missing — fixes #9
check() now catches FileNotFoundError from load_snapshot() and re-raises as PyrecallError with the snapshot name, directory path, and a clear hint — matching the pattern already used in diff(). Previously a stale baseline name from a previous session would produce a raw FileNotFoundError with no guidance.

Contributors

Big thanks to @Sid294 for shipping the rollback baseline fix (#38), reviewing the replay buffer and severity PRs, and pushing back on the effect-size naming — all of it made this release better.

Full changelog: v0.10.0...v0.10.1

Contributors

Sid294

Assets 2

12 Jun 16:35

Arths17

v0.10.0

2eae7a8

v0.10.0 — log-likelihood scoring, Cohen's d, 160 benchmarks

What's new in v0.10.0

Log-likelihood scoring (new default)

Snapshot scoring now uses log-likelihood by default instead of cosine similarity. This asks "how probable does the model consider the correct answer?" rather than comparing embedding vectors — giving a direct, reliable signal that is consistent with the methodology used by EleutherAI's lm-evaluation-harness.

The legacy cosine scoring method is still available via scoring_method="cosine" for backwards compatibility.

Cohen's d effect size in forgetting reports

The forgetting report now includes a Cohen's d column computed from paired per-item deltas across benchmark prompts. This replaces the raw threshold comparison as the primary severity signal:

Severity	Cohen's d
OK	Δ ≥ 0
MINOR	\|d\| < 0.2
MODERATE	0.2 ≤ \|d\| < 0.5
SEVERE	0.5 ≤ \|d\| < 0.8
CRITICAL	\|d\| ≥ 0.8

160 benchmark prompts across 8 categories

The default benchmark suite expanded from 64 to 160 prompts (20 per category):

Category	What it probes
`reasoning`	Math, logic, pattern recognition
`instruction_following`	Lists, rewrites, format constraints
`coding`	Write, debug, and explain Python
`general_knowledge`	Science, history, geography
`safety`	Refusals, harm avoidance, ethics
`multilingual`	Translation, cross-lingual comprehension
`tool_use`	Function calls, structured JSON output
`advanced_math`	Algebra, calculus, combinatorics

Upgrade

pip install --upgrade pyrecall

Note: Snapshots taken with v0.9.x used cosine scoring. If you compare a v0.9.x snapshot against a v0.10.0 snapshot, pyrecall will warn you that scoring methods differ and the scores are not directly comparable. Retake your baseline snapshot after upgrading.

Assets 2

12 Jun 07:00

Arths17

v0.9.0

56e7efa

v0.9.0 — Per-category forgetting thresholds

What's new

Per-category forgetting thresholds let you set tighter (or looser) sensitivity per skill — e.g. flag safety at 3% while keeping the global threshold at 10%.

Python API

model = Model(
    "meta-llama/Llama-3.2-1B",
    forgetting_threshold=0.10,
    category_thresholds={
        "safety": 0.03,
        "coding": 0.15,
    },
)

CLI

pyrecall init --model meta-llama/Llama-3.2-1B \
    --category-threshold safety=0.03 \
    --category-threshold coding=0.15

pyrecall check --category-threshold safety=0.03
pyrecall diff before_v1 after_v1 --category-threshold safety=0.03

CLI flags override config-file values for that run only. --json output includes the effective threshold per category so results are always reproducible.

Full changelog

ForgettingDetector accepts category_thresholds: dict[str, float]
ForgettingReport._threshold_for(category) resolves the effective threshold
degraded_skills and to_dict() use per-category thresholds
--category-threshold flag on init, check, and diff
Config merging: CLI overrides saved config values per run
7 new tests in test_detector.py

Assets 2

12 Jun 06:15

Arths17

v0.8.0

071e694

v0.8.0 — custom benchmark suites

What's new

Custom benchmark suites

Register your own domain-specific prompts so pyrecall can detect forgetting in skills the built-in 64 benchmarks don't cover.

pyrecall benchmark add nautical.jsonl
pyrecall benchmark add legal.jsonl --name legal_domain
pyrecall benchmark list
pyrecall benchmark remove legal_domain --yes

Benchmark file format (JSONL):

{"prompt": "What does port mean on a ship?", "reference_answer": "The left side when facing the bow.", "category": "nautical"}

Once registered, custom benchmarks run automatically alongside the built-ins on every pyrecall snapshot — no extra steps needed.

from pyrecall.benchmarks import CustomBenchmarkManager

mgr = CustomBenchmarkManager()
mgr.add("nautical.jsonl")

model = Model("meta-llama/Llama-3.2-1B")
model.snapshot("before_v1")  # runs 64 built-in + all custom prompts

Install

pip install pyrecall==0.8.0

Full changelog

CustomBenchmarkManager class with add, suites, remove, load_all, count
pyrecall benchmark add/list/remove CLI subcommands
Model._run_benchmarks() now merges default + custom prompts automatically
28 new tests

Assets 2

12 Jun 05:41

Arths17

v0.7.0

8e540ab

v0.7.0 — pyrecall compare

What's new

`pyrecall compare` — side-by-side snapshot table

Compare any number of snapshots in a single Rich table. Each snapshot becomes a column, each skill category a row. Best score per row is highlighted green, worst is red — regressions and recoveries are immediately visible across a full training progression.

pyrecall compare before_v1 after_v1 after_v2 after_v3
pyrecall compare before_v1 after_v1 --json | jq '.categories.coding'

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Category             ┃  before_v1  ┃   after_v1  ┃   after_v2  ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ overall              │  0.850      │  0.831      │  0.844      │
│ reasoning            │  0.812      │  0.809      │  0.815      │
│ coding               │  0.834      │  0.641      │  0.720      │
│ safety               │  0.901      │  0.899      │  0.900      │
└──────────────────────┴─────────────┴─────────────┴─────────────┘

No model loading required — reads directly from stored snapshot data.

Also in this release (v0.6.1)

NeptuneTracker — log snapshot scores to Neptune alongside W&B and MLflow
--log-neptune and --neptune-project flags on snapshot and learn commands
pip install pyrecall[neptune]

Install

pip install pyrecall==0.7.0

Full changelog

pyrecall compare command with --json flag
NeptuneTracker class in pyrecall.trackers
neptune>=1.0.0 added as [neptune] optional dep, included in [trackers]
14 new tests (8 for compare, 6 for Neptune)

Assets 2

12 Jun 00:13

Arths17

v0.6.0

deab2d6

v0.6.0 — pyrecall diff command

What's new

`pyrecall diff` command

Compare any two snapshots offline, without loading the model weights:

pyrecall diff before_v1 after_v2

Fast, works offline, and exits with code 2 when forgetting is detected — drop it into CI as a lightweight gate.

Expanded benchmark suite — 64 prompts, 8 categories

Three new skill categories added:

Category	What it probes
`multilingual`	Translation, cross-lingual comprehension, language identification
`tool_use`	Function calls, structured JSON output, tool selection
`advanced_math`	Algebra, calculus, combinatorics, proof by induction

Together with the existing five categories (reasoning, instruction_following, coding, general_knowledge, safety), the suite now runs 64 benchmark prompts per snapshot.

Upgrade

pip install --upgrade pyrecall

Full changelog

v0.5.1...v0.6.0

Assets 2

Releases: Pyrecall/Pyrecall

v0.10.1 — Reliability & correctness patch

What's new

🐛 Bug fixes

Contributors

Contributors

Uh oh!

v0.10.0 — log-likelihood scoring, Cohen's d, 160 benchmarks

What's new in v0.10.0

Log-likelihood scoring (new default)

Cohen's d effect size in forgetting reports

160 benchmark prompts across 8 categories

Upgrade

Uh oh!

v0.9.0 — Per-category forgetting thresholds

What's new

Python API

CLI

Full changelog

Uh oh!

v0.8.0 — custom benchmark suites

What's new

Custom benchmark suites

Install

Full changelog

Uh oh!

v0.7.0 — pyrecall compare

What's new

pyrecall compare — side-by-side snapshot table

Also in this release (v0.6.1)

Install

Full changelog

Uh oh!

v0.6.0 — pyrecall diff command

What's new

pyrecall diff command

Expanded benchmark suite — 64 prompts, 8 categories

Upgrade

Full changelog

Uh oh!

`pyrecall compare` — side-by-side snapshot table

`pyrecall diff` command