SimplyLiz · SimplyLiz · Mar 22, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,34 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [1.3.0] - 2026-03-21
+
+### Added
+
+- **Quality benchmark overhaul** — replaced broken metrics (keywordRetention, factRetention, negationErrors) with five meaningful ones: task-based probes (~70 across 13 scenarios), information density, compressed-only quality score, negative compression detection, and summary coherence checks.
+- **Task-based probes** — hand-curated per-scenario checks that verify whether specific critical information (identifiers, code patterns, config values) survives compression. Probe failures surface real quality issues.
+- **LLM-as-judge scoring** (`--llm-judge` flag) — optional LLM evaluation of compression quality. Multi-provider support: OpenAI, Anthropic, Gemini (`@google/genai`), Ollama. Display-only, not used for regression testing.
+- **Gemini provider** for LLM benchmarks via `GEMINI_API_KEY` env var (default model: `gemini-2.5-flash`).
+- **Opt-in feature comparison** (`--features` flag) — runs quality benchmark with each opt-in feature enabled to measure their impact vs baseline.
+- **Quality history documentation** (`docs/quality-history.md`) — version-over-version quality tracking across v1.0.0, v1.1.0, v1.2.0 with opt-in feature impact analysis.
+- **Min-output-chars probes** to catch over-aggressive compression.
+- **Code block language aliases** in benchmarks (typescript/ts, python/py, yaml/yml).
+- New npm scripts: `bench:quality:judge`, `bench:quality:features`.
+
+### Changed
+
+- Coherence and negative compression regression thresholds now track increases from baseline, not just zero-to-nonzero transitions.
+- Information density regression check only applies when compression actually occurs (ratio > 1.01).
+- Quality benchmark table now shows: `Ratio EntRet CodeOK InfDen Probes Pass NegCp Coher CmpQ`.
+- `analyzeQuality()` accepts optional `CompressOptions` for feature testing.
+
+### Removed
+
+- `keywordRetention` metric (tautological — 100% on 12/13 scenarios).
+- `factRetention` and `factCount` metrics (fragile regex-based fact extractor).
+- `negationErrors` metric (noisy, rarely triggered).
+- `extractFacts()` and `analyzeSemanticFidelity()` functions.
+
 ## [1.2.0] - 2026-03-20
 
 ### Added

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -14,6 +14,11 @@ npm run format           # Prettier write
 npm run format:check     # Prettier check
 npm run bench            # Run benchmark suite
 npm run bench:save       # Run, save baseline, regenerate docs/benchmark-results.md
+npm run bench:quality    # Run quality benchmark (probes, coherence, info density)
+npm run bench:quality:save   # Save quality baseline
+npm run bench:quality:check  # Compare against quality baseline
+npm run bench:quality:judge     # Run with LLM-as-judge (requires API key)
+npm run bench:quality:features  # Compare opt-in features vs baseline
 ```
 
 Run a single test file:
@@ -65,7 +70,7 @@ main ← develop ← feature branches
 - **TypeScript:** ES2020 target, NodeNext module resolution, strict mode, ESM-only
 - **Unused params** must be prefixed with `_` (ESLint enforced)
 - **Prettier:** 100 char width, 2-space indent, single quotes, trailing commas, semicolons
-- **Tests:** Vitest 4, test files in `tests/`, coverage via `@vitest/coverage-v8` (Node 20+ only)
-- **Node version:** ≥18 (.nvmrc: 22)
+- **Tests:** Vitest 4, test files in `tests/`, coverage via `@vitest/coverage-v8`
+- **Node version:** ≥20 (.nvmrc: 22)
 - **Always run `npm run format` before committing** — CI enforces `format:check`
 - **No author/co-author attribution** in commits, code, or docs
diff --git a/README.md b/README.md
@@ -32,11 +32,11 @@ const { messages: originals } = uncompress(compressed, verbatim);
 
 No API keys. No network calls. Runs synchronously by default. Under 2ms for typical conversations.
 
-The classifier is content-aware, not domain-specific. It preserves structured data (code, JSON, SQL, tables, citations, formulas) and compresses surrounding prose — optimized for LLM conversations and technical documentation.
+The classifier is content-aware, not domain-specific. It preserves structured data (code, JSON, SQL, tables, citations, formulas) and compresses surrounding prose — making it useful anywhere dense reference material is mixed with natural language: LLM conversations, legal briefs, medical records, technical documentation, support logs.
 
 ## Key findings
 
-The deterministic engine achieves **1.3-6.1x compression with zero latency and zero cost.** It scores sentences, packs a budget, strips filler — and in most scenarios, it compresses tighter than an LLM. LLM summarization is opt-in for cases where semantic understanding improves quality. See [Benchmarks](docs/benchmarks.md) for methodology and [Benchmark Results](docs/benchmark-results.md) for the latest numbers and version history.
+The deterministic engine achieves **1.3-6.1x compression with zero latency and zero cost.** It scores sentences, packs a budget, strips filler — and in most scenarios, it compresses tighter than an LLM. LLM summarization is opt-in for cases where semantic understanding improves quality. See [Benchmarks](docs/benchmarks.md) for methodology, [Benchmark Results](docs/benchmark-results.md) for the latest numbers, and [Quality History](docs/quality-history.md) for version-over-version quality tracking.
 
 ## Features