MTCMarkFranco · MTCMarkFranco · May 1, 2026 · May 1, 2026 · sourcery-ai · May 1, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,46 @@ it reaches `1.0.0`.
 
 ### Added
 
+- **19 new AC-aligned rules + domain-agnostic `text_features` extractor** ([#TBD]):
+  Closes the AC discrepancy backlog identified in
+  `docs/comparison/lambda-rag-vs-air-canada.md`. Adds 19 new rules to
+  `samples/contracts/ac-demo-ruleset.json` (now v2.0.0, 24 rules total)
+  covering payment terms, IP/work-for-hire, liability carve-outs,
+  insurance limits, security/cryptography, privacy obligations
+  (residency, breach window, consent, retention, explicit-laws), AI
+  addenda, subcontracting approval, service locations, and Quebec
+  governance.
+  - **`TextFeatureExtractor`** (new in `LambdaRag.Projection`): pure-regex,
+    domain-agnostic numeric extraction over English prose. Adds
+    `text_features.{day_counts, month_counts, year_counts, percent_values,
+    dollar_amounts}` arrays plus `_min`/`_max` scalars on every projected
+    section. Rule authors target numeric thresholds via lambdas like
+    `input1.text_features.day_count_max <= 45` — usable by **any** ruleset,
+    not just AC.
+  - Engine remains domain-agnostic: 11 new tests
+    (`TextFeatureExtractorTests` + `GenericTextFeaturesEvaluationTests`)
+    prove the extractor and evaluator work on synthetic non-AC corpora
+    (vendor bond, permit response, ESG recycled-content, oil-and-gas).
+  - Projector bumped to **v1.4.0**; topic-map (`contract.v1.json`) bumped
+    to **v1.1.0** (adds `tax`, `subcontracting`, `ai`, `service_locations`
+    topics).
+  - End-to-end vs the AC contract: **`pass=5 fail=21 gap=1`** — every Fail
+    is a genuine deterministic finding (NET 60 > 45, 2% > 1.5%, no Quebec
+    governance, etc.).
+
+### Fixed
+
+- `TextFeatureExtractor.DollarRx`: shorthand suffixes (`m|b|k`) no longer
+  match the leading letter of an unrelated trailing word (e.g. `$1,000,000
+  bond` was previously parsed as `$1,000,000 b` → 10¹⁵). Suffix now
+  requires a word boundary via negative lookahead `(?![A-Za-z])`.
+- `TextFeatureExtractor.DayCountRx`: now also matches hyphenated forms
+  (`120-day`, `30-day`) in addition to spaced forms (`120 days`).
+
+## [Unreleased — earlier]
+
+### Added
+
 - **AC-style comment format in markup output** ([#TBD]):
   Word comments now match the Air Canada contract-review UX so reviewers
   see the same visual feedback as in the agentic flow:

diff --git a/README.md b/README.md
@@ -186,6 +186,34 @@ If the ontology you need isn't in the table above, copy
 `my-industry.v1.json`, add your headings/aliases per topic, rebuild,
 and pass `--topic-map my-industry.v1` to the extractor.
 
+### Numeric thresholds with `text_features` (projector v1.4.0+)
+
+Every projected section now carries a `text_features` block with
+generic numeric facts extracted from the section's prose:
+
+| Field | What it captures | Example match |
+|-------|------------------|---------------|
+| `day_counts` / `day_count_min` / `day_count_max` | day quantities | `45 days`, `120-day cure`, `90 calendar days` |
+| `month_counts` / `_min` / `_max` | month quantities | `12 months`, `36-month term` |
+| `year_counts` / `_min` / `_max` | year quantities | `5 years`, `2-year warranty` |
+| `percent_values` / `percent_min` / `percent_max` | percentages | `1.5%`, `30 percent` |
+| `dollar_amounts` / `dollar_min` / `dollar_max` | dollar values | `$5,000,000`, `$1.5M`, `USD 10,000,000`, `CAD$ 2.5 million` |
+
+Rule lambdas reference these fields directly — no per-domain code:
+
+```json
+{
+  "predicate": "input1.topics.Contains(\"insurance\") && input1.text_features.dollar_amounts.Count > 0",
+  "lambda":    "input1.text_features.dollar_max >= 5000000"
+}
+```
+
+This is a *generic* extractor: it works on **any** domain (vendor
+bonds, ESG recycled-content thresholds, permit response windows,
+pipeline pressure-test durations…). The same rule shape is used for
+contracts, public-sector permitting, oil-and-gas, FSI policies, and
+governance frameworks.
+
 ## CLI cheat sheet
 
 ```

diff --git a/docs/comparison/lambda-rag-vs-air-canada.md b/docs/comparison/lambda-rag-vs-air-canada.md
@@ -257,3 +257,94 @@ Both engine-level defects flagged in §4 were fixed in this same PR (no separate
 Re-run vs the same AC sample contract: `pass=4 fail=1 gap=1` (was `pass=3 fail=0 gap=2`). §9 now correctly **Fails** on IP indemnity, §9 + §11 both **Pass** on the explicit liability cap, the spurious §9 liability Gap is gone, and the legitimate `AC-WAR-001` Gap (no warranty section — AC missed it) remains.
 
 The 22 `N`-class coverage gaps require authoring the AC-policy ruleset from the six PDFs (`out/ac-full/ac-policies-ruleset.json`) and are tracked separately. Section 5 of this document lists the 19 priority rules to author.
+
+
+## Phase F — 19 priority rules + `text_features` extractor (post-audit, this PR)
+
+Closes the §5 backlog identified above. Three changes, all merged together
+in this PR:
+
+1. **19 new rules** authored as data only in
+   `samples/contracts/ac-demo-ruleset.json` (now **v2.0.0**, 24 rules
+   total). New rule IDs: `AC-LIAB-CARVEOUTS`, `AC-TERM-CONV`,
+   `AC-PAY-NET45`, `AC-PAY-INT-MAX`, `AC-TAX-EXCL`,
+   `AC-IP-WORKFORHIRE`, `AC-INS-GCL-5M`, `AC-INS-CYBER-10M`,
+   `AC-CYBER-CRYPTO`, `AC-PRIV-RESIDENCY`, `AC-PRIV-72H-AC`,
+   `AC-PRIV-CONSENT`, `AC-PRIV-RETENTION`,
+   `AC-PRIV-EXPLICIT-LAWS`, `AC-AI-ADDENDUM`,
+   `AC-AI-PRIVACY-SUPP`, `AC-SUBK-APPROVAL`, `AC-SVC-LOCATION`,
+   `AC-LAW-QUEBEC`.
+
+2. **`TextFeatureExtractor`** (new in `LambdaRag.Projection`,
+   projector v1.4.0). A pure-regex, *domain-agnostic* numeric extractor
+   that runs over each section's prose and emits
+   `text_features.{day_counts, month_counts, year_counts,
+   percent_values, dollar_amounts}` arrays plus `_min`/`_max`
+   scalars. Rule authors target numeric thresholds via lambdas like
+   `input1.text_features.day_count_max <= 45` — usable by **any**
+   ruleset, not just AC. The keystone of the genericness story: the
+   engine never looks at AC-specific content; it just exposes
+   structured numeric facts that *any* downstream policy can compare
+   against.
+
+3. **Topic-map `contract.v1.json` → v1.1.0**: adds four generic
+   topics (`tax`, `subcontracting`, `ai`, `service_locations`)
+   so the new rules' `input1.topics.Contains(...)` predicates can
+   target the right sections without hard-coding string regexes in
+   lambdas.
+
+### Genericness guardrail
+
+The user's hard constraint was *"every time we tighten what we're doing
+we need to make sure it's generic enough to reuse with completely
+different rules, documents, and domains."* To prove the engine stays
+domain-agnostic this PR adds **11 new tests**:
+
+- `TextFeatureExtractorTests` (7) — regex behaviour proven on
+  oil-and-gas pipeline prose, ESG recycled-content, generic
+  payment terms, etc.
+- `GenericTextFeaturesEvaluationTests` (4) — full evaluator runs
+  with **synthetic non-AC rulesets** (`rs-vendor-x`,
+  `rs-municipal`, `rs-esg`) over synthetic non-AC sections,
+  proving the engine evaluates `text_features`-based predicates
+  generically.
+
+All four existing corpus verticals (`contract`, `oil-gas`,
+`permitting`, `gov-architecture`, `fsi`) continue to match
+their golden verdicts byte-for-byte after the projector v1.4.0 bump
+(only mechanical drift in the new `text_features` field; **zero
+verdict changes**).
+
+### End-to-end vs the AC sample contract
+
+| Run | Pass | Fail | Gap | Err |
+|-----|------|------|-----|-----|
+| Before this PR (5 rules) | 4 | 1 | 1 | 0 |
+| After this PR (24 rules) | **5** | **21** | **1** | **0** |
+
+Spot-checked Fails (all genuinely correct findings):
+
+- `AC-PAY-NET45` Fails on §4 — contract's `Net 60` violates
+  `day_count_max <= 45`.
+- `AC-PAY-INT-MAX` Fails on §4 — `2% per month` exceeds
+  `percent_max <= 1.5`.
+- `AC-LAW-QUEBEC` Fails on §12 — Governing law is Ontario, not
+  Quebec.
+- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+  below `` / ``.
- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
-  below `` / ``.
+- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+  below `$5M` / `$10M`.
- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
-  below `` / ``.
+- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+  below `$5M` / `$10M`.
+
+### Two extractor bugs found and fixed
+
+While building the genericness tests, two regex defects in
+`TextFeatureExtractor` surfaced and were fixed in this same PR:
+
+- **`DollarRx` shorthand-suffix bug**: `(million|billion|m|b|k)?`
+  matched the leading `b` of an unrelated trailing word, so
+  `,000,000 bond` was parsed as `,000,000 b` → 10¹⁵. Fixed by
+  requiring a word boundary via `(?![A-Za-z])` lookahead.
+- **`DayCountRx` hyphen support**: `120-day cure window` (very
+  common in legal English) wasn't matching. Fixed by allowing
+  `[\s-]*` between the digit and the unit word.
+
+Both fixes leave AC end-to-end outcomes identical (pass=5 fail=21
+gap=1 unchanged).