Phase 3: 19 AC-aligned rules + generic text_features extractor#49
Phase 3: 19 AC-aligned rules + generic text_features extractor#49MTCMarkFranco merged 1 commit intomainfrom
Conversation
…xtractor
* Adds 19 new rules to samples/contracts/ac-demo-ruleset.json (now v2.0.0,
24 rules total) covering payment terms, IP, liability carve-outs,
insurance limits, security/cryptography, privacy obligations, AI addenda,
subcontracting, service locations, and Quebec governance.
* New TextFeatureExtractor in LambdaRag.Projection: pure-regex,
domain-agnostic numeric extraction over prose. Adds text_features.{
day_counts, month_counts, year_counts, percent_values, dollar_amounts}
arrays + _min/_max scalars. Rule lambdas target numeric thresholds
generically (e.g. input1.text_features.day_count_max <= 45).
* Projector bumped to v1.4.0; topic-map contract.v1.json bumped to v1.1.0
(adds tax / subcontracting / ai / service_locations topics).
* Engine remains domain-agnostic. 11 new tests prove this:
- TextFeatureExtractorTests (7) — regex behaviour on synthetic non-AC
prose (oil-gas, ESG, payment terms, etc.).
- GenericTextFeaturesEvaluationTests (4) — full evaluator runs on
synthetic non-AC rulesets (vendor bond, permit response, ESG).
* Two regex defects fixed:
- DollarRx shorthand suffixes (m|b|k) no longer match the leading
letter of unrelated trailing words (\,000,000 bond was being parsed
as 10^15). Suffix now requires (?![A-Za-z]) word-boundary lookahead.
- DayCountRx now matches hyphenated forms (120-day cure window) in
addition to spaced forms (120 days).
* Corpus goldens regenerated for two clean docs (oil-gas, permitting):
pure mechanical drift in text_features content; outcome counts
(passed/failed/gap) byte-identical pre/post.
* AC contract end-to-end: pass=5 fail=21 gap=1 (was 4/1/1 with 5 rules).
Every Fail spot-checked — all genuine deterministic findings.
Tests: 189/189 passing (174 unit + 15 idempotency).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reviewer's GuideImplements a domain-agnostic TextFeatureExtractor wired into the contract projector, updates the topic map and AC demo ruleset to use the new numeric features and topics, fixes two money/day-count regex bugs, and adds tests and documentation so text_features remain generic while improving AC contract coverage from 5 to 24 rules with unchanged golden verdicts. Sequence diagram for projecting a contract section with text_features and evaluating a numeric rulesequenceDiagram
participant Client
participant Projector as DeterministicContractProjector
participant TopicMap as ContractTopicMap
participant TFE as TextFeatureExtractor
participant Store as ProjectedDocumentStore
participant Evaluator
Client->>Projector: ProjectAsync(parsedDocument)
Projector->>TopicMap: LoadDefaultTopicMap()
TopicMap-->>Projector: topics, aliases
loop For each section
Projector->>Projector: extract bodyText
Projector->>TFE: Extract(bodyText)
TFE-->>Projector: JsonObject text_features
Projector->>Projector: build section JSON
Projector-->>Store: add ProjectedSection(text_features, topics, text)
end
Projector-->>Client: ProjectedDocument
Client->>Evaluator: Evaluate(ProjectedDocument, ruleset)
loop For each rule
Evaluator->>Evaluator: bind inputs (sections)
Evaluator->>Evaluator: execute lambda
Evaluator->>Evaluator: access input1.text_features.day_count_max
Evaluator-->>Client: rule verdict (Pass/Fail/Gap)
end
Class diagram for TextFeatureExtractor and DeterministicContractProjector changesclassDiagram
class DeterministicContractProjector {
+string Id
+string Version
+string Domain
+JsonObject Schema
+Task~ProjectedDocument~ ProjectAsync(ParsedDocument parsed, CancellationToken cancellationToken)
-static TopicMap LoadDefaultTopicMap()
}
class TextFeatureExtractor {
<<static>>
-Regex DayCountRx
-Regex MonthCountRx
-Regex YearCountRx
-Regex PercentRx
-Regex DollarRx
-Regex DollarSpelledRx
+JsonObject Extract(string text)
-List~long~ SortedLongs(Regex rx, string text)
-List~double~ SortedDoubles(Regex rx, string text, int parseGroup)
-List~long~ ExtractDollars(string text)
-bool TryParseDollar(string mantissa, string suffix, out long amount)
-JsonArray ToArray(IEnumerable~long~ values)
-JsonArray ToArray(IEnumerable~double~ values)
}
class ProjectedSection {
+JsonArray day_counts
+JsonArray month_counts
+JsonArray year_counts
+JsonArray percent_values
+JsonArray dollar_amounts
+long day_count_min
+long day_count_max
+long month_count_min
+long month_count_max
+long year_count_min
+long year_count_max
+double percent_min
+double percent_max
+long dollar_min
+long dollar_max
+JsonObject topic_scores
+double topic_density
+bool is_operative_for_topic
+string inherited_from
+string text
}
DeterministicContractProjector --> TextFeatureExtractor : uses
DeterministicContractProjector --> ProjectedSection : builds
TextFeatureExtractor --> ProjectedSection : populates text_features
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The
PercentRxregex currently only matches values with a%sign, but the README and rule examples mention handling phrases like30 percent; either broaden the regex to cover thepercent/per centwording or adjust the documentation/examples to match the actual behavior. - The
DollarSpelledRxcomment mentions handling fully spelled-out amounts likefive million dollars, but the implementation only matches numeric forms followed bymillion|billion; align the comment with the actual pattern or extend the regex to support spelled-out numbers if that’s intended. - In
TextFeatureExtractorXML docs you referencedollar_amounts_max, but the emitted scalar isdollar_max; consider tightening these doc/property name references to avoid confusion for rule authors.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `PercentRx` regex currently only matches values with a `%` sign, but the README and rule examples mention handling phrases like `30 percent`; either broaden the regex to cover the `percent`/`per cent` wording or adjust the documentation/examples to match the actual behavior.
- The `DollarSpelledRx` comment mentions handling fully spelled-out amounts like `five million dollars`, but the implementation only matches numeric forms followed by `million|billion`; align the comment with the actual pattern or extend the regex to support spelled-out numbers if that’s intended.
- In `TextFeatureExtractor` XML docs you reference `dollar_amounts_max`, but the emitted scalar is `dollar_max`; consider tightening these doc/property name references to avoid confusion for rule authors.
## Individual Comments
### Comment 1
<location path="src/LambdaRag.Projection/Projectors/TextFeatureExtractor.cs" line_range="50-51" />
<code_context>
+ @"(?:\$|USD\s*\$?|CAD\s*\$?|US\$|CAD\$)\s*(\d{1,3}(?:[,\s]\d{3})*(?:\.\d+)?)\s*(million|billion|[mbk])?(?![A-Za-z])",
+ RegexOptions.IgnoreCase | RegexOptions.Compiled);
+
+ // Spelled-out dollar amounts: "five million dollars" / "$5 million"
+ private static readonly Regex DollarSpelledRx = new(
+ @"(\d{1,3}(?:[,\s]\d{3})*(?:\.\d+)?)\s*(million|billion)\s*(?:dollars|USD|CAD)",
+ RegexOptions.IgnoreCase | RegexOptions.Compiled);
</code_context>
<issue_to_address>
**issue:** Comment and regex for spelled-out dollar amounts are inconsistent and miss cases like "five million dollars".
`DollarSpelledRx` only accepts numeric mantissas (`\d{1,3}...`), so it doesn’t handle fully spelled-out amounts like "five million dollars"—only forms like "$5 million", which are already covered by `DollarRx`. Either expand this regex to support word-based amounts (e.g., "one", "five", "twenty") or update the comment to describe its actual numeric-only behavior to avoid misleading future rule authors.
</issue_to_address>
### Comment 2
<location path="docs/comparison/lambda-rag-vs-air-canada.md" line_range="333-334" />
<code_context>
+ `percent_max <= 1.5`.
+- `AC-LAW-QUEBEC` Fails on §12 — Governing law is Ontario, not
+ Quebec.
+- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+ below `` / ``.
+
+### Two extractor bugs found and fixed
</code_context>
<issue_to_address>
**issue (typo):** The example for insurance limits has missing threshold values (empty backticks).
In the bullet for `AC-INS-GCL-5M` / `AC-INS-CYBER-10M`, the text `below `` / ``.` looks like an unfilled placeholder for the actual threshold amounts. Please replace these with the intended limits (e.g. `$5M / $10M`) so the rule is clear.
```suggestion
- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+ below `$5M` / `$10M`.
```
</issue_to_address>
### Comment 3
<location path="docs/comparison/lambda-rag-vs-air-canada.md" line_range="271" />
<code_context>
+ `AC-LIAB-CARVEOUTS`, `AC-TERM-CONV`,
</code_context>
<issue_to_address>
**question (typo):** Check whether the rule ID `AC-IP-WORKFORHIRE` is intentionally spelled without a hyphen.
The identifier differs from the natural-language phrase (“work-for-hire”) and the changelog (`IP/work-for-hire`). Please confirm whether `WORKFORHIRE` is the intended canonical form, or if this should be adjusted (e.g., `AC-IP-WORK-FOR-HIRE`) for consistency with your naming conventions.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| // Spelled-out dollar amounts: "five million dollars" / "$5 million" | ||
| private static readonly Regex DollarSpelledRx = new( |
There was a problem hiding this comment.
issue: Comment and regex for spelled-out dollar amounts are inconsistent and miss cases like "five million dollars".
DollarSpelledRx only accepts numeric mantissas (\d{1,3}...), so it doesn’t handle fully spelled-out amounts like "five million dollars"—only forms like "$5 million", which are already covered by DollarRx. Either expand this regex to support word-based amounts (e.g., "one", "five", "twenty") or update the comment to describe its actual numeric-only behavior to avoid misleading future rule authors.
| - `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits | ||
| below `` / ``. |
There was a problem hiding this comment.
issue (typo): The example for insurance limits has missing threshold values (empty backticks).
In the bullet for AC-INS-GCL-5M / AC-INS-CYBER-10M, the text below `` / ``. looks like an unfilled placeholder for the actual threshold amounts. Please replace these with the intended limits (e.g. $5M / $10M) so the rule is clear.
| - `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits | |
| below `` / ``. | |
| - `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits | |
| + below `$5M` / `$10M`. |
| `samples/contracts/ac-demo-ruleset.json` (now **v2.0.0**, 24 rules | ||
| total). New rule IDs: `AC-LIAB-CARVEOUTS`, `AC-TERM-CONV`, | ||
| `AC-PAY-NET45`, `AC-PAY-INT-MAX`, `AC-TAX-EXCL`, | ||
| `AC-IP-WORKFORHIRE`, `AC-INS-GCL-5M`, `AC-INS-CYBER-10M`, |
There was a problem hiding this comment.
question (typo): Check whether the rule ID AC-IP-WORKFORHIRE is intentionally spelled without a hyphen.
The identifier differs from the natural-language phrase (“work-for-hire”) and the changelog (IP/work-for-hire). Please confirm whether WORKFORHIRE is the intended canonical form, or if this should be adjusted (e.g., AC-IP-WORK-FOR-HIRE) for consistency with your naming conventions.
Closes the §5 backlog from
docs/comparison/lambda-rag-vs-air-canada.md(19 priority rules) and adds a domain-agnostic numeric-feature extractor that any future ruleset can use.What's in this PR
1. 19 new rules in
samples/contracts/ac-demo-ruleset.json(now v2.0.0, 24 rules total)Payment terms (NET-45, ≤1.5%/month interest), tax-exclusivity, IP/work-for-hire, liability carve-outs, insurance limits (\ GCL, \ cyber), security/cryptography, privacy (residency, 72h breach, consent, retention, explicit-laws), AI addenda, subcontracting approval, service-location, Quebec governance.
2.
TextFeatureExtractor(new, projector v1.4.0)Pure-regex, domain-agnostic numeric extraction. Every section now exposes:
text_features.day_counts/day_count_min/day_count_maxtext_features.month_counts/year_counts/percent_values/dollar_amounts(each with_min/_max)Rule lambdas reference these directly:
input1.text_features.day_count_max <= 45. Usable by any ruleset — vendor bonds, permit response windows, ESG thresholds, oil-and-gas pressure tests, etc. Engine code never mentions AC or contracts.3. Topic-map
contract.v1.json→ v1.1.0Adds
tax,subcontracting,ai,service_locationstopics so new rule predicates can target sections viainput1.topics.Contains(...)without regex hacks in lambdas.Genericness guardrail
Hard requirement: the engine must remain reusable across rulesets, documents, and domains. Proved by 11 new tests:
TextFeatureExtractorTests(7) — regex on synthetic non-AC prose.GenericTextFeaturesEvaluationTests(4) — full evaluator runs with synthetic non-AC rulesets over synthetic non-AC sections (vendor bond, permit response, ESG recycled-content).All four existing corpus verticals (contract / oil-gas / permitting / gov-architecture / fsi) match goldens byte-for-byte after the projector v1.4.0 bump (pure mechanical drift in the new
text_featuresfield; zero verdict changes).Two extractor bugs found and fixed
DollarRxshorthand suffixes (m|b|k) were matching the leading letter of unrelated trailing words:\,000,000 bondparsed as 10¹⁵. Fixed with(?![A-Za-z])lookahead.DayCountRxnow also handles hyphenated120-day(very common in legal English) in addition to spaced120 days.Both fixes leave AC end-to-end outcomes identical.
End-to-end vs the AC sample contract
Every Fail spot-checked — all genuine deterministic findings (NET 60 > 45, 2% > 1.5%, no Quebec, insurance limits short, etc.).
Tests
189/189 passing (174 unit + 15 idempotency). Was 173 + 15 = 188 → +11 new + 2 corpus-golden regen on mechanical drift.
Out of scope (per user)
Summary by Sourcery
Add a domain-agnostic numeric text feature extractor to the contract projector, expand the AC demo ruleset with additional policy-aligned rules, and update documentation and tests to validate generic reuse across domains.
New Features:
Bug Fixes:
Enhancements:
Documentation:
Tests: