Skip to content

Phase 3: 19 AC-aligned rules + generic text_features extractor#49

Merged
MTCMarkFranco merged 1 commit intomainfrom
phase3-ac-rules-19
May 1, 2026
Merged

Phase 3: 19 AC-aligned rules + generic text_features extractor#49
MTCMarkFranco merged 1 commit intomainfrom
phase3-ac-rules-19

Conversation

@MTCMarkFranco
Copy link
Copy Markdown
Owner

@MTCMarkFranco MTCMarkFranco commented May 1, 2026

Closes the §5 backlog from docs/comparison/lambda-rag-vs-air-canada.md (19 priority rules) and adds a domain-agnostic numeric-feature extractor that any future ruleset can use.

What's in this PR

1. 19 new rules in samples/contracts/ac-demo-ruleset.json (now v2.0.0, 24 rules total)

Payment terms (NET-45, ≤1.5%/month interest), tax-exclusivity, IP/work-for-hire, liability carve-outs, insurance limits (\ GCL, \ cyber), security/cryptography, privacy (residency, 72h breach, consent, retention, explicit-laws), AI addenda, subcontracting approval, service-location, Quebec governance.

2. TextFeatureExtractor (new, projector v1.4.0)

Pure-regex, domain-agnostic numeric extraction. Every section now exposes:

  • text_features.day_counts / day_count_min / day_count_max
  • text_features.month_counts / year_counts / percent_values / dollar_amounts (each with _min/_max)

Rule lambdas reference these directly: input1.text_features.day_count_max <= 45. Usable by any ruleset — vendor bonds, permit response windows, ESG thresholds, oil-and-gas pressure tests, etc. Engine code never mentions AC or contracts.

3. Topic-map contract.v1.json → v1.1.0

Adds tax, subcontracting, ai, service_locations topics so new rule predicates can target sections via input1.topics.Contains(...) without regex hacks in lambdas.

Genericness guardrail

Hard requirement: the engine must remain reusable across rulesets, documents, and domains. Proved by 11 new tests:

  • TextFeatureExtractorTests (7) — regex on synthetic non-AC prose.
  • GenericTextFeaturesEvaluationTests (4) — full evaluator runs with synthetic non-AC rulesets over synthetic non-AC sections (vendor bond, permit response, ESG recycled-content).

All four existing corpus verticals (contract / oil-gas / permitting / gov-architecture / fsi) match goldens byte-for-byte after the projector v1.4.0 bump (pure mechanical drift in the new text_features field; zero verdict changes).

Two extractor bugs found and fixed

  • DollarRx shorthand suffixes (m|b|k) were matching the leading letter of unrelated trailing words: \,000,000 bond parsed as 10¹⁵. Fixed with (?![A-Za-z]) lookahead.
  • DayCountRx now also handles hyphenated 120-day (very common in legal English) in addition to spaced 120 days.

Both fixes leave AC end-to-end outcomes identical.

End-to-end vs the AC sample contract

Run Pass Fail Gap Err
Before (5 rules) 4 1 1 0
After (24 rules) 5 21 1 0

Every Fail spot-checked — all genuine deterministic findings (NET 60 > 45, 2% > 1.5%, no Quebec, insurance limits short, etc.).

Tests

189/189 passing (174 unit + 15 idempotency). Was 173 + 15 = 188 → +11 new + 2 corpus-golden regen on mechanical drift.

Out of scope (per user)

  • French translations of new rule strings — user is doing these later today.

Summary by Sourcery

Add a domain-agnostic numeric text feature extractor to the contract projector, expand the AC demo ruleset with additional policy-aligned rules, and update documentation and tests to validate generic reuse across domains.

New Features:

  • Introduce a TextFeatureExtractor that enriches projected sections with generic numeric text_features for days, months, years, percentages, and dollar amounts, including min/max aggregates.
  • Extend the AC demo contract ruleset to version 2.0.0 with 19 additional rules covering payment terms, tax, IP/work-for-hire, liability carve-outs, insurance, security, privacy, AI, subcontracting, service locations, and Quebec governance.
  • Expand the contract topic map to version 1.1.0 with new generic topics for tax, subcontracting, AI, and service locations.
  • Document how to author numeric-threshold rules using text_features in the README.

Bug Fixes:

  • Correct dollar-amount regex handling so shorthand suffixes (m/b/k) do not overmatch trailing words and mis-scale values.
  • Extend day-count regex handling to recognize hyphenated forms like 120-day in addition to spaced variants.

Enhancements:

  • Bump the deterministic contract projector to version 1.4.0 and include text_features in projected section schemas and outputs.
  • Describe the new AC-aligned rules and generic text_features extractor in the comparison doc and changelog.
  • Ensure existing corpus goldens across multiple domains remain stable by regenerating expected verdict payloads with the new text_features field.

Documentation:

  • Add comparison-doc section summarizing the new AC rules, text_features extractor, genericness guardrails, and updated AC end-to-end results.
  • Update the README with guidance and examples for using text_features to express numeric thresholds in rules.

Tests:

  • Add unit tests for TextFeatureExtractor to validate deterministic, domain-agnostic numeric extraction across varied prose and formats.
  • Add evaluation tests that run synthetic non-contract rulesets over synthetic documents to prove text_features-based predicates work generically without AC- or contract-specific coupling.
  • Regenerate golden verdicts for existing corpora to reflect the additional text_features field while preserving verdict outcomes.

…xtractor

* Adds 19 new rules to samples/contracts/ac-demo-ruleset.json (now v2.0.0,
  24 rules total) covering payment terms, IP, liability carve-outs,
  insurance limits, security/cryptography, privacy obligations, AI addenda,
  subcontracting, service locations, and Quebec governance.

* New TextFeatureExtractor in LambdaRag.Projection: pure-regex,
  domain-agnostic numeric extraction over prose. Adds text_features.{
  day_counts, month_counts, year_counts, percent_values, dollar_amounts}
  arrays + _min/_max scalars. Rule lambdas target numeric thresholds
  generically (e.g. input1.text_features.day_count_max <= 45).

* Projector bumped to v1.4.0; topic-map contract.v1.json bumped to v1.1.0
  (adds tax / subcontracting / ai / service_locations topics).

* Engine remains domain-agnostic. 11 new tests prove this:
  - TextFeatureExtractorTests (7) — regex behaviour on synthetic non-AC
    prose (oil-gas, ESG, payment terms, etc.).
  - GenericTextFeaturesEvaluationTests (4) — full evaluator runs on
    synthetic non-AC rulesets (vendor bond, permit response, ESG).

* Two regex defects fixed:
  - DollarRx shorthand suffixes (m|b|k) no longer match the leading
    letter of unrelated trailing words (\,000,000 bond was being parsed
    as 10^15). Suffix now requires (?![A-Za-z]) word-boundary lookahead.
  - DayCountRx now matches hyphenated forms (120-day cure window) in
    addition to spaced forms (120 days).

* Corpus goldens regenerated for two clean docs (oil-gas, permitting):
  pure mechanical drift in text_features content; outcome counts
  (passed/failed/gap) byte-identical pre/post.

* AC contract end-to-end: pass=5 fail=21 gap=1 (was 4/1/1 with 5 rules).
  Every Fail spot-checked — all genuine deterministic findings.

Tests: 189/189 passing (174 unit + 15 idempotency).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 1, 2026

Reviewer's Guide

Implements a domain-agnostic TextFeatureExtractor wired into the contract projector, updates the topic map and AC demo ruleset to use the new numeric features and topics, fixes two money/day-count regex bugs, and adds tests and documentation so text_features remain generic while improving AC contract coverage from 5 to 24 rules with unchanged golden verdicts.

Sequence diagram for projecting a contract section with text_features and evaluating a numeric rule

sequenceDiagram
    participant Client
    participant Projector as DeterministicContractProjector
    participant TopicMap as ContractTopicMap
    participant TFE as TextFeatureExtractor
    participant Store as ProjectedDocumentStore
    participant Evaluator

    Client->>Projector: ProjectAsync(parsedDocument)
    Projector->>TopicMap: LoadDefaultTopicMap()
    TopicMap-->>Projector: topics, aliases
    loop For each section
        Projector->>Projector: extract bodyText
        Projector->>TFE: Extract(bodyText)
        TFE-->>Projector: JsonObject text_features
        Projector->>Projector: build section JSON
        Projector-->>Store: add ProjectedSection(text_features, topics, text)
    end
    Projector-->>Client: ProjectedDocument

    Client->>Evaluator: Evaluate(ProjectedDocument, ruleset)
    loop For each rule
        Evaluator->>Evaluator: bind inputs (sections)
        Evaluator->>Evaluator: execute lambda
        Evaluator->>Evaluator: access input1.text_features.day_count_max
        Evaluator-->>Client: rule verdict (Pass/Fail/Gap)
    end
Loading

Class diagram for TextFeatureExtractor and DeterministicContractProjector changes

classDiagram
    class DeterministicContractProjector {
        +string Id
        +string Version
        +string Domain
        +JsonObject Schema
        +Task~ProjectedDocument~ ProjectAsync(ParsedDocument parsed, CancellationToken cancellationToken)
        -static TopicMap LoadDefaultTopicMap()
    }

    class TextFeatureExtractor {
        <<static>>
        -Regex DayCountRx
        -Regex MonthCountRx
        -Regex YearCountRx
        -Regex PercentRx
        -Regex DollarRx
        -Regex DollarSpelledRx
        +JsonObject Extract(string text)
        -List~long~ SortedLongs(Regex rx, string text)
        -List~double~ SortedDoubles(Regex rx, string text, int parseGroup)
        -List~long~ ExtractDollars(string text)
        -bool TryParseDollar(string mantissa, string suffix, out long amount)
        -JsonArray ToArray(IEnumerable~long~ values)
        -JsonArray ToArray(IEnumerable~double~ values)
    }

    class ProjectedSection {
        +JsonArray day_counts
        +JsonArray month_counts
        +JsonArray year_counts
        +JsonArray percent_values
        +JsonArray dollar_amounts
        +long day_count_min
        +long day_count_max
        +long month_count_min
        +long month_count_max
        +long year_count_min
        +long year_count_max
        +double percent_min
        +double percent_max
        +long dollar_min
        +long dollar_max
        +JsonObject topic_scores
        +double topic_density
        +bool is_operative_for_topic
        +string inherited_from
        +string text
    }

    DeterministicContractProjector --> TextFeatureExtractor : uses
    DeterministicContractProjector --> ProjectedSection : builds
    TextFeatureExtractor --> ProjectedSection : populates text_features
Loading

File-Level Changes

Change Details Files
Add domain-agnostic numeric TextFeatureExtractor and integrate it into the deterministic contract projector schema and projection output.
  • Introduce TextFeatureExtractor to compute sorted numeric features (days, months, years, percentages, dollar amounts) plus *_min/_max scalars from section text using pure regex and invariant parsing.
  • Wire text_features into DeterministicContractProjector output, bumping projector version to 1.4.0 and extending the projected JSON schema to include the text_features object.
  • Ensure extractor output is deterministic, domain-agnostic, and safe for generic rule lambdas (arrays always present, scalars only when values exist).
src/LambdaRag.Projection/Projectors/TextFeatureExtractor.cs
src/LambdaRag.Projection/Projectors/DeterministicContractProjector.cs
Expand AC demo ruleset and contract topic map to support 19 new AC-aligned rules using generic topics and text_features.
  • Author 19 additional AC-aligned rules (payment terms, tax, IP, liability, insurance, security, privacy, AI, subcontracting, service locations, Quebec law) using predicates and lambdas over topics and text_features, and bump ruleset version to 2.0.0.
  • Extend contract.v1 topic map to v1.1.0 by adding generic topics (tax, subcontracting, ai, service_locations) so rules can target sections without regex in lambdas.
samples/contracts/ac-demo-ruleset.json
src/LambdaRag.Projection/TopicMaps/contract.v1.json
Document the new text_features capability and AC-aligned rules in README, CHANGELOG, and comparison doc.
  • Add README section describing text_features fields, example numeric-threshold rule lambdas, and cross-domain applicability.
  • Update CHANGELOG with the new projector version, ruleset expansion, topic-map changes, and end-to-end AC contract metrics.
  • Extend lambda-rag-vs-air-canada comparison doc with a new phase describing the 19 rules, text_features extractor, topic map changes, tests, and AC contract evaluation results.
README.md
CHANGELOG.md
docs/comparison/lambda-rag-vs-air-canada.md
Add tests to enforce genericness of text_features and to validate extractor behavior across domains.
  • Create TextFeatureExtractor unit tests using non-contract, multi-vertical prose to validate extraction of days, months, years, percentages, and dollar amounts plus determinism and empty-text behavior.
  • Create GenericTextFeaturesEvaluationTests that build synthetic non-contract rulesets and projected documents and assert correct evaluation of predicates and lambdas over text_features, including defensive Count-based predicates and absence of error verdicts.
  • Regenerate golden verdict files for all corpora to include the new text_features field while keeping verdict outcomes identical.
tests/LambdaRag.UnitTests/Projection/TextFeatureExtractorTests.cs
tests/LambdaRag.UnitTests/Evaluation/GenericTextFeaturesEvaluationTests.cs
tests/Goldens/corpus/**/expected-verdict.json
Fix two extractor regex bugs affecting dollar shorthand suffixes and hyphenated day counts.
  • Tighten DollarRx with a negative lookahead so shorthand suffixes m
b

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@MTCMarkFranco MTCMarkFranco merged commit fb7d0ca into main May 1, 2026
1 of 2 checks passed
@MTCMarkFranco MTCMarkFranco deleted the phase3-ac-rules-19 branch May 1, 2026 16:17
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • The PercentRx regex currently only matches values with a % sign, but the README and rule examples mention handling phrases like 30 percent; either broaden the regex to cover the percent/per cent wording or adjust the documentation/examples to match the actual behavior.
  • The DollarSpelledRx comment mentions handling fully spelled-out amounts like five million dollars, but the implementation only matches numeric forms followed by million|billion; align the comment with the actual pattern or extend the regex to support spelled-out numbers if that’s intended.
  • In TextFeatureExtractor XML docs you reference dollar_amounts_max, but the emitted scalar is dollar_max; consider tightening these doc/property name references to avoid confusion for rule authors.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `PercentRx` regex currently only matches values with a `%` sign, but the README and rule examples mention handling phrases like `30 percent`; either broaden the regex to cover the `percent`/`per cent` wording or adjust the documentation/examples to match the actual behavior.
- The `DollarSpelledRx` comment mentions handling fully spelled-out amounts like `five million dollars`, but the implementation only matches numeric forms followed by `million|billion`; align the comment with the actual pattern or extend the regex to support spelled-out numbers if that’s intended.
- In `TextFeatureExtractor` XML docs you reference `dollar_amounts_max`, but the emitted scalar is `dollar_max`; consider tightening these doc/property name references to avoid confusion for rule authors.

## Individual Comments

### Comment 1
<location path="src/LambdaRag.Projection/Projectors/TextFeatureExtractor.cs" line_range="50-51" />
<code_context>
+        @"(?:\$|USD\s*\$?|CAD\s*\$?|US\$|CAD\$)\s*(\d{1,3}(?:[,\s]\d{3})*(?:\.\d+)?)\s*(million|billion|[mbk])?(?![A-Za-z])",
+        RegexOptions.IgnoreCase | RegexOptions.Compiled);
+
+    // Spelled-out dollar amounts: "five million dollars" / "$5 million"
+    private static readonly Regex DollarSpelledRx = new(
+        @"(\d{1,3}(?:[,\s]\d{3})*(?:\.\d+)?)\s*(million|billion)\s*(?:dollars|USD|CAD)",
+        RegexOptions.IgnoreCase | RegexOptions.Compiled);
</code_context>
<issue_to_address>
**issue:** Comment and regex for spelled-out dollar amounts are inconsistent and miss cases like "five million dollars".

`DollarSpelledRx` only accepts numeric mantissas (`\d{1,3}...`), so it doesn’t handle fully spelled-out amounts like "five million dollars"—only forms like "$5 million", which are already covered by `DollarRx`. Either expand this regex to support word-based amounts (e.g., "one", "five", "twenty") or update the comment to describe its actual numeric-only behavior to avoid misleading future rule authors.
</issue_to_address>

### Comment 2
<location path="docs/comparison/lambda-rag-vs-air-canada.md" line_range="333-334" />
<code_context>
+  `percent_max <= 1.5`.
+- `AC-LAW-QUEBEC` Fails on §12 — Governing law is Ontario, not
+  Quebec.
+- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+  below `` / ``.
+
+### Two extractor bugs found and fixed
</code_context>
<issue_to_address>
**issue (typo):** The example for insurance limits has missing threshold values (empty backticks).

In the bullet for `AC-INS-GCL-5M` / `AC-INS-CYBER-10M`, the text `below `` / ``.` looks like an unfilled placeholder for the actual threshold amounts. Please replace these with the intended limits (e.g. `$5M / $10M`) so the rule is clear.

```suggestion
- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+  below `$5M` / `$10M`.
```
</issue_to_address>

### Comment 3
<location path="docs/comparison/lambda-rag-vs-air-canada.md" line_range="271" />
<code_context>
+   `AC-LIAB-CARVEOUTS`, `AC-TERM-CONV`,
</code_context>
<issue_to_address>
**question (typo):** Check whether the rule ID `AC-IP-WORKFORHIRE` is intentionally spelled without a hyphen.

The identifier differs from the natural-language phrase (“work-for-hire”) and the changelog (`IP/work-for-hire`). Please confirm whether `WORKFORHIRE` is the intended canonical form, or if this should be adjusted (e.g., `AC-IP-WORK-FOR-HIRE`) for consistency with your naming conventions.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +50 to +51
// Spelled-out dollar amounts: "five million dollars" / "$5 million"
private static readonly Regex DollarSpelledRx = new(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Comment and regex for spelled-out dollar amounts are inconsistent and miss cases like "five million dollars".

DollarSpelledRx only accepts numeric mantissas (\d{1,3}...), so it doesn’t handle fully spelled-out amounts like "five million dollars"—only forms like "$5 million", which are already covered by DollarRx. Either expand this regex to support word-based amounts (e.g., "one", "five", "twenty") or update the comment to describe its actual numeric-only behavior to avoid misleading future rule authors.

Comment on lines +333 to +334
- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
below `` / ``.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): The example for insurance limits has missing threshold values (empty backticks).

In the bullet for AC-INS-GCL-5M / AC-INS-CYBER-10M, the text below `` / ``. looks like an unfilled placeholder for the actual threshold amounts. Please replace these with the intended limits (e.g. $5M / $10M) so the rule is clear.

Suggested change
- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
below `` / ``.
- `AC-INS-GCL-5M` / `AC-INS-CYBER-10M` Fail on insurance limits
+ below `$5M` / `$10M`.

`samples/contracts/ac-demo-ruleset.json` (now **v2.0.0**, 24 rules
total). New rule IDs: `AC-LIAB-CARVEOUTS`, `AC-TERM-CONV`,
`AC-PAY-NET45`, `AC-PAY-INT-MAX`, `AC-TAX-EXCL`,
`AC-IP-WORKFORHIRE`, `AC-INS-GCL-5M`, `AC-INS-CYBER-10M`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (typo): Check whether the rule ID AC-IP-WORKFORHIRE is intentionally spelled without a hyphen.

The identifier differs from the natural-language phrase (“work-for-hire”) and the changelog (IP/work-for-hire). Please confirm whether WORKFORHIRE is the intended canonical form, or if this should be adjusted (e.g., AC-IP-WORK-FOR-HIRE) for consistency with your naming conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant