Skip to content

feat: Add Canadian PII protection (PIPEDA)#22951

Merged
krrishdholakia merged 6 commits intomainfrom
krrishdholakia/canadian-pii-template
Mar 7, 2026
Merged

feat: Add Canadian PII protection (PIPEDA)#22951
krrishdholakia merged 6 commits intomainfrom
krrishdholakia/canadian-pii-template

Conversation

@krrishdholakia
Copy link
Member

@krrishdholakia krrishdholakia commented Mar 6, 2026

Summary

Adds 6 new Canadian PII regex pattern detectors and a comprehensive PIPEDA-compliant policy template to the LiteLLM content filter guardrails:

New Patterns (patterns.json):

  • ca_sin – Social Insurance Numbers (PIPEDA Privacy Act)
  • ca_ohip – Ontario Health Insurance Plan numbers (PHIPA)
  • ca_on_drivers_licence – Ontario driver's licences
  • ca_immigration_doc – IRCC immigration documents (UCI, work/study permits, IMM refs)
  • ca_bank_account – Canadian bank account routing (transit-institution-account format)
  • ca_postal_code – Canadian postal codes (excludes invalid chars per Canada Post spec)

New Policy Template (policy_templates.json):

  • canadian-pii-protection – 5 sub-guardrails grouping patterns by data type (government IDs, health/drivers, immigration, financial, contact)

All patterns include contextual keyword matching with bilingual support (English + French keywords) to reduce false positives. The template complements the existing passport_canada pattern for comprehensive Canadian PII protection.

Type

🆕 New Feature

…y template

Adds 6 new Canadian PII regex pattern detectors to patterns.json:
- ca_sin: Social Insurance Number (PIPEDA Privacy Act, Income Tax Act)
- ca_ohip: Ontario Health Insurance Plan Number (PHIPA)
- ca_on_drivers_licence: Ontario driver's licence (HTA, PIPEDA)
- ca_immigration_doc: IRCC immigration docs (UCI, work/study permits, IMM refs)
- ca_bank_account: Canadian bank account routing (transit-institution-account)
- ca_postal_code: Canadian postal code (Canada Post spec)

Adds comprehensive policy template 'canadian-pii-protection' (id: canadian-pii-protection)
with 5 sub-guardrails grouping patterns by data type. All patterns include contextual
keyword matching (English + French keywords where applicable) to reduce false positives.
Complements existing passport_canada pattern.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 6, 2026 4:39am

Request Review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 635ad6d898

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +550 to +554
"pattern": "\\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b",
"category": "Canadian PII Patterns",
"description": "Detects Canadian Social Insurance Numbers (9-digit federal identifier, dashed or spaced format)",
"keyword_pattern": "\\b(?:SIN|social\\s*insurance|social\\s*insurance\\s*number|numéro\\s*d'assurance\\s*sociale|NAS)\\b",
"allow_word_numbers": true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow unformatted SINs in ca_sin regex

The new ca_sin detector only matches values with separators (123-456-789 or 123 456 789), so common SIN 123456789 inputs are not masked. This also makes allow_word_numbers: true ineffective: _convert_word_number_sequence() produces contiguous digits and _find_pattern_spans() validates them with regex.fullmatch(digits), which can never satisfy a separator-required regex, so spelled-out SINs are skipped as well.

Useful? React with 👍 / 👎.

{
"name": "ca_ohip",
"display_name": "OHIP (Ontario Health Insurance Plan Number)",
"pattern": "\\b\\d{4}[\\-\\s]?\\d{3}[\\-\\s]?\\d{3}[\\-\\s]?[A-Z]{2}\\b",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make OHIP version code optional in ca_ohip

This pattern requires a trailing 2-letter version code ([A-Z]{2}), so it masks 1234-567-890-AB but misses 10-digit OHIP numbers (1234-567-890 / 1234567890). Because the policy claims to protect OHIP numbers generally, this creates a real under-masking gap for common user input formats.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR adds 9 new PII detection patterns (6 Canadian federal/provincial: ca_sin, ca_ohip, ca_on_drivers_licence, ca_immigration_doc, ca_bank_account, ca_postal_code; and 3 University of Toronto institutional: uoft_student_id, uoft_utorid, uoft_tcard) along with a canadian-pii-protection policy template organizing them into 6 sub-guardrails. A companion UI compliance prompt file and frameworkMeta entries are also included. The PR addresses several issues from a previous review round (leading-zero SINs, allow_word_numbers inconsistency, UCI separator optionality, uoft_tcard/credit-card collision via keyword narrowing).

Key observations:

  • The ca_ohip pattern mandates the two-letter version code ([A-Z]{2}), which is technically correct but silently misses the widely-used bare 10-digit OHIP format. This coverage gap should be explicitly documented or reconsidered.
  • The ca_postal_code regex uses an inconsistent character class encoding: the first-position class uses explicit enumeration while positions 3/5 use the range V-Z (which silently includes W). Both are correct per Canada Post spec, but the inconsistency makes the code harder to audit.
  • test_visa_card_masked in test_ca_policy_e2e.py uses a generic "REDACTED" assertion that doesn't pin down which pattern triggered, making it less useful as a regression guard.
  • test_student_card_no_longer_triggers_tcard in test_uoft_policy_e2e.py only checks that the TCard label is absent, but doesn't assert the original number itself passes through unmodified.
  • The test files use @pytest.mark.asyncio and call ContentFilterGuardrail.apply_guardrail() with plain-text inputs — these are local regex tests with no real network calls, consistent with the repository's CI testing requirements.

Confidence Score: 3/5

  • The PR is additive-only (new patterns, template, tests, and UI prompts) with no changes to existing logic, making it safe to merge with low risk of regressions; however, the ca_ohip coverage gap and the uoft_utorid/uoft_tcard collision risk in combined guardrail contexts warrant review before merging.
  • Score reflects that: (1) all changes are purely additive and cannot break existing functionality; (2) the new patterns have been tested with unit and E2E tests; (3) however, the ca_ohip pattern has an intentional but undocumented coverage gap for bare 10-digit OHIP numbers; (4) the uoft_utorid pattern remains broad and relies entirely on keyword context, and (5) several pre-existing review thread concerns (broad UTORid pattern, TCard/credit-card collision in combined template) have been partially addressed but not fully resolved.
  • litellm/proxy/guardrails/guardrail_hooks/litellm_content_filter/patterns.json (OHIP version code requirement, UTORid pattern breadth, TCard collision risk); tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_ca_policy_e2e.py (imprecise Visa assertion); tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_uoft_policy_e2e.py (incomplete TCard non-trigger assertion).

Important Files Changed

Filename Overview
litellm/proxy/guardrails/guardrail_hooks/litellm_content_filter/patterns.json Adds 9 new patterns (6 Canadian federal/provincial and 3 UofT institutional). ca_ohip mandatory version code may miss real OHIP numbers; uoft_utorid pattern remains very broad despite keyword safeguard; uoft_tcard 16-digit pattern conflicts with payment-card patterns in combined guardrail.
policy_templates.json Adds canadian-pii-protection template with 6 sub-guardrails; correctly tagged with PIPEDA and FIPPA; us_phone reuse for Canadian phone detection is explained in the description; template structure is consistent with existing templates.
tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_ca_patterns.py Unit tests for all 6 new Canadian patterns; covers positive/negative cases well; compact format tests are appropriately included as negative tests; no network calls — pure regex pattern testing.
tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_ca_policy_e2e.py E2E tests for the Canadian policy; test_visa_card_masked uses a generic "REDACTED" assertion rather than pattern-specific label which reduces test precision; test_ohip_compact_masked tests a format without separators but OHIP without version code is not tested.
tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_uoft_patterns.py Solid unit tests for all 3 UofT patterns; boundary conditions (min/max digits and letters) are covered; no network calls.
tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_uoft_policy_e2e.py Good E2E tests for UofT patterns; test_student_card_no_longer_triggers_tcard documents deliberate exclusion of "student card" keyword from TCard; assertion that UOFT_TCARD_REDACTED is absent is correct but doesn't verify the raw number also passes through.
ui/litellm-dashboard/src/data/canadianPiiCompliancePrompts.ts Comprehensive UI compliance prompts covering all 9 new patterns with pass/fail cases; framework metadata for PIPEDA and FIPPA is included; prompts align with the E2E test cases.
ui/litellm-dashboard/src/data/compliancePrompts.ts Minimal change: imports and spreads canadianPiiCompliancePrompts; adds two frameworkMeta entries for PIPEDA and FIPPA frameworks; change is additive and non-breaking.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Incoming LLM Request Text] --> B{ca-pii-government-identifiers\nca_sin + passport_canada}
    B --> C{ca-pii-health-drivers\nca_ohip + ca_on_drivers_licence}
    C --> D{ca-pii-immigration\nca_immigration_doc}
    D --> E{ca-pii-financial\nca_bank_account + credit_card\n+ visa + mastercard + amex + iban}
    E --> F{ca-pii-contact-information\nemail + us_phone + ca_postal_code}
    F --> G{ca-pii-institutional-ids\nuoft_student_id + uoft_utorid + uoft_tcard}
    G --> H[Masked Output to LLM]

    subgraph PerPattern [Per-Pattern Engine Logic]
        KW[keyword_pattern match?] -->|No| PASS[Pass through unchanged]
        KW -->|Yes| REX[regex pattern match?]
        REX -->|No| PASS
        REX -->|Yes| MASK[Apply redaction label]
    end

    A -.->|Each sub-guardrail applies this logic per pattern| PerPattern
Loading

Comments Outside Diff (4)

  1. litellm/proxy/guardrails/guardrail_hooks/litellm_content_filter/patterns.json, line 22-27 (link)

    ca_ohip version code is mandatory — misses common bare 10-digit format

    Real OHIP cards frequently display only the 10-digit number without the two-letter version code (the version code is on the back of the card and often omitted in forms, medical records, and user input). The current pattern \\b\\d{4}[\\-\\s]?\\d{3}[\\-\\s]?\\d{3}[\\-\\s]?[A-Z]{2}\\b requires [A-Z]{2} and the test test_without_version_code_not_matched explicitly confirms this exclusion.

    This is a meaningful coverage gap: a message like "My OHIP number is 3456 789 012" (a common user input format) would not be detected at all.

    Making the version code optional would broaden coverage at the cost of slightly more false positives from 10-digit numbers:

    If the intentional design is to require the version code (to avoid false-positives from phone numbers, account numbers, etc.), that trade-off should be explicitly noted in the description field so future maintainers understand why 10-digit-only OHIP numbers are deliberately excluded.

  2. litellm/proxy/guardrails/guardrail_hooks/litellm_content_filter/patterns.json, line 58-63 (link)

    ca_postal_code regex has inconsistent character class encoding for positions 3 and 5

    The first-position character class [ABCEGHJ-NPRSTVXY] uses explicit letter enumeration (no range ending in X/Y), intentionally excluding W and Z as first characters per Canada Post spec.

    The third and fifth position character classes [ABCEGHJ-NPRSTV-Z] use the range V-Z, which silently includes W (ASCII 87, between V and X) and Z (ASCII 90). While W and Z are valid in positions 3 and 5 per the Canada Post spec, the use of a range here makes the inclusion of W non-obvious to reviewers and maintainers who might otherwise assume the same exclusion rules apply to all positions.

    Consider making the character class for positions 3/5 as explicit as position 1, to make the difference deliberate and readable:

    Or at minimum add a note to the description field explaining that W and Z are valid in positions 3 and 5 but not position 1.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  3. tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_ca_policy_e2e.py, line 882-897 (link)

    test_visa_card_masked uses an imprecise assertion

    The assertion assert "REDACTED" in output only verifies that some redaction occurred, not that the Visa pattern specifically triggered. If a future change caused the credit_card generic pattern to suppress the visa pattern (or vice versa), this test would still pass while the wrong label was applied.

    Since setup_canadian_guardrail() includes both credit_card and visa patterns, 4111111111111111 (a canonical Visa test number) could be labeled [CREDIT_CARD_REDACTED] or [VISA_REDACTED] depending on execution order. A more robust assertion would accept either expected label explicitly:

    assert "[VISA_REDACTED]" in output or "[CREDIT_CARD_REDACTED]" in output
    assert "4111111111111111" not in output

    This makes the test's intent clear and will catch cases where neither pattern triggered.

  4. tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_uoft_policy_e2e.py, line 1372-1383 (link)

    test_student_card_no_longer_triggers_tcard only checks label absence, not that the number passes through

    The assertion only verifies that the UOFT_TCARD_REDACTED label is not applied. It does not assert that the raw 16-digit number 4111111111111111 is still present in the output (i.e., that it was NOT masked by any other pattern in the UofT guardrail).

    In the current setup this is incidentally fine — the UofT guardrail has no other pattern that would match a 16-digit all-digit string without a TCard keyword. But a stronger assertion would make the test's purpose clear and guard against future regressions:

    assert "UOFT_TCARD_REDACTED" not in output
    assert "4111111111111111" in output  # number should pass through unmasked
    assert output == text  # entire message should be unchanged

Last reviewed commit: 5d0ea2e

{
"name": "ca_sin",
"display_name": "SIN (Canadian Social Insurance Number)",
"pattern": "\\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca_sin allows invalid leading-zero SINs

Canadian Social Insurance Numbers with a first digit of 0 are never assigned by Service Canada. The current pattern \\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b would match numbers like 000-000-000 or 012-345-678, which can never be real SINs. If a keyword like "SIN" appears in proximity to any 0XX-XXX-XXX numeric string (e.g., an order number or reference code), this will produce a false-positive redaction.

Restricting the first digit to [1-9] tightens the match to actually-assigned number ranges:

Suggested change
"pattern": "\\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b",
"pattern": "\\b[1-9]\\d{2}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b",

"category": "Canadian PII Patterns",
"description": "Detects Canadian Social Insurance Numbers (9-digit federal identifier, dashed or spaced format)",
"keyword_pattern": "\\b(?:SIN|social\\s*insurance|social\\s*insurance\\s*number|numéro\\s*d'assurance\\s*sociale|NAS)\\b",
"allow_word_numbers": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_word_numbers inconsistent with other Canadian patterns

All five other new Canadian patterns set "allow_word_numbers": false, but ca_sin sets it to true. allow_word_numbers: true causes the engine to additionally scan for spelled-out number sequences (e.g., "one two three four five six seven eight nine") and attempt to match them against the regex after converting them to digits. For a SIN detector this is an unusual choice and appears inconsistent with the rest of the Canadian pattern set. If intentional, a comment in the description would clarify the reasoning; if unintentional, it should be set to false to match the other patterns.

Suggested change
"allow_word_numbers": true
"allow_word_numbers": false

{
"name": "ca_immigration_doc",
"display_name": "Immigration Document (Canada IRCC)",
"pattern": "\\b(?:\\d{4}[\\-\\s]?\\d{4}[\\-\\s]?\\d{2}|[TUFW]\\d{8,10}|IMM[\\-\\s]?\\d{4,5})\\b",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca_immigration_doc UCI sub-pattern is overly broad

The first alternative \\d{4}[\\-\\s]?\\d{4}[\\-\\s]?\\d{2} matches any 10-digit sequence in XXXX-XXXX-XX format. This overlaps with common non-immigration identifiers: compact date strings (2024-01-24-15), credit-card-like partial numbers, order/reference numbers, and even phone numbers formatted as 1234-5678-90.

While the keyword_pattern provides proximity context, the engine returns early if any keyword match exists anywhere in the text — meaning a paragraph that mentions "immigration" even once will activate matching for every XXXX-XXXX-XX number in that paragraph. A tighter expression (e.g., requiring the separator to be present) would reduce false positives:

Suggested change
"pattern": "\\b(?:\\d{4}[\\-\\s]?\\d{4}[\\-\\s]?\\d{2}|[TUFW]\\d{8,10}|IMM[\\-\\s]?\\d{4,5})\\b",
"pattern": "\\b(?:\\d{4}[\\-\\s]\\d{4}[\\-\\s]\\d{2}|[TUFW]\\d{8,10}|IMM[\\-\\s]?\\d{4,5})\\b",

Making the separators in the UCI sub-pattern mandatory (removing ?) restricts matches to numbers explicitly written in XXXX-XXXX-XX form, which is the canonical UCI display format.

Comment on lines +2617 to +2619
"pattern_type": "prebuilt",
"pattern_name": "amex",
"action": "MASK"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

us_phone pattern used in a Canadian PII template

The ca-pii-contact-information guardrail references the us_phone prebuilt pattern. While this is functionally correct — Canadian numbers use the same North American Numbering Plan (NANP) as US numbers — it is surprising to see a US-named pattern inside a template explicitly branded as "Canadian PII Protection (PIPEDA)". Users configuring this template may not realise they are activating a US-labelled pattern, which could create confusion during audits or compliance reviews.

Consider either documenting this in the guardrail_info.description (e.g., "Masks North American (NANP) phone numbers, covering both Canada and the US") or requesting a ca_phone alias pattern that maps to the same regex as us_phone.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Adds:
- test_ca_patterns.py: 30 unit tests for regex pattern matching (SIN, OHIP,
  driver's licence, immigration docs, bank account, postal code)
- test_ca_policy_e2e.py: 27 end-to-end tests running the full
  ContentFilterGuardrail pipeline with MASK action — validates detection
  of real PII and pass-through of clean prompts
- canadianPiiCompliancePrompts.ts: 21-prompt compliance dataset for UI
  evaluation, wired into the main compliancePrompts framework

Fixes keyword_pattern alternation ordering in patterns.json — longer
alternatives (e.g. "social insurance number") now precede shorter ones
("social insurance") to avoid excessive gap-word count when the regex
engine selects the shorter match first.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…36 tests)

Add 3 UofT institutional identifiers (student/employee number, UTORid, TCard)
covered under Ontario FIPPA. Includes pattern definitions, policy template
sub-guardrail, compliance prompts, unit tests, and e2e tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
)
output = result.get("texts", [])[0]
assert "[UOFT_UTORID_REDACTED]" in output
assert "smithj12" not in result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion checks the wrong variable — it tests whether "smithj12" is a key in the returned dictionary (which it never is), not whether the string appears in the masked output text. The check will always pass regardless of whether masking actually occurred.

Suggested change
assert "smithj12" not in result
assert "smithj12" not in output

Comment on lines +20 to +80
def setup_canadian_guardrail(self):
"""
Setup guardrail with all Canadian PII patterns (mimics the policy template)
"""
patterns = [
# Government identifiers
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_sin",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="passport_canada",
action=ContentFilterAction.MASK,
),
# Health & drivers
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_ohip",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_on_drivers_licence",
action=ContentFilterAction.MASK,
),
# Immigration
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_immigration_doc",
action=ContentFilterAction.MASK,
),
# Financial
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_bank_account",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="credit_card",
action=ContentFilterAction.MASK,
),
# Contact info
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="email",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_postal_code",
action=ContentFilterAction.MASK,
),
]

return ContentFilterGuardrail(
guardrail_name="canadian-pii-protection",
patterns=patterns,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup_canadian_guardrail() function omits the three uoft_* patterns (uoft_student_id, uoft_utorid, uoft_tcard) that are part of the ca-pii-institutional-ids sub-guardrail in the actual policy template. The file docstring claims to test "the complete policy with all Canadian PII patterns," but the University of Toronto institutional identifiers are missing from this setup.

Consider either:

  1. Adding the uoft_* patterns to setup_canadian_guardrail() if the intent is end-to-end testing of the full template, or
  2. Updating the docstring to accurately reflect the current scope (e.g., "tests the complete Canadian federal/provincial/contact PII patterns" and noting that institutional patterns are tested separately in test_uoft_policy_e2e.py).

Fix test_utorid_masked checking `result` (dict) instead of `output` (string).
Update test_ca_policy_e2e.py docstring to clarify scope vs UofT tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +2618 to +2621
"pattern_type": "prebuilt",
"pattern_name": "amex",
"action": "MASK"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing "PIPEDA" tag

The policy template is titled "Canadian PII Protection (PIPEDA)" and its ID is "canadian-pii-protection", but the tags array does not include "PIPEDA" — only "PIPEDA", "Canada", and "FIPPA" are present. Users or tooling that filters/searches templates by tag looking for PIPEDA compliance will not find this template.

Suggested change
"pattern_type": "prebuilt",
"pattern_name": "amex",
"action": "MASK"
},
"tags": [
"PII Protection",
"Canada",
"PIPEDA",
"FIPPA"
],

- Tighten ca_postal_code keyword_pattern: replace broad "address" with
  specific compound terms (mailing/street/shipping/home address)
- Add missing "PIPEDA" tag to policy_templates.json for discoverability
- Add us_phone pattern to test_ca_policy_e2e.py setup to match deployed template
- Add phone number e2e test for complete coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@krrishdholakia
Copy link
Member Author

@greptile please re-review this

Comment on lines +611 to +618
"name": "uoft_utorid",
"display_name": "UTORid (University of Toronto Login)",
"pattern": "\\b[a-z]{2,6}\\d{1,4}\\b",
"category": "Canadian Institutional Identifiers (FIPPA)",
"description": "Detects University of Toronto UTORid login identifiers (e.g. smithj12). Context keywords are essential due to the generic format.",
"keyword_pattern": "\\b(?:utorid|utoid|utor\\s*id|acorn|quercus|uoft\\s*login|university\\s*of\\s*toronto\\s*login|portal\\s*login|weblogin|shibboleth)\\b",
"allow_word_numbers": false
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uoft_utorid pattern is dangerously broad

The pattern \\b[a-z]{2,6}\\d{1,4}\\b matches an enormous class of common tokens: variable names (db2, str1, api42), error codes (err404), CSS classes (sm2, lg1), port names (eth0), and many others.

The design relies entirely on the keyword_pattern for correctness, but the keyword list includes terms that appear commonly in non-UTORid contexts:

  • "acorn" — a word in everyday English ("acorn tree", "the ACORN organization")
  • "quercus" — also a plant genus name appearing in biology/ecology contexts
  • "id\\s*card" — extremely common phrase

Once any of these keywords matches anywhere in the text, every [a-z]{2,6}\\d{1,4} token in the entire document becomes a match candidate. For example:

"The ACORN protocol uses db2 for storage and api42 for the endpoint"

Here db2 and api42 would both be masked as UTORid even though neither is a UTORid. The existing test_generic_short_word_no_context_passes test (in test_uoft_policy_e2e.py) only guards against the no-keyword case.

Consider tightening the pattern to the documented UTORid format: a surname-initial suffix structure. At minimum, the pattern could require the alphabetic prefix to be 4+ characters (matching the real minimum surname abbreviation), reducing collisions with 2-3-letter abbreviations that are ubiquitous in technical text.

Comment on lines +55 to +80
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_bank_account",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="credit_card",
action=ContentFilterAction.MASK,
),
# Contact info
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="email",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="us_phone",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="ca_postal_code",
action=ContentFilterAction.MASK,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Financial card patterns missing from test setup

setup_canadian_guardrail() includes credit_card but omits visa, mastercard, amex, and iban, all of which are present in the actual policy template's ca-pii-financial guardrail definition (policy_templates.json lines 2608–2626). There are also no E2E test cases that verify detection of Visa, Mastercard, Amex, or IBAN numbers.

As a result, if any of those four patterns were misconfigured or changed in a way that breaks them within the Canadian template, no test would catch it.

Consider adding:

  1. The four missing patterns to the setup_canadian_guardrail() patterns list:
ContentFilterPattern(
    pattern_type="prebuilt",
    pattern_name="visa",
    action=ContentFilterAction.MASK,
),
ContentFilterPattern(
    pattern_type="prebuilt",
    pattern_name="mastercard",
    action=ContentFilterAction.MASK,
),
ContentFilterPattern(
    pattern_type="prebuilt",
    pattern_name="amex",
    action=ContentFilterAction.MASK,
),
ContentFilterPattern(
    pattern_type="prebuilt",
    pattern_name="iban",
    action=ContentFilterAction.MASK,
),
  1. At least one E2E test asserting that a Visa/Mastercard number in text is masked by the Canadian guardrail.

Comment on lines +619 to 627
{
"name": "uoft_tcard",
"display_name": "TCard Number (University of Toronto Campus ID)",
"pattern": "\\b\\d{16}\\b",
"category": "Canadian Institutional Identifiers (FIPPA)",
"description": "Detects University of Toronto TCard campus ID card numbers (16-digit). Context keywords are critical to distinguish from payment card numbers.",
"keyword_pattern": "\\b(?:tcard\\s*number|t-card\\s*number|campus\\s*card\\s*number|campus\\s*id\\s*number|library\\s*card\\s*number|student\\s*card\\s*number|id\\s*card\\s*number|tcard|t-card|campus\\s*card|campus\\s*id|library\\s*card|student\\s*card|id\\s*card|university\\s*card)\\b",
"allow_word_numbers": false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uoft_tcard will mislabel credit card numbers as TCard in combined template

The uoft_tcard pattern \\b\\d{16}\\b is structurally identical to the Visa/Mastercard/Amex 16-digit format. In the canadian-pii-protection policy template, ca-pii-institutional-ids (containing uoft_tcard) and ca-pii-financial (containing visa, mastercard, amex) are applied as separate guardrails processed sequentially.

If ca-pii-institutional-ids runs before ca-pii-financial, a credit card number that appears in text alongside any of the TCard keywords ("student card", "id card", "campus card", etc.) will be redacted as [UOFT_TCARD_REDACTED] instead of the appropriate payment-card redaction label. The keyword list for uoft_tcard includes very common terms:

"student\\s*card\\s*number|id\\s*card\\s*number|student\\s*card"

For example:

"Please update your student card: 4111111111111111"

→ masked as [UOFT_TCARD_REDACTED] (incorrect) instead of [VISA_REDACTED] (expected).

The description acknowledges the overlap ("Context keywords are critical to distinguish from payment card numbers"), but in a combined multi-guardrail template this distinction is no longer reliable. Consider:

  • Documenting the processing order of the guardrails in the template description, or
  • Excluding well-known credit-card-adjacent keywords (e.g., "student\\s*card" in payment contexts) from the uoft_tcard keyword_pattern

…coverage

- ca_sin: reject leading-zero SINs ([1-9]\d{2}), set allow_word_numbers to false
- ca_immigration_doc: require separators in UCI pattern (prevent bare \d{10} match)
- uoft_utorid: qualify generic keywords (acorn -> acorn login, quercus -> quercus login)
- uoft_tcard: remove generic keywords (student card, id card, library card) that
  overlap with credit card contexts; keep only UofT-specific terms (tcard, campus card)
- Add visa/mastercard/amex/iban patterns to test_ca_policy_e2e.py setup to match
  deployed template; add Visa card masking test
- Add test verifying "student card" no longer triggers TCard redaction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@krrishdholakia krrishdholakia merged commit ff8e01d into main Mar 7, 2026
32 of 42 checks passed
andyren-png pushed a commit to andyren-png/litellm that referenced this pull request Mar 7, 2026
* feat: Add Canadian PII protection patterns and PIPEDA-compliant policy template

Adds 6 new Canadian PII regex pattern detectors to patterns.json:
- ca_sin: Social Insurance Number (PIPEDA Privacy Act, Income Tax Act)
- ca_ohip: Ontario Health Insurance Plan Number (PHIPA)
- ca_on_drivers_licence: Ontario driver's licence (HTA, PIPEDA)
- ca_immigration_doc: IRCC immigration docs (UCI, work/study permits, IMM refs)
- ca_bank_account: Canadian bank account routing (transit-institution-account)
- ca_postal_code: Canadian postal code (Canada Post spec)

Adds comprehensive policy template 'canadian-pii-protection' (id: canadian-pii-protection)
with 5 sub-guardrails grouping patterns by data type. All patterns include contextual
keyword matching (English + French keywords where applicable) to reduce false positives.
Complements existing passport_canada pattern.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

* feat: Add Canadian PII compliance dataset and tests (57 tests)

Adds:
- test_ca_patterns.py: 30 unit tests for regex pattern matching (SIN, OHIP,
  driver's licence, immigration docs, bank account, postal code)
- test_ca_policy_e2e.py: 27 end-to-end tests running the full
  ContentFilterGuardrail pipeline with MASK action — validates detection
  of real PII and pass-through of clean prompts
- canadianPiiCompliancePrompts.ts: 21-prompt compliance dataset for UI
  evaluation, wired into the main compliancePrompts framework

Fixes keyword_pattern alternation ordering in patterns.json — longer
alternatives (e.g. "social insurance number") now precede shorter ones
("social insurance") to avoid excessive gap-word count when the regex
engine selects the shorter match first.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

* feat: Add University of Toronto FIPPA identifier patterns and tests (36 tests)

Add 3 UofT institutional identifiers (student/employee number, UTORid, TCard)
covered under Ontario FIPPA. Includes pattern definitions, policy template
sub-guardrail, compliance prompts, unit tests, and e2e tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: Correct test assertion bug and inaccurate docstring

Fix test_utorid_masked checking `result` (dict) instead of `output` (string).
Update test_ca_policy_e2e.py docstring to clarify scope vs UofT tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: Address Greptile review feedback

- Tighten ca_postal_code keyword_pattern: replace broad "address" with
  specific compound terms (mailing/street/shipping/home address)
- Add missing "PIPEDA" tag to policy_templates.json for discoverability
- Add us_phone pattern to test_ca_policy_e2e.py setup to match deployed template
- Add phone number e2e test for complete coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: Tighten patterns to reduce false positives and add missing test coverage

- ca_sin: reject leading-zero SINs ([1-9]\d{2}), set allow_word_numbers to false
- ca_immigration_doc: require separators in UCI pattern (prevent bare \d{10} match)
- uoft_utorid: qualify generic keywords (acorn -> acorn login, quercus -> quercus login)
- uoft_tcard: remove generic keywords (student card, id card, library card) that
  overlap with credit card contexts; keep only UofT-specific terms (tcard, campus card)
- Add visa/mastercard/amex/iban patterns to test_ca_policy_e2e.py setup to match
  deployed template; add Visa card masking test
- Add test verifying "student card" no longer triggers TCard redaction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant