feat: Add Canadian PII protection (PIPEDA)#22951
Conversation
…y template Adds 6 new Canadian PII regex pattern detectors to patterns.json: - ca_sin: Social Insurance Number (PIPEDA Privacy Act, Income Tax Act) - ca_ohip: Ontario Health Insurance Plan Number (PHIPA) - ca_on_drivers_licence: Ontario driver's licence (HTA, PIPEDA) - ca_immigration_doc: IRCC immigration docs (UCI, work/study permits, IMM refs) - ca_bank_account: Canadian bank account routing (transit-institution-account) - ca_postal_code: Canadian postal code (Canada Post spec) Adds comprehensive policy template 'canadian-pii-protection' (id: canadian-pii-protection) with 5 sub-guardrails grouping patterns by data type. All patterns include contextual keyword matching (English + French keywords where applicable) to reduce false positives. Complements existing passport_canada pattern. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 635ad6d898
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "pattern": "\\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b", | ||
| "category": "Canadian PII Patterns", | ||
| "description": "Detects Canadian Social Insurance Numbers (9-digit federal identifier, dashed or spaced format)", | ||
| "keyword_pattern": "\\b(?:SIN|social\\s*insurance|social\\s*insurance\\s*number|numéro\\s*d'assurance\\s*sociale|NAS)\\b", | ||
| "allow_word_numbers": true |
There was a problem hiding this comment.
Allow unformatted SINs in
ca_sin regex
The new ca_sin detector only matches values with separators (123-456-789 or 123 456 789), so common SIN 123456789 inputs are not masked. This also makes allow_word_numbers: true ineffective: _convert_word_number_sequence() produces contiguous digits and _find_pattern_spans() validates them with regex.fullmatch(digits), which can never satisfy a separator-required regex, so spelled-out SINs are skipped as well.
Useful? React with 👍 / 👎.
| { | ||
| "name": "ca_ohip", | ||
| "display_name": "OHIP (Ontario Health Insurance Plan Number)", | ||
| "pattern": "\\b\\d{4}[\\-\\s]?\\d{3}[\\-\\s]?\\d{3}[\\-\\s]?[A-Z]{2}\\b", |
There was a problem hiding this comment.
Make OHIP version code optional in
ca_ohip
This pattern requires a trailing 2-letter version code ([A-Z]{2}), so it masks 1234-567-890-AB but misses 10-digit OHIP numbers (1234-567-890 / 1234567890). Because the policy claims to protect OHIP numbers generally, this creates a real under-masking gap for common user input formats.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR adds 9 new PII detection patterns (6 Canadian federal/provincial: Key observations:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/guardrails/guardrail_hooks/litellm_content_filter/patterns.json | Adds 9 new patterns (6 Canadian federal/provincial and 3 UofT institutional). ca_ohip mandatory version code may miss real OHIP numbers; uoft_utorid pattern remains very broad despite keyword safeguard; uoft_tcard 16-digit pattern conflicts with payment-card patterns in combined guardrail. |
| policy_templates.json | Adds canadian-pii-protection template with 6 sub-guardrails; correctly tagged with PIPEDA and FIPPA; us_phone reuse for Canadian phone detection is explained in the description; template structure is consistent with existing templates. |
| tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_ca_patterns.py | Unit tests for all 6 new Canadian patterns; covers positive/negative cases well; compact format tests are appropriately included as negative tests; no network calls — pure regex pattern testing. |
| tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_ca_policy_e2e.py | E2E tests for the Canadian policy; test_visa_card_masked uses a generic "REDACTED" assertion rather than pattern-specific label which reduces test precision; test_ohip_compact_masked tests a format without separators but OHIP without version code is not tested. |
| tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_uoft_patterns.py | Solid unit tests for all 3 UofT patterns; boundary conditions (min/max digits and letters) are covered; no network calls. |
| tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_uoft_policy_e2e.py | Good E2E tests for UofT patterns; test_student_card_no_longer_triggers_tcard documents deliberate exclusion of "student card" keyword from TCard; assertion that UOFT_TCARD_REDACTED is absent is correct but doesn't verify the raw number also passes through. |
| ui/litellm-dashboard/src/data/canadianPiiCompliancePrompts.ts | Comprehensive UI compliance prompts covering all 9 new patterns with pass/fail cases; framework metadata for PIPEDA and FIPPA is included; prompts align with the E2E test cases. |
| ui/litellm-dashboard/src/data/compliancePrompts.ts | Minimal change: imports and spreads canadianPiiCompliancePrompts; adds two frameworkMeta entries for PIPEDA and FIPPA frameworks; change is additive and non-breaking. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Incoming LLM Request Text] --> B{ca-pii-government-identifiers\nca_sin + passport_canada}
B --> C{ca-pii-health-drivers\nca_ohip + ca_on_drivers_licence}
C --> D{ca-pii-immigration\nca_immigration_doc}
D --> E{ca-pii-financial\nca_bank_account + credit_card\n+ visa + mastercard + amex + iban}
E --> F{ca-pii-contact-information\nemail + us_phone + ca_postal_code}
F --> G{ca-pii-institutional-ids\nuoft_student_id + uoft_utorid + uoft_tcard}
G --> H[Masked Output to LLM]
subgraph PerPattern [Per-Pattern Engine Logic]
KW[keyword_pattern match?] -->|No| PASS[Pass through unchanged]
KW -->|Yes| REX[regex pattern match?]
REX -->|No| PASS
REX -->|Yes| MASK[Apply redaction label]
end
A -.->|Each sub-guardrail applies this logic per pattern| PerPattern
Comments Outside Diff (4)
-
litellm/proxy/guardrails/guardrail_hooks/litellm_content_filter/patterns.json, line 22-27 (link)ca_ohipversion code is mandatory — misses common bare 10-digit formatReal OHIP cards frequently display only the 10-digit number without the two-letter version code (the version code is on the back of the card and often omitted in forms, medical records, and user input). The current pattern
\\b\\d{4}[\\-\\s]?\\d{3}[\\-\\s]?\\d{3}[\\-\\s]?[A-Z]{2}\\brequires[A-Z]{2}and the testtest_without_version_code_not_matchedexplicitly confirms this exclusion.This is a meaningful coverage gap: a message like
"My OHIP number is 3456 789 012"(a common user input format) would not be detected at all.Making the version code optional would broaden coverage at the cost of slightly more false positives from 10-digit numbers:
If the intentional design is to require the version code (to avoid false-positives from phone numbers, account numbers, etc.), that trade-off should be explicitly noted in the
descriptionfield so future maintainers understand why 10-digit-only OHIP numbers are deliberately excluded. -
litellm/proxy/guardrails/guardrail_hooks/litellm_content_filter/patterns.json, line 58-63 (link)ca_postal_coderegex has inconsistent character class encoding for positions 3 and 5The first-position character class
[ABCEGHJ-NPRSTVXY]uses explicit letter enumeration (no range ending in X/Y), intentionally excluding W and Z as first characters per Canada Post spec.The third and fifth position character classes
[ABCEGHJ-NPRSTV-Z]use the rangeV-Z, which silently includes W (ASCII 87, between V and X) and Z (ASCII 90). While W and Z are valid in positions 3 and 5 per the Canada Post spec, the use of a range here makes the inclusion of W non-obvious to reviewers and maintainers who might otherwise assume the same exclusion rules apply to all positions.Consider making the character class for positions 3/5 as explicit as position 1, to make the difference deliberate and readable:
Or at minimum add a note to the
descriptionfield explaining that W and Z are valid in positions 3 and 5 but not position 1.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
-
tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_ca_policy_e2e.py, line 882-897 (link)test_visa_card_maskeduses an imprecise assertionThe assertion
assert "REDACTED" in outputonly verifies that some redaction occurred, not that the Visa pattern specifically triggered. If a future change caused thecredit_cardgeneric pattern to suppress thevisapattern (or vice versa), this test would still pass while the wrong label was applied.Since
setup_canadian_guardrail()includes bothcredit_cardandvisapatterns,4111111111111111(a canonical Visa test number) could be labeled[CREDIT_CARD_REDACTED]or[VISA_REDACTED]depending on execution order. A more robust assertion would accept either expected label explicitly:assert "[VISA_REDACTED]" in output or "[CREDIT_CARD_REDACTED]" in output assert "4111111111111111" not in output
This makes the test's intent clear and will catch cases where neither pattern triggered.
-
tests/test_litellm/proxy/guardrails/guardrail_hooks/content_filter/test_uoft_policy_e2e.py, line 1372-1383 (link)test_student_card_no_longer_triggers_tcardonly checks label absence, not that the number passes throughThe assertion only verifies that the
UOFT_TCARD_REDACTEDlabel is not applied. It does not assert that the raw 16-digit number4111111111111111is still present in the output (i.e., that it was NOT masked by any other pattern in the UofT guardrail).In the current setup this is incidentally fine — the UofT guardrail has no other pattern that would match a 16-digit all-digit string without a TCard keyword. But a stronger assertion would make the test's purpose clear and guard against future regressions:
assert "UOFT_TCARD_REDACTED" not in output assert "4111111111111111" in output # number should pass through unmasked assert output == text # entire message should be unchanged
Last reviewed commit: 5d0ea2e
| { | ||
| "name": "ca_sin", | ||
| "display_name": "SIN (Canadian Social Insurance Number)", | ||
| "pattern": "\\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b", |
There was a problem hiding this comment.
ca_sin allows invalid leading-zero SINs
Canadian Social Insurance Numbers with a first digit of 0 are never assigned by Service Canada. The current pattern \\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b would match numbers like 000-000-000 or 012-345-678, which can never be real SINs. If a keyword like "SIN" appears in proximity to any 0XX-XXX-XXX numeric string (e.g., an order number or reference code), this will produce a false-positive redaction.
Restricting the first digit to [1-9] tightens the match to actually-assigned number ranges:
| "pattern": "\\b\\d{3}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b", | |
| "pattern": "\\b[1-9]\\d{2}[\\-\\s]\\d{3}[\\-\\s]\\d{3}\\b", |
| "category": "Canadian PII Patterns", | ||
| "description": "Detects Canadian Social Insurance Numbers (9-digit federal identifier, dashed or spaced format)", | ||
| "keyword_pattern": "\\b(?:SIN|social\\s*insurance|social\\s*insurance\\s*number|numéro\\s*d'assurance\\s*sociale|NAS)\\b", | ||
| "allow_word_numbers": true |
There was a problem hiding this comment.
allow_word_numbers inconsistent with other Canadian patterns
All five other new Canadian patterns set "allow_word_numbers": false, but ca_sin sets it to true. allow_word_numbers: true causes the engine to additionally scan for spelled-out number sequences (e.g., "one two three four five six seven eight nine") and attempt to match them against the regex after converting them to digits. For a SIN detector this is an unusual choice and appears inconsistent with the rest of the Canadian pattern set. If intentional, a comment in the description would clarify the reasoning; if unintentional, it should be set to false to match the other patterns.
| "allow_word_numbers": true | |
| "allow_word_numbers": false |
| { | ||
| "name": "ca_immigration_doc", | ||
| "display_name": "Immigration Document (Canada IRCC)", | ||
| "pattern": "\\b(?:\\d{4}[\\-\\s]?\\d{4}[\\-\\s]?\\d{2}|[TUFW]\\d{8,10}|IMM[\\-\\s]?\\d{4,5})\\b", |
There was a problem hiding this comment.
ca_immigration_doc UCI sub-pattern is overly broad
The first alternative \\d{4}[\\-\\s]?\\d{4}[\\-\\s]?\\d{2} matches any 10-digit sequence in XXXX-XXXX-XX format. This overlaps with common non-immigration identifiers: compact date strings (2024-01-24-15), credit-card-like partial numbers, order/reference numbers, and even phone numbers formatted as 1234-5678-90.
While the keyword_pattern provides proximity context, the engine returns early if any keyword match exists anywhere in the text — meaning a paragraph that mentions "immigration" even once will activate matching for every XXXX-XXXX-XX number in that paragraph. A tighter expression (e.g., requiring the separator to be present) would reduce false positives:
| "pattern": "\\b(?:\\d{4}[\\-\\s]?\\d{4}[\\-\\s]?\\d{2}|[TUFW]\\d{8,10}|IMM[\\-\\s]?\\d{4,5})\\b", | |
| "pattern": "\\b(?:\\d{4}[\\-\\s]\\d{4}[\\-\\s]\\d{2}|[TUFW]\\d{8,10}|IMM[\\-\\s]?\\d{4,5})\\b", |
Making the separators in the UCI sub-pattern mandatory (removing ?) restricts matches to numbers explicitly written in XXXX-XXXX-XX form, which is the canonical UCI display format.
| "pattern_type": "prebuilt", | ||
| "pattern_name": "amex", | ||
| "action": "MASK" |
There was a problem hiding this comment.
us_phone pattern used in a Canadian PII template
The ca-pii-contact-information guardrail references the us_phone prebuilt pattern. While this is functionally correct — Canadian numbers use the same North American Numbering Plan (NANP) as US numbers — it is surprising to see a US-named pattern inside a template explicitly branded as "Canadian PII Protection (PIPEDA)". Users configuring this template may not realise they are activating a US-labelled pattern, which could create confusion during audits or compliance reviews.
Consider either documenting this in the guardrail_info.description (e.g., "Masks North American (NANP) phone numbers, covering both Canada and the US") or requesting a ca_phone alias pattern that maps to the same regex as us_phone.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Adds:
- test_ca_patterns.py: 30 unit tests for regex pattern matching (SIN, OHIP,
driver's licence, immigration docs, bank account, postal code)
- test_ca_policy_e2e.py: 27 end-to-end tests running the full
ContentFilterGuardrail pipeline with MASK action — validates detection
of real PII and pass-through of clean prompts
- canadianPiiCompliancePrompts.ts: 21-prompt compliance dataset for UI
evaluation, wired into the main compliancePrompts framework
Fixes keyword_pattern alternation ordering in patterns.json — longer
alternatives (e.g. "social insurance number") now precede shorter ones
("social insurance") to avoid excessive gap-word count when the regex
engine selects the shorter match first.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…36 tests) Add 3 UofT institutional identifiers (student/employee number, UTORid, TCard) covered under Ontario FIPPA. Includes pattern definitions, policy template sub-guardrail, compliance prompts, unit tests, and e2e tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| ) | ||
| output = result.get("texts", [])[0] | ||
| assert "[UOFT_UTORID_REDACTED]" in output | ||
| assert "smithj12" not in result |
There was a problem hiding this comment.
This assertion checks the wrong variable — it tests whether "smithj12" is a key in the returned dictionary (which it never is), not whether the string appears in the masked output text. The check will always pass regardless of whether masking actually occurred.
| assert "smithj12" not in result | |
| assert "smithj12" not in output |
| def setup_canadian_guardrail(self): | ||
| """ | ||
| Setup guardrail with all Canadian PII patterns (mimics the policy template) | ||
| """ | ||
| patterns = [ | ||
| # Government identifiers | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_sin", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="passport_canada", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| # Health & drivers | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_ohip", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_on_drivers_licence", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| # Immigration | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_immigration_doc", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| # Financial | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_bank_account", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="credit_card", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| # Contact info | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="email", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_postal_code", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ] | ||
|
|
||
| return ContentFilterGuardrail( | ||
| guardrail_name="canadian-pii-protection", | ||
| patterns=patterns, | ||
| ) |
There was a problem hiding this comment.
The setup_canadian_guardrail() function omits the three uoft_* patterns (uoft_student_id, uoft_utorid, uoft_tcard) that are part of the ca-pii-institutional-ids sub-guardrail in the actual policy template. The file docstring claims to test "the complete policy with all Canadian PII patterns," but the University of Toronto institutional identifiers are missing from this setup.
Consider either:
- Adding the
uoft_*patterns tosetup_canadian_guardrail()if the intent is end-to-end testing of the full template, or - Updating the docstring to accurately reflect the current scope (e.g., "tests the complete Canadian federal/provincial/contact PII patterns" and noting that institutional patterns are tested separately in
test_uoft_policy_e2e.py).
Fix test_utorid_masked checking `result` (dict) instead of `output` (string). Update test_ca_policy_e2e.py docstring to clarify scope vs UofT tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| "pattern_type": "prebuilt", | ||
| "pattern_name": "amex", | ||
| "action": "MASK" | ||
| }, |
There was a problem hiding this comment.
Missing "PIPEDA" tag
The policy template is titled "Canadian PII Protection (PIPEDA)" and its ID is "canadian-pii-protection", but the tags array does not include "PIPEDA" — only "PIPEDA", "Canada", and "FIPPA" are present. Users or tooling that filters/searches templates by tag looking for PIPEDA compliance will not find this template.
| "pattern_type": "prebuilt", | |
| "pattern_name": "amex", | |
| "action": "MASK" | |
| }, | |
| "tags": [ | |
| "PII Protection", | |
| "Canada", | |
| "PIPEDA", | |
| "FIPPA" | |
| ], |
- Tighten ca_postal_code keyword_pattern: replace broad "address" with specific compound terms (mailing/street/shipping/home address) - Add missing "PIPEDA" tag to policy_templates.json for discoverability - Add us_phone pattern to test_ca_policy_e2e.py setup to match deployed template - Add phone number e2e test for complete coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@greptile please re-review this |
| "name": "uoft_utorid", | ||
| "display_name": "UTORid (University of Toronto Login)", | ||
| "pattern": "\\b[a-z]{2,6}\\d{1,4}\\b", | ||
| "category": "Canadian Institutional Identifiers (FIPPA)", | ||
| "description": "Detects University of Toronto UTORid login identifiers (e.g. smithj12). Context keywords are essential due to the generic format.", | ||
| "keyword_pattern": "\\b(?:utorid|utoid|utor\\s*id|acorn|quercus|uoft\\s*login|university\\s*of\\s*toronto\\s*login|portal\\s*login|weblogin|shibboleth)\\b", | ||
| "allow_word_numbers": false | ||
| }, |
There was a problem hiding this comment.
uoft_utorid pattern is dangerously broad
The pattern \\b[a-z]{2,6}\\d{1,4}\\b matches an enormous class of common tokens: variable names (db2, str1, api42), error codes (err404), CSS classes (sm2, lg1), port names (eth0), and many others.
The design relies entirely on the keyword_pattern for correctness, but the keyword list includes terms that appear commonly in non-UTORid contexts:
"acorn"— a word in everyday English ("acorn tree", "the ACORN organization")"quercus"— also a plant genus name appearing in biology/ecology contexts"id\\s*card"— extremely common phrase
Once any of these keywords matches anywhere in the text, every [a-z]{2,6}\\d{1,4} token in the entire document becomes a match candidate. For example:
"The ACORN protocol uses db2 for storage and api42 for the endpoint"
Here db2 and api42 would both be masked as UTORid even though neither is a UTORid. The existing test_generic_short_word_no_context_passes test (in test_uoft_policy_e2e.py) only guards against the no-keyword case.
Consider tightening the pattern to the documented UTORid format: a surname-initial suffix structure. At minimum, the pattern could require the alphabetic prefix to be 4+ characters (matching the real minimum surname abbreviation), reducing collisions with 2-3-letter abbreviations that are ubiquitous in technical text.
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_bank_account", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="credit_card", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| # Contact info | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="email", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="us_phone", | ||
| action=ContentFilterAction.MASK, | ||
| ), | ||
| ContentFilterPattern( | ||
| pattern_type="prebuilt", | ||
| pattern_name="ca_postal_code", | ||
| action=ContentFilterAction.MASK, | ||
| ), |
There was a problem hiding this comment.
Financial card patterns missing from test setup
setup_canadian_guardrail() includes credit_card but omits visa, mastercard, amex, and iban, all of which are present in the actual policy template's ca-pii-financial guardrail definition (policy_templates.json lines 2608–2626). There are also no E2E test cases that verify detection of Visa, Mastercard, Amex, or IBAN numbers.
As a result, if any of those four patterns were misconfigured or changed in a way that breaks them within the Canadian template, no test would catch it.
Consider adding:
- The four missing patterns to the
setup_canadian_guardrail()patterns list:
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="visa",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="mastercard",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="amex",
action=ContentFilterAction.MASK,
),
ContentFilterPattern(
pattern_type="prebuilt",
pattern_name="iban",
action=ContentFilterAction.MASK,
),- At least one E2E test asserting that a Visa/Mastercard number in text is masked by the Canadian guardrail.
| { | ||
| "name": "uoft_tcard", | ||
| "display_name": "TCard Number (University of Toronto Campus ID)", | ||
| "pattern": "\\b\\d{16}\\b", | ||
| "category": "Canadian Institutional Identifiers (FIPPA)", | ||
| "description": "Detects University of Toronto TCard campus ID card numbers (16-digit). Context keywords are critical to distinguish from payment card numbers.", | ||
| "keyword_pattern": "\\b(?:tcard\\s*number|t-card\\s*number|campus\\s*card\\s*number|campus\\s*id\\s*number|library\\s*card\\s*number|student\\s*card\\s*number|id\\s*card\\s*number|tcard|t-card|campus\\s*card|campus\\s*id|library\\s*card|student\\s*card|id\\s*card|university\\s*card)\\b", | ||
| "allow_word_numbers": false | ||
| } |
There was a problem hiding this comment.
uoft_tcard will mislabel credit card numbers as TCard in combined template
The uoft_tcard pattern \\b\\d{16}\\b is structurally identical to the Visa/Mastercard/Amex 16-digit format. In the canadian-pii-protection policy template, ca-pii-institutional-ids (containing uoft_tcard) and ca-pii-financial (containing visa, mastercard, amex) are applied as separate guardrails processed sequentially.
If ca-pii-institutional-ids runs before ca-pii-financial, a credit card number that appears in text alongside any of the TCard keywords ("student card", "id card", "campus card", etc.) will be redacted as [UOFT_TCARD_REDACTED] instead of the appropriate payment-card redaction label. The keyword list for uoft_tcard includes very common terms:
"student\\s*card\\s*number|id\\s*card\\s*number|student\\s*card"
For example:
"Please update your student card: 4111111111111111"
→ masked as [UOFT_TCARD_REDACTED] (incorrect) instead of [VISA_REDACTED] (expected).
The description acknowledges the overlap ("Context keywords are critical to distinguish from payment card numbers"), but in a combined multi-guardrail template this distinction is no longer reliable. Consider:
- Documenting the processing order of the guardrails in the template description, or
- Excluding well-known credit-card-adjacent keywords (e.g.,
"student\\s*card"in payment contexts) from theuoft_tcardkeyword_pattern
…coverage
- ca_sin: reject leading-zero SINs ([1-9]\d{2}), set allow_word_numbers to false
- ca_immigration_doc: require separators in UCI pattern (prevent bare \d{10} match)
- uoft_utorid: qualify generic keywords (acorn -> acorn login, quercus -> quercus login)
- uoft_tcard: remove generic keywords (student card, id card, library card) that
overlap with credit card contexts; keep only UofT-specific terms (tcard, campus card)
- Add visa/mastercard/amex/iban patterns to test_ca_policy_e2e.py setup to match
deployed template; add Visa card masking test
- Add test verifying "student card" no longer triggers TCard redaction
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: Add Canadian PII protection patterns and PIPEDA-compliant policy template
Adds 6 new Canadian PII regex pattern detectors to patterns.json:
- ca_sin: Social Insurance Number (PIPEDA Privacy Act, Income Tax Act)
- ca_ohip: Ontario Health Insurance Plan Number (PHIPA)
- ca_on_drivers_licence: Ontario driver's licence (HTA, PIPEDA)
- ca_immigration_doc: IRCC immigration docs (UCI, work/study permits, IMM refs)
- ca_bank_account: Canadian bank account routing (transit-institution-account)
- ca_postal_code: Canadian postal code (Canada Post spec)
Adds comprehensive policy template 'canadian-pii-protection' (id: canadian-pii-protection)
with 5 sub-guardrails grouping patterns by data type. All patterns include contextual
keyword matching (English + French keywords where applicable) to reduce false positives.
Complements existing passport_canada pattern.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* feat: Add Canadian PII compliance dataset and tests (57 tests)
Adds:
- test_ca_patterns.py: 30 unit tests for regex pattern matching (SIN, OHIP,
driver's licence, immigration docs, bank account, postal code)
- test_ca_policy_e2e.py: 27 end-to-end tests running the full
ContentFilterGuardrail pipeline with MASK action — validates detection
of real PII and pass-through of clean prompts
- canadianPiiCompliancePrompts.ts: 21-prompt compliance dataset for UI
evaluation, wired into the main compliancePrompts framework
Fixes keyword_pattern alternation ordering in patterns.json — longer
alternatives (e.g. "social insurance number") now precede shorter ones
("social insurance") to avoid excessive gap-word count when the regex
engine selects the shorter match first.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* feat: Add University of Toronto FIPPA identifier patterns and tests (36 tests)
Add 3 UofT institutional identifiers (student/employee number, UTORid, TCard)
covered under Ontario FIPPA. Includes pattern definitions, policy template
sub-guardrail, compliance prompts, unit tests, and e2e tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: Correct test assertion bug and inaccurate docstring
Fix test_utorid_masked checking `result` (dict) instead of `output` (string).
Update test_ca_policy_e2e.py docstring to clarify scope vs UofT tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: Address Greptile review feedback
- Tighten ca_postal_code keyword_pattern: replace broad "address" with
specific compound terms (mailing/street/shipping/home address)
- Add missing "PIPEDA" tag to policy_templates.json for discoverability
- Add us_phone pattern to test_ca_policy_e2e.py setup to match deployed template
- Add phone number e2e test for complete coverage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: Tighten patterns to reduce false positives and add missing test coverage
- ca_sin: reject leading-zero SINs ([1-9]\d{2}), set allow_word_numbers to false
- ca_immigration_doc: require separators in UCI pattern (prevent bare \d{10} match)
- uoft_utorid: qualify generic keywords (acorn -> acorn login, quercus -> quercus login)
- uoft_tcard: remove generic keywords (student card, id card, library card) that
overlap with credit card contexts; keep only UofT-specific terms (tcard, campus card)
- Add visa/mastercard/amex/iban patterns to test_ca_policy_e2e.py setup to match
deployed template; add Visa card masking test
- Add test verifying "student card" no longer triggers TCard redaction
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
Summary
Adds 6 new Canadian PII regex pattern detectors and a comprehensive PIPEDA-compliant policy template to the LiteLLM content filter guardrails:
New Patterns (patterns.json):
ca_sin– Social Insurance Numbers (PIPEDA Privacy Act)ca_ohip– Ontario Health Insurance Plan numbers (PHIPA)ca_on_drivers_licence– Ontario driver's licencesca_immigration_doc– IRCC immigration documents (UCI, work/study permits, IMM refs)ca_bank_account– Canadian bank account routing (transit-institution-account format)ca_postal_code– Canadian postal codes (excludes invalid chars per Canada Post spec)New Policy Template (policy_templates.json):
canadian-pii-protection– 5 sub-guardrails grouping patterns by data type (government IDs, health/drivers, immigration, financial, contact)All patterns include contextual keyword matching with bilingual support (English + French keywords) to reduce false positives. The template complements the existing
passport_canadapattern for comprehensive Canadian PII protection.Type
🆕 New Feature