fix: PII scrubber 不再把通用詞「公司」誤判成公司名#25
Merged
Merged
Conversation
pilot 真人說「我的工作價值對於公司代表什麼」,scrubber 把句子片段 + 通用詞「公司」當公司名 redact 成 [Company H],bot 引述回來變荒謬。 Gemini + Codex 雙審、Claude 裁決(採 legal-suffix-only 方案): - 中文公司名只匹配正式法律後綴(股份有限公司/有限責任公司/有限公司), 放棄裸「公司」—— 中文無斷詞,regex 無法區分真公司名與句子片段 - 英文移除全域 re.I(公司名靠首字母大寫辨識,re.I 會讓 the company 誤中), 後綴改 inline (?i:...),並移除裸 Company 後綴 Constraint: pilot hotfix;維持 regex-only(不引入 NER) Rejected: 行業關鍵字白名單(科技/顧問公司本身也是通用類別)+ negative lookbehind(補洞遊戲、Python 固定寬度受限) Directive: 接受漏抓非正式寫法(如「台積電公司」)— pii.py 非法遵級防線,consent 文案已請使用者自行移除 PII Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
摘要
pilot hotfix(接續 #23 #24)。真人 pilot 說「我的工作價值對於公司代表什麼」,PII scrubber 把「工作價值對於公司」整段 redact 成
[Company H],bot 引述回來變「[Company H]代表什麼」。「公司」是通用詞,不是公司名。根因
COMPANY_RE中文分支[一-鿿]{2,8}(?:公司|股份有限公司)抓裸「公司」—— 把剛好出現在「公司」前的句子片段一起吞。英文分支用全域re.I,讓[A-Z]匹配小寫 + 後綴含裸Company→the company/my company也會誤中。修法(Gemini + Codex 雙審,Claude 裁決 legal-suffix-only)
re.I(公司名靠首字母大寫辨識),後綴改 inline(?i:...),移除裸Company後綴。否決(Gemini 提案,Codex BLOCK,Claude 採納 BLOCK):行業關鍵字白名單(科技/顧問公司本身也是通用類別,不解決精度)+ negative lookbehind(補洞遊戲)。
取捨
接受漏抓非正式寫法(「台積電公司」「Google company」)。pii.py 是訪談前處理、非法遵級防線;consent 文案已請使用者自行移除 PII。over-redact 通用詞的 UX/資料傷害 > under-redact 的隱私代價。
驗證
新測試 4 個:通用詞「公司」不 redact、正式公司名仍 redact、英文公司名仍 redact、
the company/my company不誤中。Sanity:
對於公司→[]、台積電股份有限公司→redact、the company→[]、Apple Inc.→redact。review chain
Gemini(多方案)+ Codex(工程穩健度)→ 兩者對方案有分歧 → Claude(Architect)裁決採 Codex 方案 → Codex 實作 → Claude 獨立驗證(275 全綠)。
注意
pilot hotfix —— 合併後需部署 VPS(vm.2ch.tw, git pull + restart)。
殘留:正式公司名前的連續中文上下文仍可能被吞入(如「我在台積電股份有限公司」吞「我在」)—— 屬可接受的小幅 over-eat,非本 bug 範圍,後續再議。
🤖 Generated with Claude Code