fix: _is_sensitive silently drops topic notes (token-economics-of-recall.md flagged as a secret)#1169
Closed
edudatsuj45 wants to merge 1 commit into
Closed
Conversation
…f-recall.md The generic keyword patterns (credential/secret/password/token) flagged any filename containing the keyword as a standalone word, silently dropping prose documents whose descriptive slug merely mentions the topic: token-economics-of-recall.md -> skipped as sensitive (a note ABOUT tokens) password-policy-discussion.md -> skipped as sensitive Follow-up to the Graphify-Labs#436 -> Graphify-Labs#718 -> Graphify-Labs#920 lineage: the remaining failure class is keyword-as-topic-word in multi-word descriptive filenames. Fix: split the generic keyword patterns out of _SENSITIVE_PATTERNS and only count a match when the keyword is load-bearing in the name: - the keyword ends the stem (api_token.txt, github-personal-access-token.txt, oauth_token.json) - secret stores name their contents, and the content noun is the head of the compound, which comes last; or - the stem has <= 2 words (token.txt, token_config.yaml, secret_handler.txt). A keyword buried mid-phrase in a >= 3-word slug is a topic word, not a credential store. Specific patterns (.pem/.env/id_rsa/.netrc/aws_credentials) are unchanged and still always apply. All existing contracts preserved: api_token.txt, oauth_token.json, token.txt, token_config.yaml, secret_handler.txt, passwords.py, credentials.json still flagged; tokenizer.py / tokenize.py still clean. Adds 6 regression tests including dotfile (.token), plural (tokens.txt), and multi-word-keyword (my_private_key.txt) edge cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
safishamsi
added a commit
that referenced
this pull request
Jun 7, 2026
#1170 — replace nohup with cross-platform Python detach in git hooks. Git for Windows MSYS has no nohup so post-commit/post-checkout hooks silently failed. Now uses subprocess.Popen with DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP on Windows, start_new_session=True on POSIX. Quoting-safe (argv list). Fixes #1161. #1169 — fix _is_sensitive false positives on topic-mentioning filenames. token-economics-of-recall.md and password-policy-discussion.md were silently dropped as secrets. Generic keywords (token/secret/password) now only fire when the keyword ends the filename stem or the stem is ≤2 words. Specific patterns (.env/.pem/id_rsa etc.) remain unconditional. #1165 — fix multi-word endpoint resolution in _score_nodes. graphify path "AuthService" "UserRepo" never fired the exact-match bonus because per-token comparison never equalled the full label. Now joins normalized tokens and compares against the full label and its tokenized form. O(1) per node, affects query_graph and shortest_path uniformly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
|
Landed in a8dbbe5. Generic keywords (token/secret/password) now only fire when the keyword ends the filename stem or the stem is ≤2 words. Specific patterns (.env/.pem/id_rsa etc.) remain unconditional. token-economics-of-recall.md passes through; tokens.txt is still caught. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The generic keyword patterns in
_SENSITIVE_PATTERNSflag any filename containingtoken/secret/password/credentialas a standalone word — so prose documents whose descriptive slug merely mentions the topic are silently dropped from the graph:token-economics-of-recall.mdpassword-policy-discussion.mdThis is the remaining failure class after the #436 → #718 → #920 lineage. Hit in the wild: an Obsidian-style memory vault where a note on token economics vanished from the graph with no visible warning (
skipped_sensitiveis returned but nothing surfaces it — exactly the silent-data-loss failure mode described in #718's closing observation).Fix
Split the two generic keyword patterns out of
_SENSITIVE_PATTERNSand only count a match when the keyword is load-bearing in the filename:api_token.txt,oauth_token.json,github-personal-access-token.txt) — secret stores name their contents, and the content noun is the head of the compound, which comes last in English; ortoken.txt,token_config.yaml,secret_handler.txt).A keyword buried mid-phrase in a ≥ 3-word slug is a topic word, not a credential store. The specific patterns (
.pem/.env/id_rsa/.netrc/aws_credentials/_SENSITIVE_DIRS) are unchanged and still always apply.The end-of-stem check runs before word counting, so multi-word keywords survive their own separator:
my_private_key.txtis still flagged even though splitting on_would breakprivate_keyapart. Leading dots are stripped before stem extraction so dotfiles like.tokenkeep their keyword.Behavior table
token-economics-of-recall.mdpassword-policy-discussion.mdapi_token.txt/oauth_token.json(#920)token.txt/tokens.txt/.tokentoken_config.yaml/secret_handler.txt(#920)github-personal-access-token.txtmy_private_key.txtpasswords.py/credentials.jsontokenizer.py/tokenize.py(#718).env/server.pem/id_rsa/.ssh/…Tests
All 13 existing
test_sensitive_*contracts pass unchanged; adds 6 regression tests covering the topic-slug false positive plus the dotfile, plural, end-of-long-name, and multi-word-keyword edge cases.🤖 Generated with Claude Code