You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to #436. The fix in commit `4738e88` removed the full-path check (str(path) → directory names no longer trip the filter), but the filename keyword regex was left unchanged and the word-boundary fix originally proposed by the original reporter was never applied. Two failure classes remain.
This is still substring-matched against `path.name`. Combined with the fact that it applies to all files (regardless of extension), it silently drops these from detect() and therefore from the entire graph:
Filename
Substring matched
Code or secret?
password-reset.ts
password
Code (email-template helper)
AuthOauthAccessToken.model.ts
token
Code (Sequelize model)
test.search-tokenizer.ts
token (in tokenizer)
Code
password_manager.py
password
Code (likely)
JwtTokenValidator.java
token
Code
These are real cases from a 1,873-file SvelteKit codebase. There are no actual secrets in any of these files — they're source code that happens to mention auth/token/password concepts in their filenames.
The actual fix in 4738e88 was to remove or p.search(full) from _is_sensitive, so directory paths no longer trip the filter. The word-boundary refinement to the regex itself was not included.
Word boundaries would help with some cases (tokenizer.ts, AuthOauthAccessToken.model.ts, daniel-ai-secretary) but not filenames where the keyword IS a standalone word (password-reset.ts, access-token.ts, jwt-token-validator.ts).
Root insight
The keyword filter is the only generic pattern in _SENSITIVE_PATTERNS. Every other entry is highly specific:
These specific patterns target data files that store raw secrets. The keyword pattern is over-broad because it applies to all files including source code. Source code is, by definition, code — not credential storage. If a project commits secrets into .ts/.py/etc., that's a different security problem and graphify is the wrong line of defense.
Suggested fix
Two changes in _is_sensitive:
def_is_sensitive(path: Path) ->bool:
"""Return True if this file likely contains secrets and should be skipped."""# Skip the generic keyword check for known source-code extensions —# source files are code, not credential storage. The dotfile/cert/SSH-key# patterns still apply (a file like 'foo.pem' is sensitive even if it# accidentally has a code-like name).name=path.nameis_code=path.suffix.lower() inCODE_EXTENSIONSor_shebang_file_type(path) ==FileType.CODEforpatternin_SENSITIVE_PATTERNS:
ifpatternis_GENERIC_KEYWORD_PATTERNandis_code:
continueifpattern.search(name):
returnTruereturnFalse
Plus restructure _SENSITIVE_PATTERNS so the keyword pattern is identifiable (named constant or split into its own variable), and add word boundaries to it for the cases where it does apply:
secrets.json — still flagged (data file + word-bounded match on secret)
database-credentials.yml — still flagged (data file + word-bounded match)
.env, .env.local, id_rsa, aws.pem — still flagged (specific patterns unchanged)
Optional additional observation
the original reporter also flagged that users get no warning when files are dropped as sensitive. The skipped count is in detected['skipped_sensitive'] but never surfaced to stdout. On a repo where the heuristic misfires, this is silent data loss — exactly what made our case hard to diagnose (had to compare collect_files() output to the manifest to notice the gap). Worth a separate enhancement: log a one-line warning at end of detect() when N or N% of files are skipped as sensitive, so users can audit.
On a 1,873-file SvelteKit codebase: 3 files silently dropped, including a Sequelize model that's referenced 30+ times across the codebase. Larger projects with auth/payment/identity domains will hit dozens.
Follow-up to #436. The fix in commit `4738e88` removed the full-path check (
str(path)→ directory names no longer trip the filter), but the filename keyword regex was left unchanged and the word-boundary fix originally proposed by the original reporter was never applied. Two failure classes remain.Two remaining failure modes
detect.pyline 36:This is still substring-matched against `path.name`. Combined with the fact that it applies to all files (regardless of extension), it silently drops these from
detect()and therefore from the entire graph:password-reset.tspasswordAuthOauthAccessToken.model.tstokentest.search-tokenizer.tstoken(intokenizer)password_manager.pypasswordJwtTokenValidator.javatokenThese are real cases from a 1,873-file SvelteKit codebase. There are no actual secrets in any of these files — they're source code that happens to mention auth/token/password concepts in their filenames.
Why #436's fix didn't fully resolve this
the original reporter's original proposal was:
The actual fix in
4738e88was to removeor p.search(full)from_is_sensitive, so directory paths no longer trip the filter. The word-boundary refinement to the regex itself was not included.Word boundaries would help with some cases (
tokenizer.ts,AuthOauthAccessToken.model.ts,daniel-ai-secretary) but not filenames where the keyword IS a standalone word (password-reset.ts,access-token.ts,jwt-token-validator.ts).Root insight
The keyword filter is the only generic pattern in
_SENSITIVE_PATTERNS. Every other entry is highly specific:\.env|envrc→ extension/dotfile\.(pem|key|p12|pfx|cert|crt|der|p8)$→ cryptographic key extensionsid_rsa|id_dsa|id_ecdsa|id_ed25519→ SSH key names\.netrc|\.pgpass|\.htpasswd→ exact credential-store namesaws_credentials|gcloud_credentials|service.account→ cloud credential file namesThese specific patterns target data files that store raw secrets. The keyword pattern is over-broad because it applies to all files including source code. Source code is, by definition, code — not credential storage. If a project commits secrets into
.ts/.py/etc., that's a different security problem and graphify is the wrong line of defense.Suggested fix
Two changes in
_is_sensitive:Plus restructure
_SENSITIVE_PATTERNSso the keyword pattern is identifiable (named constant or split into its own variable), and add word boundaries to it for the cases where it does apply:Combined effect:
tokenizer.ts— passes (word boundaries reject\btoken\binsidetokenizer)daniel-ai-secretary/foo.ts— passes (already fixed by _is_sensitive false-positives on directory names containing "secret"/"token"/"password" as substrings #436)password-reset.ts— passes (code extension, keyword filter skipped)JwtTokenValidator.java— passes (code extension)secrets.json— still flagged (data file + word-bounded match onsecret)database-credentials.yml— still flagged (data file + word-bounded match).env,.env.local,id_rsa,aws.pem— still flagged (specific patterns unchanged)Optional additional observation
the original reporter also flagged that users get no warning when files are dropped as sensitive. The skipped count is in
detected['skipped_sensitive']but never surfaced to stdout. On a repo where the heuristic misfires, this is silent data loss — exactly what made our case hard to diagnose (had to comparecollect_files()output to the manifest to notice the gap). Worth a separate enhancement: log a one-line warning at end ofdetect()when N or N% of files are skipped as sensitive, so users can audit.Reproducer
All four are source code; none should be flagged.
Real-world impact
On a 1,873-file SvelteKit codebase: 3 files silently dropped, including a
Sequelizemodel that's referenced 30+ times across the codebase. Larger projects with auth/payment/identity domains will hit dozens.Environment