Skip to content

_is_sensitive() still drops legitimate source files: word-boundary fix from #436 wasn't applied + keyword filter shouldn't apply to source code extensions #718

Description

@jippi

Follow-up to #436. The fix in commit `4738e88` removed the full-path check (str(path) → directory names no longer trip the filter), but the filename keyword regex was left unchanged and the word-boundary fix originally proposed by the original reporter was never applied. Two failure classes remain.

Two remaining failure modes

detect.py line 36:

re.compile(r'(credential|secret|passwd|password|token|private_key)', re.IGNORECASE),

This is still substring-matched against `path.name`. Combined with the fact that it applies to all files (regardless of extension), it silently drops these from detect() and therefore from the entire graph:

Filename Substring matched Code or secret?
password-reset.ts password Code (email-template helper)
AuthOauthAccessToken.model.ts token Code (Sequelize model)
test.search-tokenizer.ts token (in tokenizer) Code
password_manager.py password Code (likely)
JwtTokenValidator.java token Code

These are real cases from a 1,873-file SvelteKit codebase. There are no actual secrets in any of these files — they're source code that happens to mention auth/token/password concepts in their filenames.

Why #436's fix didn't fully resolve this

the original reporter's original proposal was:

re.compile(r'\b(credential|secret|passwd|password|token|private_key)\b', re.IGNORECASE),

The actual fix in 4738e88 was to remove or p.search(full) from _is_sensitive, so directory paths no longer trip the filter. The word-boundary refinement to the regex itself was not included.

Word boundaries would help with some cases (tokenizer.ts, AuthOauthAccessToken.model.ts, daniel-ai-secretary) but not filenames where the keyword IS a standalone word (password-reset.ts, access-token.ts, jwt-token-validator.ts).

Root insight

The keyword filter is the only generic pattern in _SENSITIVE_PATTERNS. Every other entry is highly specific:

  • \.env|envrc → extension/dotfile
  • \.(pem|key|p12|pfx|cert|crt|der|p8)$ → cryptographic key extensions
  • id_rsa|id_dsa|id_ecdsa|id_ed25519 → SSH key names
  • \.netrc|\.pgpass|\.htpasswd → exact credential-store names
  • aws_credentials|gcloud_credentials|service.account → cloud credential file names

These specific patterns target data files that store raw secrets. The keyword pattern is over-broad because it applies to all files including source code. Source code is, by definition, code — not credential storage. If a project commits secrets into .ts/.py/etc., that's a different security problem and graphify is the wrong line of defense.

Suggested fix

Two changes in _is_sensitive:

def _is_sensitive(path: Path) -> bool:
    """Return True if this file likely contains secrets and should be skipped."""
    # Skip the generic keyword check for known source-code extensions —
    # source files are code, not credential storage. The dotfile/cert/SSH-key
    # patterns still apply (a file like 'foo.pem' is sensitive even if it
    # accidentally has a code-like name).
    name = path.name
    is_code = path.suffix.lower() in CODE_EXTENSIONS or _shebang_file_type(path) == FileType.CODE
    for pattern in _SENSITIVE_PATTERNS:
        if pattern is _GENERIC_KEYWORD_PATTERN and is_code:
            continue
        if pattern.search(name):
            return True
    return False

Plus restructure _SENSITIVE_PATTERNS so the keyword pattern is identifiable (named constant or split into its own variable), and add word boundaries to it for the cases where it does apply:

_GENERIC_KEYWORD_PATTERN = re.compile(
    r'\b(credential|secret|passwd|password|token|private_key)\b',
    re.IGNORECASE,
)
_SENSITIVE_PATTERNS = [
    # ... extension/exact-name patterns unchanged ...
    _GENERIC_KEYWORD_PATTERN,
    # ...
]

Combined effect:

  • tokenizer.ts — passes (word boundaries reject \btoken\b inside tokenizer)
  • daniel-ai-secretary/foo.ts — passes (already fixed by _is_sensitive false-positives on directory names containing "secret"/"token"/"password" as substrings #436)
  • password-reset.ts — passes (code extension, keyword filter skipped)
  • JwtTokenValidator.java — passes (code extension)
  • secrets.json — still flagged (data file + word-bounded match on secret)
  • database-credentials.yml — still flagged (data file + word-bounded match)
  • .env, .env.local, id_rsa, aws.pem — still flagged (specific patterns unchanged)

Optional additional observation

the original reporter also flagged that users get no warning when files are dropped as sensitive. The skipped count is in detected['skipped_sensitive'] but never surfaced to stdout. On a repo where the heuristic misfires, this is silent data loss — exactly what made our case hard to diagnose (had to compare collect_files() output to the manifest to notice the gap). Worth a separate enhancement: log a one-line warning at end of detect() when N or N% of files are skipped as sensitive, so users can audit.

Reproducer

from pathlib import Path
import tempfile
from graphify.detect import _is_sensitive

with tempfile.TemporaryDirectory() as d:
    d = Path(d)
    for name in ('password-reset.ts', 'AuthOauthAccessToken.model.ts',
                 'access-token.ts', 'JwtTokenValidator.java'):
        f = d / name
        f.write_text('export const x = 1')
        print(f"{name}: sensitive={_is_sensitive(f)}")

# Output (current behaviour):
#   password-reset.ts: sensitive=True
#   AuthOauthAccessToken.model.ts: sensitive=True
#   access-token.ts: sensitive=True
#   JwtTokenValidator.java: sensitive=True

All four are source code; none should be flagged.

Real-world impact

On a 1,873-file SvelteKit codebase: 3 files silently dropped, including a Sequelize model that's referenced 30+ times across the codebase. Larger projects with auth/payment/identity domains will hit dozens.

Environment

  • graphifyy 0.7.5
  • Python 3.14
  • macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions