_is_sensitive() still drops legitimate source files: word-boundary fix from #436 wasn't applied + keyword filter shouldn't apply to source code extensions

Follow-up to #436. The fix in commit \`4738e88\` removed the full-path check (`str(path)` → directory names no longer trip the filter), but the **filename** keyword regex was left unchanged and the **word-boundary fix originally proposed by the original reporter was never applied**. Two failure classes remain.

## Two remaining failure modes

`detect.py` line 36:

```python
re.compile(r'(credential|secret|passwd|password|token|private_key)', re.IGNORECASE),
```

This is still substring-matched against \`path.name\`. Combined with the fact that it applies to **all** files (regardless of extension), it silently drops these from `detect()` and therefore from the entire graph:

| Filename | Substring matched | Code or secret? |
|---|---|---|
| `password-reset.ts` | `password` | Code (email-template helper) |
| `AuthOauthAccessToken.model.ts` | `token` | Code (Sequelize model) |
| `test.search-tokenizer.ts` | `token` (in `tokenizer`) | Code |
| `password_manager.py` | `password` | Code (likely) |
| `JwtTokenValidator.java` | `token` | Code |

These are real cases from a 1,873-file SvelteKit codebase. There are no actual secrets in any of these files — they're source code that happens to mention auth/token/password concepts in their filenames.

## Why #436's fix didn't fully resolve this

the original reporter's original proposal was:

```python
re.compile(r'\b(credential|secret|passwd|password|token|private_key)\b', re.IGNORECASE),
```

The actual fix in `4738e88` was to remove `or p.search(full)` from `_is_sensitive`, so directory paths no longer trip the filter. The word-boundary refinement to the regex itself was not included.

Word boundaries would help with some cases (`tokenizer.ts`, `AuthOauthAccessToken.model.ts`, `daniel-ai-secretary`) but **not** filenames where the keyword IS a standalone word (`password-reset.ts`, `access-token.ts`, `jwt-token-validator.ts`).

## Root insight

The keyword filter is the only generic pattern in `_SENSITIVE_PATTERNS`. Every other entry is highly specific:

- `\.env|envrc` → extension/dotfile
- `\.(pem|key|p12|pfx|cert|crt|der|p8)$` → cryptographic key extensions
- `id_rsa|id_dsa|id_ecdsa|id_ed25519` → SSH key names
- `\.netrc|\.pgpass|\.htpasswd` → exact credential-store names
- `aws_credentials|gcloud_credentials|service.account` → cloud credential file names

These specific patterns target **data files** that store raw secrets. The keyword pattern is over-broad because it applies to **all** files including source code. Source code is, by definition, code — not credential storage. If a project commits secrets into `.ts`/`.py`/etc., that's a different security problem and graphify is the wrong line of defense.

## Suggested fix

Two changes in `_is_sensitive`:

```python
def _is_sensitive(path: Path) -> bool:
    """Return True if this file likely contains secrets and should be skipped."""
    # Skip the generic keyword check for known source-code extensions —
    # source files are code, not credential storage. The dotfile/cert/SSH-key
    # patterns still apply (a file like 'foo.pem' is sensitive even if it
    # accidentally has a code-like name).
    name = path.name
    is_code = path.suffix.lower() in CODE_EXTENSIONS or _shebang_file_type(path) == FileType.CODE
    for pattern in _SENSITIVE_PATTERNS:
        if pattern is _GENERIC_KEYWORD_PATTERN and is_code:
            continue
        if pattern.search(name):
            return True
    return False
```

Plus restructure `_SENSITIVE_PATTERNS` so the keyword pattern is identifiable (named constant or split into its own variable), and add word boundaries to it for the cases where it does apply:

```python
_GENERIC_KEYWORD_PATTERN = re.compile(
    r'\b(credential|secret|passwd|password|token|private_key)\b',
    re.IGNORECASE,
)
_SENSITIVE_PATTERNS = [
    # ... extension/exact-name patterns unchanged ...
    _GENERIC_KEYWORD_PATTERN,
    # ...
]
```

Combined effect:
- `tokenizer.ts` — passes (word boundaries reject `\btoken\b` inside `tokenizer`)
- `daniel-ai-secretary/foo.ts` — passes (already fixed by #436)
- `password-reset.ts` — passes (code extension, keyword filter skipped)
- `JwtTokenValidator.java` — passes (code extension)
- `secrets.json` — still flagged (data file + word-bounded match on `secret`)
- `database-credentials.yml` — still flagged (data file + word-bounded match)
- `.env`, `.env.local`, `id_rsa`, `aws.pem` — still flagged (specific patterns unchanged)

## Optional additional observation

the original reporter also flagged that **users get no warning when files are dropped as sensitive**. The skipped count is in `detected['skipped_sensitive']` but never surfaced to stdout. On a repo where the heuristic misfires, this is silent data loss — exactly what made our case hard to diagnose (had to compare `collect_files()` output to the manifest to notice the gap). Worth a separate enhancement: log a one-line warning at end of `detect()` when N or N% of files are skipped as sensitive, so users can audit.

## Reproducer

```python
from pathlib import Path
import tempfile
from graphify.detect import _is_sensitive

with tempfile.TemporaryDirectory() as d:
    d = Path(d)
    for name in ('password-reset.ts', 'AuthOauthAccessToken.model.ts',
                 'access-token.ts', 'JwtTokenValidator.java'):
        f = d / name
        f.write_text('export const x = 1')
        print(f"{name}: sensitive={_is_sensitive(f)}")

# Output (current behaviour):
#   password-reset.ts: sensitive=True
#   AuthOauthAccessToken.model.ts: sensitive=True
#   access-token.ts: sensitive=True
#   JwtTokenValidator.java: sensitive=True
```

All four are source code; none should be flagged.

## Real-world impact

On a 1,873-file SvelteKit codebase: 3 files silently dropped, including a `Sequelize` model that's referenced 30+ times across the codebase. Larger projects with auth/payment/identity domains will hit dozens.

## Environment

- graphifyy 0.7.5
- Python 3.14
- macOS


Filename	Substring matched	Code or secret?
`password-reset.ts`	`password`	Code (email-template helper)
`AuthOauthAccessToken.model.ts`	`token`	Code (Sequelize model)
`test.search-tokenizer.ts`	`token` (in `tokenizer`)	Code
`password_manager.py`	`password`	Code (likely)
`JwtTokenValidator.java`	`token`	Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

_is_sensitive() still drops legitimate source files: word-boundary fix from #436 wasn't applied + keyword filter shouldn't apply to source code extensions #718

Two remaining failure modes

Why #436's fix didn't fully resolve this

Root insight

Suggested fix

Optional additional observation

Reproducer

Real-world impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

_is_sensitive() still drops legitimate source files: word-boundary fix from #436 wasn't applied + keyword filter shouldn't apply to source code extensions #718

Description

Two remaining failure modes

Why #436's fix didn't fully resolve this

Root insight

Suggested fix

Optional additional observation

Reproducer

Real-world impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions