Skip to content

fix: greedy verb-chain extraction (#27) — 0.1.4-alpha#28

Merged
Aaronontheweb merged 2 commits into
devfrom
fix/27-greedy-verb-extraction
May 12, 2026
Merged

fix: greedy verb-chain extraction (#27) — 0.1.4-alpha#28
Aaronontheweb merged 2 commits into
devfrom
fix/27-greedy-verb-extraction

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Owner

Summary

Closes #27.

Replaces the static BashArity lookup table with a "stop at non-verb-like token" heuristic. The parser walks consecutive verb-like Word tokens from the start of each clause, transparently consuming flag-with-value pairs, and stops at the first non-verb-like token. Known FILE verbs (cat, ls, bash, cd, chmod, grep, find, …) keep a 1-token verb chain so per-verb positional-arg classification still fires for bare-name targets like cat README and ln src dst.

The new BashVerbs.IsVerbLikeToken predicate is a strict allow-list: Word kind, length 1–64, leading [a-z], body [a-z0-9._-]. The allow-list (over a negation of LooksLikePath) stays conservative for unknown shapes and rejects flags, paths, env-var refs, URLs, globs, and uppercase user-named identifiers without per-case predicate logic.

Clause.Verb is now documented as a convenience hint, not a security contract. Consumers needing security-grade matching should pattern-prefix match against the raw token stream — the deliberate over-extraction on bare-word args (git push origin main[git, push, origin, main]) is the security-correct default for auto-proposed approval patterns: a subsequent variation re-prompts rather than silently auto-grants.

Behavior changes (examples):

  • git push origin main[git, push, origin, main] (was [git, push])
  • git worktree list and arbitrary CLI subcommand chains → fully extracted
  • freshdesk ticket list --status open[freshdesk, ticket, list] (was [freshdesk])
  • kubectl get pods my-pod[kubectl, get, pods, my-pod]
  • aws s3 cp src dst[aws, s3, cp, src, dst]
  • dotnet ef migrations add InitialCreate[dotnet, ef, migrations, add] (stops at uppercase)
  • cat README → still [cat] (FileVerb carveout preserves IsPath)

Scope:

  • Removes BashVerbs.BashArity table + ProbeArity() method.
  • Adds BashVerbs.IsVerbLikeToken(BashToken) predicate.
  • Rewrites verb-extraction loop in BashCommandParser.ParseClauseSegment (greedy walk + FileVerb 1-token carveout + flag-with-value consumption).
  • SPEC.md updates: §3 VerbChain, §4 grammar, §6.1 (full rewrite), new §6.1.1 consumer pattern-matching guidance, §7 flag-with-value note, §12 worked examples, §15 versioning, §16 sequencing.
  • 7 new corpus entries (132–138); 11 existing entries flipped to new shape; 8 unit tests updated.
  • Version: 0.1.3-alpha0.1.4-alpha.

Test plan

  • dotnet build -c Release clean (0 warnings, 0 errors)
  • dotnet test -c Release — 394 passed, 0 failed
  • pwsh ./scripts/Add-FileHeaders.ps1 -Verify passes
  • dotnet pack -c Release -o ./bin/nuget produces ShellSyntaxTree.0.1.4-alpha.nupkg
  • Public API surface unchanged (VerbChain / Clause shape locked)

Replace the static BashArity lookup table with a "stop at non-verb-like
token" heuristic. The parser walks consecutive verb-like Word tokens
from the start of each clause, transparently consuming flag-with-value
pairs, and stops at the first non-verb-like token. Known FILE verbs
(cat, ls, bash, cd, chmod, grep, find, ...) keep a 1-token verb chain
so per-verb positional-arg classification still fires for bare-name
targets like `cat README` and `ln src dst`.

The new IsVerbLikeToken predicate is a strict allow-list: Word kind,
length 1-64, leading [a-z], body [a-z0-9._-]. This naturally rejects
flags, paths, env-var refs, URLs, globs, and uppercase user-named
identifiers without per-case predicate logic.

Clause.Verb is now documented as a convenience hint, not a security
contract. Consumers needing security-grade matching should pattern-
prefix match against the raw token stream; the deliberate over-
extraction on bare-word args (`git push origin main` ->
[git, push, origin, main]) is the security-correct default for
auto-proposed approval patterns.

Removes BashArity table and ProbeArity() method entirely. SPEC.md
gets a full rewrite of section 6.1 (verb-chain extraction), a new
6.1.1 (consumer pattern-matching guidance), updated grammar and
worked examples, and a versioning note acknowledging that pre-v0.1.0
alphas may include behavior course-corrections.

Corpus: 7 new entries (132-138) cover the issue's headline cases;
11 existing entries flip to the new shape. 8 unit tests in
BashCommandParserTests updated. All 394 tests pass; clean build,
headers verified, packs as ShellSyntaxTree.0.1.4-alpha.
Drop the `quotedFirstVerb` flag and gate the walk solely on `firstVerb is
not null` — the QuotedString branch no longer needs a sentinel because
it falls through with `firstVerb == null` and the loop short-circuits.

Cache the inner `HashSet<string>` from `FlagsWithValue` once into
`flagsForVerb` instead of re-hashing `firstVerb` on every flag-token
iteration. Inline the `=`-position scan via a single `IndexOf('=')` call
so the `--flag=value` short-circuit doesn't traverse the flag string
twice; this lets us delete the now-unused `StripEqualsValue` and
`HasInlineEqualsValue` helpers.

Trim the 20-line block comment at the top of `ParseClauseSegment` to
just the load-bearing invariants (FileVerb carveout + ordering with
flag-with-value consumption). Drop mid-loop comments that narrate the
control flow. Add a 4-element capacity hint to `verbTokens` to avoid
the first realloc on the typical case.

In `BashVerbs.cs`, revert the `FileVerbs` remarks paragraph that leaked
parser usage details and rewrite the `IsVerbLikeToken` doc to capture
the WHY of the strict allow-list (vs. negation-of-LooksLikePath)
without re-listing all the rejection categories.

All 394 tests still pass; no behavior change.
@Aaronontheweb Aaronontheweb merged commit f06dcef into dev May 12, 2026
2 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/27-greedy-verb-extraction branch May 12, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stop verb-chain extraction at non-verb-like tokens

1 participant