Skip to content

Stop verb-chain extraction at non-verb-like tokens #27

@Aaronontheweb

Description

@Aaronontheweb

Problem

Clause.Verb currently relies on BashArity table lookups (default 2-token, with hand-curated 3-token entries for docker compose and bun run). This breaks down in two ways:

  1. Multi-token CLI subcommands without table entries are truncated. freshdesk ticket list --status open, git worktree list, kubectl get pods, aws s3 cp, dotnet ef migrations add all extract a too-short verb chain because their root verbs aren't in the table at the appropriate arity.

  2. The table approach can't scale. Custom/private CLIs (freshdesk and any internal company tool) and the long tail of cloud-CLI subcommand groups will never appear in any curated table. Maintenance cost is unbounded.

  3. Even with a complete table, the git X Y ambiguity is structurally undecidable. git push origin main (3-token chain git push origin?) vs git worktree list (3-token chain git worktree list) look identical to the parser without per-CLI semantic knowledge. No syntactic rule disambiguates branch names from subcommand names.

Proposed change

Parser side: improve the heuristic

Replace BashArity-table lookup with a "stop at non-verb-like token" heuristic. A token is "verb-like" if it's a bare identifier (lowercase letters + hyphens + dots, no special chars, not too long) and not a flag, not a path-shape, not a value-like token (numeric, URL, env-var ref).

The walk consumes consecutive verb-like tokens from the start of the clause, treating known flag-with-value pairs (existing FlagsWithValue table) as transparent — -C /repo is consumed without breaking the verb-chain walk. Stop at the first non-verb-like token.

Effect on the same examples:

Command Old extraction (BashArity-based) New extraction (stop-at-non-verb-like)
freshdesk ticket list --status open [freshdesk] (1-token default) [freshdesk, ticket, list] (stops at --status)
git -C /repo worktree list --porcelain [git, worktree] [git, worktree, list] (stops at --porcelain)
kubectl get pods my-pod [kubectl, get] [kubectl, get, pods, my-pod] (over-extracts; see below)
git push origin main [git, push] [git, push, origin, main] (over-extracts; see below)
cat /etc/foo [cat] [cat] (stops at /etc/foo path)
chmod 755 file [chmod] [chmod] (stops at 755 numeric)

The over-extraction cases (bare-word args like origin, main, my-pod) are unfixable at the parser layer — they're indistinguishable from subcommand verbs without per-CLI semantic knowledge. The new heuristic is strictly better than today's BashArity-based extraction, just not perfect on bare-word args.

Documentation side: scope Clause.Verb correctly

Update SPEC.md to make explicit:

  • Clause.Verb is a convenience hint, not a security contract. It's a best-effort identification of the canonical verb chain. Consumers using it for display, audit dedup, or other non-load-bearing purposes can rely on it.
  • Consumers needing security-grade verb identification walk the token stream directly. The pattern-matching algorithm should be: "command matches pattern iff first N command tokens equal pattern's verb prefix, where N = pattern's verb-prefix length." This punts the depth choice to the user (via the pattern they author) and eliminates the parser's responsibility to guess.
  • The BashArity table will not grow to enumerate CLI subcommand structures. Existing entries (docker compose, bun run) stay for the convenience hint; new ones are not added. The table is a small set of well-known multi-word verb idioms, not an exhaustive registry.

Why over-extraction is acceptable for security

When the parser over-extracts (e.g., [git, push, origin, main] instead of [git, push]), pattern-depth-driven matching handles it correctly:

  • User's pattern git push * has verb-prefix length 2.
  • Command's first 2 tokens are [git, push].
  • Match check: do the first 2 tokens equal the pattern's verb prefix? ✓ MATCH.

Conversely, when a prompt auto-proposes a pattern, greedy over-extraction is the security-correct default. Auto-proposed pattern for git push origin main is git push origin main * (over-specific). This is better than git push * because:

  • A subsequent git push wrongremote wrongbranch doesn't auto-grant.
  • Re-prompts on variation are audit checkpoints, not friction.
  • Operators wanting broader grants opt in explicitly via CLI (netclaw approvals trust-verb 'git push *').

False-negative (re-prompt) is recoverable; false-positive (silent destructive grant) is not. Narrow-by-default favors the recoverable failure mode.

Suggested test cases for the corpus

input:    "freshdesk ticket list --status open"
expected: verb=[freshdesk, ticket, list]

input:    "git -C /repo worktree list --porcelain"
expected: verb=[git, worktree, list]

input:    "kubectl get pods"
expected: verb=[kubectl, get, pods]

input:    "kubectl get pods my-pod"
expected: verb=[kubectl, get, pods, my-pod]   # over-extracts; consumer handles via pattern depth

input:    "aws s3 cp src dst"
expected: verb=[aws, s3, cp, src, dst]   # over-extracts on bare-word path-like args

input:    "git push origin main"
expected: verb=[git, push, origin, main]   # over-extracts on bare-word branch/remote names

input:    "cat /etc/passwd"
expected: verb=[cat]                       # stops at path

input:    "chmod 755 file"
expected: verb=[chmod]                     # stops at numeric mode

input:    "echo --version"
expected: verb=[echo]                      # stops at flag

input:    "ls -la /tmp"
expected: verb=[ls]                        # stops at flag

Non-goals

  • Per-CLI semantic knowledge baked into the parser (no git-specific or kubectl-specific subcommand tables).
  • Disambiguating bare-word args from subcommand verbs.
  • Any UI/UX choices about how consumers display or match these verb chains. Those are consumer concerns.

Severity

Medium. Today's behavior under-extracts for any CLI not in the table, which causes pattern-matching false negatives for consumers (a saved approval pattern doesn't match the command the user thought it would). The proposed fix is strictly better — moves from "always 2-token unless in table" to "stop at non-verb-like" which handles flags and paths correctly without any per-CLI knowledge. Bare-word over-extraction remains; consumers handle that via pattern-depth-driven matching.

Prior discussion

See comments below for the path that led here — earlier proposals to extend the BashArity table with curated entries, plus a multi-option specificity-picker prompt UX, were both rejected. The table approach can't scale to unknown CLIs; multi-option pickers don't survive translation to text-only channel adapters. The current proposal punts depth choice to consumers and keeps the parser stateless about CLI semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions