Skip to content

CapturePolicy.NAMED_ONLY breaks JDK-compatible group(int) indexing #69

@jbachorik

Description

@jbachorik

Summary

When CapturePolicy.NAMED_ONLY is used, Reggie converts unnamed capturing groups (...) to non-capturing (?:...) before compiling. This renumbers the surviving (named) groups sequentially, so MatchResult.group(int) no longer uses JDK-compatible 1-based indices. Any caller that pre-computed group indices from the raw pattern string using JDK counting rules will silently extract the wrong field (or null).

Reproducer

// Pattern has two unnamed groups before the named group.
// JDK: group 1 = first unnamed, group 2 = second unnamed, group 3 = named
// Reggie NAMED_ONLY: unnamed -> (?:...), so named group becomes group 1
String pattern = "(a)(b)(?<name>c)";
ReggieOptions opts = ReggieOptions.builder()
    .capturePolicy(CapturePolicy.NAMED_ONLY)
    .build();
ReggieMatcher m = Reggie.compile(pattern, opts);
MatchResult r = m.match("abc");

// JDK-compatible expectation:
System.out.println(r.group(3)); // expected "c", actual: IndexOutOfBoundsException or null
System.out.println(r.group(1)); // expected "a" (unnamed), actual: "c" (first named)

The same issue manifests in practice with any grok-style pattern that contains unnamed structural groups (e.g. the ipv6 rule, which expands to ((([0-9A-Fa-f]{1,4}:){7}...)) with ~30 unnamed groups). Code that resolves named field positions by counting all capturing parens in the raw string — as java.util.regex.Pattern does — receives wrong values for every field when NAMED_ONLY is active.

Expected behaviour

MatchResult.group(int) should maintain JDK-compatible 1-based group numbering regardless of CapturePolicy. Groups that were unnamed and made non-capturing by NAMED_ONLY should occupy their original index slot and return null (consistent with JDK behaviour for a non-participating group). This makes Reggie a drop-in replacement for java.util.regex.Matcher from the perspective of numeric group access.

Alternatively, document clearly that NAMED_ONLY invalidates group(int) for numeric access, and that callers must switch to group(String name) — but this breaks the java.util.regex.Matcher drop-in contract.

Actual behaviour

With NAMED_ONLY, named groups are renumbered starting from 1 in their order of appearance in the transformed (unnamed-groups-removed) pattern. A call to group(n) using the JDK-computed index returns a different group or throws.

Workaround

Callers can pre-compute a int[] jdkToReggieIndex remapping at compile time and translate indices before calling group(int), but this forces every Reggie adapter to re-implement the group-counting logic that Reggie already performs internally.

Environment

  • Reggie version: 0.3.0
  • Java: 21
  • Discovered while integrating Reggie into the Datadog logs-backend grok parsing stack as a drop-in replacement for java.util.regex

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions